When It’s Gone It’s Gone !

Let me say that I like log shipping, for me as a DBA it ensures I have a set of backups in a different location, hopefully a different site to my main data centre.
Mostly I like log shipping because all of my backups get restored as soon as after being taken and as we all should know “ The only good backup is one that’s been successfully restored”

I’ve log shipped since SQL 6.5, I don’t use a third party solution, I mean it’s not difficult is it?

1.    Take backup
2.    Copy to new location
3.    Restore backup

Yup really is as easy as 1, 2 ,3 .   However there are some serious issues with how I’ve seen log shipping implemented and I’m often amused at how folk seem to miss the whole point concerning why we log ship, in fairness quite often it won’t be a DBA who has deployed log shipping  in this manner.
The problem with not getting a DBA involved? Mostly I’d say not actually understanding log shipping and why you’d use it.

In fairness I usually find that differential backups are not fully understood by non DBAs either.

So what is my gripe then?   Well almost every implementation of log shipping set up this way doesn’t ship the daily full backups and restore them – heaven forbid the system which only has one full backup a week and relies on differentials.
Meanwhile I’m told that “we’ve put in log shipping for our DR solution”
Well …. Not really in fact not in my world  .. the really good point why log shipping is better than mirroring, replication or availability groups is that you can step back to a point in time restore .. BUT .. you will need a full backup for a start point for the log restores.
In a failover situation where the log shipped server has to become the live server I have seen a copied file cause the database to go suspect, yes it’s rare but I’ve seen it happen- the problem can be if the file was being copied at the time of the failure.
Likewise I’ve seen a standby database turn out to be corrupt, only once and SQL 2000, but it did happen.
I discovered that in fact the logs continued to restore but the moment a full backup was restored , well it failed , the database threw an error.
In this particular instance the database corruption had occurred a couple  of weeks  earlier but the logs happily continued to restore, as soon as we attempted to restore a full backup the whole process failed.
The reason we were not copying the full backups every day was down to bandwidth, leased lines are costly even now but back then there just wasn’t the bandwidth to sensibly copy the full backups every day.
And yes .. we increased our bandwidth.

Lack of bandwidth is the usual reason for leaving out the full backups, I’ve seen so many ways to attempt to get around a lack of bandwidth .. and some of the explanations for slow copy times have been truly priceless – I’ll just include one to give flavour “ Ah yes you only get 500 mb on a gigabit network as it’s full duplex “  500mb in each direction then ?

So log shipping protects against all types of problems including accidental or malicious data damage, the problem with mirroring and replication – “when it’s gone it’s gone” ,  If I delete a table by mistake on my primary availability group node it’s gone on the secondary a few milliseconds later.

Log shipping allows for point in time restores, but you need your full backups in place.

Posted by GrumpyOldDBA with no comments
Filed under: ,

Oh the irony of it all !!

In my previous post I mentioned what I see as a fatal flaw with the “Alert on Error” scenario, well no sooner posted than to quote Shakespeare “hoisted by my own petard”

HA!  I have a log shipped setup where the server communicates alerts using smtp .. well today I discovered that for some reason or another any alerts were not reaching me as it appears there is an issue raising them, no errors on the servers just no emails arriving.

In my defence I have a visual display which shows the latency of the log shipped databases along with a list of the restored files, parts of the display change colour according to how far behind production the log shipped databases are, so I knew all was well, a failure of the display is also an indication of an issue, e.g. a network link down for example.

SO .. there you have it a full working example on the dangers of only alerting on an error.

Posted by GrumpyOldDBA with no comments

How's your email today ? No Alerts ?

When I first started as a SQL Server DBA / Developer in 1995 working with SQL Server 6.x one of the key steps was to set up alerting and monitoring.

I was working for a very rapidly growing mortgage provider and DR, monitoring and performance were critical to the ongoing success of the company.

I was very lucky to be working with some very clever and gifted people to whom SQL Server was something new, myself included as my prior experience had been Ingres and a little Oracle,

this meant exciting times with requests for functionality to be discovered and explored.

One of the things I implemented was log shipping, this didn't exist as a Microsoft option back then so it had to be hand written, I still to this day use a variation of that original method over the Microsoft offering - yes it's only a robocopy

but back then even robocopy was new, and we needed to know the whole thing was working as expected.

This was also the start of my customised monitoring but that story doesn't belong here.

So this was a fundamental part of DR and business continuity as well as safeguarding the company data from loss or damage, back then it wasn't unusual for thieves to break into companies to steal server hardware.

Although it all seems very strange now a 9GB scsi disk or a ( 4 MB ? ) memory chip was very costly.

Anyway how do you monitor or alert? I remember the sometimes heated discussions regarding what was a reasonable backup strategy and why alerting on failure was a really bad idea and so on.

Q. Why would I say that alerting on failure is a bad idea?

( Answers on a postcard to ........ )

A. If you don't arrive at something similar to this and you're a Production DBA I'd suggest you're probably not quite paranoid enough !

The problem with only delivering alerts on failure is when there is a failure around your mail server .. sadly I have seen this situation fully demonstrated where teams have sat oblivious to an ongoing issue just because there were no alerts.

How many times over the year? Quite a number to be honest.

So way back then there were a continuous string of alerts/emails to tell us all was well, a sudden lack of emails was an indication of a problem.

I've a presentation I gave to the SSUG of which the main theme was "Don't let the users tell you your server is down" and I think this is a key way to look at supporting production systems so if the first you know of a problem is when a user, or in a few cases a company director appears next to your desk to ask you what's wrong you really do have a problem.

I use customised real time dashboards to monitor key systems, the dashboards are designed in such a way that any issues should show before they become serious, so at least I know before that phone call or visit to my desk.

The loss of a dashboard is obviously a symptom. I use alerts too, some I have written myself but most come through Idera SQL Diagnostic Manager which I also use as part of my monitoring. I've used this product since the late 1990s when it was known as SQL Probe.

This post is sponsored by .. an email issue .. surprise surprise! Now I wasn't having any system issues but I was monitoring a key business process .. I watch this using some reports and a dashboard. Whilst watching I received some alerts ..but the alerts didn't match what I was seeing visually.

At this point I logged into SQL Diagnostic Manager as this product allows me to step through historical performance data .. e.g. step back 60 mins and then advance one minute at a time.

( there's a wealth of information that can be viewed; Locks, deadlocks, blocking, sessions, cpu, users, io and so on )

https://www.idera.com/productssolutions/sqlserver/sqldiagnosticmanager

I wasn't able to see the indicated issues from the alerts .. hmmmm .. A check then showed that the alert had actually been raised an hour earlier, a handful of checks later it was clear there was a hefty delay to the email alerts I was receiving.

And that is the problem with alerts vs monitoring, if you get an issue with the delivery you'll maybe never know until it's too late .. then you will be reactive and having to play catch up rather than being proactive and hopefully ahead of the game.

A SQL Server note here, I raise smtp alerts from SQL Server, I don't use database mail or any of the built in mail versions - force of habit I suppose.

However I don't raise mailed alerts direct, I have a "send message procedure" that alerts are sent to. This proc reads from a configuration table which flags where an alert is sent, there are two flags, "mail" and "file" , both can be toggled, any email issues I toggle the "file" setting and the alerts get stored in a table.

Because that's another issue with failed email alerts, you probably won't have an alert history.

The option I don't use is to write the alert to the SQL Error Log .. no particular reason for that but it's an option which could be used too.

As  a footnote I do recommend a look at SQL Diagnostic Manager from Idera . I’ve used this product for very many years.

https://www.idera.com/productssolutions/sqlserver/sqldiagnosticmanager

Benchmarking Hardware and Environments

I suppose it would be fair to say I didn’t pick up the Grumpy Old DBA for no reason, as well as the usual paranoia of a Production DBA , distrust of almost anyone who wants to go near a database and possibly over protective nature regarding databases I have tended to become more cynical  and distrustful during my time in IT especially if I’m told I wouldn’t understand or I believe I’m being fobbed off by other teams such as networks, storage, app and web tier folks, developers – in fact almost everyone in my quest for performance or trouble shooting.

A year or so ago I was responsible for the hardware upgrade of a production cluster and this lead to a conundrum; our existing cluster was a Dell R900 four socket quad core - what should we replace it with?
The DBA will obviously want as much bang for buck as possible, but as well as the cost of the hardware there's the obvious cost of the SQL licenses to consider as well as the cpu clock speed and number of cores and so on - this was Enterprise SQL so any notion of a 40 core box was likely to meet some resistance!
The actual configuration of the chosen server is not the point of this blog post but for the record I chose a dual socket 8 core server, 16 cores plus HT so 32 threads.

So here's the rub .. 4 sockets to 2 sockets how can I present a justification for this "downsize" of a critical server which is the backbone of the company, which runs a global system  across more than 100 countries 24 x 7 and handles an annual income measured in billions ( USD ).
( Actually one benefit was that the hosting cost of the 2 way box was 50% less than the Dell R900 - in ballpark figures this meant that most servers we replaced would pay for themselves in a 2 to 3 year period )

I started to look around for a software product I could run on any Server ( or PC ) which would produce a set of figures.

I chose Passmark Software   https://passmark.com/  and purchased a copy of PerformanceTest,  at $27 I think this is quite a bargain.
For the record I have no connections or affiliations with Passmark Software.

Now I could produce benchmarks for cpu, memory and disk
The other advantage is that there is a wealth of published data on the passmark.com website.

So I started to produce figures from servers and for the cpu benchmarks compared back to the published results on the web site, this part is important as I'll explain later.

old Server cpu benchmark  11,025
new Server cpu benchmark  20,907

Just to add a point of reference I use a single Xeon quad core ( no HT ) workstation this has a cpu benchmark of 5,944

The memory benchmarks showed even larger %age improvements.
Disk benchmarks are more tricky I'd say and as the SAN wasn't changing there would be no improvements, but I benchmarked the storage too.

Having benchmarked my server I needed to see if my result matched other reported values for a dual socket box with the same Xeon processors .. and it did.

It's important to not be overwhelmed by the tables of stats, I'd suggest that unless your sql server is running most of its workload in parallel clock speed is possibly more important than number of cores. With hindsight I would have actually bought a dual 6 core which has a higher clock speed than the dual 8 core.
We did in actual fact replace a number of servers in other systems with dual 6 cores which have a cpu benchmark of 22,000.

Now I previously said it was important to both refer to published benchmarks and keep figures for your existing estate, as part of our hardware replacements we bought a new cluster - when the benchmarks were run the cpu value was well below expected 19,072 ( it may well have been less initially as the process to resolve was very protracted to say the least.
Initially my findings were dismissed out of hand - this is one reason to buy a "known" software product and one with published results!
I'll perhaps blog about my experiences with benchmarking using SQL Server and how I got a similar reception.
This particular server had more cores and a higher clock speed and once sorted ( it was an issue within the BIOS ) we got the right value  26,200.

We're just building a new cluster for another system and I ran my checks, this showed that the cpu mark on the two nodes was very different, by about 33% , yet another BIOS setting. Again my test results were disputed but I'm used to this so I just forwarded my document which has data for a very large number of our servers.

So should the DBA apply due diligence, what is the scope of responsibility for the DBA? In my time I've been heavily involved in storage, after being told "You wouldn't understand" when I questioned the performance of a SAN I promptly put myself through two certifications on Fibre Channel Architecture with SNIA. I'm reasonable with the o/s having started with NT 4 training courses many years ago, however a technical interview recently showed my knowledge of windows clustering was slightly lacking :) And how about the integration with the various application tiers and network?

Anyway I'd suggest that getting a set of benchmark values for your hardware is important if nothing else but to check all is as expected.

I wonder how many have been able to show any hard figures for the impact of Hyperthreading or Turbo Boost ?

Here's some figures for a dual socket 8 core server.

Both cpu with Turbo Boost and HT      21,491
Both cpu no Turbo no HT                     16,916
Both cpu no HT                                    15,995
Single cpu with Turbo Boost and HT    14,107

No I can't explain all the figures and I was only able to perform very limited testing.

Finally a couple of examples of why this is important.
I was able to discover that the volumes of SAN presented to one server were not as expected.
SAN storage is a subject to itself but in very simple terms there are classes of san storage available to me at differing performance and cost ( faster = more expensive ) The storage presented was of a lower performance than I expected - Only because I had sets of benchmarks was I able to pick up on this.

And a somewhat obscure matter, in a particular server model if you populate all the memory slots the memory speed drops down a notch, e.g. 1866 runs at 1600 . A benchmark showed two servers with the same memory speed with different values - investigation showed that the memory config in the lower benchmark server wasn't as I had specified, we're back to due diligence again you see, and although the memory speed step down isn't massive it will impact the throughput for SQL Server.

And yes SQL tests and benchmarks are very important but it's important to make sure the hardware platform is optimal.
These are all physical servers .. I know there are benchmarks available on virtualised environments - I'd suggest the same types of tests should apply here .. who's to know if the bios setting in your ESXI or Hyper V host is degrading performance?

Conversations you wished you didn't hear.

OK so if you're a production DBA you are probably paranoid or considered paranoid.
This isn't bad, corporate data is valuable for all manner of reasons, the data protection act and any regulatory compliance along with corporate auditors can be enough to give a DBA a nervous twitch - and we haven't even mentioned fellow employees.

In no particular order and drawn from the last sixteen or so years:-

"The Server has two power supplies"

Setting up a production SQL Server ( 2008 R2 ) on a cluster - the supplier came back with "We don't support our application on a cluster" - we said their single server recommendation didn't provide redundancy.

"It's Admin Admin."

Leaving a multi tenanted office at home time - so very busy - conversation between person on ground floor and first floor open landing concerning access to a server.

"This SQL thing is interesting"

Response from a new dev team member two weeks after joining a team responsible/writing a SOX financial application handling millions of GBP revenue. The team recruited dot net developers but didn't require any SQL skills despite the fact the team were writing code to a SQL Server database.

"I found it on someone's blog and thought it was cool"

Instead of using a lookup table for 6 values the decision was taken to implement "constants" as sql functions.
Once this change hit production it killed performance - as you might guess as it turned set based ops into row by row!
The most ironic part of this particular fault was was when it was changed back to a lookup table the particular process then ran too fast and broke something else - it took very many man hours to slow it down to not break.

"The largest file is xxxx.mdf can I delete it?"

Running low on disk space on a production SQL server - the support solution.

"Go to Administrators and see who is in there"

Lengthy this one .. I have been constantly amazed over the years at how so many HR and Financial applications run all their users as sysadmin or dbowner - it can be disconcerting to find numerous office/excel connections as SA on your production system.
I was unfortunate to overhear parts of a phone conversation which started "We're doing a security audit..." The person making the call was then explaining to someone who was a Local Admin on an Accounting System on how to navigate to the Local Users and Groups to extract the required information. It was very clear that the person on the other end of the phone had little idea on what was being asked, but they were an Administrator who could rdp to this production system.

And a few classics to finish

"We didn't put any indexes as it slows down inserts"

"Once we go to Azure we won't need DBAs"

"Windows 2003 must be more secure than Windows 2012 as it's been patched for so long"

"I need to reboot the server because SQL has taken 95% of the memory"

"There's no performance gains using Enterprise Edition"

The last point was around the disruption caused when indexes are rebuilt in a live database. It’s worth a whole blog post on its own of course !!

 

Some essential reading material:

The Cuckoo's Egg by Cliff Stoll

The Anti-Hacker Tool Kit by Mike Shema

I read these books back in 2002 when I was involved in a on-line mortgage application project, in actual fact I actually bought three copies of the Anti-Hacker Tool Kit but had every copy confiscated because " We don't want you running that on our systems"  after I'd demonstrated a few basic examples.
We did in fact construct what we considered was a very secure and robust system.

Posted by GrumpyOldDBA with no comments
Filed under:

Failover – and guess what !!!

Yup yet another instance of something breaking when a cluster runs on the other node.

I don’t wish to apportion blame here or point any fingers but .. following on from my last post I managed to have one of the production clusters that normally runs on node1 left running on node2 after a restart .. this time it was my own monitoring application which stopped working, sigh!

I have customised performance dashboards I create with SSRS and these point to the cluster ..  normally, well I say normally but that’s not really true as every cluster appears to be different, I’m essentially collecting o/s stats so I don’t want to really connect to the SQL virtual server and I certainly don’t want to connect to the node, so in this instance I was connected to the cluster name – this worked fine when the cluster was on node1 but could not connect when the cluster was on node2.

First check was to see if the core resources had been moved over to node2 , nope, but that didn’t fix the issue. I tried to connect by using the sql server instance name but that didn’t work either so in the end I had to resort to using the cluster ip address.

The end result is that I have three failover clusters all running windows 2012 and I have to use a different method ( name ) to connect to each.

Just proves that if you don’t test you just don’t know what will happen.

A quick disclaimer .. I don’t build the clusters :)

Posted by GrumpyOldDBA with no comments

Do you fail over your clusters?

This may sound somewhat strange but what I actually mean is do you actively run your production servers on alternate nodes?

Generally I will run alternate nodes based around application releases where we have an agreed outage.

Just this last week it was brought home on just how you can get caught out when you fail over a cluster, although I don’t actively manage this cluster  and the precise details of the issues are of no significance the fact remains that the decision to run on “the other node” caused a number of application processes to break.

So how can this come about? Well our data centre have the concept of a preferred node for both failover clusters and availability groups so this means that production systems only run on the same node, usually the lowest numbered; true there will be failovers for quarterly patching but the systems will always be put back to the “preferred” node after patching.

Many years ago I ran into a problem where we appeared to have an issue with one node of a cluster, we didn’t realise we had a problem until we had a component failure on the node we “always” used, failover went well and then every so often the server would go to 100% cpu and just sit there, only a failover would resolve the issue – we never did figure out the problem but my feeling ( never proved ) was that it was a fault on the nic daughterboard.

To cut a long story short we bought a new cluster – the CIO then asked me how I could be sure that we wouldn’t be caught out like this again – the only solution was to say that at each application release we’d switch to the other node, thereby we should run roughly 50 – 50 on the nodes over a year.

For the key systems I manage I’ve kept to this routine, this includes availability groups.

I thought I’d also recount some other issues I’ve encountered with clusters to make the post complete. So .. how long do you leave between patching the nodes? ( SQL Patches ) Well I generally leave only a day because once upon a time we applied a service pack and it was the practice to wait at least a week .. however there was a real issue with the miss-matched nodes, sorry don’t remember the ins and outs of it but essentially the active node ( failover cluster btw ) kept racking up lots and lots of open handles which caused us real problems.

With Windows 2012 R2 you have to remember to take the core resources to the active node as they don’t automatically follow when you fail over, managed or not.

I haven’t found any gotchas with availability groups ( yet ).

The last issue I encounter is more directly related to process within our data centre, but with windows patching they have a habit of scheduling based around their perceived “preferred nodes”, so if my nodes are numbered xxx13  and  xxx14  they will patch 14, fail to 14, patch 13, fail back to 13 .. which is all well and good if xxx13 is actually the active node! but 50% of the time this isn’t the case so I have to be careful when approving the patching schedules, notwithstanding that we generally set these up a month or so in advance and as we always make sure releases don’t clash sometimes we end up with the data centre patching our live clusters.

 

Posted by GrumpyOldDBA with no comments

Community is Great

I have a great respect for so many who contribute to the community, without them I would often struggle in my role for sure.

When "strange events" happen in a busy production environment it can be quite daunting when it seems everyone around is expecting you to have the answer/solution at your finger tips.

I'm indebted to Paul White http://sqlblog.com/blogs/paul_white/default.aspx  in confirming I'd found a bug and doing all the hard work including raising a connect item https://connect.microsoft.com/SQLServer/feedback/details/797837/incorrect-results-partitioned-nonclustered-index-with-descending-key

Let me explain: Somewhile back we had an issue with a report which essentially every so often would not return a correct result set. There didn't seem to be a particular pattern and the sql statement involved was very large and complex.

It appeared to be related to a secondary index sort order, however there wasn't time to investigate fully as the business was crying out for a "fix" and the developers provided such by rewriting the queries. On Friday last week the same sort of issue arose again, however this time the stored procedure was small containing only a handful of statements. Essentially the report worked on one day of the week but the following day it did not.

To test this I restored historic backups and showed that although it worked on Thursday it didn't work on Wednesdy, only 2 rows returned instead of 8 - as I said a simple report for once. To cut a long story short the culprit was a single column select from a two table join - one table was partitioned. It was easy to test this query and typically it returned 2 rows when called in the procedure but 6 rows when the isolated statement was called in SSMS.

There were a number of variations that I worked through including forcing joins which changed the results, I was able to test this against SQL 2012 as well as SQL 2008 which is what this production system uses.

I've always been a big fan of Paul's work so I dropped him a mail asking if he could have a look at my problem, I'd figured out that a fix was to change the sort order of an index and was under pressure to apply this. I was concerned I had found a bug and if so what other issues we might have and if changing the index sort order might cause other problems, I'm talking a billion+ GBP billing system here, so only slight pressure < grin >

You can read what Paul found and figured out in his blog and the connect - absolutely awesome and I want to publicly offer sincere thanks for all his hard work.

We will attempt to raise this through our corporate support to add weight to the connect.

As a footnote to this post as essentially it relates to Partitioning I have discovered issues if you make the partitioning index descending, I'll have to properly check this out and blog in detail. 

 

 

Posted by GrumpyOldDBA with no comments

Not obvious then ?

I requested that we service pack a third party app - here's the response

I have received a reply back from Engineering that SP4 is 
certified now with WLE 7.x. 
Engineering ran some tests against SP4 and it runs fine and 
doesn't introduce any obvious problems.
Posted by GrumpyOldDBA with no comments
Filed under:

You couldn't write it - Expired SA account

This is the stuff of DBA nightmares !

email trail:

Q. Can you reset the SA account on server XXXXX, we think it has expired and now no-one can work.

Connect to Server:
Surely no-one would set up a Server with an sa account which expires?
Thankfully not.
Find sa password and change connection to use SA account.
Connect without issue.

Me. Have checked Server and account is fine.

A. Thanks that's great, you've fixed it we can all work now.

Posted by GrumpyOldDBA with no comments
Filed under:
More Posts Next page »