September 2016 - Posts

Oh the irony of it all !!

In my previous post I mentioned what I see as a fatal flaw with the “Alert on Error” scenario, well no sooner posted than to quote Shakespeare “hoisted by my own petard”

HA!  I have a log shipped setup where the server communicates alerts using smtp .. well today I discovered that for some reason or another any alerts were not reaching me as it appears there is an issue raising them, no errors on the servers just no emails arriving.

In my defence I have a visual display which shows the latency of the log shipped databases along with a list of the restored files, parts of the display change colour according to how far behind production the log shipped databases are, so I knew all was well, a failure of the display is also an indication of an issue, e.g. a network link down for example.

SO .. there you have it a full working example on the dangers of only alerting on an error.

Posted by GrumpyOldDBA with no comments

How's your email today ? No Alerts ?

When I first started as a SQL Server DBA / Developer in 1995 working with SQL Server 6.x one of the key steps was to set up alerting and monitoring.

I was working for a very rapidly growing mortgage provider and DR, monitoring and performance were critical to the ongoing success of the company.

I was very lucky to be working with some very clever and gifted people to whom SQL Server was something new, myself included as my prior experience had been Ingres and a little Oracle,

this meant exciting times with requests for functionality to be discovered and explored.

One of the things I implemented was log shipping, this didn't exist as a Microsoft option back then so it had to be hand written, I still to this day use a variation of that original method over the Microsoft offering - yes it's only a robocopy

but back then even robocopy was new, and we needed to know the whole thing was working as expected.

This was also the start of my customised monitoring but that story doesn't belong here.

So this was a fundamental part of DR and business continuity as well as safeguarding the company data from loss or damage, back then it wasn't unusual for thieves to break into companies to steal server hardware.

Although it all seems very strange now a 9GB scsi disk or a ( 4 MB ? ) memory chip was very costly.

Anyway how do you monitor or alert? I remember the sometimes heated discussions regarding what was a reasonable backup strategy and why alerting on failure was a really bad idea and so on.

Q. Why would I say that alerting on failure is a bad idea?

( Answers on a postcard to ........ )

A. If you don't arrive at something similar to this and you're a Production DBA I'd suggest you're probably not quite paranoid enough !

The problem with only delivering alerts on failure is when there is a failure around your mail server .. sadly I have seen this situation fully demonstrated where teams have sat oblivious to an ongoing issue just because there were no alerts.

How many times over the year? Quite a number to be honest.

So way back then there were a continuous string of alerts/emails to tell us all was well, a sudden lack of emails was an indication of a problem.

I've a presentation I gave to the SSUG of which the main theme was "Don't let the users tell you your server is down" and I think this is a key way to look at supporting production systems so if the first you know of a problem is when a user, or in a few cases a company director appears next to your desk to ask you what's wrong you really do have a problem.

I use customised real time dashboards to monitor key systems, the dashboards are designed in such a way that any issues should show before they become serious, so at least I know before that phone call or visit to my desk.

The loss of a dashboard is obviously a symptom. I use alerts too, some I have written myself but most come through Idera SQL Diagnostic Manager which I also use as part of my monitoring. I've used this product since the late 1990s when it was known as SQL Probe.

This post is sponsored by .. an email issue .. surprise surprise! Now I wasn't having any system issues but I was monitoring a key business process .. I watch this using some reports and a dashboard. Whilst watching I received some alerts ..but the alerts didn't match what I was seeing visually.

At this point I logged into SQL Diagnostic Manager as this product allows me to step through historical performance data .. e.g. step back 60 mins and then advance one minute at a time.

( there's a wealth of information that can be viewed; Locks, deadlocks, blocking, sessions, cpu, users, io and so on )

I wasn't able to see the indicated issues from the alerts .. hmmmm .. A check then showed that the alert had actually been raised an hour earlier, a handful of checks later it was clear there was a hefty delay to the email alerts I was receiving.

And that is the problem with alerts vs monitoring, if you get an issue with the delivery you'll maybe never know until it's too late .. then you will be reactive and having to play catch up rather than being proactive and hopefully ahead of the game.

A SQL Server note here, I raise smtp alerts from SQL Server, I don't use database mail or any of the built in mail versions - force of habit I suppose.

However I don't raise mailed alerts direct, I have a "send message procedure" that alerts are sent to. This proc reads from a configuration table which flags where an alert is sent, there are two flags, "mail" and "file" , both can be toggled, any email issues I toggle the "file" setting and the alerts get stored in a table.

Because that's another issue with failed email alerts, you probably won't have an alert history.

The option I don't use is to write the alert to the SQL Error Log .. no particular reason for that but it's an option which could be used too.

As  a footnote I do recommend a look at SQL Diagnostic Manager from Idera . I’ve used this product for very many years.

Benchmarking Hardware and Environments

I suppose it would be fair to say I didn’t pick up the Grumpy Old DBA for no reason, as well as the usual paranoia of a Production DBA , distrust of almost anyone who wants to go near a database and possibly over protective nature regarding databases I have tended to become more cynical  and distrustful during my time in IT especially if I’m told I wouldn’t understand or I believe I’m being fobbed off by other teams such as networks, storage, app and web tier folks, developers – in fact almost everyone in my quest for performance or trouble shooting.

A year or so ago I was responsible for the hardware upgrade of a production cluster and this lead to a conundrum; our existing cluster was a Dell R900 four socket quad core - what should we replace it with?
The DBA will obviously want as much bang for buck as possible, but as well as the cost of the hardware there's the obvious cost of the SQL licenses to consider as well as the cpu clock speed and number of cores and so on - this was Enterprise SQL so any notion of a 40 core box was likely to meet some resistance!
The actual configuration of the chosen server is not the point of this blog post but for the record I chose a dual socket 8 core server, 16 cores plus HT so 32 threads.

So here's the rub .. 4 sockets to 2 sockets how can I present a justification for this "downsize" of a critical server which is the backbone of the company, which runs a global system  across more than 100 countries 24 x 7 and handles an annual income measured in billions ( USD ).
( Actually one benefit was that the hosting cost of the 2 way box was 50% less than the Dell R900 - in ballpark figures this meant that most servers we replaced would pay for themselves in a 2 to 3 year period )

I started to look around for a software product I could run on any Server ( or PC ) which would produce a set of figures.

I chose Passmark Software  and purchased a copy of PerformanceTest,  at $27 I think this is quite a bargain.
For the record I have no connections or affiliations with Passmark Software.

Now I could produce benchmarks for cpu, memory and disk
The other advantage is that there is a wealth of published data on the website.

So I started to produce figures from servers and for the cpu benchmarks compared back to the published results on the web site, this part is important as I'll explain later.

old Server cpu benchmark  11,025
new Server cpu benchmark  20,907

Just to add a point of reference I use a single Xeon quad core ( no HT ) workstation this has a cpu benchmark of 5,944

The memory benchmarks showed even larger %age improvements.
Disk benchmarks are more tricky I'd say and as the SAN wasn't changing there would be no improvements, but I benchmarked the storage too.

Having benchmarked my server I needed to see if my result matched other reported values for a dual socket box with the same Xeon processors .. and it did.

It's important to not be overwhelmed by the tables of stats, I'd suggest that unless your sql server is running most of its workload in parallel clock speed is possibly more important than number of cores. With hindsight I would have actually bought a dual 6 core which has a higher clock speed than the dual 8 core.
We did in actual fact replace a number of servers in other systems with dual 6 cores which have a cpu benchmark of 22,000.

Now I previously said it was important to both refer to published benchmarks and keep figures for your existing estate, as part of our hardware replacements we bought a new cluster - when the benchmarks were run the cpu value was well below expected 19,072 ( it may well have been less initially as the process to resolve was very protracted to say the least.
Initially my findings were dismissed out of hand - this is one reason to buy a "known" software product and one with published results!
I'll perhaps blog about my experiences with benchmarking using SQL Server and how I got a similar reception.
This particular server had more cores and a higher clock speed and once sorted ( it was an issue within the BIOS ) we got the right value  26,200.

We're just building a new cluster for another system and I ran my checks, this showed that the cpu mark on the two nodes was very different, by about 33% , yet another BIOS setting. Again my test results were disputed but I'm used to this so I just forwarded my document which has data for a very large number of our servers.

So should the DBA apply due diligence, what is the scope of responsibility for the DBA? In my time I've been heavily involved in storage, after being told "You wouldn't understand" when I questioned the performance of a SAN I promptly put myself through two certifications on Fibre Channel Architecture with SNIA. I'm reasonable with the o/s having started with NT 4 training courses many years ago, however a technical interview recently showed my knowledge of windows clustering was slightly lacking :) And how about the integration with the various application tiers and network?

Anyway I'd suggest that getting a set of benchmark values for your hardware is important if nothing else but to check all is as expected.

I wonder how many have been able to show any hard figures for the impact of Hyperthreading or Turbo Boost ?

Here's some figures for a dual socket 8 core server.

Both cpu with Turbo Boost and HT      21,491
Both cpu no Turbo no HT                     16,916
Both cpu no HT                                    15,995
Single cpu with Turbo Boost and HT    14,107

No I can't explain all the figures and I was only able to perform very limited testing.

Finally a couple of examples of why this is important.
I was able to discover that the volumes of SAN presented to one server were not as expected.
SAN storage is a subject to itself but in very simple terms there are classes of san storage available to me at differing performance and cost ( faster = more expensive ) The storage presented was of a lower performance than I expected - Only because I had sets of benchmarks was I able to pick up on this.

And a somewhat obscure matter, in a particular server model if you populate all the memory slots the memory speed drops down a notch, e.g. 1866 runs at 1600 . A benchmark showed two servers with the same memory speed with different values - investigation showed that the memory config in the lower benchmark server wasn't as I had specified, we're back to due diligence again you see, and although the memory speed step down isn't massive it will impact the throughput for SQL Server.

And yes SQL tests and benchmarks are very important but it's important to make sure the hardware platform is optimal.
These are all physical servers .. I know there are benchmarks available on virtualised environments - I'd suggest the same types of tests should apply here .. who's to know if the bios setting in your ESXI or Hyper V host is degrading performance?

Conversations you wished you didn't hear.

OK so if you're a production DBA you are probably paranoid or considered paranoid.
This isn't bad, corporate data is valuable for all manner of reasons, the data protection act and any regulatory compliance along with corporate auditors can be enough to give a DBA a nervous twitch - and we haven't even mentioned fellow employees.

In no particular order and drawn from the last sixteen or so years:-

"The Server has two power supplies"

Setting up a production SQL Server ( 2008 R2 ) on a cluster - the supplier came back with "We don't support our application on a cluster" - we said their single server recommendation didn't provide redundancy.

"It's Admin Admin."

Leaving a multi tenanted office at home time - so very busy - conversation between person on ground floor and first floor open landing concerning access to a server.

"This SQL thing is interesting"

Response from a new dev team member two weeks after joining a team responsible/writing a SOX financial application handling millions of GBP revenue. The team recruited dot net developers but didn't require any SQL skills despite the fact the team were writing code to a SQL Server database.

"I found it on someone's blog and thought it was cool"

Instead of using a lookup table for 6 values the decision was taken to implement "constants" as sql functions.
Once this change hit production it killed performance - as you might guess as it turned set based ops into row by row!
The most ironic part of this particular fault was was when it was changed back to a lookup table the particular process then ran too fast and broke something else - it took very many man hours to slow it down to not break.

"The largest file is xxxx.mdf can I delete it?"

Running low on disk space on a production SQL server - the support solution.

"Go to Administrators and see who is in there"

Lengthy this one .. I have been constantly amazed over the years at how so many HR and Financial applications run all their users as sysadmin or dbowner - it can be disconcerting to find numerous office/excel connections as SA on your production system.
I was unfortunate to overhear parts of a phone conversation which started "We're doing a security audit..." The person making the call was then explaining to someone who was a Local Admin on an Accounting System on how to navigate to the Local Users and Groups to extract the required information. It was very clear that the person on the other end of the phone had little idea on what was being asked, but they were an Administrator who could rdp to this production system.

And a few classics to finish

"We didn't put any indexes as it slows down inserts"

"Once we go to Azure we won't need DBAs"

"Windows 2003 must be more secure than Windows 2012 as it's been patched for so long"

"I need to reboot the server because SQL has taken 95% of the memory"

"There's no performance gains using Enterprise Edition"

The last point was around the disruption caused when indexes are rebuilt in a live database. It’s worth a whole blog post on its own of course !!


Some essential reading material:

The Cuckoo's Egg by Cliff Stoll

The Anti-Hacker Tool Kit by Mike Shema

I read these books back in 2002 when I was involved in a on-line mortgage application project, in actual fact I actually bought three copies of the Anti-Hacker Tool Kit but had every copy confiscated because " We don't want you running that on our systems"  after I'd demonstrated a few basic examples.
We did in fact construct what we considered was a very secure and robust system.

Posted by GrumpyOldDBA with no comments
Filed under: