Failover – and guess what !!!

Yup yet another instance of something breaking when a cluster runs on the other node.

I don’t wish to apportion blame here or point any fingers but .. following on from my last post I managed to have one of the production clusters that normally runs on node1 left running on node2 after a restart .. this time it was my own monitoring application which stopped working, sigh!

I have customised performance dashboards I create with SSRS and these point to the cluster ..  normally, well I say normally but that’s not really true as every cluster appears to be different, I’m essentially collecting o/s stats so I don’t want to really connect to the SQL virtual server and I certainly don’t want to connect to the node, so in this instance I was connected to the cluster name – this worked fine when the cluster was on node1 but could not connect when the cluster was on node2.

First check was to see if the core resources had been moved over to node2 , nope, but that didn’t fix the issue. I tried to connect by using the sql server instance name but that didn’t work either so in the end I had to resort to using the cluster ip address.

The end result is that I have three failover clusters all running windows 2012 and I have to use a different method ( name ) to connect to each.

Just proves that if you don’t test you just don’t know what will happen.

A quick disclaimer .. I don’t build the clusters :)

Posted by GrumpyOldDBA with no comments

Do you fail over your clusters?

This may sound somewhat strange but what I actually mean is do you actively run your production servers on alternate nodes?

Generally I will run alternate nodes based around application releases where we have an agreed outage.

Just this last week it was brought home on just how you can get caught out when you fail over a cluster, although I don’t actively manage this cluster  and the precise details of the issues are of no significance the fact remains that the decision to run on “the other node” caused a number of application processes to break.

So how can this come about? Well our data centre have the concept of a preferred node for both failover clusters and availability groups so this means that production systems only run on the same node, usually the lowest numbered; true there will be failovers for quarterly patching but the systems will always be put back to the “preferred” node after patching.

Many years ago I ran into a problem where we appeared to have an issue with one node of a cluster, we didn’t realise we had a problem until we had a component failure on the node we “always” used, failover went well and then every so often the server would go to 100% cpu and just sit there, only a failover would resolve the issue – we never did figure out the problem but my feeling ( never proved ) was that it was a fault on the nic daughterboard.

To cut a long story short we bought a new cluster – the CIO then asked me how I could be sure that we wouldn’t be caught out like this again – the only solution was to say that at each application release we’d switch to the other node, thereby we should run roughly 50 – 50 on the nodes over a year.

For the key systems I manage I’ve kept to this routine, this includes availability groups.

I thought I’d also recount some other issues I’ve encountered with clusters to make the post complete. So .. how long do you leave between patching the nodes? ( SQL Patches ) Well I generally leave only a day because once upon a time we applied a service pack and it was the practice to wait at least a week .. however there was a real issue with the miss-matched nodes, sorry don’t remember the ins and outs of it but essentially the active node ( failover cluster btw ) kept racking up lots and lots of open handles which caused us real problems.

With Windows 2012 R2 you have to remember to take the core resources to the active node as they don’t automatically follow when you fail over, managed or not.

I haven’t found any gotchas with availability groups ( yet ).

The last issue I encounter is more directly related to process within our data centre, but with windows patching they have a habit of scheduling based around their perceived “preferred nodes”, so if my nodes are numbered xxx13  and  xxx14  they will patch 14, fail to 14, patch 13, fail back to 13 .. which is all well and good if xxx13 is actually the active node! but 50% of the time this isn’t the case so I have to be careful when approving the patching schedules, notwithstanding that we generally set these up a month or so in advance and as we always make sure releases don’t clash sometimes we end up with the data centre patching our live clusters.

 

Posted by GrumpyOldDBA with no comments

Community is Great

I have a great respect for so many who contribute to the community, without them I would often struggle in my role for sure.

When "strange events" happen in a busy production environment it can be quite daunting when it seems everyone around is expecting you to have the answer/solution at your finger tips.

I'm indebted to Paul White http://sqlblog.com/blogs/paul_white/default.aspx  in confirming I'd found a bug and doing all the hard work including raising a connect item https://connect.microsoft.com/SQLServer/feedback/details/797837/incorrect-results-partitioned-nonclustered-index-with-descending-key

Let me explain: Somewhile back we had an issue with a report which essentially every so often would not return a correct result set. There didn't seem to be a particular pattern and the sql statement involved was very large and complex.

It appeared to be related to a secondary index sort order, however there wasn't time to investigate fully as the business was crying out for a "fix" and the developers provided such by rewriting the queries. On Friday last week the same sort of issue arose again, however this time the stored procedure was small containing only a handful of statements. Essentially the report worked on one day of the week but the following day it did not.

To test this I restored historic backups and showed that although it worked on Thursday it didn't work on Wednesdy, only 2 rows returned instead of 8 - as I said a simple report for once. To cut a long story short the culprit was a single column select from a two table join - one table was partitioned. It was easy to test this query and typically it returned 2 rows when called in the procedure but 6 rows when the isolated statement was called in SSMS.

There were a number of variations that I worked through including forcing joins which changed the results, I was able to test this against SQL 2012 as well as SQL 2008 which is what this production system uses.

I've always been a big fan of Paul's work so I dropped him a mail asking if he could have a look at my problem, I'd figured out that a fix was to change the sort order of an index and was under pressure to apply this. I was concerned I had found a bug and if so what other issues we might have and if changing the index sort order might cause other problems, I'm talking a billion+ GBP billing system here, so only slight pressure < grin >

You can read what Paul found and figured out in his blog and the connect - absolutely awesome and I want to publicly offer sincere thanks for all his hard work.

We will attempt to raise this through our corporate support to add weight to the connect.

As a footnote to this post as essentially it relates to Partitioning I have discovered issues if you make the partitioning index descending, I'll have to properly check this out and blog in detail. 

 

 

Posted by GrumpyOldDBA with no comments

Not obvious then ?

I requested that we service pack a third party app - here's the response

I have received a reply back from Engineering that SP4 is 
certified now with WLE 7.x. 
Engineering ran some tests against SP4 and it runs fine and 
doesn't introduce any obvious problems.
Posted by GrumpyOldDBA with no comments
Filed under:

You couldn't write it - Expired SA account

This is the stuff of DBA nightmares !

email trail:

Q. Can you reset the SA account on server XXXXX, we think it has expired and now no-one can work.

Connect to Server:
Surely no-one would set up a Server with an sa account which expires?
Thankfully not.
Find sa password and change connection to use SA account.
Connect without issue.

Me. Have checked Server and account is fine.

A. Thanks that's great, you've fixed it we can all work now.

Posted by GrumpyOldDBA with no comments
Filed under:

You Couldn't Write it - Houston we have a problem!

Note identities changed to protect the innocent (sic ).

In a datacentre I have an iscsi san which provides storage for a SQL Cluster.
It developed a fault and required replacement of a few parts, all hot swappable.
Although we had suppport/warranty this did not include onsite so we arranged to have the parts delivered.
The datacentre did not want to carry out the work so we had to arrange for the manufacturer to send an engineer.
Times were arranged and interested/concerned parties put on standby.
However it appears that somehow the storage was registered at a different location to its physical location, a mere 1,000 miles apart!So the engineer turned up at the wrong location, ironically Houston.

There is a postscript to this however, when we actually managed to get all the interested parties in the correct location, replacing the hot swap parts caused the cluster to fail over.

You Couldn’t Write it !! ( part 1 )

This post was inspired by a developer and I think illustrates the gulf that can sometimes exist between IT and the business.
I should point out that this post is the diplomatic version!

Initially I was sent a simple search for a person with a question about why the query plan showed a sort when there was no sort in the query and why did the sort show it was 40% of the query.
( The point about the sort belongs to another post some time. )

Easy answer to the duration was that this was a leading wild card search, so I decided to remove most of the query and just concentrate on the table containing the peoples names we were searching on.
My production server isn't a shoddy bit of kit so in the interests of science I performed the cut down query on the actual server.

Even with the all the required data in cache it took over 3 seconds to return a single name.


Now that's not very impressive to my mind,there's only a few million rows in the table and I'm searching one single column in memory which is indexed so no lookups are required, yes it has to scan but it’s in memory for goodness sake!
Well that clearly provided the evidence for why the search was so slow, I couldn’t initially just turn round and say it was the leading wild card I have to show it is, sigh!
So what happens if we drop the leading wild card but leave the trailing wild card?  Answer - so quick I had to use profiler to get a time, something under 100 ms.

Some time later in one of our UAT environments there were similar issues, our test systems don't have the resource of production so you expect some degradation.
I asked the developers how these searches were presented to the users, just to add to the confusion we have several interfaces or applications which front the same database.
In one application the default search is "Begins with ... "  e.g.  trailing wild card, alternative is "Contains...." leading and trailing wild card.
In the other main application default is "Contains..." with alternative of "Exact Match ... " , so most users choose "Contains..." regardless.

So there's a simple solution here, the same users can use both systems let's make them the same and we'll get a definite improvement.
The solution was put to the business - there was a very quick response - "Please make all searches "Contains.." and remove other options.”

Posted by GrumpyOldDBA with no comments
Filed under:

Performance issues due to Inactive Terminal Server Ports

This is an interesting post and although I don't have any direct issues of performance with any of my Servers I can see the build up in the registry as described.

If you make extensive use of DPR / TS sessions you might want to check this out

http://blogs.technet.com/b/askperf/archive/2012/03/06/performance-issues-due-to-inactive-terminal-server-ports.aspx

Posted by GrumpyOldDBA with no comments

We all read the instructions first - right ?

I've been working with the bare metal Hyper V and it's been an interesting time, if you haven't ever encountered windows core then it's a bit of a culture shock, as I understand it SQL Server will be running on windows core very soon - or maybe I read it wrong?

  Anyway having built my Hyper V server I decided that I'd build a cluster on it, for this you need some shared storage and one way to achieve this is to use Windows Storage Server which in effect creates an iscsi san.

So I downloaded Storage Server, which is an iso image, moved it to the Hyper V box with the intention of building a VM from the iso image ( you can do this and it is stunningly fast with windows 2008 R2 - but appallingly slow for XP Now I assumed as it was an iso it was a Server Install - but no it's not.

So I unpacked the iso and lo and behold there was an exe. So I built a w2k8 r2 VM and ran the exe - which promptly created two iso images - now I'm getting confused!

Back to building a server from the iso image ( again ) - yet another failure.

OK unpack the iso and we have a bunch of files - this isn't a bootable image or anything.

  So in a nut shell the iso image holds an exe file which holds an iso image which holds a bunch of files - why not just zip the lot up together Microsoft? talk about making life hard! 

That was probably the most confusing part of building a cluster - I might suggest that some pictures in the docs would have made life easier. It assumes, I guess, that you know all about isci targets and such, however there's only so many variations and I essentially knew what I should end up with, once I figured out that the cluster was built very quickly. 

Am I lover of virtualisation? Not really, my experience is that SQL Server performance sucks under virtualisation, however I love Microsoft because without them I don't earn a living, so when Microsoft made the bare metal hypervisor available I thought I'd take in some training and investigate.

So I have a DL585 configured and as well as the cluster I have SQL2012 and windows 8 VMs to evaluate. 

http://www.microsoft.com/en-us/server-cloud/windows-server/hyper-v.aspx

x64 Memory Issues - revisited

It's funny how things seem to bob to the surface every so often, a bit like revivals of Musicals or the fact you know it's christmas becuase ET is on the TV again.

Blake Morrision from the Ask the Performance Team has blogged about issues with working sets and SQL Server, http://blogs.technet.com/b/askperf/archive/2012/01/25/real-life-sql-working-set-trimming-issue.aspx

( If you don't subscribe to this blog then you're missing a useful source of help )

I blogged about this back in March 2009  http://sqlblogcasts.com/blogs/grumpyolddba/archive/2009/03/18/x64-memory-problems.aspx and at the time deployed the dynamic cache service to deal with this exact problem.

Personally given the circumstances I'd never allow a third party backup tool onto a production SQL Server, my preference is to ship or replicate backups to another location where the streaming to tape can occur without any impact on the Production Server.

I also don't slavishly apply lock pages in memory, my view is that if you have to set this to stop memory paging it would be a far better option to find the root cause and fix that rather than sweeping it under the table - no doubt many will not agree with me on this!

More Posts Next page »