Do you fail over your clusters?
This may sound somewhat strange but what I actually mean is do you actively run your production servers on alternate nodes?
Generally I will run alternate nodes based around application releases where we have an agreed outage.
Just this last week it was brought home on just how you can get caught out when you fail over a cluster, although I don’t actively manage this cluster and the precise details of the issues are of no significance the fact remains that the decision to run on “the other node” caused a number of application processes to break.
So how can this come about? Well our data centre have the concept of a preferred node for both failover clusters and availability groups so this means that production systems only run on the same node, usually the lowest numbered; true there will be failovers for quarterly patching but the systems will always be put back to the “preferred” node after patching.
Many years ago I ran into a problem where we appeared to have an issue with one node of a cluster, we didn’t realise we had a problem until we had a component failure on the node we “always” used, failover went well and then every so often the server would go to 100% cpu and just sit there, only a failover would resolve the issue – we never did figure out the problem but my feeling ( never proved ) was that it was a fault on the nic daughterboard.
To cut a long story short we bought a new cluster – the CIO then asked me how I could be sure that we wouldn’t be caught out like this again – the only solution was to say that at each application release we’d switch to the other node, thereby we should run roughly 50 – 50 on the nodes over a year.
For the key systems I manage I’ve kept to this routine, this includes availability groups.
I thought I’d also recount some other issues I’ve encountered with clusters to make the post complete. So .. how long do you leave between patching the nodes? ( SQL Patches ) Well I generally leave only a day because once upon a time we applied a service pack and it was the practice to wait at least a week .. however there was a real issue with the miss-matched nodes, sorry don’t remember the ins and outs of it but essentially the active node ( failover cluster btw ) kept racking up lots and lots of open handles which caused us real problems.
With Windows 2012 R2 you have to remember to take the core resources to the active node as they don’t automatically follow when you fail over, managed or not.
I haven’t found any gotchas with availability groups ( yet ).
The last issue I encounter is more directly related to process within our data centre, but with windows patching they have a habit of scheduling based around their perceived “preferred nodes”, so if my nodes are numbered xxx13 and xxx14 they will patch 14, fail to 14, patch 13, fail back to 13 .. which is all well and good if xxx13 is actually the active node! but 50% of the time this isn’t the case so I have to be careful when approving the patching schedules, notwithstanding that we generally set these up a month or so in advance and as we always make sure releases don’t clash sometimes we end up with the data centre patching our live clusters.