Whats worse losing money or losing data?
I mentioned a the storage engine blog recently and was reading the latest post Common bad advice around disaster recovery when I thought about a recent sitauation that occurred for us which applies to another of the storage engine posts about recovery or repair
Paul talks about doing some Root Cause Anlaysis of the problem, however when your problem is stopping your system from operating what is best,
- Carry on trying to solve the problem to avoid corruption, but prolonging the issue
- Do a quick dirty fix, that causes corruption, but gets your service working again. Often means you can't do a full RCA.
Our situation is a replication one, where the log reader agent got stuck. It couldn't read past a certain transaction in the log. This meant that no data was being replicated. We run at close to the limit for some of our subscriptions. This means that if anything gets in the way the latency can take quite some time to recover. In this situation we were convinced we could fix the issue and have everything working again, 3 hours later we realised that we could not and so did the fix that resulted in inconsistent data across our replication topology.
The fix meant that data was being replicated again but we had to spend considerable tie reconciling the data, that had been lost (put in the log) in the time we took to try and fix the problem.
As a result of this, we realised that to meet our SLAs we are better doing the dirty fix straight away. Whilst this means that we may have inconsitencies, these should be small (relative to the fix time), data continues to be replicated and so SLAs are met and we shouldn't loose many money/clients.
So the moral of the story is that you need to align ALL your SLAs with ALL your recovery procedures, fix, restore, failover etc.