Not thinking it through – the post mortem

I just thought I’d round off my series of posts concerning my first experience of an outsourced data centre.

 

http://sqlblogcasts.com/blogs/grumpyolddba/archive/2007/01/04/not-thinking-it-through-the-saga-continues.aspx

 

It really got to the stage of almost disbelief which is why it ended.

From the previous posts you’ll remember the initial issue arose around the re-indexing of the production database each week, and the state the database was left.

Well we never did resolve that:-

  • We switched the database to simple recovery model to rebuild the indexes in a series of batches, simple recovery making sure the log didn’t grow and grow, sadly the tables were not batched in such a way to ensure we didn’t run out of disk space so occasionally this still happened – some of the largest tables started with the same keyword.
  • There was no check in the job to switch the database back to full recovery in event of an error or to make sure a full or differential backup was taken prior to re-starting the transaction log backups.
  • If disk space was low then the full backup failed and the database was not recoverable until the next backup, a full 24 hours and a working day later, this was an international application so a day was actually 24 working hours.
    • If you switch to simple, you need at least a differential backup to be able to recommence transaction log backups to enable the recovery of your database.

Some other events of note;

  • I discovered during a routine hardware check that the redundant power supplies had been disconnected on two production servers.
  • I asked for a server re-boot, the data centre couldn’t find the server in the racks and phoned me to ask its location – not having been in the data centre – no idea.
  • The data centre turned off a reporting server; it took them nearly three days to find it again to switch it back on.
  • They didn’t know how to detect or measure awe memory on a windows server.
  • There was an issue with server to server file copies, “path too deep” if you want to check. We established that the only way to move a database backup for a release was to use a memory stick – sadly the data centre were unable to provide resource so a member of management had to drive a couple of hundred miles very early in the morning to do this.
  • The actual file copy issue was very interesting and I don’t know if it was ever resolved, but basically there were inconsistent copy times when moving the same file from server to server in the data centre – and I mean serious differences – from a couple of seconds to generally around 40 mins, worst case a couple of hours, this also manifested itself on back and forward copies, so copy from A to B = 2 seconds, from B to A = 30 mins. Cool huh?
  • Alerts were dealt with on the basis – mark as fixed if when you check the alert, which could be hours later, the event is not there. E.g. high cpu etc.
  • Disk free space was marked to alert on 1 mb free remaining – not too sure who set this threshold.
  • When a production database ran out of log space the solution was to stop the server, delete the log and attach creating a new log.

 

Maybe I was just unlucky, but it’s probably fair to ask “who watches the watchers?”  and to not assume that because it’s a large data centre things will actually be good.

Published 26 June 2007 12:55 by GrumpyOldDBA

Comments

No Comments