Not thinking it through
I’m looking into performance problems on a box which is in a managed data centre, so there is the question of ownership of problems etc. One current issue has been an index rebuild job which has been regularly failing. The job is a simple cursor loop that applies a dbcc dbreindex to each table in turn, the database is in full recovery mode and tran logs get backed up hourly, it’s effectively a 7 x 24 system so options of changing database mode are not applicable. The server is also, in my view, short of disk space, we can only store one backup of the main database which is about 20gb in size, and the space available for transaction logs is also limited.
There were a number of server configuration changes to the out of the box settings, one which still exists is a setting to the Query Governor. Upon investigation it was discovered that the query governor setting was responsible for the failure of the dbcc dbreindex job.
Once this was resolved the reindex of the tables ran the transaction log drive out of space! I’m not going to relate how that was resolved, let’s just say the production database now has a different dbid.
The solution to this issue was to add a second loop inside the dbreindex loop to perform a transaction log backup every 10 or 20 tables, sounds fair, however, the log backup command is with an init and the time stamp for the file name is of the usual format dbname_log_yyymmddhhmm.bak, problem is if the dbreindex hits a batch of small tables the loop executes in less than a minute, this means the next log file overwrites the previous file and we have lost the continuity of the transaction logs and we can’t recover the database.
This was “tested” a number of times on the production server, including generating the logs without actually applying the reindex, 20 transaction log backups all with the same name in under a minute. As there were no before and after backups of the database, not enough disk space, the routine destroyed the recovery point until the next full backup – which didn’t occur until 36 hours later, putting us at risk of loss of a day’s business.
The Query Governor has been increased enough to presumably allow the job to run, although as we haven't had a complete run of the job data changes since may of course cause the job to fail next time it runs.
So the moral of this story is to beware changing server settings, you might not realise what is affected and think through clearly the logic of coding changes.