August 2011 - Posts

Mending the master database
19 August 11 02:11 PM | GavinPayneUK | with no comments

Having a corrupt master database in SQL Server could be said to be one of the worst issues a database server instance could ever have.

What makes protecting against master database corruption more difficult is that it can’t have any of the regular high-availability features available in SQL Server used with it, such as database mirroring, replication or log shipping.  The only protection available is to back the database up regularly.

Backing up the master database

How you back that database up depends on your team’s skillset.  While SQL Server professionals would probably prefer to use native T-SQL commands and flat files if your infrastructure team only know about Backup Exec and would be responsible for restores at 2am then the decision might be made for you.

Below are links to the two backup methods mentioned above:

Ola’s T-SQL housekeeping and backup scripts here
Symantec’s Backup Exec SQL Server agent here

Restoring the master database using T-SQL

If your SQL Server master database is consistent but has invalid configuration settings then restoring it is fairly straight forward.

Using T-SQL and a flat file backup of the database requires SQL Server to be re-started in single user mode and the database restored, instructions can be found here.

If your SQL Server master database is corrupt to the extent that SQL Server won’t start in single-user mode then you have to rebuild the master database using the original setup tool.  A PSS blog article can be found here on how to do that.  If you’re rebuilding master on a SQL Server 2005 cluster expect to type one of the longest command line commands known to man, more details here.

Restoring the master database using Backup Exec

If you’re using a tool like Backup Exec, then restoring the master database becomes more straight forward according to Symantec’s web support.  It has a clever file system copying method for making a backup of master, meaning if you ever want to restore it whether because of corruption or invalid settings all you apparently have to do is stop SQL Server and rename its copies of the master mdf and ldf files to replace those you no longer want.  You then follow some post-rename steps once SQL Server is running again. 

An article about this clever feature can be found here.

Recovery not backup

As with all business critical systems how you back your systems up isn’t what matters, its how you test those backups and know how to use them in the event of a recovery which matters.  So don’t believe any of what I’ve written until you’ve tested it for yourself!

Clustered Floppy Disks?
02 August 11 10:19 PM | GavinPayneUK | with no comments

Many years ago we used to use floppy disks, hundreds of the things, each normally holding 1.44MB, and they were all we had to copy data around on.

Fast-forward to 2011 and its been a long time since I saw a PC with a 3.5” floppy drive fitted, let alone have someone send me data on one, probably about 11 years is my guess.  So, what a surprise when I was building a cluster with Windows Server 2008 R2 today when I got the following error message:

The wrong diskette is in the drive
Insert %2 (Volume Serial Number: %3) into drive %1.

I got this error on the "List potential cluster disks” section of the Cluster Verification Report (CVR) when I was preparing Windows Failover Clustering. 

Some background.  Both servers in the soon-to-be cluster had several LUNs attached to them using the SAN vendor’s MPIO and Fibre Channel HBA drivers.  Each LUN was presented as a disk to Windows, from where Windows had initialized the disk and put a GPT on it as a Simple disk.  Each disk had a single partition created and formatted as NTFS.  All very normal.

The issue was escalated to Microsoft and the SAN vendor where the focus of investigation was on the SAN and LUN configuration.

The initial consideration was the “Host operating system” value assigned to each of the hosts within the SAN controllers configuration; was this set to Window Server 2008/with Clustering?  Interestingly, while recommended, apparently the actual value makes no difference to the storage’s underlying functionality, but was still “highly recommended”. 

Ultimately, the issue was resolved by clearing the Persistent Reservations for the LUNs assigned to the cluster nodes at the SAN controller level.  I borrow from Symantec the following definition for those not familiar with the term:

“SCSI-3 Persistent Reservations (SCSI-3 PR) are required for I/O fencing and resolve the issues of using SCSI reservations in a clustered SAN environment. SCSI-3 PR enables access for multiple nodes to a device and simultaneously blocks access for other nodes.”  The full link is available here

Essentially, PRs are like write locks on a file, they allow at a storage communications level for a specific host to reserve full control over a storage item, while their peers for the time being can only achieve a lower level of access to the same storage item.

Clearing the Persistent Reservations allowed the previously failing CVR to succeed from where I could then build the cluster knowing Microsoft would happily support it.

So in the end, the actual error had nothing to do with floppy disks.  Instead, as a Google search will show, the Win32 API used and the error message raised date back to when that was the most likely cause of the failure obviously being trapped.