We all make mistakes and I made one today. I was building a 3 node, 4 instance SQL Server 2005 failover cluster, and as part of the build process needed to lock it down so that it can run happily with a minimum set of permissions. I added the SQL Server DBAs Windows AD Group as a sysadmin on one of the SQL Server instances (instance D), and then removed BUILTIN\Administrators. The instance continued to run fine and I carried on with some other work. Later on, I needed to finish the build so did some testing by trying to fail over every instance to every node to be sure that all SQL Server instances will run on all nodes if necessary. Instances A, B and C failed over fine, but instance D failed to start and was marked as failed by the cluster service. Checking the SQL server error logs I found this:
Login failed for user 'WINDOWSDOMAIN\clusterservice'. [CLIENT: 10.x.x.x]
Error: 18456, Severity: 14, State: 11.
Cluster service had no login in SQL Server
It then dawned on me that I had forgotten to add the cluster service account as a login in SQL Server. The cluster service has a process called IsAlive which checks that the clustered SQL Server instances can be logged into. It had been logging in to SQL Server through BUILTIN\Administrators before, but now couldn't. I tried to take the resource offline, and start SQL Server in single user mode by adding the following switches in SQL Server Configuration Manager to the SQL Server service: -m;-c;-T3608. This didn't help and SQL Server failed to start.
I then tried to think if there were any AD groups that had a login to SQL Server. There were two - the SQL Server Admins AD group, and the SQL Agent AD group. When creating a clustered instance of SQL Server you need to create a Windows AD group for each SQL service for each instance in your cluster - an annoyance of mine, but there you go that's the way it works. So, now I had a chance to recover by simply adding the cluster service as a member of the SQL Agent Windows AD group for this instance. I added the account in, and waited a minute or so, but the SQL Server service still failed to start. I then decided to reboot all nodes in the cluster and wait 15-20 minutes for AD replication to replicate the group membership changes across all domain controllers. This did the trick, after waiting 15 minutes, the SQL Server instance started successfully.
Cleanup
I then added the cluster service as an explicit login in SQL Server and removed the membership from the SQLAgent service account.