Mark's SQL Server Blog

select experiences from workingInTheField

Cluster service login missing from SQL Server - how to recover

We all make mistakes and I made one today. I was building a 3 node, 4 instance SQL Server 2005 failover cluster, and as part of the build process needed to lock it down so that it can run happily with a minimum set of permissions. I added the SQL Server DBAs Windows AD Group as a sysadmin on one of the SQL Server instances (instance D), and then removed BUILTIN\Administrators. The instance continued to run fine and I carried on with some other work. Later on, I needed to finish the build so did some testing by trying to fail over every instance to every node to be sure that all SQL Server instances will run on all nodes if necessary. Instances A, B and C failed over fine, but instance D failed to start and was marked as failed by the cluster service. Checking the SQL server error logs I found this:

Login failed for user 'WINDOWSDOMAIN\clusterservice'. [CLIENT: 10.x.x.x]
Error: 18456, Severity: 14, State: 11.

Cluster service had no login in SQL Server

It then dawned on me that I had forgotten to add the cluster service account as a login in SQL Server. The cluster service has a process called IsAlive which checks that the clustered SQL Server instances can be logged into. It had been logging in to SQL Server through BUILTIN\Administrators before, but now couldn't. I tried to take the resource offline, and start SQL Server in single user mode by adding the following switches in SQL Server Configuration Manager to the SQL Server service: -m;-c;-T3608. This didn't help and SQL Server failed to start.

I then tried to think if there were any AD groups that had a login to SQL Server. There were two - the SQL Server Admins AD group, and the SQL Agent AD group. When creating a clustered instance of SQL Server you need to create a Windows AD group for each SQL service for each instance in your cluster - an annoyance of mine, but there you go that's the way it works. So, now I had a chance to recover by simply adding the cluster service as a member of the SQL Agent Windows AD group for this instance. I added the account in, and waited a minute or so, but the SQL Server service still failed to start. I then decided to reboot all nodes in the cluster and wait 15-20 minutes for AD replication to replicate the group membership changes across all domain controllers. This did the trick, after waiting 15 minutes, the SQL Server instance started successfully.

Cleanup

I then added the cluster service as an explicit login in SQL Server and removed the membership from  the SQLAgent service account.

Comments

SqlServerKudos said:

Kudos for a great Sql Server article - Trackback from SqlServerKudos

# September 1, 2009 6:48 PM

adamsjs said:

Good post... helped me out tonight.  Did the same thing, and already had two running SQL Server instances on the cluster that I could not afford to take down/failover.  I waited out the AD replication after adding the cluster service account to a domain group with access to the newly-installed third instance, but still had issues getting the service to run.  Figured out that the cluster service needed a restart, but I obviously didn't want to take down the running instances on the active node.  So, here's what I did:

The running instances were both running on the same node of a three-node cluster.  After adding the cluster service to a group that had access to the newly installed instance, I moved the resource group for the new instance to one of the non-active nodes, restarted the cluster service on that node (to pickup the new group membership), and verfied the instance could be brought online.  I then restarted the cluster service on the other non-active node, and moved the new instance that node.

To convince myself that I was correct in my thinking, I tried to move the resource group back to the node where the other two active instances were running (where the new instance had been running when I removed the cluster service account's access).  It would not come online (as expected).  I brought the instance online on one of the other nodes, created the login for the cluster service account, moved the instance to the original node, and it came online.  Lastly, I removed the cluster service account from the domain group to which it had been added.

Hope this helps the next person out.

# October 11, 2009 5:39 AM