We all make mistakes and I made one today. I was building a 3 node, 4 instance SQL Server 2005 failover cluster, and as part of the build process needed to lock it down so that it can run happily with a minimum set of permissions. I added the SQL Server DBAs Windows AD Group as a sysadmin on one of the SQL Server instances (instance D), and then removed BUILTIN\Administrators. The instance continued to run fine and I carried on with some other work. Later on, I needed to finish the build so did some testing by trying to fail over every instance to every node to be sure that all SQL Server instances will run on all nodes if necessary. Instances A, B and C failed over fine, but instance D failed to start and was marked as failed by the cluster service. Checking the SQL server error logs I found this:
Login failed for user 'WINDOWSDOMAIN\clusterservice'. [CLIENT: 10.x.x.x]
Error: 18456, Severity: 14, State: 11.
Cluster service had no login in SQL Server
It then dawned on me that I had forgotten to add the cluster service account as a login in SQL Server. The cluster service has a process called IsAlive which checks that the clustered SQL Server instances can be logged into. It had been logging in to SQL Server through BUILTIN\Administrators before, but now couldn't. I tried to take the resource offline, and start SQL Server in single user mode by adding the following switches in SQL Server Configuration Manager to the SQL Server service: -m;-c;-T3608. This didn't help and SQL Server failed to start.
I then tried to think if there were any AD groups that had a login to SQL Server. There were two - the SQL Server Admins AD group, and the SQL Agent AD group. When creating a clustered instance of SQL Server you need to create a Windows AD group for each SQL service for each instance in your cluster - an annoyance of mine, but there you go that's the way it works. So, now I had a chance to recover by simply adding the cluster service as a member of the SQL Agent Windows AD group for this instance. I added the account in, and waited a minute or so, but the SQL Server service still failed to start. I then decided to reboot all nodes in the cluster and wait 15-20 minutes for AD replication to replicate the group membership changes across all domain controllers. This did the trick, after waiting 15 minutes, the SQL Server instance started successfully.
Cleanup
I then added the cluster service as an explicit login in SQL Server and removed the membership from the SQLAgent service account.
I am making some improvements to a system that I didn't write. We have lots of tables which do not use the IDENTITY property - but these tables are using integer surrogate primary keys. The primary keys are assigned using a stored procedure from a key table. This is an ugly, horrible disgusting solution that was fairly common 20 years ago to get around limitations of database managers back then. But we're in the 21st century now so we should be using something more elegant and manageable.
So, I have a central table called Trade with lots of foreign keys into it on the surrogate column - tradeId. What I need to do is change the tradeId column to Identity via a script that can be applied in the dev environment, QA/UAT, and then production. An easy way around this is to harness the power of the scripting capabilities of SQL Server Management Studio (SSMS). So, to solve our problem right click on the table and select Design.

Where it says Identity Specification, expand it and then select the drop down next to (Is Identity) - select Yes. Change the Identity Seed value to the max value in your table. I've decided to do this via a script as I don't know what the production max seed value is so I will have to find that with something like:
select max(tradeId) from trade
before I change the table.
So, now that we have changed the Identity, from the Table Designer menu select Generate Change Script.

SSMS will in the background (via SMO) generate a script for you to make the change. I do not recommend applying the script as is though, we have further work to do yet. You should get a window popping up asking you to you want to save the change script to a file, select yes. Sometimes if you have a lot of data you will get a warning saying this process may take a long time - up to you what you want to do here. My advice is always test in a development environment first.
The script will have lots of ALTER TABLE DROP CONSTRAINT statements to drop any foreign keys into Trade. It will then create a new temp table with an Identity column. I recommend you edit this section to put in a dynamically allocated Identity seed value.
So for my table the script for the temp table looks like this (truncated):
CREATE TABLE dbo.Tmp_Trade
(
TradeId dbo.entityid NOT NULL IDENTITY (1, 1),
ModifiedBy dbo.entityid NOT NULL,
ModifiedAt datetime NOT NULL,
Version int NOT NULL,
...
)
What you should do is replace the CREATE TABLE above with something like this:
declare @newTradeSeed int
declare @stmt nvarchar(max)
select @newTradeSeed = max(tradeId) + 1 from dbo.trade
select @stmt = 'CREATE TABLE dbo.Tmp_Trade
(
TradeId dbo.entityid NOT NULL IDENTITY (' + cast(@newTradeSeed as nvarchar(10)) + ', 1),
ModifiedBy dbo.entityid NOT NULL,
ModifiedAt datetime NOT NULL,
Version int NOT NULL,
...
)
exec (@stmt)
Further down the script IDENTITY_INSERT is switched on for the new tmp table so that the data can be copied in with an INSERT..SELECT statement. Once the data is in, we can drop the original Trade table and then rename the tmp table back to Trade. Once this has been done, we simply add our foreign keys back which SSMS kindly scripted out for us.
It is important on busy systems to run all the above within a transaction to prevent consistency errors. I like to run it in a TRY..CATCH block within a transaction. So:
BEGIN TRY
BEGIN TRANSACTION
<my script>
COMMIT TRANSACTION
END TRY
BEGIN CATCH
if @@trancount > 0 ROLLBACK TRANSACTION
<some error logging code>
END CATCH
(Or something roughly like the above in structure).
I had a puzzling little problem this week which caused
our remote users unable to access our system in our remote location. We have two SQL Server 2005 SP2
servers on different continents. The link between them is extremely slow around
100kbit. We have a linked server set up on the remote server, let’s call that
REMOTESQL01 pointing to the central SQL Server, let’s call that CENTRALSQL01.
We also have a form of replication set up which I will go into in another blog
post, but that’s not the important point here.
On the Friday evening – great time for a failure, we had some network problems causing an outage of 15 minutes. Once this error had cleared everything appeared to be working fine.
On the Sunday evening we have a batch process which dumps
out a cut of the data from CENTRALSQL01 and then compresses it and sends it to
REMOTESQL01, where a job bulk loads the data in. This is controlled with stored procedures connecting through Linked
Servers.
The job failed because the linked server on REMOTESQL01
could not connect to CENTRALSQL01 – I believe this is related to the
network outage – although according to our infrastructure
dept, everything is back to normal. To illustrate the issue, if I ran the following
on REMOTESQL01:
select * from [CENTRALSQL01].MyDatabase.dbo.myTable
I got the following error:
OLE DB provider "SQLNCLI" for linked server "CENTRALSQL01"
returned message "Unable to complete login process due to delay in opening
server connection".
Msg 7303, Level 16, State 1, Line 1
Cannot initialize the data source object of OLE DB provider
"SQLNCLI" for linked server "CENTRALSQL01".
For completeness, my linked server settings are:

In the linked server properties, I decided to change the Connection Timeout and see what would happen if that was extended from the default of 20secs to 120secs. After several tests across the slow link, the new timeout period solved our problem. This is only a temporary fix until we find out the root cause on the network.
I hope this little tip is handy to people experiencing Linked Server connection errors across slow WAN links.