Last week i was
dragged kicking and screaming away from my allocated SQL 2005 days to
spend 2 days diagnosing performance problems on one of the firms
Systems and write a report.
Due to the sensitive nature of the
data on the system it has been managed by another team and this was the
first time i had looked at it. I thought i'd share my thought process
to identifying the problem on the server and reference tools i used.
Its the first iteration of the review and once my recommendations are
implemented there will be another review to see what else may need
dealing with. Please note however i have not included every single step
i took otherwise i'd be writing a novel!
Scenario
The
users of the system have been reporting intermitant time outs occuring
in the front end application. Current system administrators have noted
periods of high cpu usage. The server is Windows 2000 SP4 running SQL
2000 SP3a with 2 physical processors, hyperthreading enabled, 2gb
memory, SAN attached (EVA 5000).
Analysis
The first port of
call was to review the event logs and sql server logs for any errors
that may indicate what is happening. These logs revealed nothing. Next
port of call was to connect via Query Analyser and see what was going
on. A quick look at current activity showed nothing much going on, so i
did the standard SELECT * FROM SYSPROCESSES. Now this revealed that 80%
of the connected spids last wait times had been for a latch and
spefically the waitresource was 2:3:69 or 2:1:24.
This
waitresource translates as follows first number = Db ID, in this case
tempdb. second number = filegroup and 3rd number is the page.
So,
this made me suspicious and i moved on. Next thing, we have a timeout
at the front end so what is the server waiting for? To find this out i
created the track_waitstats and get_waitstats procedures
on the server and ran the track_waitstats for 8 hours with 5 minute
samples. At the same time i kicked off a server side trace and
performance monitor with all my standard counters and left this running
for 24hrs.
One last thing to do was take a look at how SQL was using its memory. For this i used DBCC MEMORYSTATUS and
DBCC MEMUSAGE. This did not reveal anything alarming but the DBCC
MEMUSAGE did show that a log table was the top buffer cache user so i
noted this and i would later suggest the table be reviewed and archived
as appropriate.
The next day it was time to start reviewing the information i had gathered. I loaded up the trace with my trusty trace analysis tool which
takes all the pain out of working with the trace. Within 30 minutes the
tool had analysed the trace, loaded it into a normalised database and
presented me with a summary report with the most interesting
information. Alot of the time i don't need to go any deeper than the
summary report, although in a small number of cases you do have to and
then its a case of querying the normalised database to extract more
information and in very rare cases going back to the trace (painful!).
The first thing i
looked at was the wait statistics i captured using track_waitstats.
This gave me the key to the problem and incdicated that the server
spent 45% of its time waiting for latches!
The other most significant wait @ 32% was for cxpackets which relate to
parallelism. Now with latching in my head i started thinking IO
bottleneck but a little voice said, "unlikely, you are attached to a
high performance SAN". So, i went to review my performance counters and
specifically the Average and Current disk queue lengths inline with
disk read & writes. Nothing! Queue lengths were negligable and read
writes were nothing to boast about. So what next? Well, i need to know
where the latching is occuring so i set up a loop to capture the output
of sysprocesses for 45 minutes. Now this was interesting, most of the
waiting is occuring in tempdb! Ahar! Everything now points to tempdb
contention. For a moment i thought to myself its classic contention and
follow Microsoft article about it. Never that easy though! I was not exhibiting waitresources of 2:1:1 or 2:1:3.
So, my wait resources are primarily 2:3:69. What is this? So i delve into the page to take a look. To do this i used DBCC PAGE to
find out whats on the page and what object it belongs to. Both the
pages being flagged belonged to SYSINDEXES in tempdb. The alarm bells
are now ringing, the contention is in tempdb around sysindexes.
Probably some dodgy code so back to the summary results of my trace and
lets take a look at the top queries by CPU, READS & WRITES. Voila!
There it is. There are 2 queries that consume 40% of all resouce
requested by SQL on the box.
So tuning time. The
primary problem was that one of the peices of code was doing a SELECT *
INTO #TBL from badly_written_view. To the best of my understanding its
was actually this that was causing the latch contention in tempdb. I
believe that when you do select * into its holding latches around the
appropriate tables including sysindexes but for the duration of the
transaction which is up to 3.5mins in this case. So i have removed the
select * into and changed it to create #table followed by insert
into. I then spent the next 5 hours optimising the code so i could send
it back to the developers. I did not have to optimise the code but i
was so horrified by the code i saw and a little part of me wanted
to say to them "And thats how you do it". I managed to half the
execution times on both bits of code but there was still room for
improvement but that was going to take to much time as i was not not
familiar enough with the dataset and my optimising was rusty.
The developers now
have my recommendations and new code for review, hopefully changes will
be implemented in the next week and i can confirm my belief regarding
select * into. Either way i`ll be doing another round of reviewing.
I've decided to stop
here for the blog but a number of other recommendations were made which
i've listed below. Every iteration of a performance review is different
and other tools i have not mentioned are used but i hope you find this
article gives you an insight into tuning and that its links are useful.
1) Increase tempdb files to match the number of processors. (Best practice)
2) Reduce degree of parallelism to n - 1 where n is the number of physical processors.
3) Review archive table.
Who knows, the next
review may turn up a whole new world of pain or confirm the server is
now running in a satisfactory state! One thing i can say for sure, the
way i approach the problem and the tools i use next time will be
similar but definately not the same.