A comment was made that it’s all very well writing about performance tuning but it’s rare for hard and fast figures to be quoted and we never write about what the system is we’re working on – the criticism being that the statement “ It just depends ” is a bit of a cop out.
Well I have some sympathy with this view, however if it was that easy I’d probably be out of work!
I had a think and decided to make some notes about some of the values I view as being an issue.
( disclaimer :- I’m talking servers with scsi or fibre channel disks, SAN or otherwise, server operating systems – not xp on a laptop or a £600 desktop with two sata disks and sql server developer installed )
Disk subsystem.
This is one of my first ports of call, for an OLTP system I don’t expect to see raid 5 and I don’t expect to see a physical array carved into logical drives or shared luns.
I generally use four counters as from experience these will usually give useful data regardless of where or how the disks are connected:-
- Avg. Disk sec/Read
- Avg. Disk sec/Write
- % idle Time
- Disk Transfers/sec
1) Avg. Disk sec/Read and Write – you want io to complete as fast as possible, on a decent subsystem the read value should be almost unreadable and the writes should be well under 10ms. I’ve seen values of 1ms reads and 5ms writes on a well set up SAN and DAS.
2) I’ve found disk time to give meaningless values whereas the idle time seems to be consistant regardless of subsystem – you wouldn’t want to see idle time of 0 consistently, I usually work on the basis average 85% usage or less is good, so > 15% for this counter
3) Transfers/sec just measures io, you should be able to use this figure with the knowedge of the raid and number of spindles to work out how much the system is saturated, although idle time will provide the same indication. This is always going to be relative figure as you won’t be able to tell which are random and which are sequential io.
i. A transaction log drive should not have any write latency at all – any latency is bad news.
Memory
Most systems I deal with have more databases then memory so there’s always likely to be an issue – the best performance is when your database is all in cache.
There’s really only one counter I look at and that’s page life expectancy – many figures are presented for this, a value of 300 indicates a 5 minute page life, is this sufficient, well I prefer to see over 1000 as that will hopefully minimise disk activity.
The real thing to view is the page life expectancy being taken down to zero or near zero, low page life means physical io.
What might you see by increasing memory, well surprisingly not what you might expect, as an example I had a system with 8Gb of ram which I took to 32Gb, true I got a great page expectancy but I still had performance issues, but what I did find was that the work throughput had increased, by about 50%, so the users were able to actually do more work. There was a total skew of physical io with reads decreasing and writes increasing.
This is why so much tuning is a “just depends” AND if you don’t have a proper baseline of collected performance data before you start making changes then you’ll really never know what you’ve improved. Incidentally this is one of the areas I question during technical interviews where a DBA claims to have performance tuned a server/database.
I tend not to make so much of hit ratio counters, with 32bit there is only so much memory you can allocate to procedure cache anyway, and if your application produces loads of ad-hoc sql there might not be much you can do.
Cpu
If your cpu is constantly flat lining or hitting periods of high use then most cpu counters you’d want to view will be high – or whatever. Context switches is one of the favourite counters but I find high values of this coincide with high cpu usage so you already know this! Important is to watch all your processor graphs to make sure there’s no inconstant loading.
With SQL Server I usually view umsstats, dbcc sqlperf(umsstats) you can get this through a dbcc or a dmv, I actually think the dbcc command is better. What you’re looking for here is the waiting workers when the system is busy – any significant value probably means you have a cpu problem. Solving it isn’t always about adding cpu, if the sql is poor or the database badly optimised resources are held for longer. For example I just added an index for a small report, the report only runs once a week, but it required 600k io and took a over 15 seconds to run, the index dropped the io to 272 and the time to a few ms. The point here is that the resource overall has been reduced, and this is the whole point and it also proves that most of your gains are in the software/code not the hardware.
Ums stats allows you to see allocated workers, active workers and waiting workers – this may be more informative and a better view of server activity.