In the last couple of weeks I've seen a few database servers where sqlserver.exe was almost constantly at 100% CPU usage, usually they were servers I’d never seen before which didn’t help!
If you've had your environment designed or tuned by the SQLCat team then you’d probably be happy that you were getting every ounce of capability out of your hardware. Reality for most people is that 100% CPU usage is bad and indicates something somewhere is wrong, at the same time your users will all have probably stopped working as well.
Sometimes the initial reaction is to either find/buy more CPUs, or even more tactically, re-start SQL Server. Neither of these are a long term solution, one just means more CPU resource will get burnt when the problem re-occurs, the other is a very short term fix giving perhaps a few minutes grace until the problem re-occurs.
Where to begin looking
The first place I look when I see a system in an ultra-high CPU state is outside of SQL Server and at what Windows and the hardware is doing. I perform a quick check of some vital system stats:
Windows Pages/sec, % Pagefile utilisation, Available Memory
Disk latencies, IOPs and KB/sec
NIC % utilisation
Sometimes I’ll also look at % Privileged Time if only to tell me whether to check out anti-virus, DoubleTake or other system software.
Then I’ll look at the SQL Server errorlog. This was where I found evidence of a recent cause of ultra-high CPU usage:
SQL Server’s memory being paged
On one server the backup software had a memory leak and was using over 3GB of memory. This caused Windows to page out everything and anything it could find, including SQL Server’s memory. Events were written to the SQL Server errorlog saying that “A significant part of sql server process memory has been paged out” and at the same time the % pagefile utilisation counter was up at 60%. Stopping the backup service freed up over 3GB of memory and almost immediately the CPU usage by sqlserver.exe dropped from 100% to…10%.
The second cause I’ve seen a couple of times recently again had very little to do with CPU cycles from the outside but is an area which we all know can have a huge impact on SQL Server’s performance:
After the two checks I mentioned above the next thing I do is open Activity Monitor within Management Studio. In an ideal world I’d use Perfstats to perform a more much detailed capture and then analyse it in SQLNexus but usually by this time you start hearing comments like “the system is down” and “no one can work”.
Within 60 seconds you can start to see trends in the top performing queries, whether ordered by CPU, executions or reads. With those top performing queries I then get their execution plans into SQL Sentry’s Plan Explorer.
Plan Explorer does what I need it to do for me in a hurry, it finds and highlights the most expensive operations in those queries and at this point I will often find:
- Missing clustered indexes
- Key/RID lookups
- Table and index scans
In some situations the creation of 4 or 5 carefully selected “emergency” indexes based on the observations from Plan Explorer can also drop the CPU usage down from 100% to single digit values. These can be turning heaps into clustered indexes to reduce locking and latch contention, implementing covering indexes to stop lookups or converting expensive table scans into index seeks.
If that doesn’t help then the next area I check is:
Statistics and Index Housekeeping
As we know SQL Server’s query optimiser loves having up-to-date statistics while its IO processes hate logical or physical fragmentation. On some systems there is only weekly re-indexing which in a database with missing clustered indexes or heavy page split rates can become heavily fragmented in a matter of hours, so after a few days performance can really struggle. In one case a check of the stats data using DBCC showed that running sp_updatestats was well overdue and fortunately 90 seconds later bought the CPU utilisation down from 100% to something much lower. Re-indexing on a production system isn’t something I’d normally want to do, but again if the data shows it needs running and the system is already “dead” to the business users it might be a decision you take.
Your Mileage Will Vary Along With Your Troubleshooting Methods
The observations and solutions I’ve written about above were point in time situations and remedies which I may never see again, however they are useful checks for when I see a poor performing system again. They’re not the only causes of ultra-high CPU usage and my remedies weren’t the only solutions to the issues we faced. There are also much better ways to more accurately diagnose high CPU usage, however if sharing my experiences means someone who does a Google search in a hurry finds some benefit from them I’ll feel this post was worthwhile.