August 2007 - Posts

Non yielding schedulers

Error: 17883, Severity: 1, State: 0 server Process 162:0 (0) UMS Context 0x11DE8B68 appears to be non-yielding on Scheduler 1.

 

I was browsing past Technet when I just happened to notice this KB being in the most recent trouble shooting posts.    http://support.microsoft.com/kb/922658/en-us

 

Actually quite strange as this is dated February, so not really recent.

 

The non yielding schedulers is something I’ve run up into a few times at other client sites and to be honest I can’t remember exactly the outcome but if I recollect correctly I think I wanted to apply sp2 to w2k3 but as it was an o/s change I was not able to get the change applied.

I’m also pretty sure I encountered this with x64 SQL2005, however there were a number of issues with the x64 boxes so this probably got overlooked. I’m sure there are some specific x64 fixes in sp2 ( for w2k3 ), funny how it’s often easy to get a SQL Server service pack or patch applied but a real problem to get a change to an operating system.

 

What is interesting is that this bug manifests itself as a lack of activity on the server;

 

In most cases, you may notice that the value in the "Batch Requests/sec" counter of the SQLServer:SQLStastics performance object drops significantly in the System performance monitor. Sometimes, the counter value drops to 0. Additionally, you may notice a similar decrease in CPU utilization and in disk utilization. “

 

 now that’s a tricky situation to monitor!

Posted by GrumpyOldDBA with no comments
Filed under:

What's in a counter ?

A comment was made that it’s all very well writing about performance tuning but it’s rare for hard and fast figures to be quoted and we never write about what the system is we’re working on – the criticism being that the statement “ It just depends ” is a bit of a cop out.

 

Well I have some sympathy with this view, however if it was that easy I’d probably be out of work!

 

I had a think and decided to make some notes about some of the values I view as being an issue.

 ( disclaimer :- I’m talking servers with scsi or fibre channel disks, SAN or otherwise, server operating systems – not xp on a laptop or a £600 desktop with two sata disks and sql server developer installed )

 

Disk subsystem.

 

This is one of my first ports of call, for an OLTP system I don’t expect to see raid 5 and I don’t expect to see a physical array carved into logical drives or shared luns.

 

I generally use four counters as from experience these will usually give useful data regardless of where or how the disks are connected:-

 

  1. Avg. Disk sec/Read
  2. Avg. Disk sec/Write
  3. % idle Time
  4. Disk Transfers/sec

 

1)     Avg. Disk sec/Read  and Write – you want io to complete as fast as possible, on a decent subsystem the read value should be almost unreadable and the writes should be well under 10ms. I’ve seen values of 1ms reads and 5ms writes on a well set up SAN and DAS.

2)     I’ve found disk time to give meaningless values whereas the idle time seems to be consistant regardless of subsystem – you wouldn’t want to see idle time of 0 consistently, I usually work on the basis average 85% usage or less is good, so  > 15% for this counter

3)     Transfers/sec just measures io, you should be able to use this figure with the knowedge of the raid and number of spindles to work out how much the system is saturated, although idle time will provide the same indication. This is always going to be relative figure as you won’t be able to tell which are random and which are sequential io.

                                                              i.      A transaction log drive should not have any write latency at all – any latency is bad news.

 

Memory

 

Most systems I deal with have more databases then memory so there’s always likely to be an issue – the best performance is when your database is all in cache.

 

There’s really only one counter I look at and that’s page life expectancy – many figures are presented for this, a value of 300 indicates a 5 minute page life, is this sufficient, well I prefer to see over 1000  as that will hopefully minimise disk activity.

The real thing to view is the page life expectancy being taken down to zero or near zero, low page life means physical io.

What might you see by increasing memory, well surprisingly not what you might expect, as an example I had a system with 8Gb of ram which I took to 32Gb, true I got a great page expectancy but I still had performance issues, but what I did find was that the work throughput had increased, by about 50%, so the users were able to actually do more work. There was a total skew of  physical io with reads decreasing and writes increasing.

 

This is why so much tuning is a “just depends” AND if you don’t have a proper baseline of collected performance data before you start making changes then you’ll really never know what you’ve improved. Incidentally this is one of the areas I question during technical interviews where a DBA claims to have performance tuned a server/database.

 

I tend not to make so much of hit ratio counters, with 32bit there is only so much memory you can allocate to procedure cache anyway, and if your application produces loads of ad-hoc sql there might not be much you can do.

 

Cpu

 

If your cpu is constantly flat lining or hitting periods of high use then most cpu counters you’d want to view will be high – or whatever. Context switches is one of the favourite counters but I find high values of this coincide with high cpu usage so you already know this! Important is to watch all your processor graphs to make sure there’s no inconstant loading.

 

With SQL Server I usually view umsstats,   dbcc sqlperf(umsstats)  you can get this through a dbcc or a dmv, I actually think the dbcc command is better. What you’re looking for here is the waiting workers when the system is busy – any significant value probably means you have a cpu problem. Solving it isn’t always about adding cpu, if the sql is poor or the database badly optimised resources are held for longer. For example I just added an index for a small report, the report only runs once a week, but it required 600k io and took a over 15 seconds to run, the index dropped the io to 272 and the time to a few ms. The point here is that the resource overall has been reduced, and this is the whole point and it also proves that most of your gains are in the software/code not the hardware.

 

Ums stats allows you to see allocated workers, active workers and waiting workers – this may be more informative and a better view of server activity.

Posted by GrumpyOldDBA with no comments
Filed under:

SQL 2005 Hotfixes

WesleyB kindly posted a link to allow me to see this list

http://support.microsoft.com/kb/937137

http://sqlug.be/blogs/wesleyb/archive/2007/08/01/tokenandpermuserstore-continued.aspx

Having been told I need a specific patch for the cache problems it seems maybe that info was incorrect < grin >

Posted by GrumpyOldDBA with no comments

dbmail puzzler

This had me scratching my head for a short while:-

 

The first query sends a mail with a return status of 0

 

exec @return = msdb.dbo.sp_send_dbmail @profile_name='Any profile Will Do',

@subject = @subject,

@body = @body,

@query = @query,

@query_result_separator = @separator,

@exclude_query_output = 1,

@recipients = 'colin.leversuch-roberts@xyz.co.uk;john.smith@abc.co.uk;jim.jones@abc.co.uk';

 

This query doesn’t send a mail but still returns a status of 0

 

exec @return = msdb.dbo.sp_send_dbmail @profile_name='Any profile Will Do',

@subject = @subject,

@body = @body,

@query = @query,

@query_result_separator = @separator,

@exclude_query_output = 1,

@recipients = 'colin.leversuch-roberts@xyz.co.uk;

john.smith@abc.co.uk;

jim.jones@abc.co.uk';

 

I had a few email addresses to send to, they’re a bit longer than these shown, and I only have a 15” monitor so I tend to drop code down a line so I can actually see it.

I’d tested the call using myself as recipient and it was only after adding the extra recipients the call stopped – but no error messages – well there was one in the log :-

 

Message

The mail could not be sent to the recipients because of the mail server failure. (Sending Mail using Account 2 (2007-08-01T08:58:17). Exception Message: Cannot send mails to mail server. (The specified string is not in the form required for an e-mail address.).

 

Which was very useful, I cut and pasted the email addresses into exchange and sent a mail with no problem. Had me puzzled for a while.

Posted by GrumpyOldDBA with 2 comment(s)
Filed under: