July 2010 - Posts

So i only seem to get inspired when im totally immersing myself in a project. Having been busy for a while with a few projects requiring me to either consolidate or upgrade servers to 2008 and 2008 R2, I finally decided to put my resistance aside and Blog. This particular blog will be devoted to CPU configuration.

CPU affinity

One of the first things I like to do when I begin to consolidate a server, is to calculate how many cpu's it needs allocated to it. Often I create spreadsheet similar to the following to make this more visible:

  Numa Node  0 Numa Node  1 Numa Node  2
  CPU 0  CPU 1 CPU 2 CPU 3 CPU 4 CPU 5
Instance 1 Y Y        
Instance 2     Y Y    
Instance 3         Y Y
Instance 4     Y Y Y Y

On hardware with 16 plus processors, my preference is to reserver CPU 0 for exclusive use by the operating system. When  ever I set CPU affinity for a server, I will also set I/O affinity to the same CPU. I have not yet found a reason to change this.

Now I know some of you may find CPU binding a bit stingy but there are few reasons why I do this:

Less is more

I like to set CPU and I|\O affinity on consolidated servers and servers with many CPU's ( 12-16). My ultimate rule for this is sometimes, less is more. Sometimes we think the more the resources we give something the faster it will run. And sometimes not.

I have seen batch jobs complete 4 times faster on old servers with 4 cpu's than modern servers with 16. Why? NUMA and multiplexing. There is an overhead when accessing memory from foreign NUMA nodes (discussed in more detail below)  Modern processors work most effectivly when they are allocated a similar kind of workload, one where they can easily utilise and reuse the Level 1/2 & 3 cache. Each time a new task is allocated to its processor, it has to reload its cache. It can also be the case (and often is) that  the processor may not be available to give full attention to the process in hand.

The more you get the more you want

I have often seen with consolidation projects that developers quickly forget about being efficient with code. With the advent of new hardware, all the abysmal code that runs like a dog gets forgotton. For a while the problems go away but instead of being contained on its own server, the problem has infected other database with its contention for resources. Thats why I like CPU affinity because I can easily silo off performance issues and again force developers to tune application issues as they arise.

 

Level 2 & 3 Cache

Careful examination of TPC benchmarks will give you a real insight into the effects of Level 2 and 3 Cache. In Essence this cache give ultra fast memory access to the processor to be used for the temporary storing of calculations and lookup data. A larger cache in this area will often give better results than the use of faster processors for Sql Server Processing. Modern HP blade servers have even sacrificed Level 2 Caches for the sake of much larger Level 3 caches. I believe this is worthwhile sacrifice and worth considering when looking for the ultimate bang for your buck.

Non Uniform Memory Access (NUMA)

In most modern hardware now NUMA is enabled. The subject of NUMA is quite large but in short, NUMA governs the way in which memory banks are shared between groups of cores.  Memory access between NUMA nodes is quite slow so the advantage of having Sql Processes access memory on local NUMA nodes is quite big. Using CPU affinity to bind  Sql Instances to  a specific or group of NUMA nodes increases the possibility that the memory will be local. Therefore increasing performance.

When configuring CPU affinity you should always have in mind, which NUMA nodes relate to which CPU's.

 

Hyperthreading

One of my clients today asked me recently whether it was good to use hyperthreading in Sql 2008 release 2 and while my initial answer was no I decided to a little research. Some people I spoke to mentioned it would not be an issue with modern chipsets and others veered away from the subject. As usual with Sql Server the answer came back it depends..... Ultimatley without any compelling reason to use hyperthreading my advice to anyone would be to avoid.

UPDATE

See this blog post about my latest opinion on this issue:

 

 

 

@blakmk

 

I recently found myself having to failover a database mirror due to hardware issues. I was pretty horified that to do this, one of the applications had to be resinstalled because it did not support the failover partner parameter in the connection string. Its not the first time I have encountered this situation with Java based apps even though it does support the failover partner specification.

Unhappy with this and wanting more flexibility, i started to implement a solution where all application database connections, connected via a dns alias rather than a specific server name. What this enabled was a transparent redirect of application connections just by modifying the DNS alias.

While it does not smoke out all the nasty little excel\access and bespoke apps that still make connections without a failover partner. Once they are pointed to the DNS alias there are going to be no more manual tasks to reconfigure odbc aliases or ini files

job done.....

@blakmk

I was recently asked to investigate a performance issue with a Sql server that had gradually started to die. Performance was dire and connectivity had become sporadic and services were reguarly stopping and needed to be restarted.

The event log was littered with errors like the following being issued every second:

External dump process return code 0x20000001.

 
  * MSASN1                         76190000  761A1FFF  00012000
2010-06-29 17:22:37.95 spid8s      * USERENV                        76920000  769E1FFF  000c2000
2010-06-29 17:22:37.95 spid8s      * WINMM                          76AA0000  76ACCFFF  0002d000
2010-06-29 17:22:37.95 spid8s      * opends60                       333E0000  333E6FFF  00007000
2010-06-29 17:22:37.95 spid8s      * NETAPI32                       71C40000  71C96FFF  00057000
2010-06-29 17:22:37.95 spid8s      * BatchParser                    520C0000  520DEFFF  0001f000
2010-06-29 17:22:37.95 spid8s      * comctl32                       77420000  77522FFF  00103000
2010-06-29 17:22:37.96 spid8s      * odbcint                        00900000  00916FFF  00017000
2010-06-29 17:22:37.96 spid8s      * psapi                          76B70000  76B7AFFF  0000b000
2010-06-29 17:22:37.96 spid8s      * instapi10                      00A80000  00A89FFF  0000a000
2010-06-29 17:22:37.96 spid8s      * sqlevn70                       4F610000  4F80DFFF  001fe000

2010-06-29 17:22:37.96 spid8s      * -------------------------------------------------------------------------------
2010-06-29 17:22:37.96 spid8s      * Short Stack Dump
2010-06-29 17:22:37.96 spid8s      00000000 Module(UNKNOWN+00000000)
2010-06-29 17:22:37.96 spid8s      01792A21 Module(sqlservr+00792A21)
2010-06-29 17:22:37.96 spid8s      015CCD18 Module(sqlservr+005CCD18)
2010-06-29 17:22:37.96 spid8s      015CCB91 Module(sqlservr+005CCB91)
2010-06-29 17:22:37.96 spid8s      015CB9B7 Module(sqlservr+005CB9B7)
2010-06-29 17:22:37.96 spid8s      015CB817 Module(sqlservr+005CB817)
2010-06-29 17:22:37.96 spid8s      015CBA7B Module(sqlservr+005CBA7B)
2010-06-29 17:22:37.96 spid8s      015BE5A5 Module(sqlservr+005BE5A5)
2010-06-29 17:22:37.96 spid8s      015BE6D6 Module(sqlservr+005BE6D6)
2010-06-29 17:22:37.96 spid8s      015BE38F Module(sqlservr+005BE38F)
2010-06-29 17:22:37.96 spid8s      0112F47D Module(sqlservr+0012F47D)
2010-06-29 17:22:37.96 spid8s      0112E2FA Module(sqlservr+0012E2FA)
2010-06-29 17:22:37.96 spid8s      015943B9 Module(sqlservr+005943B9)
2010-06-29 17:22:37.97 spid8s      0112E8E8 Module(sqlservr+0012E8E8)
2010-06-29 17:22:37.97 spid8s      781329BB Module(MSVCR80+000029BB)
2010-06-29 17:22:37.97 spid8s      78132A47 Module(MSVCR80+00002A47)
2010-06-29 17:22:37.97 spid8s      Stack Signature for the dump is 0x50D4088A
2010-06-29 17:22:38.37 spid8s      External dump process return code 0x20000001.

Upon investigation I found that there were 5 active instances that were not intensly loaded. This would normally not be a problem for a 4 core server. But then I examined the RAM for the server and there was only 2Gb. One of the instances (the latest to be utilised) had over a 1000 active user connections and NONE of the instances had max server memory set.It was clear was what happening after looking at the paging activity. Basically there was  memory contention between the various Sql Server instances which had brought the server to its knees.

This server had gradually had more and more instances allocated to it and because the CPU had not maxed out it was considered to have more headroom. No one ever thought of looking at RAM or even the Disk or  Network subsystems to see if they were sufficient to the database that was going to be hosted. The last instance to be consolidated which actually turned out to be the biggest, was litterally the straw that broke the camels back.

Now I know we are all pushed to find cost savings and consolidate as much as possible onto existing hardware but the learning from all of this is to ask questions when you get requested to host a new Instance (or even a DB) on an existing server. Its not enough to work off predicted CPU and database size. You also need to ask questions like how many users is it going to support, what kind of disk throughput is it going to experience, whats the network load likely to be and most importantly what kind of memory signature is it likely to have.

Microsoft recommends setting max server on all Sql Servers but this is definatly the case when you have multiple instances running on the server. Doing this kind of exercise will alert you immediatly to any issues regarding the amount of actual physical memory and make sure you never end up running 5 instances on 2Gb of ram.

@Blakmk