So i only seem to get inspired when im totally immersing myself in a
project. Having been busy for a while with a few projects requiring me to either
consolidate or upgrade servers to 2008 and 2008 R2, I finally decided to put my
resistance aside and Blog. This particular blog will be devoted to CPU
configuration.
CPU affinity
One of the first things I like to do when I begin
to consolidate a server, is to calculate how many cpu's it needs allocated to it.
Often I create spreadsheet similar to the following to make this more
visible:
| |
Numa Node 0 |
Numa Node 1 |
Numa Node 2 |
| |
CPU 0 |
CPU 1 |
CPU 2 |
CPU 3 |
CPU 4 |
CPU 5 |
| Instance 1 |
Y |
Y |
|
|
|
|
| Instance 2 |
|
|
Y |
Y |
|
|
| Instance 3 |
|
|
|
|
Y |
Y |
| Instance 4 |
|
|
Y |
Y |
Y |
Y |
On hardware with 16 plus processors, my preference is
to reserver CPU 0 for exclusive use by the operating system. When ever I set CPU
affinity for a server, I will also set I/O affinity to the same CPU. I have not
yet found a reason to change this.
Now I know some of you may find CPU binding a bit stingy but there
are few reasons why I do this:
Less is more
I like to set CPU and I|\O affinity on consolidated
servers and servers with many CPU's ( 12-16). My ultimate rule for this is
sometimes, less is more. Sometimes we think the more the resources we
give something the faster it will run. And sometimes not.
I have seen batch jobs complete 4 times faster on old servers with
4 cpu's than modern servers with 16. Why? NUMA and multiplexing. There is an
overhead when accessing memory from foreign NUMA nodes (discussed in more detail below)
Modern processors work most effectivly when they are allocated a
similar kind of workload, one where they can easily utilise and reuse the Level
1/2 & 3 cache. Each time a new task is allocated to its processor, it has to
reload its cache. It can also be the case (and often is) that the
processor may not be available to give full attention to the process in
hand.
The more you get the more you want
I have often seen with consolidation projects that
developers quickly forget about being efficient with code. With the advent of
new hardware, all the abysmal code that runs like a dog gets forgotton. For a
while the problems go away but instead of being contained on its own server, the
problem has infected other database with its contention for resources. Thats why
I like CPU affinity because I can easily silo off performance issues and again
force developers to tune application issues as they arise.
Level 2 & 3 Cache
Careful examination of TPC benchmarks will give you a real insight into the
effects of Level 2 and 3 Cache. In Essence this cache give ultra fast memory
access to the processor to be used for the temporary storing of calculations and
lookup data. A larger cache in this area will often give better results than the
use of faster processors for Sql Server Processing. Modern HP blade servers have
even sacrificed Level 2 Caches for the sake of much larger Level 3 caches. I
believe this is worthwhile sacrifice and worth considering when looking for the
ultimate bang for your buck.
Non Uniform Memory Access (NUMA)
In most modern hardware now NUMA is enabled. The subject of NUMA
is quite large but in short, NUMA governs the way in which memory banks
are shared between groups of cores. Memory access between NUMA nodes
is quite slow so the advantage of having Sql Processes access memory on local
NUMA nodes is quite big. Using CPU affinity to bind Sql Instances to
a specific or group of NUMA nodes increases the possibility that the memory
will be local. Therefore increasing performance.
When configuring CPU affinity you should always have in mind, which NUMA
nodes relate to which CPU's.
Hyperthreading
One of my clients today asked me recently whether it was
good to use hyperthreading in Sql 2008 release 2 and while my initial answer was
no I decided to a little research. Some people I spoke to mentioned it would not
be an issue with modern chipsets and others veered away from the subject. As
usual with Sql Server the answer came back it depends..... Ultimatley without
any compelling reason to use hyperthreading my advice to anyone would be to
avoid.
UPDATE
See this blog
post about my latest opinion on this issue:
@blakmk
I recently found myself having to failover a
database mirror due to hardware issues. I was pretty horified that to do this,
one of the applications had to be resinstalled because it did not support the
failover partner parameter in the connection string. Its not the first time I
have encountered this situation with Java based apps even though it does support
the failover
partner
specification.
Unhappy with this and wanting more flexibility, i started to
implement a solution where all application database connections, connected via a
dns alias rather than a specific server name. What this enabled was a transparent
redirect of application connections just by modifying the
DNS alias.
While it does not smoke out all the nasty little
excel\access and bespoke apps that still make connections without a failover
partner. Once they are pointed to the DNS alias there are going to be no more
manual tasks to reconfigure odbc aliases or ini files
job done.....
@blakmk
I was recently asked to investigate a performance issue
with a Sql server that had gradually started to die. Performance was dire and
connectivity had become sporadic and services were reguarly stopping and needed
to be restarted.
The event log was littered with errors like the
following being issued every second:
External dump process return code
0x20000001.
*
MSASN1
76190000 761A1FFF 00012000
2010-06-29 17:22:37.95
spid8s *
USERENV
76920000 769E1FFF 000c2000
2010-06-29 17:22:37.95
spid8s *
WINMM
76AA0000 76ACCFFF 0002d000
2010-06-29 17:22:37.95
spid8s *
opends60
333E0000 333E6FFF 00007000
2010-06-29 17:22:37.95
spid8s *
NETAPI32
71C40000 71C96FFF 00057000
2010-06-29 17:22:37.95
spid8s *
BatchParser
520C0000 520DEFFF 0001f000
2010-06-29 17:22:37.95
spid8s *
comctl32
77420000 77522FFF 00103000
2010-06-29 17:22:37.96
spid8s *
odbcint
00900000 00916FFF 00017000
2010-06-29 17:22:37.96
spid8s *
psapi
76B70000 76B7AFFF 0000b000
2010-06-29 17:22:37.96
spid8s *
instapi10
00A80000 00A89FFF 0000a000
2010-06-29 17:22:37.96
spid8s *
sqlevn70
4F610000 4F80DFFF 001fe000
2010-06-29 17:22:37.96
spid8s *
-------------------------------------------------------------------------------
2010-06-29
17:22:37.96 spid8s * Short Stack
Dump
2010-06-29 17:22:37.96 spid8s 00000000
Module(UNKNOWN+00000000)
2010-06-29 17:22:37.96
spid8s 01792A21
Module(sqlservr+00792A21)
2010-06-29 17:22:37.96
spid8s 015CCD18
Module(sqlservr+005CCD18)
2010-06-29 17:22:37.96
spid8s 015CCB91
Module(sqlservr+005CCB91)
2010-06-29 17:22:37.96
spid8s 015CB9B7
Module(sqlservr+005CB9B7)
2010-06-29 17:22:37.96
spid8s 015CB817
Module(sqlservr+005CB817)
2010-06-29 17:22:37.96
spid8s 015CBA7B
Module(sqlservr+005CBA7B)
2010-06-29 17:22:37.96
spid8s 015BE5A5
Module(sqlservr+005BE5A5)
2010-06-29 17:22:37.96
spid8s 015BE6D6
Module(sqlservr+005BE6D6)
2010-06-29 17:22:37.96
spid8s 015BE38F
Module(sqlservr+005BE38F)
2010-06-29 17:22:37.96
spid8s 0112F47D
Module(sqlservr+0012F47D)
2010-06-29 17:22:37.96
spid8s 0112E2FA
Module(sqlservr+0012E2FA)
2010-06-29 17:22:37.96
spid8s 015943B9
Module(sqlservr+005943B9)
2010-06-29 17:22:37.97
spid8s 0112E8E8
Module(sqlservr+0012E8E8)
2010-06-29 17:22:37.97
spid8s 781329BB
Module(MSVCR80+000029BB)
2010-06-29 17:22:37.97
spid8s 78132A47
Module(MSVCR80+00002A47)
2010-06-29 17:22:37.97
spid8s Stack Signature for the dump is
0x50D4088A
2010-06-29 17:22:38.37 spid8s
External dump process return code 0x20000001.
Upon investigation I found that there were 5 active
instances that were not intensly loaded. This would normally not be a problem
for a 4 core server. But then I examined the RAM for the server and there was
only 2Gb. One of the instances
(the latest to be utilised) had over a 1000 active user connections and NONE of the
instances had max server memory set.It was clear was what happening after looking at the
paging activity. Basically there was memory contention between the various Sql Server instances which had
brought the server to its knees.
This server had gradually had more and more instances allocated to it and
because the CPU had not maxed out it was considered to have more headroom. No
one ever thought of looking at RAM or even the Disk or Network subsystems
to see if they were sufficient to the database that was going to be hosted. The
last instance to be consolidated which actually turned out to be the biggest,
was litterally the straw that broke the camels back.
Now I know we are all pushed to find cost savings and consolidate as much as
possible onto existing hardware but the learning from all of this is to ask
questions when you get requested to host a new Instance (or even a DB) on an
existing server. Its not enough to work off predicted CPU and database size. You
also need to ask questions like how many users is it going to support, what kind
of disk throughput is it going to experience, whats the network load likely to
be and most importantly what kind of memory signature is it likely to have.
Microsoft recommends setting max server on all Sql Servers but this is
definatly the case when you have multiple instances running on the server. Doing
this kind of exercise will alert you immediatly to any issues regarding the
amount of actual physical memory and make sure you never end up running 5
instances on 2Gb of ram.
@Blakmk