Consolidation, capacity planning and the art of asking questions
I was recently asked to investigate a performance issue
with a Sql server that had gradually started to die. Performance was dire and
connectivity had become sporadic and services were reguarly stopping and needed
to be restarted.
The event log was littered with errors like the
following being issued every second:
External dump process return code
0x20000001.
*
MSASN1
76190000 761A1FFF 00012000
2010-06-29 17:22:37.95
spid8s *
USERENV
76920000 769E1FFF 000c2000
2010-06-29 17:22:37.95
spid8s *
WINMM
76AA0000 76ACCFFF 0002d000
2010-06-29 17:22:37.95
spid8s *
opends60
333E0000 333E6FFF 00007000
2010-06-29 17:22:37.95
spid8s *
NETAPI32
71C40000 71C96FFF 00057000
2010-06-29 17:22:37.95
spid8s *
BatchParser
520C0000 520DEFFF 0001f000
2010-06-29 17:22:37.95
spid8s *
comctl32
77420000 77522FFF 00103000
2010-06-29 17:22:37.96
spid8s *
odbcint
00900000 00916FFF 00017000
2010-06-29 17:22:37.96
spid8s *
psapi
76B70000 76B7AFFF 0000b000
2010-06-29 17:22:37.96
spid8s *
instapi10
00A80000 00A89FFF 0000a000
2010-06-29 17:22:37.96
spid8s *
sqlevn70
4F610000 4F80DFFF 001fe000
2010-06-29 17:22:37.96
spid8s *
-------------------------------------------------------------------------------
2010-06-29
17:22:37.96 spid8s * Short Stack
Dump
2010-06-29 17:22:37.96 spid8s 00000000
Module(UNKNOWN+00000000)
2010-06-29 17:22:37.96
spid8s 01792A21
Module(sqlservr+00792A21)
2010-06-29 17:22:37.96
spid8s 015CCD18
Module(sqlservr+005CCD18)
2010-06-29 17:22:37.96
spid8s 015CCB91
Module(sqlservr+005CCB91)
2010-06-29 17:22:37.96
spid8s 015CB9B7
Module(sqlservr+005CB9B7)
2010-06-29 17:22:37.96
spid8s 015CB817
Module(sqlservr+005CB817)
2010-06-29 17:22:37.96
spid8s 015CBA7B
Module(sqlservr+005CBA7B)
2010-06-29 17:22:37.96
spid8s 015BE5A5
Module(sqlservr+005BE5A5)
2010-06-29 17:22:37.96
spid8s 015BE6D6
Module(sqlservr+005BE6D6)
2010-06-29 17:22:37.96
spid8s 015BE38F
Module(sqlservr+005BE38F)
2010-06-29 17:22:37.96
spid8s 0112F47D
Module(sqlservr+0012F47D)
2010-06-29 17:22:37.96
spid8s 0112E2FA
Module(sqlservr+0012E2FA)
2010-06-29 17:22:37.96
spid8s 015943B9
Module(sqlservr+005943B9)
2010-06-29 17:22:37.97
spid8s 0112E8E8
Module(sqlservr+0012E8E8)
2010-06-29 17:22:37.97
spid8s 781329BB
Module(MSVCR80+000029BB)
2010-06-29 17:22:37.97
spid8s 78132A47
Module(MSVCR80+00002A47)
2010-06-29 17:22:37.97
spid8s Stack Signature for the dump is
0x50D4088A
2010-06-29 17:22:38.37 spid8s
External dump process return code 0x20000001.
Upon investigation I found that there were 5 active
instances that were not intensly loaded. This would normally not be a problem
for a 4 core server. But then I examined the RAM for the server and there was
only 2Gb. One of the instances
(the latest to be utilised) had over a 1000 active user connections and NONE of the
instances had max server memory set.It was clear was what happening after looking at the
paging activity. Basically there was memory contention between the various Sql Server instances which had
brought the server to its knees.
This server had gradually had more and more instances allocated to it and
because the CPU had not maxed out it was considered to have more headroom. No
one ever thought of looking at RAM or even the Disk or Network subsystems
to see if they were sufficient to the database that was going to be hosted. The
last instance to be consolidated which actually turned out to be the biggest,
was litterally the straw that broke the camels back.
Now I know we are all pushed to find cost savings and consolidate as much as
possible onto existing hardware but the learning from all of this is to ask
questions when you get requested to host a new Instance (or even a DB) on an
existing server. Its not enough to work off predicted CPU and database size. You
also need to ask questions like how many users is it going to support, what kind
of disk throughput is it going to experience, whats the network load likely to
be and most importantly what kind of memory signature is it likely to have.
Microsoft recommends setting max server on all Sql Servers but this is
definatly the case when you have multiple instances running on the server. Doing
this kind of exercise will alert you immediatly to any issues regarding the
amount of actual physical memory and make sure you never end up running 5
instances on 2Gb of ram.
@Blakmk