23 July 2011 14:12
tonyrogerson
Discovery - when your IO Subsystem out performs processor Cores - how parallelisation really is our friend
CPU
Saturation – an over performing IO subsystem
It’s not often that you see the IO subsystem able to out
drive the cores in the box. To demonstrate this behaviour the
TestGuid_HeapInsert table (see http://sqlblogcasts.com/blogs/tonyrogerson/archive/2011/07/19/use-guid-s-uniqueidentifier-for-keys-when-storing-data-on-ssd-s-part-2-initial-database-schema.aspx
for set up scripts), a 50 million row table (a heap) 54GBytes in size will be
used, processor affinity will be used to lock SQL Server to a specific number
of available cores.
Note: the equipment for this test is described here: http://sqlblogcasts.com/blogs/tonyrogerson/archive/2011/07/22/use-guid-s-uniqueidentifier-for-keys-when-storing-data-on-ssd-s-part-3-initial-base-line-with-iometer-and-first-test-group.aspx. In summary, Windows 2008 R2 x64, SQL 2008 R2 x64, 16GB of RAM but SQL Server is limited to 1GB, processor is a AMD Phenom II X6 Six Core 1090T Black Edition.
Interestingly this behaviour will not show up when using the
COUNT(*) function without any WHERE clause, performance is consistent regardless
of the number of cores used, however, once you have a Compute Scalar in the
plan then the core scalability issue described kicks in.
select COUNT(*) from TestGuid_HeapINSERT with
( index = 0 )
Average Disk Bytes/read 404,396
Average Disk Sec/read 0.001
Average Disk Reads/sec 2,806
Average Read Bytes/sec 1,134,827,393
The SQL below has been used; the index hint used to simplify
the experiment and make sure a table scan is being performed:
select MAX( cast( expandrow as bigint ) ) from TestGuid_HeapINSERT with
( index = 0 )
|
Cores
|
Avg Disk Bytes/Read
|
Avg Disk Sec/Read
|
Avg Disk Reads/Sec
|
Avg Read Bytes/Sec
|
Logical Reads
|
Read-Ahead Reads
|
CPU Time (ms)
|
Elapsed (ms)
|
|
1
|
411,619
|
0.000
|
622
|
256,306,080
|
7,142,864
|
7,142,864
|
211,225
|
228,679
|
|
2
|
406,215
|
0.001
|
1,244
|
506,625,778
|
7,142,864
|
7,142,864
|
216,249
|
116,102
|
|
3
|
407,495
|
0.001
|
1,833
|
746,940,222
|
7,142,864
|
7,142,754
|
220,336
|
78,957
|
|
4
|
409,191
|
0.001
|
2,392
|
978,956,526
|
7,142,864
|
7,142,864
|
223,797
|
60,129
|
|
5
|
408,588
|
0.001
|
2,795
|
1,142,225,279
|
7,142,864
|
7,142,864
|
225,248
|
51,388
|
|
6
|
408,588
|
0.001
|
3,270
|
1,324,789,338
|
7,142,864
|
7,142,864
|
229,429
|
44,631
|
Reviewing the figures above Average Disk Bytes per read, Average Disk Sec/Read,
Logical Reads, Read-ahead reads and CPU Time (ms) remain fairly constant,
however Average Disk Reads/Sec and Average Read Bytes/Sec are stepped and
Elapsed (ms) reduces dramatically with a second core then savings diminish as
cores are added.
|
Cores
|
Avg Disk Reads/Sec
|
Avg Read Bytes/Sec
|
Elapsed (ms)
|
|
2
|
622
|
250,319,698
|
-112,577
|
|
3
|
589
|
240,314,444
|
-37,145
|
|
4
|
559
|
232,016,304
|
-18,828
|
|
5
|
403
|
163,268,753
|
-8,741
|
|
6
|
475
|
182,564,059
|
-6,757
|
It should be noted, this problem isn’t actually specific to
the IO subsystem, and the problem occurs if you have enough of your table in
the buffer pool too, essentially the core cannot get the data through quick
enough.
Relating to the real
world
The figures above show that given a query that has no
parallelisation and therefore runs on a single core the duration will be 512% of
the capability of the box should all cores be used and will use only 19% of the
availability Read Bytes/sec maximum.
In past versions of SQL Server we have been used to turning
parallelism off because it more often than not extended the duration of the query,
which was fixed in SQL Server 2008 where the parallelism does now work well.
However, legacy systems and code and a perpetual myth that needs breaking is
that MAXDOP should be removed and just let SQL Server get on with it.
It is true that on a system with a number of concurrent
connections that this problem will balance itself out, however, what about the
sequential overnight batch jobs that so many of us have?
No direct advice here (yet), may be that will come when I
finally do my conclusions around these SSD benchmarks, but you need to be aware
that this problem is here and as IO subsystems perform better (which mine does)
then these issues will need addressing.