Benchmarks
(SSD based)
In the following test I’ve tried to be realistic in terms of
what a real world query would do - joins! Ordinarily you’d have several tables
joining together; this is where you take the performance hit if a) you’ve an IO
subsystem that has high read latency and b) lots of fragmentation.
Benchmarks will be performed on SSD and then on rotational
disk, the first lot of benchmarks will focus on read queries, writing will be
done at a later date where I will discuss the hybrid approach of storage
design.
There are three versions of the table each held on its own
file group so that there is no object interleaving to interfere and give a
clear and clean set of numbers. The three versions are a) table populated with
50 million single INSERTS that has a clustered index on the guid column
(TestGuid_INSERT), b) a table populated from the table from a) but as a single
insert so there is no fragmentation (TestGuid_SingleINSERT) and finally c)
table populated with 50 million single INSERTS that is a heap and has a single
non-clustered index on the guid column (TestGUID_HeapINSERT), a padding column
of char(1024) is used to pad the row thus causing more page splitting.
The database set up script can be found in the blog article:
http://sqlblogcasts.com/blogs/tonyrogerson/archive/2011/07/19/use-guid-s-uniqueidentifier-for-keys-when-storing-data-on-ssd-s-part-2-initial-database-schema.aspx
Base lining the test
kit
To give benchmarks a back drop the capabilities of the kit
needs to be established, in this instance a number of measures need to be
established by placing the kit under specific load types. IOMeter has been used
to test raw grunt in terms of IO’s per second, throughput and read latency.
Kit
CPU and Memory
£154.16 - Gigabyte GA-890FXA-UD7 AMD 890FX (Socket AM3)
PCI-Express DDR3 Motherboard
£170.20 - AMD Phenom II X6 Six Core 1090T Black Edition
3.20GHz (Socket AM3)
£123.25 - 8GB Team Xtreem LV DDR3 PC3-16000 (9-11-9-27)
Dual Channel Low-Voltage kit
£123.25 - 8GB Team Xtreem LV DDR3 PC3-16000 (9-11-9-27)
Dual Channel Low-Voltage kit
Sub-Total: 570.86
IO Sub System
£108.36 - 80GB OCZ Vertex 2 SATA II 2.5" SSD Solid
State Disk
(285MB/sec read 275MB/sec
write)
£50.67 -
OCZSSDPX-1HSDL4P OCZ HSDL 3.5 INCH IBIS 4 port HSDL Card
£442.85 - 240GB OCZ IBIS 3.5-inch HSDL High-Speed Data Link
SSD
(read 740MB/s - write 720MB/s)
£442.85 - 240GB OCZ IBIS 3.5-inch HSDL High-Speed Data Link
SSD
(read 740MB/s - write 720MB/s)
Sub-Total: 1044.73
Build details
Operating System: Windows 2008 R2 x64
SQL Server 2008 R2 (10.50.1617) x64
C: on OCZ Vertex 2
E: as software RAID 0 over the two IBIS drives and formatted
with an allocation size of 64Kbytes, the two IBIS drives internally RAID 0 with
the default 64KB stripe.
Although there is 16GB of RAM in the box SQL Server 2008 R2
is fixed at min/max 1024 to put pressure on the buffer pool and thus stretch
the I/O system.
IOMeter tests on a 50GB file
104857600 512 sectors
Three transfer sizes have been chosen, 64KB (an extent), 8
KB (a page) and 256KB (read ahead reading multiple extents).
For the 8B and 64KB I’ve given figures for both ends of the
sequential and random read spectrum, for the 256KB test 50/50 was chosen
because that is more realistic.
|
Transfer Size
|
Read or Write
|
% Rand
|
% Seq
|
Outstanding IO
|
IO's
/sec
|
Mbytes /sec
|
Avg I/O (ms)
|
Max I/O (ms)
|
CPU %
|
|
64
|
Read
|
100
|
0
|
32
|
18644
|
1165
|
1.71
|
10.23
|
27.26
|
|
64
|
Read
|
100
|
0
|
64
|
21305
|
1331
|
3
|
5.91
|
26.68
|
|
64
|
Read
|
100
|
0
|
86
|
21523
|
1345
|
3.99
|
6.75
|
18.76
|
|
64
|
Read
|
0
|
100
|
32
|
22718
|
1419
|
1.4
|
2.39
|
36.57
|
|
64
|
Read
|
0
|
100
|
64
|
20122
|
1257
|
3.17
|
9.6
|
34.18
|
|
64
|
Read
|
0
|
100
|
86
|
21233
|
1327
|
4.05
|
6.83
|
44.39
|
|
8
|
Read
|
100
|
0
|
32
|
65627
|
512
|
0.48
|
13.05
|
30.49
|
|
8
|
Read
|
100
|
0
|
64
|
60915
|
475
|
1.05
|
3.13
|
46.24
|
|
8
|
Read
|
100
|
0
|
86
|
75519
|
590
|
1.13
|
22.73
|
32.95
|
|
8
|
Read
|
0
|
100
|
32
|
79506
|
621
|
0.4
|
1.9
|
29.79
|
|
8
|
Read
|
0
|
100
|
64
|
79039
|
617
|
0.81
|
2.52
|
55.61
|
|
8
|
Read
|
0
|
100
|
86
|
67485
|
527
|
1.27
|
19.27
|
21.24
|
|
256
|
Read
|
50
|
50
|
32
|
5211
|
1302
|
6.14
|
10.83
|
26.64
|
|
256
|
Read
|
50
|
50
|
64
|
5121
|
1280
|
12.46
|
21.5
|
31.56
|
|
256
|
Read
|
50
|
50
|
86
|
5200
|
1300
|
16.5
|
34.92
|
27.79
|
Populating the data
The clustered index with 50 million inserts took 15 hours 4
minutes to populate the table; the 50 million inserts into a heap took 7 hours
55 minutes.
The table below shows the results of
sys.dm_db_index_physical_stats, you can clearly see that TestGuid_INSERT
(having the clustered index) and built from 50 million inserts is somewhat
fragmented. The Heap contains almost exactly the same number of pages as the
leaf level of the TestGuid_SingleINSERT which has no fragmentation – that is
what I’d expect to see. The non-clustered index (index_id 2) which is the
primary key (nonclustered) is extremely badly fragmented which again is what is
expected because the insert order is purely random.
|
|
index id
|
index depth
|
index level
|
avg fragmentation in percent
|
fragment count
|
avg fragment size in pages
|
page count
|
avg page space used in percent
|
record count
|
min record size in bytes
|
max record size in bytes
|
avg record size in bytes
|
|
TestGuid_HeapInsert
|
0
|
1
|
0
|
3.4
|
31478
|
226.9
|
7142864
|
91
|
50000000
|
1047
|
1047
|
1047
|
|
TestGuid_HeapInsert
|
2
|
4
|
0
|
99.2
|
243713
|
1.0
|
243713
|
68
|
50000000
|
25
|
25
|
25
|
|
TestGuid_HeapInsert
|
2
|
4
|
1
|
99.9
|
1087
|
1.0
|
1087
|
69
|
243713
|
23
|
23
|
23
|
|
TestGuid_HeapInsert
|
2
|
4
|
2
|
50.0
|
4
|
1.0
|
4
|
84
|
1087
|
23
|
23
|
23
|
|
TestGuid_HeapInsert
|
2
|
4
|
3
|
0.0
|
1
|
1.0
|
1
|
1
|
4
|
23
|
23
|
23
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TestGuid_SingleINSERT
|
1
|
4
|
0
|
0.7
|
48827
|
146.3
|
7142858
|
91
|
50000000
|
1047
|
1047
|
1047
|
|
TestGuid_SingleINSERT
|
1
|
4
|
1
|
0.0
|
47938
|
1.0
|
47938
|
46
|
7142858
|
23
|
23
|
23
|
|
TestGuid_SingleINSERT
|
1
|
4
|
2
|
0.0
|
321
|
1.0
|
321
|
46
|
47938
|
23
|
23
|
23
|
|
TestGuid_SingleINSERT
|
1
|
4
|
3
|
0.0
|
1
|
1.0
|
1
|
99
|
321
|
23
|
23
|
23
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TestGuid_INSERT
|
1
|
4
|
0
|
99.1
|
10625949
|
1.0
|
10625950
|
61
|
50000000
|
1047
|
1047
|
1047
|
|
TestGuid_INSERT
|
1
|
4
|
1
|
100.0
|
47661
|
1.0
|
47661
|
69
|
10625950
|
23
|
23
|
23
|
|
TestGuid_INSERT
|
1
|
4
|
2
|
99.6
|
224
|
1.0
|
224
|
66
|
47661
|
23
|
23
|
23
|
|
TestGuid_INSERT
|
1
|
4
|
3
|
0.0
|
1
|
1.0
|
1
|
69
|
224
|
23
|
23
|
23
|
Simple Performance
Comparison (Test group 1)
A series of tests will be performed and blogged over the
coming weeks after which a conclusion will be drawn.
The first test is simply taking 50,000 of the 50 million rows;
each row is equally spaced throughout the table (see SQL below) and then used
to join back in to the big table in order to do a MAX. This tests performance
of random look ups, basically this test is a real world and what I’d expect in
most SQL Server databases with modest indexing strategies and maintenance.
--
-- TEST
1 (on fragmented INSERT with clustered index)
--
checkpoint
go
dbcc dropcleanbuffers
dbcc freeproccache
go
select guidcol
into #lookup
from ( select
guidcol, ROW_NUMBER() over( order by guidcol ) as rn
from TestGuid_INSERT )
as d
where d.rn %
1000 = 0
go
/*
perfmon:
avg
disk bytes/read 8192
avg
disk sec/read 0.000
avg
disk reads/sec 48,887
avg
read bytes/sec 400,501,565
Table 'TestGuid_INSERT'. Scan count 1, logical
reads 10673614, physical reads 25690, read-ahead reads 10709354, lob logical
reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server
Execution Times:
CPU time
= 73134 ms, elapsed time = 214243 ms.
*/
checkpoint
go
dbcc dropcleanbuffers
dbcc freeproccache
go
select max(
tg.expandrow )
from #lookup as l
inner join
TestGuid_INSERT as tg on
tg.guidcol = l.guidcol
go
/*
Table 'TestGuid_INSERT'. Scan count 0, logical
reads 351705, physical reads 6, read-ahead reads 196291, lob logical reads 0,
lob physical reads 0, lob read-ahead reads 0.
Table '#lookup'. Scan count 7, logical reads 156,
physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads
0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0,
physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads
0, lob read-ahead reads 0.
SQL Server
Execution Times:
CPU time
= 2231 ms, elapsed time = 1756 ms.
*/
--
------
drop table #t,
#lookup
--
-- TEST
2 (on fragmented INSERT with heap)
--
checkpoint
go
dbcc dropcleanbuffers
dbcc freeproccache
go
select guidcol
into #lookup
from ( select
guidcol, ROW_NUMBER() over( order by guidcol ) as rn
from TestGuid_HEapINSERT )
as d
where d.rn %
1000 = 0
go
/*
perfmon:
avg
disk bytes/read 8211
avg
disk sec/read 0.000
avg
disk reads/sec 5,749
avg
read bytes/sec 47,213,526
Table 'TestGuid_HeapINSERT'. Scan count 1,
logical reads 244803, physical reads 444, read-ahead reads 278989, lob logical
reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server
Execution Times:
CPU time
= 22074 ms, elapsed time = 25424 ms.
*/
checkpoint
go
dbcc dropcleanbuffers
dbcc freeproccache
go
select max(
tg.expandrow )
from #lookup as l
inner join
TestGuid_HEapINSERT as tg on tg.guidcol = l.guidcol
go
/*
Table 'TestGuid_HeapINSERT'. Scan count 0,
logical reads 345645, physical reads 7, read-ahead reads 185834, lob logical
reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#lookup'. Scan count 7, logical reads 156,
physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads
0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0,
physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads
0, lob read-ahead reads 0.
SQL Server
Execution Times:
CPU time
= 2571 ms, elapsed time = 1624 ms.
*/
--
------
drop table #t,
#lookup
--
-- TEST
3 (on single (non-fragmented) INSERT with clustered index)
--
checkpoint
go
dbcc dropcleanbuffers
dbcc freeproccache
go
select guidcol
into #lookup
from ( select
guidcol, ROW_NUMBER() over( order by guidcol ) as rn
from TestGuid_SingleINSERT )
as d
where d.rn %
1000 = 0
go
/*
perfmon:
avg
disk bytes/read 297,763
avg
disk sec/read 0.001
avg
disk reads/sec 3,221
avg
read bytes/sec 959,224,428
Table 'TestGuid_SingleINSERT'. Scan count 1,
logical reads 7190799, physical reads 21110, read-ahead reads 7190779, lob
logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server
Execution Times:
CPU time
= 44960 ms, elapsed time = 62299 ms.
*/
checkpoint
go
dbcc dropcleanbuffers
dbcc freeproccache
go
select max(
tg.expandrow )
from #lookup as l
inner join
TestGuid_SingleINSERT as tg on tg.guidcol = l.guidcol
go
/*
Table 'TestGuid_SingleINSERT'. Scan count 0,
logical reads 353439, physical reads 6, read-ahead reads 203706, lob logical
reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#lookup'. Scan count 7, logical reads 156,
physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads
0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0,
physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads
0, lob read-ahead reads 0.
SQL Server
Execution Times:
CPU time
= 2386 ms, elapsed time = 1675 ms.
*/
Findings of Comparison
(Test group 1)
Population of #lookup
|
|
Avg Disk Bytes/read
|
Avg Disk Sec/Read
|
Avg Disk Reads/sec
|
Avg Read Bytes/sec
|
Logical Reads
|
Physical Reads
|
Read-ahead
|
CPU (ms)
|
Elapsed (ms)
|
|
TestGuid_INSERT
|
8,192
|
0.000
|
48,887
|
400,501,565
|
10,673,614
|
25,690
|
10,709,354
|
73,134
|
214,243
|
|
TestGuid_HeapINSERT
|
8,211
|
0.000
|
5,749
|
47,213,526
|
244,803
|
444
|
278,989
|
22,074
|
25,424
|
|
TestGuid_SingleINSERT
|
297,763
|
0.001
|
3,221
|
959,224,428
|
7,190,799
|
21,110
|
7,190,779
|
44,960
|
62,299
|
The Heap clearly comes out on top because SQL Server is
using the non-clustered index to get the list of rowguid’s to use so
significantly less data to read.
Interestingly the read-ahead isn’t reading anything greater
than a single page, this is what you’d expect because the read-ahead works best
on contiguous data hence the contiguous TestGuid_SingleINSERT table has a high
average bytes per read.
MAX( )
|
|
Logical Reads
|
Physical Reads
|
Read-ahead
|
CPU (ms)
|
Elapsed (ms)
|
|
TestGuid_INSERT
|
351,705
|
6
|
196,291
|
2,231
|
1,756
|
|
TestGuid_HeapINSERT
|
345,645
|
7
|
185,834
|
2,571
|
1,624
|
|
TestGuid_SingleINSERT
|
353,439
|
6
|
203,706
|
2,386
|
1,675
|
Given 50,000 rows (#lookup) against the 50 million row table
the results of the join are similar in all cases regardless of how fragmented
the data is.
Initial findings would suggest that random row lookups on
data stored on SSD have uniform seek performance regardless of fragmentation
and the size of the table because the nature of the join used in this case
Nested Loops.
Plan for lookup against TestGuid_INSERT
|--Stream
Aggregate(DEFINE:([Expr1005]=MAX([partialagg1006])))
|--Parallelism(Gather Streams)
|--Stream
Aggregate(DEFINE:([partialagg1006]=MAX([GUIDTest_SSD].[dbo].[TestGuid_INSERT].[expandrow]
as [tg].[expandrow])))
|--Nested Loops(Inner Join, OUTER REFERENCES:([l].[guidcol], [Expr1007])
OPTIMIZED WITH UNORDERED PREFETCH)
|--Table Scan(OBJECT:([tempdb].[dbo].[#lookup] AS [l]))
|--Clustered Index
Seek(OBJECT:([GUIDTest_SSD].[dbo].[TestGuid_INSERT].[PK__TestGuid__AB46272D7F60ED59]
AS [tg]), SEEK:([tg].[guidcol]=[tempdb].[dbo].[#lookup].[guidcol] as
[l].[guidcol]) ORDERED FORWARD)
Plan for lookup against TestGuid_HeapINSERT
|--Stream
Aggregate(DEFINE:([Expr1006]=MAX([partialagg1007])))
|--Parallelism(Gather Streams)
|--Stream
Aggregate(DEFINE:([partialagg1007]=MAX([GUIDTest_SSD].[dbo].[TestGuid_HeapINSERT].[expandrow]
as [tg].[expandrow])))
|--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1003], [Expr1011])
OPTIMIZED WITH UNORDERED PREFETCH)
|--Compute Scalar(DEFINE:([Expr1010]=BmkToPage([Bmk1003])))
| |--Nested Loops(Inner Join,
OUTER REFERENCES:([l].[guidcol], [Expr1009]) WITH UNORDERED PREFETCH)
| |--Sort(ORDER
BY:([l].[guidcol] ASC))
| | |--Table
Scan(OBJECT:([tempdb].[dbo].[#lookup] AS [l]))
| |--Index
Seek(OBJECT:([GUIDTest_SSD].[dbo].[TestGuid_HeapINSERT].[PK__TestGuid__AB46272C1273C1CD]
AS [tg]), SEEK:([tg].[guidcol]=[tempdb].[dbo].[#lookup].[guidcol] as
[l].[guidcol]) ORDERED FORWARD)
|--RID Lookup(OBJECT:([GUIDTest_SSD].[dbo].[TestGuid_HeapINSERT] AS [tg]),
SEEK:([Bmk1003]=[Bmk1003]) LOOKUP ORDERED FORWARD)
Plan for lookup against TestGuid_SingleINSERT
|--Stream
Aggregate(DEFINE:([Expr1005]=MAX([partialagg1006])))
|--Parallelism(Gather Streams)
|--Stream
Aggregate(DEFINE:([partialagg1006]=MAX([GUIDTest_SSD].[dbo].[TestGuid_SingleINSERT].[expandrow]
as [tg].[expandrow])))
|--Nested Loops(Inner Join, OUTER REFERENCES:([l].[guidcol], [Expr1007])
OPTIMIZED WITH UNORDERED PREFETCH)
|--Table Scan(OBJECT:([tempdb].[dbo].[#lookup] AS [l]))
|--Clustered Index
Seek(OBJECT:([GUIDTest_SSD].[dbo].[TestGuid_SingleINSERT].[PK__TestGuid__AB46272D08EA5793]
AS [tg]), SEEK:([tg].[guidcol]=[tempdb].[dbo].[#lookup].[guidcol] as
[l].[guidcol]) ORDERED FORWARD)
In the next blog instalment I’ll look at how the I/O system
can outperform the cores in the box creating a CPU bottleneck, we’ll then
return to more tests and finally a conclusion.
Any comments feel free to email me privately tonyrogerson@torver.net