July 2011 - Posts

CPU Saturation – an over performing IO subsystem

It’s not often that you see the IO subsystem able to out drive the cores in the box. To demonstrate this behaviour the TestGuid_HeapInsert table (see http://sqlblogcasts.com/blogs/tonyrogerson/archive/2011/07/19/use-guid-s-uniqueidentifier-for-keys-when-storing-data-on-ssd-s-part-2-initial-database-schema.aspx for set up scripts), a 50 million row table (a heap) 54GBytes in size will be used, processor affinity will be used to lock SQL Server to a specific number of available cores.

Note: the equipment for this test is described here: http://sqlblogcasts.com/blogs/tonyrogerson/archive/2011/07/22/use-guid-s-uniqueidentifier-for-keys-when-storing-data-on-ssd-s-part-3-initial-base-line-with-iometer-and-first-test-group.aspx. In summary, Windows 2008 R2 x64, SQL 2008 R2 x64, 16GB of RAM but SQL Server is limited to 1GB, processor is a AMD Phenom II X6 Six Core 1090T Black Edition.

Interestingly this behaviour will not show up when using the COUNT(*) function without any WHERE clause, performance is consistent regardless of the number of cores used, however, once you have a Compute Scalar in the plan then the core scalability issue described kicks in.

select COUNT(*) from TestGuid_HeapINSERT with ( index = 0 )

Average Disk Bytes/read              404,396
Average Disk Sec/read                  0.001
Average Disk Reads/sec                2,806
Average Read Bytes/sec               1,134,827,393

The SQL below has been used; the index hint used to simplify the experiment and make sure a table scan is being performed:

select MAX( cast( expandrow as bigint ) ) from TestGuid_HeapINSERT with ( index = 0 )

Cores

Avg Disk Bytes/Read

Avg Disk Sec/Read

Avg Disk Reads/Sec

Avg Read Bytes/Sec

Logical Reads

Read-Ahead Reads

CPU Time (ms)

Elapsed (ms)

1

411,619

0.000

622

256,306,080

7,142,864

7,142,864

211,225

228,679

2

406,215

0.001

1,244

506,625,778

7,142,864

7,142,864

216,249

116,102

3

407,495

0.001

1,833

746,940,222

7,142,864

7,142,754

220,336

78,957

4

409,191

0.001

2,392

978,956,526

7,142,864

7,142,864

223,797

60,129

5

408,588

0.001

2,795

1,142,225,279

7,142,864

7,142,864

225,248

51,388

6

408,588

0.001

3,270

1,324,789,338

7,142,864

7,142,864

229,429

44,631


Reviewing the figures above Average Disk Bytes per read, Average Disk Sec/Read, Logical Reads, Read-ahead reads and CPU Time (ms) remain fairly constant, however Average Disk Reads/Sec and Average Read Bytes/Sec are stepped and Elapsed (ms) reduces dramatically with a second core then savings diminish as cores are added.

Cores

Avg Disk Reads/Sec

Avg Read Bytes/Sec

Elapsed (ms)

2

622

250,319,698

-112,577

3

589

240,314,444

-37,145

4

559

232,016,304

-18,828

5

403

163,268,753

-8,741

6

475

182,564,059

-6,757

 

It should be noted, this problem isn’t actually specific to the IO subsystem, and the problem occurs if you have enough of your table in the buffer pool too, essentially the core cannot get the data through quick enough.

Relating to the real world

The figures above show that given a query that has no parallelisation and therefore runs on a single core the duration will be 512% of the capability of the box should all cores be used and will use only 19% of the availability Read Bytes/sec maximum.

In past versions of SQL Server we have been used to turning parallelism off because it more often than not extended the duration of the query, which was fixed in SQL Server 2008 where the parallelism does now work well. However, legacy systems and code and a perpetual myth that needs breaking is that MAXDOP should be removed and just let SQL Server get on with it.

It is true that on a system with a number of concurrent connections that this problem will balance itself out, however, what about the sequential overnight batch jobs that so many of us have?

No direct advice here (yet), may be that will come when I finally do my conclusions around these SSD benchmarks, but you need to be aware that this problem is here and as IO subsystems perform better (which mine does) then these issues will need addressing.

Benchmarks (SSD based)

In the following test I’ve tried to be realistic in terms of what a real world query would do - joins! Ordinarily you’d have several tables joining together; this is where you take the performance hit if a) you’ve an IO subsystem that has high read latency and b) lots of fragmentation.

Benchmarks will be performed on SSD and then on rotational disk, the first lot of benchmarks will focus on read queries, writing will be done at a later date where I will discuss the hybrid approach of storage design.

There are three versions of the table each held on its own file group so that there is no object interleaving to interfere and give a clear and clean set of numbers. The three versions are a) table populated with 50 million single INSERTS that has a clustered index on the guid column (TestGuid_INSERT), b) a table populated from the table from a) but as a single insert so there is no fragmentation (TestGuid_SingleINSERT) and finally c) table populated with 50 million single INSERTS that is a heap and has a single non-clustered index on the guid column (TestGUID_HeapINSERT), a padding column of char(1024) is used to pad the row thus causing more page splitting.

The database set up script can be found in the blog article: http://sqlblogcasts.com/blogs/tonyrogerson/archive/2011/07/19/use-guid-s-uniqueidentifier-for-keys-when-storing-data-on-ssd-s-part-2-initial-database-schema.aspx

Base lining the test kit

To give benchmarks a back drop the capabilities of the kit needs to be established, in this instance a number of measures need to be established by placing the kit under specific load types. IOMeter has been used to test raw grunt in terms of IO’s per second, throughput and read latency.

Kit

CPU and Memory

£154.16 - Gigabyte GA-890FXA-UD7 AMD 890FX (Socket AM3)
          PCI-Express DDR3 Motherboard

£170.20 - AMD Phenom II X6 Six Core 1090T Black Edition
          3.20GHz (Socket AM3)

£123.25 - 8GB Team Xtreem LV DDR3 PC3-16000 (9-11-9-27)
          Dual Channel Low-Voltage kit

£123.25 - 8GB Team Xtreem LV DDR3 PC3-16000 (9-11-9-27)
          Dual Channel Low-Voltage kit

Sub-Total: 570.86

IO Sub System

£108.36 - 80GB OCZ Vertex 2 SATA II 2.5" SSD Solid State Disk
          (285MB/sec read 275MB/sec write)

 £50.67 - OCZSSDPX-1HSDL4P OCZ HSDL 3.5 INCH IBIS 4 port HSDL Card

£442.85 - 240GB OCZ IBIS 3.5-inch HSDL High-Speed Data Link SSD
          (read 740MB/s - write 720MB/s)

£442.85 - 240GB OCZ IBIS 3.5-inch HSDL High-Speed Data Link SSD
          (read 740MB/s - write 720MB/s)

Sub-Total: 1044.73

Build details

Operating System: Windows 2008 R2 x64

SQL Server 2008 R2 (10.50.1617) x64

C: on OCZ Vertex 2

E: as software RAID 0 over the two IBIS drives and formatted with an allocation size of 64Kbytes, the two IBIS drives internally RAID 0 with the default 64KB stripe.

Although there is 16GB of RAM in the box SQL Server 2008 R2 is fixed at min/max 1024 to put pressure on the buffer pool and thus stretch the I/O system.

IOMeter tests on a 50GB file

104857600 512 sectors

Three transfer sizes have been chosen, 64KB (an extent), 8 KB (a page) and 256KB (read ahead reading multiple extents).

For the 8B and 64KB I’ve given figures for both ends of the sequential and random read spectrum, for the 256KB test 50/50 was chosen because that is more realistic.

Transfer Size

Read or Write

% Rand

% Seq

Outstanding IO

IO's

/sec

Mbytes /sec

Avg I/O (ms)

Max I/O (ms)

CPU %

64

Read

100

0

32

18644

1165

1.71

10.23

27.26

64

Read

100

0

64

21305

1331

3

5.91

26.68

64

Read

100

0

86

21523

1345

3.99

6.75

18.76

64

Read

0

100

32

22718

1419

1.4

2.39

36.57

64

Read

0

100

64

20122

1257

3.17

9.6

34.18

64

Read

0

100

86

21233

1327

4.05

6.83

44.39

8

Read

100

0

32

65627

512

0.48

13.05

30.49

8

Read

100

0

64

60915

475

1.05

3.13

46.24

8

Read

100

0

86

75519

590

1.13

22.73

32.95

8

Read

0

100

32

79506

621

0.4

1.9

29.79

8

Read

0

100

64

79039

617

0.81

2.52

55.61

8

Read

0

100

86

67485

527

1.27

19.27

21.24

256

Read

50

50

32

5211

1302

6.14

10.83

26.64

256

Read

50

50

64

5121

1280

12.46

21.5

31.56

256

Read

50

50

86

5200

1300

16.5

34.92

27.79

 

Populating the data

The clustered index with 50 million inserts took 15 hours 4 minutes to populate the table; the 50 million inserts into a heap took 7 hours 55 minutes.

The table below shows the results of sys.dm_db_index_physical_stats, you can clearly see that TestGuid_INSERT (having the clustered index) and built from 50 million inserts is somewhat fragmented. The Heap contains almost exactly the same number of pages as the leaf level of the TestGuid_SingleINSERT which has no fragmentation – that is what I’d expect to see. The non-clustered index (index_id 2) which is the primary key (nonclustered) is extremely badly fragmented which again is what is expected because the insert order is purely random.

 

index id

index depth

index level

avg fragmentation in percent

fragment count

avg fragment size in pages

page count

avg page space used in percent

record count

min record size in bytes

max record size in bytes

avg record size in bytes

TestGuid_HeapInsert

0

1

0

3.4

31478

226.9

7142864

91

50000000

1047

1047

1047

TestGuid_HeapInsert

2

4

0

99.2

243713

1.0

243713

68

50000000

25

25

25

TestGuid_HeapInsert

2

4

1

99.9

1087

1.0

1087

69

243713

23

23

23

TestGuid_HeapInsert

2

4

2

50.0

4

1.0

4

84

1087

23

23

23

TestGuid_HeapInsert

2

4

3

0.0

1

1.0

1

1

4

23

23

23

 

 

 

 

 

 

 

 

 

 

 

 

 

TestGuid_SingleINSERT

1

4

0

0.7

48827

146.3

7142858

91

50000000

1047

1047

1047

TestGuid_SingleINSERT

1

4

1

0.0

47938

1.0

47938

46

7142858

23

23

23

TestGuid_SingleINSERT

1

4

2

0.0

321

1.0

321

46

47938

23

23

23

TestGuid_SingleINSERT

1

4

3

0.0

1

1.0

1

99

321

23

23

23

 

 

 

 

 

 

 

 

 

 

 

 

 

TestGuid_INSERT

1

4

0

99.1

10625949

1.0

10625950

61

50000000

1047

1047

1047

TestGuid_INSERT

1

4

1

100.0

47661

1.0

47661

69

10625950

23

23

23

TestGuid_INSERT

1

4

2

99.6

224

1.0

224

66

47661

23

23

23

TestGuid_INSERT

1

4

3

0.0

1

1.0

1

69

224

23

23

23

 

Simple Performance Comparison (Test group 1)

A series of tests will be performed and blogged over the coming weeks after which a conclusion will be drawn.

The first test is simply taking 50,000 of the 50 million rows; each row is equally spaced throughout the table (see SQL below) and then used to join back in to the big table in order to do a MAX. This tests performance of random look ups, basically this test is a real world and what I’d expect in most SQL Server databases with modest indexing strategies and maintenance.

--

--          TEST 1 (on fragmented INSERT with clustered index)

--

 

checkpoint

go

dbcc dropcleanbuffers

dbcc freeproccache

go

 

select guidcol

into #lookup

from (      select guidcol, ROW_NUMBER() over( order by guidcol ) as rn

            from TestGuid_INSERT ) as d

where d.rn % 1000 = 0

go

/*

perfmon:

      avg disk bytes/read 8192

      avg disk sec/read   0.000

      avg disk reads/sec  48,887

      avg read bytes/sec  400,501,565

 

Table 'TestGuid_INSERT'. Scan count 1, logical reads 10673614, physical reads 25690, read-ahead reads 10709354, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

 SQL Server Execution Times:

   CPU time = 73134 ms,  elapsed time = 214243 ms.

 

*/

checkpoint

go

dbcc dropcleanbuffers

dbcc freeproccache

go

 

select max( tg.expandrow )

from #lookup as l

      inner join TestGuid_INSERT as tg on tg.guidcol = l.guidcol

go

 

/*

 

Table 'TestGuid_INSERT'. Scan count 0, logical reads 351705, physical reads 6, read-ahead reads 196291, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#lookup'. Scan count 7, logical reads 156, physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

 SQL Server Execution Times:

   CPU time = 2231 ms,  elapsed time = 1756 ms.

 

*/

 

--

------

 

drop table #t, #lookup

 

 

--

--          TEST 2 (on fragmented INSERT with heap)

--

 

checkpoint

go

dbcc dropcleanbuffers

dbcc freeproccache

go

 

select guidcol

into #lookup

from (      select guidcol, ROW_NUMBER() over( order by guidcol ) as rn

            from TestGuid_HEapINSERT ) as d

where d.rn % 1000 = 0

go

/*

perfmon:

      avg disk bytes/read 8211

      avg disk sec/read   0.000

      avg disk reads/sec  5,749

      avg read bytes/sec  47,213,526

 

Table 'TestGuid_HeapINSERT'. Scan count 1, logical reads 244803, physical reads 444, read-ahead reads 278989, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

 SQL Server Execution Times:

   CPU time = 22074 ms,  elapsed time = 25424 ms.

  

*/

checkpoint

go

dbcc dropcleanbuffers

dbcc freeproccache

go

 

select max( tg.expandrow )

from #lookup as l

      inner join TestGuid_HEapINSERT as tg on tg.guidcol = l.guidcol

go

 

/*

 

Table 'TestGuid_HeapINSERT'. Scan count 0, logical reads 345645, physical reads 7, read-ahead reads 185834, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#lookup'. Scan count 7, logical reads 156, physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

 SQL Server Execution Times:

   CPU time = 2571 ms,  elapsed time = 1624 ms.

 

*/

 

--

------

 

drop table #t, #lookup

 

 

 

--

--          TEST 3 (on single (non-fragmented) INSERT with clustered index)

--

 

checkpoint

go

dbcc dropcleanbuffers

dbcc freeproccache

go

 

select guidcol

into #lookup

from (      select guidcol, ROW_NUMBER() over( order by guidcol ) as rn

            from TestGuid_SingleINSERT ) as d

where d.rn % 1000 = 0

go

/*

perfmon:

      avg disk bytes/read 297,763

      avg disk sec/read   0.001

      avg disk reads/sec  3,221

      avg read bytes/sec  959,224,428

 

Table 'TestGuid_SingleINSERT'. Scan count 1, logical reads 7190799, physical reads 21110, read-ahead reads 7190779, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

 SQL Server Execution Times:

   CPU time = 44960 ms,  elapsed time = 62299 ms.

  

*/

checkpoint

go

dbcc dropcleanbuffers

dbcc freeproccache

go

 

select max( tg.expandrow )

from #lookup as l

      inner join TestGuid_SingleINSERT as tg on tg.guidcol = l.guidcol

go

 

/*

 

Table 'TestGuid_SingleINSERT'. Scan count 0, logical reads 353439, physical reads 6, read-ahead reads 203706, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#lookup'. Scan count 7, logical reads 156, physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

 SQL Server Execution Times:

   CPU time = 2386 ms,  elapsed time = 1675 ms.

 

*/

 

 

Findings of Comparison (Test group 1)

Population of #lookup

 

Avg Disk Bytes/read

Avg Disk Sec/Read

Avg Disk Reads/sec

Avg Read Bytes/sec

Logical Reads

Physical Reads

Read-ahead

CPU (ms)

Elapsed (ms)

TestGuid_INSERT

8,192

0.000

48,887

400,501,565

10,673,614

25,690

10,709,354

73,134

214,243

TestGuid_HeapINSERT

8,211

0.000

5,749

47,213,526

244,803

444

278,989

22,074

25,424

TestGuid_SingleINSERT

297,763

0.001

3,221

959,224,428

7,190,799

21,110

7,190,779

44,960

62,299

 

The Heap clearly comes out on top because SQL Server is using the non-clustered index to get the list of rowguid’s to use so significantly less data to read.

Interestingly the read-ahead isn’t reading anything greater than a single page, this is what you’d expect because the read-ahead works best on contiguous data hence the contiguous TestGuid_SingleINSERT table has a high average bytes per read.

MAX( )

 

Logical Reads

Physical Reads

Read-ahead

CPU (ms)

Elapsed (ms)

TestGuid_INSERT

351,705

6

196,291

2,231

1,756

TestGuid_HeapINSERT

345,645

7

185,834

2,571

1,624

TestGuid_SingleINSERT

353,439

6

203,706

2,386

1,675

 

Given 50,000 rows (#lookup) against the 50 million row table the results of the join are similar in all cases regardless of how fragmented the data is.

Initial findings would suggest that random row lookups on data stored on SSD have uniform seek performance regardless of fragmentation and the size of the table because the nature of the join used in this case Nested Loops.

Plan for lookup against TestGuid_INSERT

  |--Stream Aggregate(DEFINE:([Expr1005]=MAX([partialagg1006])))

       |--Parallelism(Gather Streams)

            |--Stream Aggregate(DEFINE:([partialagg1006]=MAX([GUIDTest_SSD].[dbo].[TestGuid_INSERT].[expandrow] as [tg].[expandrow])))

                 |--Nested Loops(Inner Join, OUTER REFERENCES:([l].[guidcol], [Expr1007]) OPTIMIZED WITH UNORDERED PREFETCH)

                      |--Table Scan(OBJECT:([tempdb].[dbo].[#lookup] AS [l]))

                      |--Clustered Index Seek(OBJECT:([GUIDTest_SSD].[dbo].[TestGuid_INSERT].[PK__TestGuid__AB46272D7F60ED59] AS [tg]), SEEK:([tg].[guidcol]=[tempdb].[dbo].[#lookup].[guidcol] as [l].[guidcol]) ORDERED FORWARD)

Plan for lookup against TestGuid_HeapINSERT

  |--Stream Aggregate(DEFINE:([Expr1006]=MAX([partialagg1007])))

       |--Parallelism(Gather Streams)

            |--Stream Aggregate(DEFINE:([partialagg1007]=MAX([GUIDTest_SSD].[dbo].[TestGuid_HeapINSERT].[expandrow] as [tg].[expandrow])))

                 |--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1003], [Expr1011]) OPTIMIZED WITH UNORDERED PREFETCH)

                      |--Compute Scalar(DEFINE:([Expr1010]=BmkToPage([Bmk1003])))

                      |    |--Nested Loops(Inner Join, OUTER REFERENCES:([l].[guidcol], [Expr1009]) WITH UNORDERED PREFETCH)

                      |         |--Sort(ORDER BY:([l].[guidcol] ASC))

                      |         |    |--Table Scan(OBJECT:([tempdb].[dbo].[#lookup] AS [l]))

                      |         |--Index Seek(OBJECT:([GUIDTest_SSD].[dbo].[TestGuid_HeapINSERT].[PK__TestGuid__AB46272C1273C1CD] AS [tg]), SEEK:([tg].[guidcol]=[tempdb].[dbo].[#lookup].[guidcol] as [l].[guidcol]) ORDERED FORWARD)

                      |--RID Lookup(OBJECT:([GUIDTest_SSD].[dbo].[TestGuid_HeapINSERT] AS [tg]), SEEK:([Bmk1003]=[Bmk1003]) LOOKUP ORDERED FORWARD)

Plan for lookup against TestGuid_SingleINSERT

  |--Stream Aggregate(DEFINE:([Expr1005]=MAX([partialagg1006])))

       |--Parallelism(Gather Streams)

            |--Stream Aggregate(DEFINE:([partialagg1006]=MAX([GUIDTest_SSD].[dbo].[TestGuid_SingleINSERT].[expandrow] as [tg].[expandrow])))

                 |--Nested Loops(Inner Join, OUTER REFERENCES:([l].[guidcol], [Expr1007]) OPTIMIZED WITH UNORDERED PREFETCH)

                      |--Table Scan(OBJECT:([tempdb].[dbo].[#lookup] AS [l]))

                      |--Clustered Index Seek(OBJECT:([GUIDTest_SSD].[dbo].[TestGuid_SingleINSERT].[PK__TestGuid__AB46272D08EA5793] AS [tg]), SEEK:([tg].[guidcol]=[tempdb].[dbo].[#lookup].[guidcol] as [l].[guidcol]) ORDERED FORWARD)

 

In the next blog instalment I’ll look at how the I/O system can outperform the cores in the box creating a CPU bottleneck, we’ll then return to more tests and finally a conclusion.

Any comments feel free to email me privately tonyrogerson@torver.net

In relation to my blog post on guids and ssds (part 1) the script below forms the basis of the permanent tables in the database.

TestGuid_INSERT forms a table that is 50 million rows and around 85.5GBytes in size.

The tests I will be doing as part of the benchmarking are real world tests for instance an INNER JOIN between the TestGuid_INSERT and a lookup table with 50K guids in it.

If you can think of any real world tests you want me to run then email me tonyrogerson@torver.net and I'll include in the benchmarks, better still - create the test yourself and be ready to run the benchmarks to better understand the game changer I'm talking about.

I've created the contiguous table on its own file group so I know the data is as contiguous as I can get it.

The 1GB start size and increment on the original test was deliberate to make it even more on-drive fragmented (it is already on SSD because of the mapping that goes on).

CREATE DATABASE [GUIDTest_SSD] ON  PRIMARY

( NAME = N'GUIDTest_SSD', FILENAME = N'E:\S2008R2\GUIDTest_SSD.mdf' , SIZE = 1GB , FILEGROWTH = 1GB )

 LOG ON

( NAME = N'GUIDTest_SSD_log', FILENAME = N'E:\S2008R2\GUIDTest_SSD_log.ldf' , SIZE = 1GB , FILEGROWTH = 1GB)

GO

alter database GUIDTest_SSD set recovery simple

go

USE GUIDTest_SSD

go

select * from sys.dm_io_virtual_file_stats( DB_ID(), null )

go

--    population script (2:55)

create table TestGuid_INSERT (

      guidcol uniqueidentifier not null default( newid() ) primary key clustered,

      expandrow char(1024) not null

      );

declare @i int = 1;

set nocount on;

while @i <= 50000000

begin

      insert TestGuid_INSERT ( guidcol, expandrow ) values( NEWID(), cast( @i as CHAR(1024) ) )

      set @i = @i + 1

end

go

 

select * from sys.dm_io_virtual_file_stats( DB_ID(), null )

go

USE [master]

GO

ALTER DATABASE [GUIDTest_SSD] ADD FILEGROUP [CONTIG_TBL]

GO

ALTER DATABASE [GUIDTest_SSD] ADD FILE ( NAME = N'GUIDTest_SSD_CONTIG', FILENAME = N'E:\S2008R2\GUIDTest_SSD_CONTIG.ndf' , SIZE = 100GB , FILEGROWTH = 1024KB ) TO FILEGROUP [CONTIG_TBL]

GO

USE GUIDTest_SSD

go

 

create   table TestGuid_SingleINSERT (

      guidcol uniqueidentifier not null default( newid() ) primary key clustered,

      expandrow char(1024) not null

      ) ON [CONTIG_TBL];

select * from sys.dm_io_virtual_file_stats( DB_ID(), null )

go

insert TestGuid_SingleINSERT ( guidcol, expandrow )

      select guidcol, expandrow

      from TestGuid_INSERT

      order by guidcol, expandrow

go

select * from sys.dm_io_virtual_file_stats( DB_ID(), null )

go   

USE [master]

GO

ALTER DATABASE [GUIDTest_SSD] ADD FILEGROUP [HEAP_TBL]

GO

ALTER DATABASE [GUIDTest_SSD] ADD FILE ( NAME = N'GUIDTest_SSD_HEAP_TBL', FILENAME = N'E:\S2008R2\GUIDTest_SSD_HEAP_TBL.ndf' , SIZE = 100GB , FILEGROWTH = 1024KB ) TO FILEGROUP [HEAP_TBL]

GO

USE GUIDTest_SSD

go

--    population script (2:55)

create table TestGuid_HeapINSERT (

      guidcol uniqueidentifier not null default( newid() ) primary key nonclustered,

      expandrow char(1024) not null

      ) on [heap_tbl];

declare @i int = 1;

set nocount on;

select * from sys.dm_io_virtual_file_stats( DB_ID(), null )

while @i <= 50000000

begin

      insert TestGuid_HeapINSERT ( guidcol, expandrow ) values( NEWID(), cast( @i as CHAR(1024) ) )

      set @i = @i + 1

end

go

select * from sys.dm_io_virtual_file_stats( DB_ID(), null )

go

Using GUID’s (uniqueidentifier) for surrogate keys has always and still is frowned on by database designers and optimisation consultants alike and rightly so, the fragmentation and randomness of how rows are stored and accessed can cripple performance with traditional storage systems.

People on all sides are acutely aware of the benefits GUID’s bring for distributed processing whereby the middle tier is contained on many servers or the database is spread across many physical machines.

Solid State Drives have been a game changer, they remove problems associated with traditional rotational disk technology because they not only offer uniform seek times across the storage but also the read latency for fully random reads is consistent and extremely low, they do have a weakness – cells burn out on write intensive workloads, Multiple Layer Cells (MLC) this around 10K cycles, for Single Layer Cells (SLC) this around 100K cycles and for the newer Phase Change Memory (PCM) this around 100M (http://www.fastestssd.com/articles/pcm-based-storage-a-game-changer/).

This article seeks to demonstrate that we as database designers and optimisation specialists need to adjust our approach and realise SSD’s provide a unique opportunity to scale out as well as up the database and offer less complicated database implementations especially as now like never before there is a greater need to spread data across many physical machines. The article is split into two parts, the first gives background for SSD, and the second the actual benchmarks.

What is really important IO size or query duration?

We should only be concerned with query duration, queries should return in an acceptable duration depending on the query being performed, user expectation and number of concurrent queries executing. Query speed is the outcome from a number of parameters, the rather simplistic formula goes something like queryduration = ( iolatency + iothroughput + cputime + blocking )* concurrentusers.

I’ve presented a number of times now on the commodity “reporting brick” I’ve built around an AMD Phenom 1090T, 16GB RAM and 2 x 240GB OCZ IBIS drives in a RAID 0, I’ve shown how all 6 cores are needed and top out in order to get the full potential out of the two IBIS drives, when limited to 1 or 2 cores instead of 1.3GBytes per second of throughput, the throughput drops off to a pitiful (though good still) 130-260Mbytes per second.

SO – are we really bothered about the IO subsystem when it can outperform the CPU cores?

Query speed against data held on rotational disk is not consistent; it is subject to the non-uniform seek latency that is dramatically affected if the data is fragmented or held randomly across the sectors on the disk.

Let’s have a look at the hardware that relates to performance that is a) the IO sub-system for getting the data off and b) the system RAM for the SQL Server buffer pool.

Hardware

Costs and Classification of Solid State Drives

SSD’s can be classified into commodity or enterprise; SATA attached or PCI attached.

Commodity SSD tends to be the cheaper MLC which has a lower write life, MLC is also found in some enterprise SSDs but SLC is predominate because of its higher write life. Cell burn issues and therefore the life of the drive extended by a number of techniques – in the enterprise space they are simply a lot better and there is also more redundancy on the stated drive size.

Costs of SSD can vary dramatically, for enterprise PCI based SSD is a must because of the reduced latency and the hardware is specifically designed around SSD rather than having to use a legacy protocol such as SATA and drivers there of which is designed around rotational disk technology and the problems associated with it for performance optimisation.

The OCZ Z-Drive product line (http://www.ocztechnology.com/ocz-z-drive-r3-p84-pci-express-ssd.html) is currently the best price per Gigabyte; you can buy a 1TB card for £4K (http://www.amazon.co.uk/OCZ-Z-Drive-PCI-Express-SSD-p84/dp/B003EEMIH2/ref=sr_1_4?ie=UTF8&qid=1310737655&sr=8-4) but hold out and wait for the R3 which is due shortly; a single card will give you 1GBytes per second throughput and capable of 135K IOs/second – the read latency will also be a couple of milliseconds at those levels as well – that’s fully random IO!

Just a thought to throw in the pot: consider that when you overwrite a piece of data on the SSD you aren’t necessarily overwriting, you are marking previous cells as deleted and writing information, now, write performance degrades if TRIM hasn’t caught up and reset the cells; so, would it not be logical to buy an SSD that has far more storage capacity than you actually need, for instance, if you only need 240GB for your database, buy one that is twice that, that would reduce the possibility of you hitting cells that you’d have to do a full erase cycle in order to write over.

Costs and Classification of Rotational Drives

We’ve had rotational speeds of 5.4K, 7.2K, 10K and then in 2001 technology hit a wall at 15K rpm, that’s 10 years ago! Even now the size of 15Krpm disks only go up to 147GB, at 10Krpm we have disks up to 900GB. The cost for a 15Krpm 147GB is around £185 ex VAT, the average access time a whopping 3.4 milliseconds.

How many disks would you need to create a disk array that could keep up with a pair of OCZ IBIS Drives in a RAID 0 delivering over 80K 64Kbyte 100% random reads per second at a latency of 1.2milliseconds?

Rotational disks are just fine for the transaction log because so long as it’s the only thing being written to on that array then the seek times are going to be extremely low.

Cost of memory and Limits

I recently cost up a DL385 G7 server, it will hold up to 256GB of RAM; 16GB of RAM cost £575 so for 256GB it would set you back £9.2K.

Summary of Part 1

Enterprise SSDs are still expensive when looked at from a price per GByte, however, from a read/write latency per thousands of IOs/second they are extremely cheap.

Operational care should be taken if the data access pattern is high overwriting (UPDATE) rather than mostly INSERT because you may quickly burn the SSD out. Design considerations come into play here, you can easily implement a type 2 Slowly Changing Dimension (SCD) in the operational system.

In part 2 we will look at the benchmarks from various real world tests against a 50 million row 85 GBytes table which has over 99% fragmentation in the leaf and other levels of the index. You may well be surprised at what you will see and it’s a game changer in terms of the way we currently think.