May 2008 - Posts

Covering Clustered Indexes

 

I’ve noticed that of late I’ve become a bit more critical of a well known publication that I suspect many DBA’s read. I have subscriptions to a number of publications and for the ones that I pay for I’m generally quite content.

I like paper / hard copy because I can make use of time where it’s just not practical to have a laptop or PC – and I spend my working day in front of a screen so sometimes it’s nice to escape to the garden or the pub or on the train, say travelling to a SSUG meeting without a PC.


So what’s set GrumpyOldDba off now you may ask? Well there’s an article been published about removing duplicate indexes – the Microsoft guys blogged this, with code, ages ago and I spent some time looking at this type of analysis as part of my index analysis series.

However it’s not quite so easy as it sounds and leaving aside duplicated single column non clustered indexes, which are obviously bad, the actual matter of where indexes have the same columns but in different orders or contained within is quite a complex matter and worth a whole lot of time on its own.

( Here’s the original blog from June 2007  http://blogs.msdn.com/mssqlisv/archive/2007/06/29/detecting-overlapping-indexes-in-sql-server-2005.aspx)


Idx_anytable_index1

 SalesID

PropertyID

SaleStartDate

SaleEndDate

UserID

IsCurrent

Idx_anytable_index2

 SalesID

UserID

PropertyID

SaleStartDate

SaleEndDate

IsCurrent

Idx_anytable_index3

 SalesID

PropertyID

SaleEndDate

IsCurrent

 

 

Idx_anytable_index4

 PropertyID

SalesID

SaleStartDate

SaleEndDate

UserID

IsCurrent


Taking the above indexes to be all standard non clustered indexes we can say that there’s a fair bit of duplication and in fact index 3 is already contained within index 1

Sadly with anything like this it just depends, and generally I’d say that index 3 is probably unwanted – however it may cover a query exactly, in which case index 1 might be ignored, and in certain joins one index may actually be preferred over another.

So what I’m saying is that other than basic duplicates it’s actually quite tricky, obviously you can check the dmv’s ( sql 2005 onwards ) to actually see if the index is being used, if you’re using 2000 or 7.0 then you can’t.


So what really got me worked up was the statement that creating a secondary index over a clustered index was a bad idea, now to me that such an article appears in this magazine is very disappointing and I’m also disappointed with the author – to me it shows a lack of editorial knowledge and like the review of the backup utilities last month isn’t what I’d expect from such a respected magazine. I’m also frankly becoming less and less impressed by the fact that many articles are only partly published and you have to go to the website for the full article. Maybe it’s a sign of the times that we don’t actually want paper copy any more.


Anyway here’s a very simple example of secondary indexes over the top of clustered indexes. The table I used has 2 million rows and is just under 16gb in size ( Let’s do real world examples here <grin> ) It’s a variation of the table in used in  “Analysing Index Part 4 – Size does matter”


http://www.grumpyolddba.co.uk/sql2005/working%20with%20indexes%204.mht


This is the test table and the code used to populate it. You might want to find other ways of populating the table if you’re not running on a server as I figure this will run for a few hours on a laptop! There were distinct reasons why I wanted to generate data this way for some previous tests, the only important factor for this tests is that each row should exceed half a page so as to force one row per page.


create table dbo.TestTable

(

NumKey int identity(1,1)not null,

cGuid uniqueidentifier default newid()null,

cSguid uniqueidentifier default NEWSEQUENTIALID() not null,

cBig bigint null,

cInt int null,

cDate1 datetime not null,

cDate2 datetime null,

cVchar1 varchar(50) null,

cChar1 char(20) null,

cVchar2 varchar(500) null,

cChar2 char(100),

Pad1 char(4000) not null

);

go

--

--

set nocount on

--

declare @num bigint,@count int,@count2 int,@count3 int

set @count = 1

set @count2 = 12

set @count3 = 1

while @count<2000001

begin

IF (@count%2)=0

            begin

                        insert into dbo.TestTable(cBig,cInt,cDate1,cDate2,cVchar1,cChar1,cVchar2,cChar2,pad1)

                        values( 36854775807+(36854775807/@count),@count2*3,dateadd(dd,@count3,'25 MAY 2007'),dateadd(mi,@count,getdate()),

                        reverse(convert(varchar(50),dateadd(dd,@count3,'25 MAY 2007'))),'0000'+convert(varchar(7),@count)+convert(varchar(4),@count3),

                        replicate(convert(char(36),newid()),10),convert(varchar(20),datename(dw,getdate()+@count)+datename(month,getdate()+@count)) +'blahblahblahblahblahblahblahblahblahblahblahblah',

                        replicate(convert(char(36),newid()),110))

            end

else

            begin

                        insert into dbo.TestTable(cBig,cInt, cDate1,cDate2,cVchar1,cChar1,cVchar2,cChar2,pad1)

                        values( 854775807-(36854775807/@count),@count2*2,dateadd(dd,@count3,'5 dec 2006'),dateadd(mi,@count,getdate()),

                        reverse(convert(varchar(50),dateadd(dd,@count3,'25 MAY 2007'))),'0000'+convert(varchar(7),@count)+convert(varchar(4),@count3),

                        replicate(convert(char(36),newid()),10),convert(varchar(20),datename(dw,getdate())+datename(month,getdate())) +'blahblahblahblahblahblahblahblahblahblahblahblah',

                        replicate(convert(char(36),newid()),110))

            end

set @count = @count+1;

set @count2 = @count2+2;

IF @count2>50

            set @count2 = 12;

IF (@count%3000)=0

            set @count3 = @count3+1;

 

end  

 


Ø  Here’s the output from sp_spaceused



Ø  and from sys.dm_db_index_physical_stats



Ø  Here’s my table after adding a clustered primary key on the integer identity column, NumKey



Ø  Extracting index information from sys.dm_db_index_physical_stats


 select index_id,index_type_desc,index_depth,index_level,page_count,record_count,min_record_size_in_bytes, max_record_size_in_bytes,avg_record_size_in_bytes

            from sys.dm_db_index_physical_stats (db_id(), object_id('testtable'), null ,null ,'detailed' )

 


And the results


index_id

index_type_desc

index_depth

index_level

page_count

record_count

min_record_size_in_bytes

max_record_size_in_bytes

avg_record_size_in_bytes

1

CLUSTERED

4

0

2000000

2000000

4577

4577

4577

1

CLUSTERED

4

1

3218

2000000

11

11

11

1

CLUSTERED

4

2

8

3218

11

11

11

1

CLUSTERED

4

3

1

8

11

11

11


Select 50 rows from the table by PK


select * from dbo.TestTable where NumKey between 50 and 100


(51 row(s) affected)

Table 'TestTable'. Scan count 1, logical reads 56, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


This is fair enough because at 4,577 bytes per row it requires 51 page requests to return the data rows


 

 

The entire row

 


Just the Numkey column

 


The in statement

 


Ø  If we just select the integer PK column it still requires the same io to return the result set


select numkey from dbo.TestTable where NumKey between 50 and 100

 

(51 row(s) affected)

Table 'TestTable'. Scan count 1, logical reads 56, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Ø  Just out of pure interest and as part of another test I was running I dropped the 51 values into an IN statement


select numkey from dbo.testtable where numkey

in (50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,

71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,

92,93,94,95,96,97,98,99,100)


(51 row(s) affected)


Table 'TestTable'. Scan count 51, logical reads 204, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Ø  The same for two values?


select numkey from dbo.testtable where numkey in (50,51)

 

(2 row(s) affected)

Table 'TestTable'. Scan count 2, logical reads 8, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


So that’s making use of the clustered PK on the integer column, returning just the integer value.


Now create a non clustered secondary index over the PK


create index idx_TestTable_Numkey on dbo.TestTable(Numkey);

go


index_id

index_type_desc

index_depth

index_level

page_count

record_count

min_record_size_in_bytes

max_record_size_in_bytes

avg_record_size_in_bytes

1

CLUSTERED INDEX

4

0

2000000

2000000

4577

4577

4577

1

CLUSTERED INDEX

4

1

3218

2000000

11

11

11

1

CLUSTERED INDEX

4

2

8

3218

11

11

11

1

CLUSTERED INDEX

4

3

1

8

11

11

11

2

NONCLUSTERED INDEX

3

0

1732

2000000

5

5

5

2

NONCLUSTERED INDEX

3

1

4

1732

11

11

11

2

NONCLUSTERED INDEX

3

2

1

4

11

11

11


Now the original query


select numkey from dbo.TestTable where NumKey between 50 and 100


(51 row(s) affected)

Table 'TestTable'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Now in the overall scheme of things 56 to 3 io may not seem very much, but if this query runs maybe several times a second then there could be some big savings.

 

select numkey from dbo.testtable where numkey

in (50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,

71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,

92,93,94,95,96,97,98,99,100)

 

(51 row(s) affected)Table 'TestTable'. Scan count 51, logical reads 153, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


And for the IN Down 25%.


 

 

 

 

Conclusion:



To assume that the Primary Key will only be used for singleton selects is “strange” –  and to assume that the PK will always be the clustered index is also unhelpful. 

In real world applications it is not uncommon to find wide Primary Keys and wide clustered indexes, all on wide tables.

Which ever way you cut it the clustered index is always the width of the entire row and io will reflect that.

Where the Clustered Primary Key is several columns wide it’s more than likely you’ll find range selects against it and here a covering non clustered index can save you considerable io.


Not all databases are the same and what holds true for one may certainly not hold true for another, but sweeping generalisations are a problem, especially when they’re in a respected publication.


It’s also difficult for the DBA who has to try to explain to a Manager why what’s in such an article is wrong, I remember unsuccessfully spending over a year trying to explain what the lock waits counters actually represented to senior management because someone had read they actually meant something else and they had convinced themselves this was true – it wasn’t !


© www.grumpyolddba.co.uk

Posted by GrumpyOldDBA with 5 comment(s)

Information Overload ??

As I've remarked before the profusion of Blogs ( from the various Microsoft teams and individuals ) whilst 
exposing so much more information then we ever had before also requires you to almost spend all your time just 
checking every blog you can find every day, and that's assuming you actually found the blog in the first place.

In SQL Server World we tend to direct our attentions to SQL Server blogs, I'll list most of the ones I have 
bookmarked, however as I've also pointed out before SQL Server actually runs on windows which runs on various 
manufacturer's hardware, attaches to various types of storage and may or may not be clustered.

All of these elements may or may not affect SQL Server, there's been a number of w2k3 patches for "bugs" which 
affect SQL Server and I'm sure anyone with HP hardware will be aware of the "lights out" problem, I know this 
isn't an isolated incident.

x64 deployments have also brought some people grief, It's been interesting to try to dispel some of the myths 
around x64, some of which are frankly very worrying; I still see an almost endless stream of posts about memory 
configuration for both x32 and x64 with so often the solutions/advice being totally incorrect. Personally I'd 
have thought x32 memory configuration wasn't a problem any more, but I know I'm wrong here.

Anyway the purpose of this post is to direct you to some non SQL Blogs which may prove both interesting and 
educational.
so you should read this if you have x64  ( along with my posts < grin > )


http://blogs.technet.com/askperf/archive/2008/03/25/lock-pages-in-memory-do-you-really-need-it.aspx?CommentPosted=true#commentmessage

and you should bookmark these:-

http://blogs.technet.com/askperf/  Windows Performance
http://blogs.msdn.com/ntdebugging/  Windows Advanced troubleshooting
http://blogs.technet.com/askcore/  Windows Core
http://blogs.technet.com/networking/  Networking
http://blogs.technet.com/windowsserver/  Windows Server
http://blogs.msdn.com/clustering/  Clustering

SQL Blogs:-

http://denster.spaces.live.com/
http://blogs.msdn.com/craigfr/default.aspx
http://sqlblog.com/blogs/linchi_shea/default.aspx
http://blogs.msdn.com/mssqlisv/default.aspx
http://blogs.msdn.com/sqlcat/
http://blogs.msdn.com/michkap/
http://blogs.msdn.com/psssql/
http://blogs.msdn.com/sqlprogrammability/
http://blogs.msdn.com/sqlperf/
https://blogs.msdn.com/sqlserverstorageengine/default.aspx
http://blogs.msdn.com/sqlqueryprocessing/default.aspx
http://blogs.msdn.com/sqlrsteamblog/Default.aspx
http://blogs.msdn.com/bimusings/


This is only a selection of SQL Blogs so apologies if I've ommitted your blog.