To cluster of not - SimonS Blog on SQL Server Stuff

To cluster of not

Tony's blogged about the use of heaps http://sqlblogcasts.com/blogs/tonyrogerson/archive/2007/06/24/row-fragmentation-hopscotch-heap-v-clustered-and-io-cost.aspx. Whilst this highlights data on a page being out of sequence and thus yoyoing when you scan the data, it doesn't mean you should always have a clustered index.

The stats in Tony's blog are only because the query is a full table scan in the first (heap) query. With a clustered index you are not doing a table scan your are only doing a partial scan and so the reads are different. If you remove the where clause thus forcing a table scan then the heap is qucker because it is more compact, it has to read less pages.

You can also address these issues with a covering index. An non clustered index is effectively the its own special table with a clustered index with the leaf pages also containing the Row id or Primary key column.

Also the issue here is highlighted by a scan. If you are performing an operation that does a seek heaps aren't bad. Also clustered indexes result in the whole row being read, thus if you only need a few rows they can also be worse than a heap with a covering index.

Finally having a clustered index is beneficial if you key size is < 8 bytes because the key is put in the leaf of non-clustered indexes. Whereas a heap the row id is and the row id is 8 bytes

Finally with clustered indexes you will end up with page splits, this can result in page fragmentation across the database file, the worst case would be that in order to read 8 pages (8k), 8 extents(64k) would have to be read because each page is on a different extent. The ALTER INDEX REORGANISE and REBUILD address this fragmentation by the former moving pages to be in the same order on disk as logically and also compacting pages in and the latter rebuilding the index entirely. REORGANISE is an online operation.

Following this post Tony and I have had a very long discussion. The bottom line is that if there is one thing you should learn its the internals of indexes. This would be my starter, heaps and b trees, covering indexes, bookmark lookups, included columns, page fragmentation, page splitting ...



-
Published 26 June 2007 13:51 by simonsabin

Comments

26 June 2007 18:04 by tonyrogerson

# re: To cluster of not

Hi Simon,

The whole point about my entry is that the rows are scattered and out of order when stored with a heap, that is not the case when you use a clustered index. Take the order detail table, the primary key would be order_id, say you had an index on order_entry_date; in the clustered index approach the order_id would implicitly be part of that clustered index and rows held in order_id order, however on the heap it would not, but if you include it you are still leap frogging throughout the database using RID's - its' not good and causes a lot of expensive physical IO's because in a heap the rows are out of order and stored scattered across pages in the database and you cannot defrag them without outage.

Absolutely agree Simon, you must know index internals but you must also understand how data is stored and accessed on disk and within the database, not to mention the pattern of data within your application that you are indexing.

Heaps still suck - just ask any Sage DBA.

Tony.

27 June 2007 08:55 by Colin Leversuch-Roberts

# re: To cluster of not

well the main thing about a clustered index is that it allows you to remove leaf( table ) level fragmentation, when I did my OU computer stuff many years ago I seem to remember the definition of a heap was a table that accepts data only, no updates and no deletes - sort of like a queue ( ish ) heaps are fine if they only ever insert - anything else and you end up with fragmentation which increases io and degrades performance. I'm actually writing a post on a slightly different aspect of this - held up by the test server being rebuilt - I have to agree with Tony, Heaps are just bad news in a database esp when you get into terabytes.

27 June 2007 13:32 by Simon Sabin

# Re: To cluster of not

Thats a very dangerous statement, firstly very few people deal with terabytes, secondly even fewer have tables of terabyte size. Heap tables have their place and can be more performant than clustered tables. Why give your granny a ferarri to do to the shops. You need to understand your workload of your application and how it is being addressed by the database design

27 June 2007 15:20 by Colin Leversuch-Roberts

# re: To cluster of not

perhaps a working example of where not having a clustered index gives better performance? On a table of say, at least 1 million rows? ( forget the terabytes for now! )

I do agree however about understanding workloads btw. ( granny + ferrari well that's a question of style surely - sorry ) Often databases and tables are not ideal too. When faced with queries I cannot alter I've sometimes found that defining a non unique clustered index ( yes i know what happens underneath the covers ) will produce less io by a clustered index scan vs a table scan , mainly beacuse the clustered index will be used despite poor selectivity - unlike a non clustered index - again this was on tables with many millions of rows and as always "it just depends" but for reasonable size tables I'd always want a clustered index.