February 2009 - Posts

Thin Provisioning and ntfs

I'm currently working on a SQL Server storage project and as part of that I'm touching a number of areas that may be of interest to the DBA.

The IT industry is full of buzz words, new fads, fashionable terms and so on; and Thin Provisioning was something I was aware of but not fully informed, not that I am now mind. Anyway sad to admit amongst the publications I read is Storage Magazine  http://www.storagemagazine.co.uk/  I like to keep abreast of the latest trends which is why I also buy MSDN magazine although I can't claim to understand a great deal of it, but at least I might get an insight into what's hot or not for developers with regard to anything which might touch SQL Server.

Anyway amongst the reading material was a comments about Thin Provisioning not really working with ntfs; now thin provisioning allows you to allocate storage you don't have but trick the applications into thinking you have. The theory is that shoudl you maybe need 1 TB for a database but currently it's only using 100GB you provision the 1TB but onlygive 100GB. The application thinks it has 1TB and as storage use increases the storage pool automatically increases the actual storage. Now doesn't that sound very much like what our banks were doing with their lending?

Anyway you might like to read a general article here http://searchstorage.techtarget.com/news/article/0,289142,sid5_gci1188117,00.html  and I also found a really good windows article by a Microsoft MVP, Dilip Naik, http://msftmvp.com/default.aspx   his article on thin provisioning is here  http://msftmvp.com/Documents/ThinNTFSv2.pdf

Now as many people know I'm fairly cynical about SAN storage and the apparent lack of performance, however I'm not going to make any comment about this subject matter - I suggest you read carefully and then be sure to ask the right questions if your storage provider has this facility. Some storage vendors have  different terms for it as you might guess!

Here's an article which describes Thin Provisioning in more detail http://searchstorage.techtarget.com/news/column/0,294698,sid5_gci1134713,00.html?mboxSession=1235686994531-668391&mboxSession=1235686994531-668391#

The links to searchstorage are somewhat messy with lots of ads and popups but if you can get through the rubbish there are some good articles. I should mention that Dilip has an excellent white paper on using SSDs too and Joe Chang  http://sqlblog.com/blogs/joe_chang/default.aspx     is also promising some research with SSDs and SQL Server. I'm planning to buy another 4 SSDs but I'm not sure I'll ahve enough for a meaningful test.

 

Posted by GrumpyOldDBA with no comments
Filed under: , ,

Can I Index a bit ? ( Part 1 )

The final part wil follow and a fully illustrated pdf will be on my website - I'll post the location when it's complete.

 

Can I index a bit?

VBUG  Presentation 27th January 2009 Bracknell.

 

Here's a brief summary of subject matter and the points raised.

 

Q.   Is indexing a black art: will zen help: will the queries work better with a full moon?

A.   Yes!

 

The presentation intends to show practical examples of how things are rather than how things should be.

The reality is that few databases match classroom best practices, this presentation is  about what actually works ( sometimes ) based upon real world  experience,  as to will these examples always be true, well,  to coin an oft used phrase  " It just depends "

 

NB. I struggle to sometimes make large blog posts which include pictures and code samples, I will illustrate and code a pdf version of this on my website www.grumpyolddba.co.uk

 

Bad Table:

  • This was the introduction where I showed a table with a wide clustered primary key and all other columns allowing nulls. I asked what, as a DBA, I might not like about this table, now read on.

 

Wide clustered indexes:

  • Bad because the width of the clustered index is added to every row of every secondary index.
  • If you need a wide Primary Key then make it non clustered and cluster on an integer if possible that makes up your PK.
  • For that reason GUIDs make bad clustered keys as they are 16 bytes wide compared to an int at 4 bytes.
  • If you must use a GUID then choose sequential guids as this will reduce table fragmentation.
  • e.g. You may calculate that a secondary index defined on an integer column would fit approximately 8,000/4 rows per page, 2,000 rows per page.
  • o However, if your clustered key consists a guid then each entry for each row for that index is actually 16+4+1 bytes, this actually gives approx 381 rows per page
  • o For an integer clustered key the capacity would be approx 890 rows per page. In real terms this is 11.8Mb difference for an integer index on a 1 million row table, and that's for each secondary index.
  • o This also equates to buffer storage for secondary indexes and page reads required for seeks and scans.

 

The Truth about clustered indexes:

  • Traditionally the clustered index has been marked for range scans, typically dates and so on.
  • What is often overlooked is that the clustered index is the entire table.
  • Showing a select of a range of integer values from an integer clustered primary key showed that despite the result set only being an integer the entire row for each value was placed in buffer cache.
  • o ( shown by the number of page reads )
  • The solution was to create a non clustered index which duplicated the clustered index, the same query now used the secondary index and in the example shown reduced page reads from 175 to 5.
  • The point being made here is that if the clustered primary key is being used in a query, say a six table join, every clustered index seek or scan will have returned the entire row regardless of how many columns were required to satisfy the query, this can be a big hit in terms of io and usage of the buffer cache.

 

A clustered index seek in a query plan is good:

  • Well actually "it just depends".
  • Depending upon the where clause a seek may actually be a scan.
  • I returned 1,999,999 integer values from the clustered key from a 2 million row table, this is shown quite clearly as a seek.
  • o ( using a where clause of < 2000000 )
  • I then returned all 2 million rows
  • o ( using a where clause of <= 2000000 )
  • The point here was that when looking at query plans don't just assume a seek is automatically good, you must also consider how many rows the seek is returning.

 

What should the Primary/Clustered key be:

  • I said that I was leaning towards the use of a sequential numeric key, a surrogate key if needed.
  • I tended to only see the clustered key as a means to defragging the table and of having limited use in data retrieval queries.
  • o Well how many single column( key) queries do you do on a clustered PK?
  • o How often do you actually want the entire row of every table in say a 5 table join.
  • Did I think the Primary key should be part of the real data?
  • Yes this is valid, but not always the clustered key.

 

The curse of the null:

  • This one is easy.
  • I ran a query in the background against a 400 million row table.
  • A nullable column makes every row in an index 3 bytes bigger than a non null column.
  • o This is 3 bytes per column, index 4 nullable columns you get 12 bytes per row extra
  • The 400 million table results showed that a single integer index was 1.14GB larger for the nullable column.
  • That's a big difference in storage, io, buffer usage, backups
  • With an integer clustered key the record size per row for an integer secondary index are 12 bytes for a nullable column and 9 bytes for a non null.
  • o In page allocation this is 896 rows per page vs 671 rows per page
  • Finally a table which has all nullable columns accepts an entry of nil data ( except the PK ) not a particularly good design concept.

 

Talking of nulls:

  • How do you deal with multiple nulls where the data must be unique?
  • For UK readers we have a National Insurance number allocated to us.
  • This number is unique and is a key part of data collected by employers, so a HR database which stores NI numbers must make sure the stored numbers ( which actually are alphanumeric ) are always unique.
  • But life isn't always like that and new employees might not immediately know their NI number so you store "Unknown" or null. Trouble is that you may only have one null or unknown in your column.
  • Prior to SQL 2008 you could solve this by using an indexed view defined on the NI number.

 

 

create view dbo.vw_People_NiNumber

with schemabinding

as

select NiNumber  from dbo.People where NINumber is not null;

--

-- now index the view

--

create unique clustered index idx_vw_People_NiNumber on dbo.vw_People_NiNumber(NiNumber);

--

-- show it has materialised

--

exec dbo.sp_spaceused  'dbo.vw_People_NiNumber';

 

 

  • This will allow you to have multiple nulls but the stored data will be unique
  • SQL 2008 provides filtered indexes to resolve this, below.

 

-- now in sql 2008

--

create unique index uk_People_NiNumber on dbo.People(NiNumber) where NiNumber is not null;

 

 

How do you create an index that spans more than one table?

  • Now there's a handy thought if I ever heard one, however, in SQL Server an index can only include columns from a single table.
  • Well that's actually not strictly true, you can do this with an indexed view.

Indexed Views to replace joins:

  • Indexed views can be useful, typically they may be used to store aggregated data.
  • As a tuning aid I will consider using an indexed view to replace tables in a join.
  • To explain, the view is used ONLY to satisfy part or all of a joined query, I don't call the view directly or wish to do so.
  • The optimiser is clever enough to recognise if an indexed view "covers" part of a query and will use it replacing the original tables.
  • This is a three table join

 

-- query to show additional data for People

--

select p.*,sp.name,cr.name

from dbo.People p join dbo.StateProvice sp on p.StateProvinceID = sp.StateProvinceID

join dbo.CountryRegion cr on cr.CountryRegionCode = sp.CountryRegionCode;

--

 

 

  • The view replaces two of the tables and the join ( StateProvince and CountryRegion )
  • When the query runs again the optimiser will use the indexed view instead of the two tables
  • ( this is better viewed on http://www.grumpyolddba.co.uk/ )

 

 

-- create view

--

create view dbo.vw_RegionStuff

with schemabinding

as

select sp.StateProvinceID, cr.CountryRegionCode,    sp.name as Province ,cr.name as Region

from dbo.StateProvice sp

join dbo.CountryRegion cr on cr.CountryRegionCode = sp.CountryRegionCode;

go

--

-- create index

--

create  unique clustered index  cix_RegionStuff on vw_RegionStuff(StateProvinceID,CountryRegionCode,Province,Region);

go

 

 

 

Posted by GrumpyOldDBA with 1 comment(s)
Filed under: , ,