Time dimension - What should the key be?
We are currently implementing a data warehouse and the
design of the time dimension was recently being done.
Firstly it was agreed that day and time should be split out. Combining the
two into one dimension is nuts your dimension would be huge and unworkable.
So considering the day dimension what should the key be.
There are a number of options
1. No surrogate key just use the date (with no time component as we don't
have a date datatype yet)
2. Use a generated surrogate key i.e. an identity
column
3. Convert the datetime to an integer. As the decimal
portion of a datetime represents the time, if you remove it then you
have a number that represents the day.
4. Use an integer representation
of the date as a string i.e. 1 july 2007 becomes 20070601
I am sure there are others we just considered the above.
The following are the pros and cons of each.
1.
Is simple because you don't have to translate the date in your ETL (if
you don't have time in your datetimes). However a datetime is 8 bytes compared
to a 4 byte integer.
2.
This requires you to lookup the relevant key in your ETL and has no
meaning when used in queries so to get the date you always have to join to the
time dimension. This can use a smallint, but do you realy need it
3.
This is simple because you can obtain the integer and remove the time
component easily by doing floor(cast(datecol as float)). However you are basing
your key on the internal representation of a datetime in SQL Server, an example
of the problems you have with this is using this number in Excel. Excel starts
at 31 Dec 1899 not 1 Jan 1900 and also has a different leap year at some point,
which means your dates can be 2 days out. In addition to get the date
information you need you need to join to the time dimension or convert the
number to a datetime.
4.
This is also simple to calculate by doing, cast(convert(char(8),@d
,112) as int). Ok so the overhead is greater than 3, however the calculation
still takes .00059ms compared to 0.00017ms for 3.
The big benefit is that the
key is a meaningful value that makes it much easier to query.
So option 4 is my preferred route and having spoke to a number of fellow MVPs
they all agreed that whilst meaningful surrogate keys are something to be
avoided, this is one case where the rule should be broken.
-