Time dimension - What should the key be?

We are currently implementing a data warehouse and the design of the time dimension was recently being done.

Firstly it was agreed that day and time should be split out. Combining the two into one dimension is nuts your dimension would be huge and unworkable.

So considering the day dimension what should the key be.

There are a number of options

1. No surrogate key just use the date (with no time component as we don't have a date datatype yet)
2. Use a generated surrogate key i.e. an identity column
3. Convert the datetime to an integer. As the decimal portion of a datetime represents the time, if you remove it then you have a number that represents the day.
4. Use an integer representation of the date as a string i.e. 1 july 2007 becomes 20070601

I am sure there are others we just considered the above.

The following are the pros and cons of each.

1.
Is simple because you don't have to translate the date in your ETL (if you don't have time in your datetimes). However a datetime is 8 bytes compared to a 4 byte integer.

2.
This requires you to lookup the relevant key in your ETL and has no meaning when used in queries so to get the date you always have to join to the time dimension. This can use a smallint, but do you realy need it

3.
This is simple because you can obtain the integer and remove the time component easily by doing floor(cast(datecol as float)). However you are basing your key on the internal representation of a datetime in SQL Server, an example of the problems you have with this is using this number in Excel. Excel starts at 31 Dec 1899 not 1 Jan 1900 and also has a different leap year at some point, which means your dates can be 2 days out. In addition to get the date information you need you need to join to the time dimension or convert the number to a datetime.

4.
This is also simple to calculate by doing, cast(convert(char(8),@d ,112) as int). Ok so the overhead is greater than 3, however the calculation still takes .00059ms compared to 0.00017ms for 3.
The big benefit is that the key is a meaningful value that makes it much easier to query.

So option 4 is my preferred route and having spoke to a number of fellow MVPs they all agreed that whilst meaningful surrogate keys are something to be avoided, this is one case where the rule should be broken.



-
Published Tuesday, June 26, 2007 9:04 AM by simonsabin

Comments

Tuesday, June 26, 2007 2:12 PM by Simon Sabin

# Re: Time dimension - What should the key be?

Thanks to Natalia in my team for highlighting that a better way is to use

year(@d)*10000 + month(@d) * 100 + day(@d)

Tuesday, June 26, 2007 7:18 PM by AdamMachanic

# re: Time dimension - What should the key be?

If it's just a date, why not use SMALLDATETIME?  Then you get the advantage of 4 bytes along with the advantages of a meaningful key and no translation.

Tuesday, June 26, 2007 9:18 PM by simonsabin

# re: Time dimension - What should the key be?

smalldatetime is limited to 2079, I personally don't like the limitation. What is needed is a date datatype :)

Wednesday, June 27, 2007 3:21 AM by dgosbell

# re: Time dimension - What should the key be?

Option 4 is what I normally use too. Dates are the *only* dimension where I use meaningful keys, it makes so many things easier and it is the one type of data where you can guarantee that the "business key" will never change.

Thursday, June 28, 2007 2:01 PM by furmangg

# re: Time dimension - What should the key be?

When I was first learning data warehousing, I was very tempted to just use a datetime or smalldatetime field in my fact tables. The thing that convinced me was when I saw the need for non-date date members. For instance, you almost always have a -1 DateKey which means Unknown. Maybe you have a -2 DateKey which means N/A. And you can also insert other rows which mean things like "before the first date we're tracking" or "after the last date we're tracking" or whatever you want. So definitely use a surrogate key int. And I agree it should be meaningful for real dates.