Late Arriving Facts & SSIS Contd
As discussed before if you have a type 2 dimension and late arriving facts you are in a messy situation. Most of us do it using SQL Statements.
Lets see we could do this using SSIS. One of my colleague the following way, which seems to be a very good option but SSIS data flow task might be bit messy.
Let me describe each steps.
1. Read the data from Stage using OLE DB Source.
2. Conditionally split into 2 pipeline. Conditionally split would be done on a key date column. For example if you loading orders on a daily basis and you have a control table which data you have loaded already. You could use that table as a source and map it to a variable. Then anything less than that variable would be sent down the late arriving facts pipeline, where the others will send down the default (New records) pipeline. Some of you would be happy would setting late arring facts as anythign less than 2 days old. That would work too as long as you are happy with that scenario.
3. New records pipeline would use full cache lookups to get their surrogate keys. They will run fast.
4. Late arriving facts need to do the Lookups with the range value. This will work but will be slow. The reason for that is it needs to go to database for each record. By splitting up this way we have suggested only few records should go to the database for the surrogate keys.
5. Please note if you do this on SQL 2005, it will definitely go to disk for each record. In SQL2008, you can use Partial Cache to build the cache.
Since I wrote this article Matt Masson has put a great blog together about this. You can read it here.