Working with Slowly Changing Dimensions in SSIS
One of the very nice enhancements of SQL Server 2005 Integration Services (SSIS) compared to DTS is the introduction of the wizard for slowly changing dimensions (SCD). It makes it easy to create dimension tables storing changing data and tracking these changes. There are some blogs about that, i.e. Jamie has a nice example for a package using the wizard… http://blogs.conchango.com/jamiethomson/archive/2005/06/06/1543.aspx
However I didn’t find anything about how to use a slowly changing dimension table in the fact load process. The problem is that you have to find the right record in the dimension table which fits to the fact record. This might be an exact match but more likely you have to deal with a “best fit”.
A “classical” example would be: You have a table tracking which sales representative is responsible for a customer. The table has the following fields:
Customer_ID
Salesperson_ID
Start_Date
End_Date
So the customer with the ID 1 can be related to salesperson ID 1 from 1/1/2002 to 12/31/2003 and to salesperson ID 2 from 1/1/2004 to 12/31/2004. This would end up in two records in this table.
The SCD wizard does a great job maintaining this table (I’ll probably blog about some problems with the wizard later…). But how do you find out which salesperson was responsible for customer ID 1 when you try to process an invoice issued on 5/5/2002? You can’t use a Lookup transform for that because this needs an exact match in the dimension table and what should be the key for that match?
I see two approaches for this problem, one I already tried, one I like to evaluate later.
OLE DB Command
The first way to work with SCDs is to use an OLE DB command to do the lookup. I did that using a small stored procedure which finds out the right ID corresponding salesperson for the fact record. It receives the Customer ID and the date of the fact record. The procedure executes a SQL command finding out the “best match” for the fact record and returns the correct Salesperson_ID. Something like a
CREATE PROCEDURE dbo.P_Salesperson
@Customer_ID [int],
@date [smalldatetime],
@Salesperson_ID [int] OUTPUT
WITH EXECUTE AS CALLER
AS
select @Salesperson_ID=Salesperson_ID
from Salesperson_SCD
where isnull(Start_Date,'1900-01-01')<= @date
and isnull(End_Date,'2078-12-31')>= @date
In the OLE DB command transform you issue a
exec P_Salesperson ?,?,? output
Certainly you have to take care about the correct parameter mapping. But that’s all. It works, however it works slowly. The reason for that is that the OLE DB transform executes a SQL command per record flowing through the SSIS pipeline. This makes it very slow…
Lookup & Cross Join
I didn’t try the second approach, yet. But this one might be quite fast, however it will be a solution only for some scenarios.
The Lookup transform is very fast, much faster than executing one command per record. But how do you create a lookup table in a scenario like I described before?
The trick would be to do a cross join for the salesperson table with the date table (I assume that you have one…). So the result would be a table where you have one record per customer per day. The SQL for the lookup would be something like a
Select customer_id, date (select salesperson_id from salesperson_SCD where salesperson_SCD.customer_id = customer.customer_id and start_date<=date and end_date>=date) from customer, date
This might be quite a resultset and when it’s completely loaded into memory it might be a problem. But if that works, this approach seams to me like a very good solution.
There is another problem with this solution. You need a date table to do this cross join and the granularity of the table is the same as the referenced data. What happens when you track the SCD not by day but by second? Would you really setup a date table with records for each second? That would be a huge resultset of the cross join…
Conclusion
So what do we learn out of that? The best thing I can imagine is that Microsoft (or someone else?) provides a more flexible Lookup transform. This would accept not only exact match but also ranges (like the start_date and the end_date), “less than”, “greater than”, …. All other solutions I can imagine have their cons. Or perhaps do YOU have a better approach? I would be happy to see some feedback on this post…
Thomas