Thomas Pagel BI Blog

Microsoft BI Technology & more

<September 2008>
SuMoTuWeThFrSa
31123456
78910111213
14151617181920
21222324252627
2829301234
567891011


Navigation

SSIS Blogs

SSRS Blogs

SSAS Blogs

Other SQL Server Links

Subscriptions

Post Categories



Working with Slowly Changing Dimensions in SSIS

One of the very nice enhancements of SQL Server 2005 Integration Services (SSIS) compared to DTS is the introduction of the wizard for slowly changing dimensions (SCD). It makes it easy to create dimension tables storing changing data and tracking these changes. There are some blogs about that, i.e. Jamie has a nice example for a package using the wizard… http://blogs.conchango.com/jamiethomson/archive/2005/06/06/1543.aspx

However I didn’t find anything about how to use a slowly changing dimension table in the fact load process. The problem is that you have to find the right record in the dimension table which fits to the fact record. This might be an exact match but more likely you have to deal with a “best fit”.

A “classical” example would be: You have a table tracking which sales representative is responsible for a customer. The table has the following fields:

Customer_ID

Salesperson_ID

Start_Date

End_Date

So the customer with the ID 1 can be related to salesperson ID 1 from 1/1/2002 to 12/31/2003 and to salesperson ID 2 from 1/1/2004 to 12/31/2004. This would end up in two records in this table.

The SCD wizard does a great job maintaining this table (I’ll probably blog about some problems with the wizard later…). But how do you find out which salesperson was responsible for customer ID 1 when you try to process an invoice issued on 5/5/2002? You can’t use a Lookup transform for that because this needs an exact match in the dimension table and what should be the key for that match?

I see two approaches for this problem, one I already tried, one I like to evaluate later.

OLE DB Command

The first way to work with SCDs is to use an OLE DB command to do the lookup. I did that using a small stored procedure which finds out the right ID corresponding salesperson for the fact record. It receives the Customer ID and the date of the fact record. The procedure executes a SQL command finding out the “best match” for the fact record and returns the correct Salesperson_ID. Something like a

CREATE PROCEDURE dbo.P_Salesperson

      @Customer_ID [int],

      @date [smalldatetime],

      @Salesperson_ID [int] OUTPUT

WITH EXECUTE AS CALLER

AS

select @Salesperson_ID=Salesperson_ID

from Salesperson_SCD

where isnull(Start_Date,'1900-01-01')<= @date

and isnull(End_Date,'2078-12-31')>= @date

In the OLE DB command transform you issue a

exec P_Salesperson ?,?,? output

Certainly you have to take care about the correct parameter mapping. But that’s all. It works, however it works slowly. The reason for that is that the OLE DB transform executes a SQL command per record flowing through the SSIS pipeline. This makes it very slow…

Lookup & Cross Join

I didn’t try the second approach, yet. But this one might be quite fast, however it will be a solution only for some scenarios.

The Lookup transform is very fast, much faster than executing one command per record. But how do you create a lookup table in a scenario like I described before?

The trick would be to do a cross join for the salesperson table with the date table (I assume that you have one…). So the result would be a table where you have one record per customer per day. The SQL for the lookup would be something like a

Select customer_id, date (select salesperson_id from salesperson_SCD where salesperson_SCD.customer_id = customer.customer_id and start_date<=date and end_date>=date) from customer, date

This might be quite a resultset and when it’s completely loaded into memory it might be a problem. But if that works, this approach seams to me like a very good solution.

There is another problem with this solution. You need a date table to do this cross join and the granularity of the table is the same as the referenced data. What happens when you track the SCD not by day but by second? Would you really setup a date table with records for each second? That would be a huge resultset of the cross join…

Conclusion

So what do we learn out of that? The best thing I can imagine is that Microsoft (or someone else?) provides a more flexible Lookup transform. This would accept not only exact match but also ranges (like the start_date and the end_date), “less than”, “greater than”, …. All other solutions I can imagine have their cons. Or perhaps do YOU have a better approach? I would be happy to see some feedback on this post…

 

Thomas

posted on Sunday, July 24, 2005 1:10 PM by tpagel


# SSIS: Lookups with no exact match @ Wednesday, August 31, 2005 10:53 AM

I wrote about that some weeks ago (http://sqljunkies.com/WebLog/tpagel/archive/2005/07/24/16195.aspx)....

Anonymous




Powered by Dot Net Junkies, by Telligent Systems