SSIS: On my whish list: A dynamic lookup transform
I just issued a suggestion for the next SSIS version… I decided to make it public here just to ensure that I get lots of votes for it… ;-) Here is the description and at the end of the post you’ll find the link for the Product Feedback Center where you can help to make it more important…
I know that this is quite an effort, but I guess it's worth that... Currently SSIS is very "generic" when it comes to common BI tasks. You can do virtually everything with SSIS what also means that you have do invent everything on your own on the other hand. I would be very happy if you can still work very "generic" but Microsoft provides a "standard" way to go ("best practice") that makes BI with SSIS/SQL Server much easier and faster.
One "standard" issue is the process of creating/updating dimension tables, including the assignment of surrogate keys and linking them with fact tables.
The Lookup transform currently fetches the reference data at the package start-up time. If you have i.e. a star schema to populate then you have to prepare the dimension tables first before you can work on the fact table because otherwise the lookup for surrogate keys will not work. This is even worse if you think of snowflake schemas. You have to work from the "outer leaves" table by table to get to the "base". There are workarounds for that (I blogged about that http://sqljunkies.com/WebLog/tpagel/archive/2005/08/19/16495.aspx) but they are very complex and a nightmare to maintain.
So this is my idea:
Make the Lookup transform "dynamic" by allowing updates to the underlying data. The lookup still would be populated at package start-up. But some things will be different....
-
You still define a connection manager
-
You define a table for the lookup data (better no query (only views), otherwise this would be more complicated...)
-
A mapping of the pipeline to that table (for lookups/inserts/updates)
-
Within the mapping you define
a) This is the lookup key for matching the records (may be more than one field, like today with the lookup
b) This is the data I'm looking for (the values returned from the lookup, in this case this would be the surrogate key in most cases, but it might not be limited to that). For existing data you get the data out of the cache. Otherwise the record is created. If you have a surrogate key in that table you can use an own mechanism (i.e. a script) to calculate it and populate it by mapping a pipeline field to the table's field storing the surrogate key (this would also mean that you have some kind of "feedback" that you need a new key for the next record, maybe a different output for created records or a "status" variable indicating that you have to calculate a new key) or you can use a "build-in" surrogate key which is calculated automatically (you should have an option to auto-increment the last issued surrogate key). Then the cache is automatically updated (see below).
c) This is a field I want to be saved in the table (for new records, this is not to be "cached" for lookups, it's only transferred to the database for new records, only if it's marked to be "returned", than it's cached, too! So the cache is permanently updated with new records!)
d) This is a field I want to be saved in the table (also for updates, it's still not cached but updated data will be pushed to the database (i.e. changes will be committed at the end of the package), only if it’s marked to be “returned”, than it’s cached, too (like the insert))
- Maybe c and d can be omitted if inserts/updates are always done for all mapped, this would make thinks easier for the users. Maybe you simply select if updates should be committed to the table or not…
I think this can be done by “extending” the current Lookup transform. Maybe it’s a good idea to create a new transform for it.
It might be a good idea to have that for SCD type 2, too. The SCD wizard is a great thing, however it adds complexity to the project and it lacks some functionality (surrogate key generation, updated cache, …). I would keep the wizard as it is today… I just won’t dare to ask to add SCD type 2 support to this new transform, too. It’s just giving “BI Best Practices” in an easy intuitive way to the users. This will not be a 100% solution for everyone but a 90% solution offering much benefit for a huge number of users. I didn’t think about that so much but it shouldn’t be too hard… It might be a combo box where you choose between “no updates”, “overwrite updates” and “track history”. Then you need something to “outdate” the “old” record. This might be something you define in the “role” of a field (if it’s integer, it might be a “actual” flag, if it’s datetime it might be “valid from” or “valid to” (which might come from the pipeline or might be auto assigned). This might be enough for the beginning and people could get packages much easier to maintain and the learning curve wouldn’t be as steep as today…
How should that be used?
Well, this might be something like a “Validate your pipeline against the dimension table”. So if you have a package for updating a dimension table you might have a “dynamic lookup” at the end of the data flow. Before that you prepare the dimension’s data. The “dynamic lookup” checks if the data is already present in the dimension table, assigns new surrogate keys (maybe also tracks the history if SCD type 2 is not impossible ;-) ) and updates the dimension table.
If you use it in the fact load process, you can have it to create inferred members (or late arriving dimension entries). You get that for free ;-) just while finding out the surrogate keys (foreign keys) to use in your fact table…
So if you like these ideas I would like to ask your for your vote at http://lab.msdn.microsoft.com/ProductFeedback/viewFeedback.aspx?feedbackId=FDBK43386.
Thomas