SQL Server 2005 Integration Services
SQL Server 2005 SP2 is finally available... http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=d07219b2-1e23-49c8-8f0c-63fa18f26d3a
Today I came across some issues with “Parallel Execution” within a dataflow. Did you ever use one source system for more than one data source in a dataflow? If the queries you send of are not trivial it might affect your source system that you run these queries in parallel, which is the default for Integration Services. In the last days we struggled with a dataflow sending off 10 quite complex queries to one server using up all the memory at the source. So I was looking for a solution for this problem.
However there seams to be no solution. The only way you can handle this is by splitting up the dataflow into many and save the result i.e. in RAW files which you later use instead of querying the source.
What I would appreciate is a way to control how many queries can be send off to a connection manager at one time. I opened a suggestion for that on Microsoft Connect and everybody is invited to vote for it... http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=178069
I also would like to point your attention to a suggestion Jamie posted. It’s the old issue that you can’t control how data is send to multiple destinations in a dataflow. So if you have two destinations (i.e. two tables connected by a foreign key constraint) you can’t enforce your data integrity by controlling that i.e. an order is saved before the details are added. Here’s the link to Jamie’s post and I assume that votes are appreciated as well... http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=178058
Thomas
Did you ever get this message: “Parameters cannot be extracted from the SQL command. The provider might not help to parse parameter information from the command. ...” My Avanade Colleague Marcel Franke invested some time and found out that this might be a problem with line breaks in the SQL command… Especially when you use /* comments */ in the SQL… We’re not ready for an official bug report, yet, but I came across that problem more than once and now I have an idea where to look for…
Thomas
We investigated about a solution for that… Finally my Avanade colleague Tom Lichtenstein came up with an approach weeks ago which worked perfectly… Now I finally find some time to blog about that and: well, I’m not the first one…
Did you ever want to do a Cross-Join with the Merge Join transform? There isn’t a feature like that… What you have to do is add a “dummy” column (i.e. by a Derived Column transform) in both pipelines, sort by that column and join them. That’s expensive (because of the sort) and more a hack than anything else… Perhaps you can do the Cross-Join in your SQL query of the source instead… Perhaps you can at least add the dummy column in your source queries (that works if you have two different sources you want to join) and change the sort property in the output so that SSIS things the source is already sorted by that column (so you don’t need the Sort transform anymore).
One thing you should do for sure: vote for a change in the Merge Join transform so that we’ll see a Cross-Join in the future… https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=161154
Thomas
Since nobody else on my bloglist reported this (at least I didn't notice)... Here you find the KB article with the fixed issues: http://support.microsoft.com/kb/913090/en-us. And here is the download: http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=cb6c71ea-d649-47ff-9176-e7cac58fd4bc
Wow, I really expected to see that download later... But I got some hints that it will we available in April since every vendor has to ship the version TPC benchmarks are based on a defined time after publishing the results. I'm not aware of that timeframe, but from that timeframe you could expect to see SP1 before May...
Let's see how things developed...
Thomas
I just issued a suggestion for the next SSIS version… I decided to make it public here just to ensure that I get lots of votes for it… ;-) Here is the description and at the end of the post you’ll find the link for the Product Feedback Center where you can help to make it more important…
I know that this is quite an effort, but I guess it's worth that... Currently SSIS is very "generic" when it comes to common BI tasks. You can do virtually everything with SSIS what also means that you have do invent everything on your own on the other hand. I would be very happy if you can still work very "generic" but Microsoft provides a "standard" way to go ("best practice") that makes BI with SSIS/SQL Server much easier and faster.
One "standard" issue is the process of creating/updating dimension tables, including the assignment of surrogate keys and linking them with fact tables.
The Lookup transform currently fetches the reference data at the package start-up time. If you have i.e. a star schema to populate then you have to prepare the dimension tables first before you can work on the fact table because otherwise the lookup for surrogate keys will not work. This is even worse if you think of snowflake schemas. You have to work from the "outer leaves" table by table to get to the "base". There are workarounds for that (I blogged about that http://sqljunkies.com/WebLog/tpagel/archive/2005/08/19/16495.aspx) but they are very complex and a nightmare to maintain.
So this is my idea:
Make the Lookup transform "dynamic" by allowing updates to the underlying data. The lookup still would be populated at package start-up. But some things will be different....
-
You still define a connection manager
-
You define a table for the lookup data (better no query (only views), otherwise this would be more complicated...)
-
A mapping of the pipeline to that table (for lookups/inserts/updates)
-
Within the mapping you define
a) This is the lookup key for matching the records (may be more than one field, like today with the lookup
b) This is the data I'm looking for (the values returned from the lookup, in this case this would be the surrogate key in most cases, but it might not be limited to that). For existing data you get the data out of the cache. Otherwise the record is created. If you have a surrogate key in that table you can use an own mechanism (i.e. a script) to calculate it and populate it by mapping a pipeline field to the table's field storing the surrogate key (this would also mean that you have some kind of "feedback" that you need a new key for the next record, maybe a different output for created records or a "status" variable indicating that you have to calculate a new key) or you can use a "build-in" surrogate key which is calculated automatically (you should have an option to auto-increment the last issued surrogate key). Then the cache is automatically updated (see below).
c) This is a field I want to be saved in the table (for new records, this is not to be "cached" for lookups, it's only transferred to the database for new records, only if it's marked to be "returned", than it's cached, too! So the cache is permanently updated with new records!)
d) This is a field I want to be saved in the table (also for updates, it's still not cached but updated data will be pushed to the database (i.e. changes will be committed at the end of the package), only if it’s marked to be “returned”, than it’s cached, too (like the insert))
- Maybe c and d can be omitted if inserts/updates are always done for all mapped, this would make thinks easier for the users. Maybe you simply select if updates should be committed to the table or not…
I think this can be done by “extending” the current Lookup transform. Maybe it’s a good idea to create a new transform for it.
It might be a good idea to have that for SCD type 2, too. The SCD wizard is a great thing, however it adds complexity to the project and it lacks some functionality (surrogate key generation, updated cache, …). I would keep the wizard as it is today… I just won’t dare to ask to add SCD type 2 support to this new transform, too. It’s just giving “BI Best Practices” in an easy intuitive way to the users. This will not be a 100% solution for everyone but a 90% solution offering much benefit for a huge number of users. I didn’t think about that so much but it shouldn’t be too hard… It might be a combo box where you choose between “no updates”, “overwrite updates” and “track history”. Then you need something to “outdate” the “old” record. This might be something you define in the “role” of a field (if it’s integer, it might be a “actual” flag, if it’s datetime it might be “valid from” or “valid to” (which might come from the pipeline or might be auto assigned). This might be enough for the beginning and people could get packages much easier to maintain and the learning curve wouldn’t be as steep as today…
How should that be used?
Well, this might be something like a “Validate your pipeline against the dimension table”. So if you have a package for updating a dimension table you might have a “dynamic lookup” at the end of the data flow. Before that you prepare the dimension’s data. The “dynamic lookup” checks if the data is already present in the dimension table, assigns new surrogate keys (maybe also tracks the history if SCD type 2 is not impossible ;-) ) and updates the dimension table.
If you use it in the fact load process, you can have it to create inferred members (or late arriving dimension entries). You get that for free ;-) just while finding out the surrogate keys (foreign keys) to use in your fact table…
So if you like these ideas I would like to ask your for your vote at http://lab.msdn.microsoft.com/ProductFeedback/viewFeedback.aspx?feedbackId=FDBK43386.
Thomas
Perhaps not everyone knows that it’s very easy to call stored procedures from a Data Flow and return values into the pipeline.
Just take this procedure:
CREATE PROCEDURE dbo.sp_test
@param1 int,
@param2 int,
@param3 int output
AS
BEGIN
select @param3=@param1+@param2
END
GO
Well, quite trivial, but it should work as an example…
So just imagine that you have two integer fields in your pipeline and want to have these added. The result should be in the pipeline for later use (i.e. to write it into a table). So what do you have to do?
Just add a field to the pipeline (i.e. use a Derived Column transform for that assigning a default value to it).
Then you can add an OLE DB transform to call the stored procedure. The command will look like this:
exec dbo.sp_test ?,?,? output
Then you simply have to do the column mapping for the three parameters. And that’s it…
I didn’t do any performance testing on that, but it would be quite interesting how good this performs… I’m quite curious if calling a stored procedure (perhaps even some C#/VB.net code) is faster than calling a VB.net script…
Thomas
Just a quick note: I got the confirmation that my colleague Markus and/or I will be presenting at Pass Europe in February (www.sqlpass.org/events/europe/2006). We will be talking about ETL in the enterprise, how we at Avanade do it and with some ideas how you can do it, too. I hope we’ll find some time to talk about thinks you should avoid, too... Just some traps we stepped in and some hints to get around them…
The sessions are not finished, yet, you can see a list here: http://www.sqlpass.org/events/Europe/2006/Sessions.cfm.
Chris Webb will be there, too (http://spaces.msn.com/members/cwebbbi), I guess that will get some very interesting days…
Thomas
I wrote about that some weeks ago (http://sqljunkies.com/WebLog/tpagel/archive/2005/07/24/16195.aspx). Using a OLE DB command transform to find out a matching record instead of a lookup transform when you don’t have an exact match but i.e. a range to compare is very slow. Using a cross join might be a solution (I still didn’t have time to test it…), but produces huge datasets to load into memory. So no perfect solution…
Some days ago my colleague Markus was in the BI Migration Lab in Redmond. It was a great event and we got some very valuable input there. Runying Mao came up with a very neat solution for this issue…
Back to the example of the employee<->department relationship. You have a fact table with employee IDs and another table having the employee ID and the department the employee is working in. Since employees can change their department you have a start and end date for this employee<->department assignment. Now you want to know for each fact record in which department the employee was at the given time.
Runying’s solution is to do an inner join between the fact and the relationship table. If you have an employee who was in different departments you get the same fact record for each department the employee ever was. So the only thing you have to do is filter out (by a conditional split) the “non fitting” records, what means the records where the date from the facts doesn’t fit into the data range coming from the relationship table.
Cool solution… Perhaps this is a good scenario for my next snippet…
Thomas
Well, what a headline... So many BI buzz words, wow, what a solution this has to be... It's nothing special, not very trivial, but also not very hard to understand... at least I hope so...
From a post in the Microsoft Technical Forums for SSIS (http://forums.microsoft.com/msdn/ShowPost.aspx?PostID=75211) I got the idea to write my second snippet which shows how to create new dimension records from a fact table (so you get facts for a dimension record which doesn't exist, yet) and handles all the surrogate key stuff you need...
So what does this package do? It's based on Marco's post http://sqljunkies.com/WebLog/sqlbi/archive/2005/05/30/15684.aspx about how to work with surrogate keys. This is a nice example for loading dimensions and assigning surrogate keys on the fly.
I took this and placed some transforms around it (see a screenshot here: http://tpagel.gmxhome.de/snippets/keygen.jpg). There is a source and I assign the surrogate keys by doing a "merge join" with the dimension table. The problem is now that the dimension table might have some records missing. The solution is that I split up the source by a "multicast". The one side only does a "sort" to prepare for the "merge join". The other path is aggregated by the application key. I lookup the dimension table to find out the dimension key for these application keys. If that fails I generate a new surrogate key (thanks to Marco's code) and add the record with this new key to the dimension table. At the same time ("multicast" again...) I join this dimension record (with a "union all") with the already existing records from the lookup.
So now I have both, the existing records who got their surrogate key from the dimension table and the new dimension records just created. I sort them by the application key and then I can merge them with the facts using an "inner join".
The result is an updated dimension table with new records (including surrogate keys) from the fact table and the fact table itself with the right surrogate keys assigned.
I hope that this helps people to understand SSIS a little better. I'm quite sure that this is no perfect design but it works...
You can download the package (including the backup of a very simple sample database) here: http://tpagel.gmxhome.de/snippets/keygen.zip. Just add a few records to the "factSource" table, run the package and see what happens to the "dim" and "factDestination" table...
Comments are (as always) very appreciated...
Thomas
SSIS brings a very cool feature with it... Expressions help you very much to dynamically change the behaviour of your package. There are already some posts about them around (i.e. search for “expressions” on Kirk’s blog http://sqljunkies.com/WebLog/knight_reign) so I don't want to talk too much about the feature itself. What I'm just thinking about (and posted on the technical forum http://forums.microsoft.com/msdn/ShowPost.aspx?PostID=74473) is a way to find out which property is affected by an expression... OK, you can just look in the list of expressions (the place where they are defined), but just imagine you get a package and you wonder about why it behaves differently from what you see on the first sight... A very simple example would be a SMPT task sending a different message than defined in it's properties. After a while you think "well, that could be an expression" and... yes, there is one defined. So why don't make thinks easier by marking properties affected by expressions i.e. with a different colour or font. That would help a lot...
I'll pass that to Microsoft, too... Perhaps someone will take care of that...
Thomas
Since “SSIS Nuggets” is a registered trademark ;-) by Jamie (http://blogs.conchango.com/jamiethomson) I had to invent my own brand of small SSIS samples you are invited to use in your own projects. I hope I’ll find some time to provide more snippets in the future…
So what do you get in this snippet: this is an example how to delete old files (created a given time ago) in a directory. My intension to build this snippet was that I didn’t find any way to “age out” backup files. In SQL Server 2000 you could define that backup files are deleted after i.e. 2 weeks. I’m not aware of any equivalent in SQL Server 2005 (perhaps you know how?!). Other uses of this snippet include deleting old log files or whatever you imagine…
This package has a “ForEach Loop” which looks for files in a directory (you define the path and the wildcard for the files to find in this component). When it finds one or more matching files a script task calculates the age of the file by reading out the creation date and comparing it with the actual date. The result is stored in a variable. Based on this variable a “Filesytem Task” is called (or not). This is done by an expression based constraint. So you can define how old the files to be deleted should be by changing the expression in the properties of the link between the script and the Filesystem task.
You can download this SSIS snippet here: http://tpagel.gmxhome.de/snippets/deleteoldfiles.zip
Comments are always appreciated…
Thomas
One of the very nice enhancements of SQL Server 2005 Integration Services (SSIS) compared to DTS is the introduction of the wizard for slowly changing dimensions (SCD). It makes it easy to create dimension tables storing changing data and tracking these changes. There are some blogs about that, i.e. Jamie has a nice example for a package using the wizard… http://blogs.conchango.com/jamiethomson/archive/2005/06/06/1543.aspx
However I didn’t find anything about how to use a slowly changing dimension table in the fact load process. The problem is that you have to find the right record in the dimension table which fits to the fact record. This might be an exact match but more likely you have to deal with a “best fit”.
A “classical” example would be: You have a table tracking which sales representative is responsible for a customer. The table has the following fields:
Customer_ID
Salesperson_ID
Start_Date
End_Date
So the customer with the ID 1 can be related to salesperson ID 1 from 1/1/2002 to 12/31/2003 and to salesperson ID 2 from 1/1/2004 to 12/31/2004. This would end up in two records in this table.
The SCD wizard does a great job maintaining this table (I’ll probably blog about some problems with the wizard later…). But how do you find out which salesperson was responsible for customer ID 1 when you try to process an invoice issued on 5/5/2002? You can’t use a Lookup transform for that because this needs an exact match in the dimension table and what should be the key for that match?
I see two approaches for this problem, one I already tried, one I like to evaluate later.
OLE DB Command
The first way to work with SCDs is to use an OLE DB command to do the lookup. I did that using a small stored procedure which finds out the right ID corresponding salesperson for the fact record. It receives the Customer ID and the date of the fact record. The procedure executes a SQL command finding out the “best match” for the fact record and returns the correct Salesperson_ID. Something like a
CREATE PROCEDURE dbo.P_Salesperson
@Customer_ID [int],
@date [smalldatetime],
@Salesperson_ID [int] OUTPUT
WITH EXECUTE AS CALLER
AS
select @Salesperson_ID=Salesperson_ID
from Salesperson_SCD
where isnull(Start_Date,'1900-01-01')<= @date
and isnull(End_Date,'2078-12-31')>= @date
In the OLE DB command transform you issue a
exec P_Salesperson ?,?,? output
Certainly you have to take care about the correct parameter mapping. But that’s all. It works, however it works slowly. The reason for that is that the OLE DB transform executes a SQL command per record flowing through the SSIS pipeline. This makes it very slow…
Lookup & Cross Join
I didn’t try the second approach, yet. But this one might be quite fast, however it will be a solution only for some scenarios.
The Lookup transform is very fast, much faster than executing one command per record. But how do you create a lookup table in a scenario like I described before?
The trick would be to do a cross join for the salesperson table with the date table (I assume that you have one…). So the result would be a table where you have one record per customer per day. The SQL for the lookup would be something like a
Select customer_id, date (select salesperson_id from salesperson_SCD where salesperson_SCD.customer_id = customer.customer_id and start_date<=date and end_date>=date) from customer, date
This might be quite a resultset and when it’s completely loaded into memory it might be a problem. But if that works, this approach seams to me like a very good solution.
There is another problem with this solution. You need a date table to do this cross join and the granularity of the table is the same as the referenced data. What happens when you track the SCD not by day but by second? Would you really setup a date table with records for each second? That would be a huge resultset of the cross join…
Conclusion
So what do we learn out of that? The best thing I can imagine is that Microsoft (or someone else?) provides a more flexible Lookup transform. This would accept not only exact match but also ranges (like the start_date and the end_date), “less than”, “greater than”, …. All other solutions I can imagine have their cons. Or perhaps do YOU have a better approach? I would be happy to see some feedback on this post…
Thomas