posted on Thursday, March 10, 2005 8:48 PM
by
ashvinis
SCD optimization
I ran across a scenario today where the doing the SCD on a few dozen thousand rows was taking an unreasonable amount of time (something's telling me I shouldn't be going into specifics with perf numbers).
So Wenyang, a colleague of mine in the SSIS team, and I started investigating what's up. We found two things, both of which are obvious in hind sight and discussed with a few customers but still worth writing down if just to have you make sure you're not bitten by it.
1. The SCD transform sends queries to the dimension table - one per row - to identify changes. We sped up that part 35 times by adding an index. Well, duh!
2. The second part was that in this scenario all of the rows were ending up in an OLEDB command destination with a statement that looked like:
Update [Customer] SET Age = ?, City = ? WHERE CustomerID = ?
I finally got to try out what the performance improvement was of using set based Update instead of sending an Update statement per row that came in. Changes:
a. push the data that was destined for the OLEDB Command out into a temporary table. The table had Age, City and CustomerID as columns.
b. having an execute sql task after the data flow task that had something like this:
Update [Customer] SET [Customer].Age = t.Age, [Customer].City = t.City FROM [Customer], t WHERE [Customer].CustomerID = t.CustomerID
Together with both of these changes, the performance improved by approximately 20 times.
That was the good news. Now the caveats.
i. Putting the index on the dimension table - well, it depends on how much of the incoming data will end up in the dimension table as inserts. BOL had a pretty good title on this entitled “Guidelines for Optimizing Bulk“ which I love to read periodically.
ii. For using a temporary table - this is an interesting problem. Lets first discuss what happens with the command destination. Due to the pipeline nature of the way rows are handled, SCD will merrily send SELECT statements to the dimension table while at the same time a downstream OLEDB Command might be updating the same table or an OLEDB Destination might be inserting rows into the same table. Lets take an example. Say there's one row in the dimension table for one customer:
CustomerID Age City
1 29 Bellevue
[I have no idea what font I was using earlier so lets pick one.]
Now say both Age and City are marked as type 1 and the two input rows that come in are:
CustomerID Age City
1 31 Redmond
1 32 Bellevue
SCD will think the existing row needs to be updated and send the statement to OLEDB Command destination that will first apply an update for the first row and then apply an update for the second. I end up with
CustomerID Age City
1 32 Bellevue
If I use an Update statement, though, what would happen? Am I guaranteed the order I inserted the rows in the database will win? It's probably an easy answer for a SQL Server expert. Which means I have no clue but will try to find out. If you do know, drop me a line.