May 2005 - Posts

Surrogate key generation in SSIS

In the last few days I tested several ways to generate surrogate keys when loading a dimension into a data mart.

The classical approach with SQL Server is to rely on a INT IDENTITY column. It works fine, but you don't know what surrogate key has been generated for a given application key until you read the row from the table. Not a big issue with DTS, because you can't do very much (in a way faster than SQL itself) even if you know this number, but with SSIS if you want to push performances to the limit, you need to load a fact table within the dimension table(s) without reading what you just writed into a table.

The first step in this direction is to generate surrogate keys inside a SSIS package. If you do a complete dimension load at each run, you can solve the whole problem using a scripting component inside a data flow task. Read this article from Jamie Thomson to see how you can implement it. But life is never as easy as you could desire, so when you have to incrementally load a dimension table then you need to give to this script the right initial value (the last ID currently used instead of 0).

After several approaches discussed on the newsgroup, I came to the conclusion that the most affordable way to implement it is a not-so-desirable one:

  • define a Execute SQL Task which execute a SELECT COALESCE( MAX( ID_Dimension ), 0 ) AS LastID FROM Dimension and put the single row ResultSet into a variable (let's say we use User::ID_Dimension)

  • define a Data Flow Task in the same way Jamie did in his article

  • put the User::ID_Dimension variable into the ReadOnly Variables collection in the Script Component

  • define PreExecute method inside the script to initialize the value for counter you will use to enumerate new surrogate keys (see code below)

Public Class ScriptMain

    Inherits UserComponent

    Dim counter As Int32

    Public Overrides Sub Input_ProcessInputRow(ByVal Row As InputBuffer)

        counter = counter + 1

        Row.IDDimension = counter

    End Sub

    Public Overrides Sub PreExecute()

        MyBase.PreExecute()

        counter = Variables.IDDimension

    End Sub

End Class

That's it. You can download the sample package here.

Now, don't pretend this solution to be faster than a SQL Server INT IDENTITY column... by now, it isn't, but we can't judge this with current CTPs. But don't desperate: it could be better in the future and, more important, if you really need to use the generated keys in other transformations you could already take advantage from this architecture.

Why it's a not-so-desirable way? Well, I don't like to rely on a variable defined in the control flow while all the dimension-load logic is inside a data flow task. Imagine how ugly could become a SSIS package if you load 20 or 30 dimensions; of course you can group two or more tasks together, but I consider this only as a work-around. There is another issue if you would like to update your somewhere-in-the-package variable with the last generated surrogate key, because you can't use the same variable as ReadOnly and ReadWrite (in the Script Transform) and you can't read a variable passed as ReadWrite outside the PostExecute method (I'm right: you can only read/write it inside PreExecute overridden method, you can read it in the PreExecute methos - as in my sample - only if variable is passed as ReadOnly).

I thought other ways to find a better solution (most of them has been suggested by kindly people on the newsgroup):

  • Script transformation with two inputs: not supported (I would like to pass the real data source as Input1 and the last used ID as Input2)
  • Cross Join between two inputs (defined as the previous idea)
  • Custom data flow task (with one input and a property with table/column to look for the last ID) - it needs to be written

There are many other possible variations on those themes. But I have to say that SSIS could be improved very much to make life easier for the BI developer: a real "definitive" surrogate key generation task could be the right answer to this kind of needs.

Cascading many-to-many dimensions: unsupported scenario?

I already opened a bug and written on newsgroup about this specfic scenario, but I still haven't received an answer. This morning I met Chris Webb in Munich (I'm attending SQL Pass Europe) and he told me that one year ago someone of MS told him that it's an unsupported scenario. It would be good to have a comment more or less official from someone of MS, because at this time the behavious of SSAS is very disappointing (you could design a model that really give you wrong data). The following is the description of the scenario.

I just posted a bug on this, anyway I'd like to know your opinion: is the "cascading many-to-may dimensions" a supported scenario?

Let me explain.
If I define two cascading many-to-many dimensions, the higher aggregation dimension doesn't work well and doesn't crossjoin measures with other dimensions.

Scenario.
You have a fact table (Sales) that links a Products dimension.
Each product has one or more Categories.
Each category has one or more market.
If you define a model of this, Cube wizard doesn't model well (should it be another bug?), but if you manually define relations between dimensions and fact table (and factless fact tables), Analysis Services doesn't warn you about a possibly unsupported scenario. So I think it should works, but it doesn't.
You can dice and slice cube across Product and Category dimensions, but if you try to use Market dimension you obtain NULL measures if define a member other than All for Product and Category, or you receive the total for All Products and All Categories for each Market member, independently from the Market you choose.

I tested it on April CTP.

Who wants to comment that?

UPDATE 2005-08-22: It will be supported, read my new post!

Database snapshots to feed a data warehouse?

I just attended the pre-conference of Kimberly Tripp at SQL Pass 2005 Europe. The topic was "New SQL Server 2005 features that a DBA need to learn" and I took advantage of this day to cover an area of SQL2005 I still haven't explored.

One consideration that worth a mention on this blog is the possible use of Database Snapshots to get a consistent state of a OLTP database and transforms its data into a data warehouse. Database Snapshot eliminates the possible locking scalability problem, while it still maintains the load on the OLTP server during query. If this kind of load is unacceptable for your user (but I imagine a lot of cases where, once eliminated locking problems, this is not a real thing to worry about) you can still use a replica of principal database (or a log shipping copy). A SQL2005 feature which apparently could be useful is the Database Mirroring in asynchronous configuration (so you don't slow down the principal database) but my understanding is that this configuration can't permit a parallel synchronous configuration (necessary if you want to achieve high availability and data protection) so I presume that a traditional approach still maintains some advantage. Database Mirroring is really orientated to offer data protection feature more than a database copy useful to build reporting solutions.

Comments about this are welcome.