Donald Farmer on SQL Server Integration Services

pamboli's naive perambulations in the world of data integration, ETL, metadata and data warehousing

<January 2009>
SuMoTuWeThFrSa
28293031123
45678910
11121314151617
18192021222324
25262728293031
1234567


Navigation

SSIS Links

Subscriptions

Post Categories



Sunday, April 03, 2005 - Posts

Business logic and data integration with SQL Server Integration Services

Loading fact tables and handling surrogate keys has been the subject of a great discussion between Marco Russo, Michael Barrett Johnson and Jamie Thomson. Their threads center around the handling of key values which cannot be found in the dimension when loading a fact table. This is an interesting and important discussion, and you'll see that they have got right into the details. You can read the posts and comments here:

 

 

Marco previously always used a star join to find dimension keys for incoming facts, but followed Jamie's advice and tried a series of pipelined lookups. There are considerable performance advantages to this approach. In fact, an SSIS developer can just about perfectly implement Ralph Kimball's Surrogate Key Pipeline subsystem which he has been recommending since 1998! See http://www.dbmsmag.com/9806d05.html (the original article has also been reused almost verbatim in his book The Data Warehouse ETL Toolkit.)

 

This pipelined system works extremely well for performance, but what to do about keys which are not found during lookup? Marco found handling the error stream cumbersome with many dimensions in your schema. Michael suggested using default values in the fact table definition, and Jamie preferred, as a matter of style, using a derived column to add an indicator value to the row to be inserted.

 

In the course of the conversation, two very important points were raised which I want to drill into in more detail, as I believe they are critical to this issue in particular, and to the design of integration processes in general. Michael calls out the important point that there may be unknown keys in the fact table due to poor data integrity in the source system and Jamie mentions in passing that he likes to manage all eventualities within his ETL process.

 

Michael's point is important because it should set off a little alarm bell for the integration developer: this is not just a technical matter but a business issue. Why is the integrity of the source system poor? Is it poor for technical reasons - perhaps there are transformation errors in the extract subsystem? Or is the quality poor for business reasons - perhaps information on new products is always delivered late to the IT department, or customer details are prone to be inaccurately recorded?

 

So in consideration of these problems, I am going to set out what I believe are two fundamental principles of integration process design:

 

  • Don't confuse business logic with technical logic;
  • If your integration process must compensate for business processes, do so in a manner that leaves an audit trail back to the business process;

 

This could end up being a whitepaper, so I'm not going to go into deep detail, but let's address them in relation to the surrogate key lookup problem that Marco, Jamie and Michael were discussing.

 

Don't confuse technical logic with business logic

Or the other way round for that matter. Your dimensional model is merely a model. Processes built on that model are only accurate to the extent that the model accurately represents reality.

 

A Product ID which is not found in the dimension table is just that - a key not found in the reference table. Do not forget that this is first and foremost a technical issue which models a business issue. There is plenty to go wrong before you can be sure that an individual technical case accurately reflects a individual business case.

 

Perhaps the incoming data quality is poor and there is an error in the key. Perhaps, if you are using a surrogate key table as Kimball suggests, the process of populating that surrogate key table from the dimension table has gone wrong. Or, if you have special logic such as a WHERE condition in the query to load the key cache, perhaps that has failed to cache the necessary key. In these, and a host of other cases, inserting a new dimension member would be a mistake - but a mistake which would bubble up into the presentation layer for business users making decisions with the warehouse.

 

A Product ID not found in a dimension table is almost certainly not an unknown product in business terms, and that can be an important distinction to make because there may be cases where dimension members are genuinely unknown. If you are a food company and a customer phones in to say that some frozen-meal, long since thrown in the trash compactor, has left them chained to the lavatory for 24 hours, you may really have a business problem with an unknown product. And if you are a bank you want to be very sure that you are not making capitalization decisions based on certificates from unknown customers.

 

You could perhaps address any potential confusion here with something as simple as choosing a good accurate term for the missing key. Unmatched, perhaps, rather than unknown. Even this has potential for problems. I well remember a case from my life before Microsoft where a customer complained to me that a subcontractor had not delivered a component despite being prompted for it. It turned out that the customer had indicated to the support team in an email that the component was outstanding - which they took to mean that he was delighted with it!

 

Then again, you could just require that consumers of data warehouse reports are aware of exactly what unknown means when it appears. But, ensuring that awareness, is itself a business process which you do not control and which may fail. The report could be forwarded to an accountant, or worse, an auditor, who has not had the training. And don't expect to shrug off the blame if a senior manager in finance makes a bad decision based on an ambiguity in your reports.

 

So it is essential that you distinguish between the technical issue of not matching a key, with the business process that may lead to that member arriving late and with other business issues which result in similar symptoms.

 

For unknown dimension members, SSIS enables special handling in the dimension load process for a boolean flag column which indicates whether the member is inferred or not. I like that term, of course, but I also like the use of a boolean flag as it easily enables table scans and filters for those members for subsequent processes.

 

If your integration process must compensate for business processes, do so in a manner that leaves an audit trail back to the business process;

In these days of increased corporate scrutiny and regulatory requirements, you want to be sure your that your integration architecture does not introduce hidden decisions that may affect your compliance. For example, your personnel database may be replicated and integrated on a schedule which is substantially slower than your sales system: perhaps monthly and daily respectively. If a sale has been made by a new salesperson before their personnel record arrives, then inserting an inferred dimension member may be technically sound, but you also need to implement a process which ensures those records are properly updated at the next refresh of the employee dimension - inferred members in SSIS can help you do that. But just as importantly you will likely need a process to ensure that if any inferred members still remain after that refresh, that this gets flagged as an issue for the business stakeholders. You do not want income recorded in your reporting system that cannot be fully accounted for.

 

In such cases I would suggest multicasting off the records which are to be inserted as inferred records and generating a report which ensures you have full and detailed history of the process which can be shared with the business stakeholders of the personnel system. In this way, they can know, in advance of the next refresh, whether these records are simply new employees who have not yet been updated, or the result of problems with the keys of existing employees.

 

I hope Jamie, Michael and Marco do not mind me jumping on the topic. I believe that, as SSIS becomes more and more established at the heart of enterprise data integration, that the business issues need to be considered in just as much detail as the technical tips and tricks. I guess I'll blogging more about this kind of thing in the future.

posted Sunday, April 03, 2005 12:52 PM by donald farmer with 2 Comments

SQL Summit in Minneapolis

Kent has already posted about the SQL Server Summit in Minneapolis: it was a great event, so I hope you don't mind me giving my own account.

 

My first trip to Minneapolis proper (previously I had only ever seen the airport) looked like it was going to get off to a good start. My flight arrived somewhat late, and it was already late evening, but the SuperShuttle staff were friendly and appeared, (deceptively) to be efficient.   The driver was listening to some Ethiopian religious music, so we got into a conversation about the wonderful Aster Aweke, one of my favourite musicians. However, from there on things went downhill rapidly. A few more people got on the shuttle and I don't know if it was the driver or the dispatcher who made the call, but we delivered them to the suburbs before dropping me off downtown. It's somewhat disconcerting to be on a downtown hotel shuttle driving through what looked like countryside in the middle of the night. I thought I was being kidnapped, but at least I now know where Minnetonka is.

 

Finally, after over an hour-and-a-half on the shuttle I got to the hotel, at 11.45. I was hungry, wanted a shower, and still had to run through my demo for the keynote in the morning. But room service in the Hyatt stops at midnight, and, bizarrely, all the hot water in the hotel was being shut off at midnight too for some repairs. So I had a lukewarm bath, worked on my demos until about 2am, and set my alarm for 6. I didn't even get to meet up with Kent for a pint. I shouldn't complain - in the morning I discovered that Paul Flessner had had an even worse trip - a cancelled flight meant he had to drive from Bloomington to Chicago in a massive hailstorm, which reduced Ilya Bukshteyn from marketing to a nervous wreck.

 

But the summit was well worth it. The local team are clearly doing an impressive job of building an enthusiastic community of SQL Users. The keynote room held about 1300 people and was packed. My sessions (on SSIS, and Analysis Services) were in rooms for 300 and were standing room only. The sponsors and exhibitors were delighted and all the speakers got good audiences who were knowledegable and appreciative.

 

Paul's keynote was a fine start to the day. There's a real candour and honesty throughout the SQL Server team about our product and our processes, and it really works to build a community of users who feel they can commit to us. Paul was able to pitch his keynote in such a way that everyone in the audience, from the VP of Unisys sitting beside me, to the overburdened DBA just behind be, recognized their own issues in what he was saying. You find this a lot in the SQL Server world - Dave Campbell and Bill Baker are experts at it, too - that the business vision is driven intricately with a detailed technical grasp of what is needed to fulfill that vision on the ground.

 

I think some SQL Server enthusiasts do not realize how rare this is. But try attending an IBM or Oracle conference, or Siebel or SAP, and you'll find that there are really two distinctive approaches there: a glossy, high level, business-driven marketing message, and a technical practical message. In the SQL Server world, it is very different: you come away with the feeling that Paul or Bill or Dave could come down from the podium and help you tune that query, or optimize that dimension hierarchy, just as readily as they could map out your enterprise's business relationship with Microsoft. And it's not just about technical knowledge - which could be merely academic - it's also about having a sense of what is needed to drive technical decisions to achieve really practical goals. This mix of vision and practicality is what attracted me to the SQL Server team in the first place, and it's good to see we still have it in bucketloads.

 

My own demo in the keynote went pretty well - the idea was to start from scratch and in only five minutes build an SSIS package, a mining model and a report to analyze some customer information from Excel into a hot leads report for a sales team.  Despite no rehearsal at all, Paul and I got the stagecraft right, I think, and I was able to complete the demo in only 4 1/2 minutes, which surprised even me. And the audience got it too, I think, seeing how BI can actually add real value to raw data easily and efficiently. Hopefully it also underscored the huge value to developers of having all these tools together in a single environment.

 

So from an inauspicious start, my visit to Minneapolis ended up being really worthwhile. I met a ton of good customers (I have about 30 detailed issues to follow up on), had good partnership conversations with the likes of Unisys, Sogeti and Panorama which will all lead to good things, met up with old friends (hi Mike from Appsmart!), had awesome audiences at my sessions, and was able to meet up with the local team who did such a great job of organizing the conference. The team also took me out to a pub in Minneapolis where I was able to have a nostalgic pint of McEwans, and my shuttle driver back to the airport turned out to be a real football fan (soccer for the yanks) so we chatted away merrily about Scotland's World Cup chances. (How long before we change our national motto from Nemo me impune lacessit to There is still a mathematical possibility we could qualify?)

 

If every one-day summit was like this, I could do 365 a year.

posted Sunday, April 03, 2005 9:00 AM by donald farmer with 2 Comments




Powered by Dot Net Junkies, by Telligent Systems