December 2004 - Posts

Fuzzy transformations whitepaper...

For those that are interested in the fuzzy grouping and fuzzy lookup transformations, there is an excellent whitepaper available here produced by the folks in Microsoft Research that created them.

It's a nice introduction to the technology and how to apply it.

Thanks,
Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

SSIS Variable Scope - Some clarifications II

The reason we have parent package variable configurations is to decouple packages so it's possible to execute/debug the child package without the reference to parent package. However, it's still possible to have a child package modify a variable value in a parent package.

It's easy, just reference a variable in the parent package. If you create a variable on any ancestor of a container, it's visible to the child.
This still holds true for child packages. They are children of the taskhost container of the Execute Package Task that launches the child package.

I've created a couple of packages that do this. There's a deployment bundle at the link below that contains two packages, a parent package that has a variable ("FooVar") of type string and a child package that changes that variable. The parent contains a couple of script tasks that show message boxes with the values of the the variable before and after the child package changes it.

There is a connection that points to the child package, make sure when you deploy this to change the connection string property to the directory where you've dropped the child package.
Get the packages here.

Let me know how this works for you.

Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

A little too much reminiscing...

Euan's post here brings back some of the memories I'd much rather forget...

The toenails oohoho, the horrible toenails.

Thanks a lot Euan! That was one part of DTS/SSIS history I was hoping to put behind us. :)

Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

One configuration, multiple packages...

This question comes up a lot.

“Can you use one configuration for multiple packages?“ The answer is yes. In fact, from the start, this was the scenario that we were planning for and ultimately is the way to abstract your packages away from the differences in the machines they are running on. Here's a really simple example that shows how to configure a property on a task, in this case, the subject line of a send mail task in two packages with the same configuration file:

Here's how you do it:

  1. Create two new packages in your solution.
  2. Drop a send mail task into each one
  3. Rename them both, SendMail. I know, not much imagination. Bear with me, it's late.
  4. Change the subject lines for both Send Mail tasks to something random like “On these three breezy trees the cheezy flees flew“
  5. Right click on one of the packages and select configurations.
  6. Create a new XML configuration file and specify the subject property of the SendMail task to export to the configuration.
  7. Save the configuration somewhere.
  8. Go to the other package, right click, select package configurations, browse for the one you've just created and select it.
  9. The wizard will ask you if you want to reuse the existing one. Click on yes.
  10. Now you've got two packages that use the same configuration. So long as both tasks are named the exact same, and they remain in the same location hierarchically (so their package paths match), both will get configured from the same configuration you've created.

This will yield a configuration that should look something like this:

Configuration ConfiguredType="Property" Path="\Package\SendMail.Subject" ValueType="String"

If you try this, let me know how it works for you.

Thanks,

Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

Standalone IS Installations

This came up on another thread in one of the newsgroups.

IS can be installed standalone and function without issue.* This makes it possible to segment your ETL processes away from OLTP and OLAP systems for pure ETL servers making it possible to fully dedicate all the resources of the machine to ETL. Of course, it will run on the same machine as your other servers, but if you want to dedicate a pure ETL machine, IS supports it.

Customers sometimes run complicated pipelines and a SQL Server on the same box. Both can be very memory intensive. And, we've seen cases where one starves the other of memory. The data flow task will attempt to use as much memory as it needs up to the limits of what the machine will provide. Then it will spool buffers to the temporary storage location the user specifies in the “BufferTempStoragePath“ properties on the DataFlow task. If possible, it's best to avoid this temporary caching because anytime the dataflow task does so, it has a drastic impact on throughput, performance will suffer.

*Disclaimer: I don't know about the licensing issues here. I believe those are still being worked out in other parts of the building. Those issues should be factored into any configuration decisions, of course.

Property expressions...

Part I

I've been meaning to blog a little about property expressions. A newsgroup question reminded me I need to do this.

There's a lot to talk about for Property Expressions because they're so flexible. Here's an example of how to use a property expression to configure a ForEach Enumerator with a value in a variable.

Create a variable at a scope visible to the ForEach Loop called FolderToIterate with the namespace "SSISRocks" :)
For the value, specify the folder, say "C:\data\droppedfiles\*.csv", for example.
Open the UI for the ForEachLoop by double clicking (or right click, edit)
Select the Collection tab, "For Each File Enumerator" should be selected in Enumerator
There should also be an Expressions editor there. Click on the property and then the "..." button.
You'll get another dialog with property and expression.
Specify FileSpec for the property
Then for the expression, use "@SSISRocks::FolderToIterate"
 
Now, when you run the package, it will use the folder specified in the variable.
 
I'll talk more about property expressions soon, but this is an example of how powerful they are. With SSIS, once you have a value in a variable, you can apply that value to just about any property in the package that is the same type as the value. (Unfortunately, this will not work for properties on components in the pipeline).
 
Tell me what you think. Let me know if this isn't clear. Find a good use for property expressions, something cool, interesting? Let us know.
 
Thanks,
Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden
 
 

Yes...

Well, we did it. We ZBB'ed. The halls filled with energy as we realized we've turned the corner on this thing and from here on out, we're truly polishing the product, getting it ready to deliver...

Thanks for your crossed fingers, feedback, testing efforts and kind words of support.

I promise the next blog will have something of substance. But for now, I'm just savoring the moment.

K

 

Inside stuff...

Well, if you haven't noticed, the SSIS dev. team have been pretty quiet on blogs and newsgroups. We're pushing for what we call ZBB, or zero bug bounce. Zero Bug Bounce is the day we reach a certain plateau of bugs. In our case, we will have no active bugs older than 7 days.

This is a big date for us and represents the first step toward shipping the product after a very, very long development cycle. This is the longest project I've ever worked on. I hope to never work on one this long again. :)

So, if we're a little slow to respond and a little thin on the blogs, this is probably why.

The bright spot is that we are extremely close. Cross your fingers for us. :)

We're going to try to hit ZBB today!

Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

 

Some other blogs...

There are a few other blogs of SSIS interest.

Jamie Thomson has been very active and has some great blogs around SSIS. Notably, this one about containers explains the fundamental concept of containers: http://blogs.conchango.com/jamiethomson/archive/2004/12/13/445.aspx

Andrew Watt seems to have taken a real interest in SSIS and is very active on newsgroups and has some great explanatory posts. His blog is here: http://www.tfosorcim.org/cgi-bin/mt-tb.cgi/110

Scott Allen touches on SSIS occasionally here: http://odetocode.com/Blogs/scott/

Then there is the venerable Donald Farmer blog: http://sqljunkies.com/WebLog/donald%20farmer

 

 

 

Transactions in SSIS

If you're familiar with the way transactions are supported in DTS, it will help you to understand how transactions work in SSIS. In DTS, transaction properties were defined on the package and only one transaction was allowed per package. Steps/tasks were only in or out of a transaction. They didn't have any more options.

SSIS transactions are more flexible than that. It's possible to have more than one transaction per package and transactions scope is defined at the container level. Containers control transactions. This means that the container starts the transaction and either rolls it back or commits it. This can be a bit confusing if you don't understand that tasks are also considered containers. Actually, the taskhost is a container. It's a special container that only holds one object, the task. It's transparent in the Business Intelligence Design Studio (BIDS), but it's there and it's what controls transactions on a particular task.

Now, there are three “TransactionOptions” available per container.

  • NotSupported - The container won't enter the transaction.
  • Supported - The container will enlist in an existing transaction.
  • Required - The container will create a transaction if none exists and enlist in an existing one.

So, this begs the question, “How does one know if a transaction 'exists'?” You look at the parents. If the parent container, or the parent of the parent or any ancestor up to the package “Requires” a transaction, a transaction exists.

As usual Jamie Thomson, from Conchango, has been pushing around the product and come up with some great questions. Thanks Jamie.

He has four SQL tasks in a workflow. One creates a table, the second inserts a valid value into it, the third inserts an invalid value into it and the third inserts a valid value. He asked why the third value goes in even though the second value was invalid. Why didn't the transaction roll the insertion back?

The answer is because, for his package, there are potentially 3 different transactions. One for each insertion. To make this scenario work, in other words, to allow the failed insertion (operation) to fail the rest, you need to put all the insert SQL tasks in the same container. Drop a sequence on the designer, drop all the SQL Tasks you want in the same transaction into that sequence. Set the transaction options to required. Set the transaction option on the sequence to required and now, if any of the SQL Tasks fail, the entire transaction will fail and none of the values will remain in the table.

A few things you should remember.

  1. Make sure the DTC is started in the services control panel.
  2. Transaction scope is defined by the containing container.
  3. If a transaction is already started on a parent, the child won't start another. It will use the existing transaction.
  4. If there is no transaction already started in any ancestor, and the transaction option is set to required, the container will start a new one.
  5. If there is no transaction already started in any ancestor, and the transaction option is set to supported, no transaction will be started.

Hope this helps. Let me know if you have any questions.

Thanks,

K

Event Handlers

Way back, years now, I sat down with Euan Garden in his office for several days for literally hours and started talking about all the stuff we were going to do with DTS. Euan is a font of knowledge and those few days were like drinking from the proverbial fire hydrant. We talked about how we can make DTS better. We talked about all the problems and pain points people experienced with DTS. How much trouble it was to debug DTS. Loops and connections that transparently synchronize workflow. So many ideas flowed from those initial discussions, it's hard to qualify or quantify their importance in the design of what is now known as Integration Services. One of the problems we discussed was how to handle errors specifically and events generally inside the package. Well, Euan moved on to bigger and OK things. :) The DTS team continued thinking about this problem. We already had a workflow system, why not handle workflow events with workflow. This idea grew into what is now known as event handlers.

Event handlers are one of the lesser known features, but judging from the news groups, people are starting to realize their potential. Event handlers are documented in books online, so I won't give too much background here. But, I do want to point out a few interesting things about Event Handlers and help folks understand them a bit better.

This is the link to the books online topic.
ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/extran9/html/6f60cf93-35dc-431c-908d-2049c4ab66ba.htm

Event handlers are like small subpackages inside a package. The same could be said about all containers I suppose, but what sets event handlers apart is they only get called when an event is raised by a container or other object in the main package. For example, every task raises the PreExecute and PostExecute events. Even custom tasks that aren't explicitly coded to support events, raise events. Those events bubble up their parent and ancestor hierarchy until they reach the package. Each ancestor, in turn, gets a chance to execute their event handler. If there is no event handler defined, the event just bubbles up to the next parent. If no event handler is defined for the event anywhere in the hierarchy, the event is just returned to the event sink the client application provides. If the client application doesn't provide an event sink, the event is lost. This is how the designer gets all the information that it provides to the user while running a package. For example, when a task posts an Error event, the designer captures that event and shows the error in the progress view and the errors window.

Event handlers are containers just like a sequence or a foreach loop. They may hold other containers and tasks. They may be executed and they may succeed or fail. But, they are only executed in response to an event and aren't part of the regular workflow.

Event handlers created at package scope handle every event of that type for the entire package. An Error event handler placed on the package, will execute for every error event in the package. So, this isn't always the best approach to handling events. You should place event handlers strategically based on the event you wish to handle at the scope you wish to handle it. Even then, you may get unwanted events bubbling up to your event handler. In those cases, it's possible to control event propagation in a couple of ways. One way is to disable the propagation down the chain by setting the System::Propagate system variable on the event handlers of containers/tasks that you don't want to propagate. This is best because it's more efficient and cuts the event off at the source eliminating the fairly costly propagation through the chain. Another way is to filter at the event handler itself. You can do this by placing a script task in the event handler and filtering on the event handlers parameter system variables. Every event handler that handles events that take parameters, will have system variables that contain the values of those parameters. For example, every event handler has these system variables:System::SourceDescription, System::SourceID, and System::SourceName. If you name your containers distinctively you can effectively filter on these values. Specifically, you can use the Description property on all objects for just about anything you wish, since it's reserved for users and IS never uses that property for any purpose. You can group containers by giving them descriptions such as “Handle events: Error;PostExecute“ In your script task, you can then use this description to filter. Again, this isn't terribly efficient, but works in a pinch for some scenarios.

Play with event handlers a bit. They're very powerful. Some of the uses I've seen or heard of include handling errors, sending mail, auditing results, processing log files post execution, pre and post package execution cleanup and setup, pre-charging caches, executing contingent workflow based on variable value changes and so forth. I'd love to hear about any ways you've used event handlers, especially if you've used them to accomplish something that you couldn't have done easily in any other way.

Custom tasks may also create their own custom events. And, what's more, you may create event handlers that execute when those custom events are raised.