January 2005 - Posts

SQL Injection Attacks...

Just read this I found from a link on Vinod's blog.

It's a paper on SQL Injection attacks, how they're done and how to thwart them.

Great read!

Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

WMI in SSIS

A new task in SSIS is the WMI Task. Actually, there are two tasks. One is the WMI Event Watcher task and the other is the WMI Query Task. They are fundamentally different tasks even though they use the same underlying technology, WMI.

WMI is Windows Management Instrumentation. Way back when I was working in the eHome group, I found out about it the hard way. I had just written the fundamental infrastructure for a mechanism to discover other machines on the home network and to discover information about the machines like machine name, hard drive space, disk drives and attached printers etc. Then someone came along and said, hey, you should check out this WMI stuff. It's pretty cool. He wrote a little web page that worked against the WMI scripting model in about half a day that did exactly what I had taken a week to do. And, he already had a web page UI. :(

Well, that was my rude awakening to WMI. I hated it then cuz it meant all my hard work was down the drain and I really was looking forward to that project. WMI can be used to find out just about anything about a machine on a network. Right down to the MAC address and remaining hard drive space available to the kinds of expansion cards in the machine.

I've built a package that uses the WMI event task to do something really simple. Basically, it sits and waits on a directory for a file to be created. And then it returns. I've put it all in a loop that could well be a forever loop. This is not considered best practice, of course. It's just a simple demonstration to show the WMI task in action. Basically, it waits for a file to show up in the E:\temp folder. When it does, it returns and a script task is executed, then it returns to waiting for the next file to show.

You can get the package here.

For those that would rather just build the package themselves, here's the WMI Query:

SELECT * FROM __InstanceCreationEvent WITHIN 10 WHERE Targetinstance ISA "CIM_DirectoryContainsFile" and TargetInstance.GroupComponent= "Win32_Directory.Name=\"e:\\\\temp\""

You can change the directoryname to something different, but it must be formatted this way.

Now, this doesn't wait until the file is completely written. So for very large files, if you attempt to write or otherwise open the file too soon, it will fail. The way you may work around this is to have whatever process copied the file, to also copy another file with the same name, zero length and a different extension after the actual file is transferred. Then, you can enumerate or open any file with the same name as the zero length file using the correct extension and be assured that the file is completely copied.

Give it a try, let me know what you think. See if you can find other uses for the WMI Task. Here are some we've thought of:

  • Detecting low memory conditions at the beginning of a package and alerting via email to the issue
  • Detecting full disk volume conditions before beginning large copy operations
  • Detecting email ariving to a particular recipient with a particular subject
  • Detecting the availability of another server machine before attempting to move files
  • Waiting for an existing memory intensive process to complete before beginning another

WMI is very powerful, but mostly what keeps people from using it is the learning curve. It's substantial for sure, but provides some really robust and elegant ways of doing things that are otherwise quite difficult or as I discovered a few years ago, require a lot of coding to accomplish.

Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

Well, I did it...

I signed the contract...

Ragged right clarification

In my SQL Server Magazine article on the Import/Export Wizard, I used ragged right format for importing a flat file with the wizard. I mentioned ragged right format but didn't give a clear enough explanation for some people as is evident from the emails I've received asking for clarification. I'll try to clarify a little better here some of the advantages and attributes of each file format.

Fixed Width
First let's start with fixed width format files. Fixed width files, as their name implies, use the same width for all the columns regardless of the contents of those columns. Typically, this meens that there is a lot of wasted space for columns with data that varies, like names for example. So, these types of files tend to be a little larger on disk than delimited or ragged right. If you care, they're also more difficult to view in notepad or other text editors and more difficult to get right when importing into Excel, Access or other desktop data software. So why would someone use this format? Well, if you don't know why, you probably don't need to use it. Those who do need it know that they need it because they're using legacy software or mainframe software that takes the format. It's usually for backwards compatibility.

Delimited
Delimited files use a delimiter like a comma or tab to separate the columns and typically use a character combination to delimit rows/records. This format is usually the easiest to view in notepad etc. and is very easy to import into software packages like Excel because the delimiter makes it very obvious where the columns are. These are much more compact than fixed width files typically. The only case where these files will be larger is if the data in the fixed width file is exactly conformed and uniform in all columns. In other words, it's the same size for all columns, then there is no need for the extra delimiter or the row/record delimiters. This is rare. There is almost always at least one non-uniformly sized column. Which brings us to Ragged Right.

Ragged Right
Ragged right files are similar to fixed width except, they use a delimiter to mark the end of a row/record. This is a compromise between the two file types that eliminates the column delimiters and makes the file easier to work with, for example, when importing to a different data application or when viewing in a text editor. The difference between ragged right and fixed width means that the last column may be of variable size. Ragged right uses a row/record delimiter, usually <cr><lf>, to mark the end of a row. The row delimiter makes the file easier to deal with in common text editors and when importing, it's easier to find the end of the row. So it's ideal for cases where you only have one non-uniform column etc. or if you need such a compromise to save space.

File Sizes
Of course it really depends on the data, but generally speaking, delimited files come in smaller because less space is wasted on whitespace. There are cases like I mentioned above where that isn't true. I personally like working with delimited files the most, especially comma delimited files. Again, there's a reason for having so many different formats and depending on your requirements, one will work better than another. Many times it comes down to what your legacy software will support. Then you don't have a choice in the matter.

Package
I've created a package that illustrates the differences. It uses deployment, so you can configure it before you run it to point to the correct location for dropping the output flat files. Basically, it sends the employees table through a multicast to three flat file destination adapters. Each one uses one of the three different formats described here. Playing with the column widths in the flat file connection managers gives you a good feel for how setting column widths effects the resulting size and format of the files. I encourage you to use a larger dataset as well and get a feel for how changing the column types effects the performance.
Here's what the package looks like:

And here is where you can download the package.

Let me know how it goes.

Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

 

Script task bug, nit and trick

In beta 2 and IDW 11 there is a bug in the script task that causes some odd behavior after so many iterations inside a for loop. Some of you have already experienced this. Essentially, the script task stops with an error like:

Error: There was an error while compiling the script: -2146226146

What's happening under the hood is the script task relies on a product called VSA or Visual Studio for Applications. The VSA engine uses a COM object and the shell and some other components are managed. When the script task gets put into a tight loop, the COM object gets created repeatedly but not released because managed code doesn't free the object quick enough and a system resource gets starved. (Garbage collection)

That's the bug. In IDW 13 there is a fix for this and you will no longer have to deal with the issue.

However, there's a nit that's worth mentioning with the script task. It's pretty fast once it compiles the code, but if the script task is in a tight loop, say as in a ForLoop container, the compiling step takes a substantial amount of time slices and slows down the ForLoop. Normally, you wouldn't notice this, like say, if you had a script task in the package outside a loop etc. However, it's noticeably slow in a ForLoop.

That's the nit. In IDW 13 there is a fix for this and you will be able to speed up the script task substantially.

Sergei, our wunderkind developer who wrote the task has added a property called PreCompile to the script task. If you set that property to true, it will precompile the code and save it as part of the package. The advantage is that the script task will run, in some scenarios, 10 times faster. Or, at least a tight loop with a script task in it as I've attached here. This package running on my machine took ~98 seconds to run. With precompile turned on, it took ~10 seconds.

The disadvantage is that it make the package larger. The attached package is roughly 73K, so I wouldn't recommend using this feature if you don't need to speed up a tight loop with a script task in it.

That's the trick. When you get IDW 13, give this a try. It's amazing how much faster it runs.

Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

Comments : ON

Who gets the first spam?

It's a race Kent...

Package Configurations II

In an earlier blog I wrote a little about package configurations. IDW11 (December CTP) had limited support for SQL Configurations. That will be changing in the next CTP build (IDW13), slated for sometime in Feb. You will no longer need to script a table for the configurations. Support will be built into the Package Configuration wizard. It looks something like this:
Package Configuration Wizard

You'll still need to insert the configuration values as described in the earlier blog post, but the wizard enhancement should make it much easier to set these up.

Someone also asked how the configuration name is used. It's placed in a ConfigurationFilter column in the SQL Configuration Table. Essentially, the SQL Configuration uses this to query the table and extract a subset of configurations. Using this field you can create multiple configurations for the same package or set of packages. Or, use it to simply group configurations for a particular solution etc.

Look for this change in the IDW13 build and let us know what you think.

Thanks,

Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

 

Font sizes in BIDS

A feature that is perhaps a little difficult to find in the BI Dev. Studio is how to modify the size and style of the font used for displaying the names and other text. 

  • You can change the font size by selecting Tools | Options and under the environment tab in the Options dialog, select Fonts and Colors.
  • In the Show settings for combo box, select Business Intelligence Designers.
  • Change the size of text with the Size combo box.

If you're struggling with package formatting, this may help.

Also, another semi-hidden feature in the designer is the ability to “auto-connect” component as you double click on it in the toolbox. In the same dialog, select the Business Intelligence Designers tab and select the Data Transformation Designers node. Those settings allow you to automatically connect the new task/transform to the existing selected task/transform.

Note also the Package Signing node there. That's a whole 'nother blog...

Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

Outer objects, inner objects, and the IS object model...

Hitachi consulting has a blog here. They talk a little about the IS object model and refer to its complexity. Notably they talk about the inner vs. outer objects.

Anyone that starts coding against the IS object model will see this almost immediately. I'm going to try to give some real quick pointers here about it, explain some of the rationale behind the inner/outer object design. Later in this blog I'll try to drill down on it a bit more.

The inner/outer object discussion is really only related to the runtime. The pipeline doesn't support wrapping objects for transformations because of a number of reasons, most notably, to make transforms extremely light and fast so that the overall pipeline will be fast. Looks like it worked. :)

So, when discussing inner and outer objects, we're really talking about the runtime, which controls everything you see in the workflow window in the designer and a few other things like logging, connections and events, etc. It's easiest to think about outer objects from the perspective of the inner objects, the pluggable components. When you the user, create a custom task, or foreachenumerator, you put properties on it that the runtime cannot know about. In other words, the runtime may discover that the custom component has certain properties, but it cannot know what those properties are for, usually. On the other hand, there are features that you'd like your component to always take advantage of. Say, you'd like to have your task support the OnPreExecute breakpoint. But, that code is somewhat complicated and, besides, why should you have to write it if Microsoft has already written it for their components. Why doesn't Microsoft just provide a way for you to expose that breakpoint so you don't have to write all that code. Well this is the beginnning of the rationale for having outer objects. We call them hosts. Essentially, they are wrapper objects that perform a lot of default behaviour for stock and custom components.

Custom components or stock components (Ones Microsoft writes and provide in the box) are all instatiated inside a host. That is, all components that may be created by third parties, explicitly, Tasks, Log Providers, Connection Managers, and ForEachEnumerators are hosted. The rationale behind this is many fold, but it begins with the premise that Microsoft wished to make it very simple to write custom components, (The notion of a platform). With the host pattern, Microsoft is able to do things like provide a common implementation of property expressions, breakpoints, properties collections, transaction interaction, variable scoping, persistence and other features without one line of code in the custom component.

So, when working in the IS object model (Still IDTS... interfaces however), the outer or host objects for tasks and other components will always be the same. Generally, you should never need to access the inner objects since you can configure them through the properties collections on their hosts. But, in the case where you do need to modify the custom/stock component directly, you can use the "InnerObject" property on all hosted objects to retrieve a reference to it.

For tasks, the host carries the additional burden of being a special container. It inherits some of the same behaviours as sequences, loops and the package. It provides variables scope and transaction scope for example. This makes it possible to do things like the Execute Package task because the container hierarchy principle holds true even for the task.

Understanding the rationale and principles behind the design I hope will help to better understand how to work with the object model. Keeping the notion of inner and outer object in mind, it should be easier to see how to create programs programatically etc.
Thanks to the folks over at Hitachi for a great segue. :)

Universe.Earth.Software.Microsoft.SQLServer.IS.KirkHaselden

Normalizing flat files with Integration Services...

A common question asked on newsgroups is how to normalize a flat file with DTS.

With DTS the answer was usually, you load the file into a table and then perform queries on the table to extract the colums into normalized tables. For large flat files this can be a real pain because you must use a different set of tools to perform a simple operation and it may be unnecessarily time consuming because you have to load the data before you can even begin to transform it. Yes, ELT.  Finally, you would need to use one process to extract the data and another to transform it, decentralizing the process.

It would be much better if it was possible to normalize the data as it flows through the pipeline for a one time load sequence. With SSIS, you can transform the data as you load it, ETL.

For example, maybe a flat file contains header/detail information in the same row. Likely, since the flat file isn't normalized there's some redundant header data. The way to do this in Integration Services is by splitting the flow from a flat file source adapter. For the simple case, where you only need to split the flat file into two tables, use the multi-cast to split the flow into two. On the one flow, the one to extract header information, use the sort t\ransform to remove duplicates. Then, use a lookup that references the header destination table. If the header data