February 2005 - Posts

Data warehousing, data quality, and data vs. information

An interesting day in all of the e-newsletters I subscribe to.

SQLServerCentral's "Database Daily" newsletter included a link to this article, which states that 50% of all data warehousing projects are doomed to failure! Why? "The main problems ... centre on a lack of attention to data quality issues."

This is interesting, in that one of the first things I learned when studying about data warehousing is that second letter in "ETL" -- TRANSFORM. Part of that transformation is data clensing! The warehouse itself should be the final arbiter of what data gets in, and what data gets out -- and "garbage in/garbage out" certainly applies. It's humorous to me that SQL databases have had features like CHECK constraints for so long, and yet they're so rarely used in most shops. How many constraints do you have on the tables in your current project? Are you really constraining to make sure no trash gets in, or are you letting the application handle some of the burden?

On to the SQL Server Worldwide Users Group newsletter, which included three interesting links dealing with data vs. information. What is data, and when does it become information? Some very good reading here:

Interesting that both newsletters posted such related topics. Why is it that we have all of these data warehouses being developed with no conception of what a data warehouse is? In my opinion, a data warehouse is nothing more -- or less -- than a tool by which data becomes information. Of what use is dirty, flawed information? And how will it help business owners make the kinds of sound decisions that will drive up your paycheck? Data quality is of utmost importance in any database project -- not just a data warehouse -- yet many database professionals I encounter never implement CHECK constraints, and many seem to fear even basic declarative relational integrity (foreign key constraints? natural primary keys?)

I'm saddened by what I see as a trend towards lower data quality, not the other way around. As more software engineers jump into database work without paying attention to the foundations, without thinking about data as more than just a collection of bytes, this problem will continue to intensify. So if you have some time today, read some of these articles and spend a bit of time thinking about data quality within your organization. There is no reason that we as database professionals shouldn't be doing all we can to combat bad data, at every possible level.

Teetering on the edge: TSQL "reflection"

I've just spent the afternoon writing a new UDF -- yes, I certainly know how to have a fun weekend! This particular UDF has caused me to almost lose my mind, but I'm quite pleased to announce that it seems to work well at this point... And I didn't quite lose it.

This UDF returns the text of any routine (UDF/stored procedure/trigger). Sounds simple, but there were a few interesting twists along the way.

So check it out! And stay tuned for why I would bother creating something like this...

SQL Server 2005 editions and feature comparisons announced

Microsoft has announced the editions and feature comparisons for SQL Server 2005.

Interesting is the introduction of a "Workgroup" edition, which appears to replace SQL Server 2000's "Personal" edition as one step above the MSDE/SQL Express version. Also announced is a SQL Server 2000 version of the Workgroup edition, but I can't find any details on what exactly that includes. Stay tuned for further updates....

New script posted: Script out PK/UNIQUE constraints and referencing FKs

To help ease some of the annoyance of modifying primary keys heavily referenced by other tables, I've created this script.

Hopefully it will help someone...

SQL Server 2005: Query Plan Guides

Thanks to Bob Beauchemin for pointing out the new query plan guides feature in SQL Server 2005.

This is a very cool sounding feature. Apparently we will be able to define query hints for queries WITHOUT modifying the queries themselves (i.e. stored procedures -- OR ad hoc SQL, I'm hoping). This means that:

A) Stored procedures will not have to be edited for different environments. If a query hint is needed in a certain environment, it can be added without modifying the code -- and without risk of damaging the code's integrity and symmetry with the code checked in to the source control system (you do use source control, right?).

B) If ad hoc SQL is being generated from applications, adding query hints will not require a recompile. And again, this will greatly assist with configuration within different environments.

Based on my current understanding of this feature, I'm really looking forward to it. Hopefully the implementation will be as good as it sounds!

New UDF - Pattern-based split string

I just posted a new UDF in the library. This one is a modification of the split string UDF previously posted, but takes a pattern for the delimiter instead of a single character. This was written to handle requests such as, "how do I find all the distinct words in a text column?"

The UDF can be found here.

Bitmask Handling, part 4: Left-shift and right-shift

Quick installment this time. Left-shift and right-shift operators.

Left-shift and right-shift are integral to binary mathematical operations as they have two important qualities: Left-shifting a bitmask once multiplies by two. Right-shifting once divides by two. For example:

0011 (base 2) = 1 + 2 = 3

3 << 1 = 0110 (base 2) = 4 + 2 = 6

-- Note that << is the C++ left-shift operator

Or we could go the other way and divide:

0110 (base 2) = 4 + 2 = 6

6 >> 1 = 0011 (base 2) = 2 + 1 = 3


-- But what happens if we do a second right-shift?

3 >> 1 = 0001 (base 2) = 1


-- Now we've lost a bit (it "fell off the edge") -- causing rounding.
-- Luckily, that's the same way SQL Server treats this situation:

SELECT 3 / 2 AS [3_Right_1]


3_Right_1
---------
1

So now you've seen the basics of left-shifting and right-shifting. But how easy is it to implement given the bitmask handling framework already established in previous installments?

Very, very easy!

Remember that 0011 (base 2) is 3 (base 10) or 0x0003 (base 16), and has bits 1 and 2 turned on. Left-shifting once should produce 0110 / 6 / 0x0006 -- bits 2 and 3 turned on.

DECLARE @Bitmask VARBINARY(4096)
SET @Bitmask = 0x0003

DECLARE @LeftShiftNum INT
SET @LeftShiftNum = 1


SELECT Number + @LeftShiftNum AS Number
FROM dbo.splitBitmask(@Bitmask)


Number
------
2
3

And that's literally all there is to it. Right-shift is just as easy, but we subtract:

DECLARE @Bitmask VARBINARY(4096)
SET @Bitmask = 0x0006

DECLARE @RightShiftNum INT
SET @RightShiftNum = 1


SELECT Number - @RightShiftNum AS Number
FROM dbo.splitBitmask(@Bitmask)


Number
------
1
2

Right-shifting twice will produce an out-of-bounds bit, 0. But that's not an issue, because the bitmask re-constitution pattern uses the bitmaskNumbers table, which conveniently doesn't contain a 0. A bit of accidental foresight on my part, perhaps.

I have nothing more to say on this issue. Plug everything back into the re-constitution pattern and we're done:

CREATE FUNCTION bitwiseLeftShift
(
	@Bitmask VARBINARY(4096),
	@LeftShiftNum INT
)
RETURNS VARBINARY(4096)
AS
BEGIN
	DECLARE @BitsInBitmask TABLE(Number SMALLINT)
	INSERT @BitsInBitmask
	SELECT Number
	FROM
	(
		SELECT Number + @LeftShiftNum AS Number
		FROM dbo.splitBitmask(@Bitmask)
	) x
	
	SET @Bitmask = 0x
	
	SELECT @Bitmask = @Bitmask +
		CONVERT(VARBINARY(1), 
			SUM(CASE
				WHEN x.Number IS NULL THEN 0
				ELSE BitmaskNumbers.BitValue
				END)
			)
	FROM @BitsInBitmask x
	RIGHT JOIN BitmaskNumbers ON BitmaskNumbers.Number = x.Number
	WHERE BitmaskNumbers.Byte <=
		(SELECT
			CASE MAX(Number) % 8
				WHEN 0 THEN (MAX(Number) - 1) / 8
				ELSE  MAX(Number) / 8
			END + 1
		FROM @BitsInBitmask)
	GROUP BY BitmaskNumbers.Byte
	ORDER BY BitmaskNumbers.Byte DESC

	RETURN(@Bitmask)
END
GO


-- Here are some examples to prove that everything works:


SELECT dbo.bitwiseLeftShift(0x03, 1) AS [3_times_two]
GO


3_times_two
-----------
0x06



SELECT dbo.bitwiseLeftShift(0x03, 3) AS [3_times_eight]
GO


3_times_eight
-------------
0x18



SELECT CONVERT(int, 0x18) AS [int_0x18]


int_0x18
--------
24



SELECT dbo.bitwiseLeftShift(0x18, 12) AS [24_times_4096]


24_times_4096
-------------
0x018000


SELECT CONVERT(int, 0x018000) AS [int_0x018000]


int_0x018000
------------
98304



SELECT 98304 / 4096 AS [this_will_be_24]


this_will_be_24
---------------
24

... and the same for right-shift:

CREATE FUNCTION bitwiseRightShift
(
	@Bitmask VARBINARY(4096),
	@RightShiftNum INT
)
RETURNS VARBINARY(4096)
AS
BEGIN
	DECLARE @BitsInBitmask TABLE(Number SMALLINT)
	INSERT @BitsInBitmask
	SELECT Number
	FROM
	(
		SELECT Number - @RightShiftNum AS Number
		FROM dbo.splitBitmask(@Bitmask)
	) x
	
	SET @Bitmask = 0x
	
	SELECT @Bitmask = @Bitmask +
		CONVERT(VARBINARY(1), 
			SUM(CASE
				WHEN x.Number IS NULL THEN 0
				ELSE BitmaskNumbers.BitValue
				END)
			)
	FROM @BitsInBitmask x
	RIGHT JOIN BitmaskNumbers ON BitmaskNumbers.Number = x.Number
	WHERE BitmaskNumbers.Byte <=
		(SELECT
			CASE MAX(Number) % 8
				WHEN 0 THEN (MAX(Number) - 1) / 8
				ELSE  MAX(Number) / 8
			END + 1
		FROM @BitsInBitmask)
	GROUP BY BitmaskNumbers.Byte
	ORDER BY BitmaskNumbers.Byte DESC

	RETURN(@Bitmask)
END
GO


--Right-shifting the same number we just left-shifted 12 bits should yeild the same input

SELECT dbo.bitwiseRightShift(0x018000, 12) AS [should_be_0x18]


should_be_0x18
--------------
0x18



--Is overflow handled the same way that SQL Server handles it?

SELECT 
	CONVERT(INT, dbo.bitwiseRightShift(0x018000, 16)) AS [overflow],
	98304 / 65536 AS [98304_divide_65536]

overflow	98304_divide_8192
--------	-----------------
1		1


--Apparently :)

Enough for now. Next installment, the long awaited mathematical operators. Special thanks to Steve Kass for smacking me around a bit in the comments section of the last installment and teaching me how to properly implement signed numbers in a binary system. So the next installment will actually be able to deal with negative integers too. Thanks, Steve!!!

VB.NET SQL Server stress tool

I'm not sure how I missed this one -- it was posted in April, 2004 -- but I will pass it along here since I'm sure others will find it useful, too.

Alberto Venditti shares a simple VB.NET stress utility for SQL Server 2000. It's kind of similar to the Database Hammer that came with the SQL 2000 resource kit, but looks a lot easier to set up and use.

Great job, Alberto!

Want to create a file with some SQL Server output?

First read this article by Zach Nichter of the Scalability Experts. Some interesting performance tests.

What I would really like to see is one more test, vs. an extended stored procedure, although perhaps that doesn't qualify as a solution that most people would employ for solving the problem...

Anyway, a good read at any rate.

Bitmask Handling, part 3: Logical operators

It's been longer than I hoped since my last installment on bitmask / big number handling. Life caught up with me and I've had many thankless tasks to catch up on. But that's over now and I'm back to the general slacking that typifies my days, so welcome to Part 3, handling logical operators.

I'll be discussing four operators in this post. AND, OR, XOR, and NOT. The first three are extremely easy given the framework already built out in the previous two posts. The last one has some problems -- so I'll discuss those first.

I haven't been able to find any mathematical or computer science texts that discuss how to deal with variable-length bitmasks, which is what I'm attempting to define here. Most texts keep the discussion to, at most, two or four bytes, and even then those two or four bytes are stable. But the problem with using a variable number is that it creates some inconsistencies with regards to signing. Take this example:

SELECT 1 AS Oops
WHERE 
	-1 <> 
	CONVERT(INT, CONVERT(VARBINARY, CONVERT(SMALLINT, -1)))


Oops
----
1

So what's happening here? The -1 on the right side of the comparison is being cast into a 2-byte SMALLINT, then converted to VARBINARY (which produces 0xFFFF). Then it's being converted back to INT. But guess what..?

SELECT CONVERT(INT, 0xFFFF) AS Arrgh


Arrgh
-----
65535

So what is SQL Server telling us? That a small -1 is not the same as a bigger -1. -1 <> -1!!!

You're probably asking yourself, "what is this guy talking about, and what does any of this have to do with the topic at hand, implementing a binary NOT operation?" And if that's what you were asking yourself, then good job, because you've asked the correct question. So what do they have to do with each other?

The truth-table for NOT is quite simple:

InputResult
01
10

But let's delve a little deeper. The SMALLINT representation of the decimal number 1 in hex is 0x0001. In binary, that's 0000000000000001. If you run that through a logical NOT, the result is 1111111111111110, or 0xFFFE. That's -2. But remember, that's only -2 if you're a SMALLINT. If you're either an INT or a BIGINT, it's 65534. And that's just not consistent. I want to know that any equivalent number into my function yeilds an equivalent number on the way out. So NOT(0x000001) should yeild the same result as NOT(0x000000000001).

... and that result, in the function I provide, will be: 0xFE. One byte in, one byte out. Similar to some prison gang mottos, but that's a topic for a later post.

You'll notice that this is similar to the way I handled the bitmask re-constitution in the last post. So I feel pretty good about this. Sparsity is a good thing.

But this has a very important side-effect. These numbers are now officially and permanently un-signed. We can't deal with sign because we can't know which byte corresponds to the highest byte -- that's variable. And since we can't determine the highest byte, we also can't determine the highest bit in that byte, and so can't know whether or not our number is negative.

But I can live with that. And I hope you can, too. And if you can't, write a solution and send it to me and I'll post it.

So on that note, and without further ado, here's how you figure out which bit positions should be output by a NOT operation:

DECLARE @Bitmask VARBINARY(4096)
SET @Bitmask = 0x01

SELECT x.Number 
FROM BitmaskNumbers x
LEFT JOIN dbo.SplitBitmask(@Bitmask) y ON y.Number = x.Number
WHERE y.Number IS NULL
	AND x.Byte <= 
		(SELECT MAX(Byte)
		FROM BitmaskNumbers z
		WHERE z.Number =
			(SELECT MAX(Number)
			FROM dbo.SplitBitmask(@Bitmask)))


Number
-------
2
3
4
5
6
7
8

Pretty simple, really: Split the bitmask and take any numbers within the same byte range that aren't in the bitmask. To reconstitute it, simply modify the reconstitution pattern a bit, stuff it all into a function, and you get:

CREATE FUNCTION bitwiseNot
(
	@Bitmask VARBINARY(4096)
)
RETURNS VARBINARY(4096)
AS
BEGIN
	DECLARE @BitsInBitmask TABLE(Number SMALLINT)
	INSERT @BitsInBitmask
	SELECT Number
	FROM dbo.splitBitmask(@Bitmask)
	
	SET @Bitmask = 0x
	
	SELECT @Bitmask = @Bitmask +
		CONVERT(VARBINARY(1), 
			SUM(CASE
				WHEN x.Number IS NOT NULL THEN 0
				ELSE BitmaskNumbers.BitValue
				END)
			)
	FROM @BitsInBitmask x
	RIGHT JOIN BitmaskNumbers ON BitmaskNumbers.Number = x.Number
	WHERE BitmaskNumbers.Byte <=
		(SELECT
			CASE MAX(Number) % 8
				WHEN 0 THEN (MAX(Number) - 1) / 8
				ELSE  MAX(Number) / 8
			END + 1
		FROM @BitsInBitmask)
	GROUP BY BitmaskNumbers.Byte
	ORDER BY BitmaskNumbers.Byte DESC

	RETURN(@Bitmask)
END
GO


SELECT dbo.BitwiseNot(0x01) AS Not01


Not01
-----
0xFE

And of course, given the properties I described above:

SELECT dbo.BitwiseNot(0x0000000001) AS Not0000000001


Not0000000001
-------------
0xFE

And now on to the other three logical operations, which are much simpler...

The easiest is OR, which has the following truth table:

+01
001
111

And what is that similar to, in relational parlance..? A UNION, perhaps?

SELECT Number
FROM
(
	SELECT Number
	FROM dbo.splitBitmask(0x01)

	UNION

	SELECT Number
	FROM dbo.splitBitmask(0x03)
) x

... and how about exclusive OR (XOR)?

+01
001
110

Similar to the UNION, but we only want intersections with exactly one bitmask position... Luckily, SQL is equipped for that:

SELECT Number
FROM
(
	SELECT Number FROM dbo.SplitBitmask(0x01)

	UNION ALL

	SELECT Number FROM dbo.SplitBitmask(0x02)
) x
GROUP BY Number
HAVING COUNT(*) = 1

Finally, the AND operation:

+01
000
101

Just like XOR, but you need exactly two bit positions in each intersection:

SELECT Number
FROM
(
	SELECT Number FROM dbo.SplitBitmask(0x01)

	UNION ALL

	SELECT Number FROM dbo.SplitBitmask(0x02)
) x
GROUP BY Number
HAVING COUNT(*) = 2

Putting it all together, I present the following OR, XOR, and AND UDFs:

OR

CREATE FUNCTION bitwiseOr
(
	@Bitmask1 VARBINARY(4096),
	@Bitmask2 VARBINARY(4096)
)
RETURNS VARBINARY(4096)
AS
BEGIN
	DECLARE @BitsInBitmask TABLE(Number SMALLINT)
	INSERT @BitsInBitmask
	SELECT Number
	FROM
	(
		SELECT Number
		FROM dbo.splitBitmask(@Bitmask1)

		UNION

		SELECT Number
		FROM dbo.splitBitmask(@Bitmask2)
	) x
	
	SET @Bitmask1 = 0x
	
	SELECT @Bitmask1 = @Bitmask1 +
		CONVERT(VARBINARY(1), 
			SUM(CASE
				WHEN x.Number IS NULL THEN 0
				ELSE BitmaskNumbers.BitValue
				END)
			)
	FROM @BitsInBitmask x
	RIGHT JOIN BitmaskNumbers ON BitmaskNumbers.Number = x.Number
	WHERE BitmaskNumbers.Byte <=
		(SELECT
			CASE MAX(Number) % 8
				WHEN 0 THEN (MAX(Number) - 1) / 8
				ELSE  MAX(Number) / 8
			END + 1
		FROM @BitsInBitmask)
	GROUP BY BitmaskNumbers.Byte
	ORDER BY BitmaskNumbers.Byte DESC

	RETURN(@Bitmask1)
END


SELECT dbo.bitwiseOr(0x01, 0x03) AS Or_01_03


Or_01_03
--------
0x03

XOR

CREATE FUNCTION bitwiseXOr
(
	@Bitmask1 VARBINARY(4096),
	@Bitmask2 VARBINARY(4096)
)
RETURNS VARBINARY(4096)
AS
BEGIN
	DECLARE @BitsInBitmask TABLE(Number SMALLINT)
	INSERT @BitsInBitmask
	SELECT Number
	FROM
	(
		SELECT Number
		FROM dbo.splitBitmask(@Bitmask1)

		UNION ALL

		SELECT Number
		FROM dbo.splitBitmask(@Bitmask2)
	) x
	GROUP BY Number
	HAVING COUNT(*) = 1
	
	SET @Bitmask1 = 0x
	
	SELECT @Bitmask1 = @Bitmask1 +
		CONVERT(VARBINARY(1), 
			SUM(CASE
				WHEN x.Number IS NULL THEN 0
				ELSE BitmaskNumbers.BitValue
				END)
			)
	FROM @BitsInBitmask x
	RIGHT JOIN BitmaskNumbers ON BitmaskNumbers.Number = x.Number
	WHERE BitmaskNumbers.Byte <=
		(SELECT
			CASE MAX(Number) % 8
				WHEN 0 THEN (MAX(Number) - 1) / 8
				ELSE  MAX(Number) / 8
			END + 1
		FROM @BitsInBitmask)
	GROUP BY BitmaskNumbers.Byte
	ORDER BY BitmaskNumbers.Byte DESC

	RETURN(@Bitmask1)
END


SELECT dbo.bitwiseXOr(0x01, 0x03) AS XOr_01_03


XOr_01_03
--------
0x02

... And finally, AND

CREATE FUNCTION bitwiseAnd
(
	@Bitmask1 VARBINARY(4096),
	@Bitmask2 VARBINARY(4096)
)
RETURNS VARBINARY(4096)
AS
BEGIN
	DECLARE @BitsInBitmask TABLE(Number SMALLINT)
	INSERT @BitsInBitmask
	SELECT Number
	FROM
	(
		SELECT Number
		FROM dbo.splitBitmask(@Bitmask1)

		UNION ALL

		SELECT Number
		FROM dbo.splitBitmask(@Bitmask2)
	) x
	GROUP BY Number
	HAVING COUNT(*) = 2
	
	SET @Bitmask1 = 0x
	
	SELECT @Bitmask1 = @Bitmask1 +
		CONVERT(VARBINARY(1), 
			SUM(CASE
				WHEN x.Number IS NULL THEN 0
				ELSE BitmaskNumbers.BitValue
				END)
			)
	FROM @BitsInBitmask x
	RIGHT JOIN BitmaskNumbers ON BitmaskNumbers.Number = x.Number
	WHERE BitmaskNumbers.Byte <=
		(SELECT
			CASE MAX(Number) % 8
				WHEN 0 THEN (MAX(Number) - 1) / 8
				ELSE  MAX(Number) / 8
			END + 1
		FROM @BitsInBitmask)
	GROUP BY BitmaskNumbers.Byte
	ORDER BY BitmaskNumbers.Byte DESC

	RETURN(@Bitmask1)
END


SELECT dbo.bitwiseAnd(0x01, 0x03) AS And_01_03


And_01_03
--------
0x01

... And that's enough for today's installment. Enjoy..!

DB2, CLR hosting, and DB2 fans up in arms...

Michael Otey discusses the fact that IBM beat Microsoft to CLR hosting in a database product. That's right, DB2 can host CLR stored procedures and functions in its latest version. SQL Server 2005 is yet to be available. So IBM beat MS.

But really, who cares? DB2, Oracle, PostgreSQL, and other platforms have had this kind of functionality for years. Extenders, cartridges, blades, or whatever people call them. Usually implemented in Java. So is this really the coup d'etat that IBM wants you to think it is? Probably not. Microsoft was already clearly behind in this technology area and is now catching up.

But the real entertainment isn't from Otey's article. Rather, check out the comments at the bottom. Disgruntled anti-Microsoft people angry with Otey for pointing out flaws in the DB2 implementation.

Maybe they'd be happier if they hadn't had to use Java for so long...

SQL Server 2000 metadata article

Mindy Curnutt shares this article on SQL Server Central, introducing some concepts for using the SQL Server 2000 metadata repository.

This is one of my favorite almost-hidden and almost totally unused features in SQL Server 2000. I have also never used it, but since SQL Server 2000 came out I've been thinking about various ways that it would make life so much easier. Alas, deadlines always get in the way and so most shops never have any data dictionary, especially not one implemented within the database!

Anyway, hopefully this article will get some more people thinking about how to use this feature in interesting ways. A really cool idea would be for someone out there to write a series of Reporting Services reports that would output the metadata info in a nice format... Any takers on that?