December 2006 - Posts

Happy Holidays!

I just want to wish everyone out there in the SQL Server community a happy, relaxing, and hopefully not at all productive holiday!  I'm betting things will be pretty quiet here at SQLBlog over the next several days, but make sure to say with us--there's a lot more content in store for 2007!

In the meantime, make sure to check out the Simple-Talk Cookbook, a collection of recipes from various members of the SQL Server community, including Andy Kelly and me. There are some interesting recipes in there, and I certainly hope mine are up to snuff!

Again, have a great holiday season, and thank you all for a fantastic 2006!!




Cross-posted from SQLBlog! - http://www.sqlblog.com


Medians, ROW_NUMBERs, and performance

A couple of days ago, Aaron Bertrand posted about a method for calculating medians in SQL Server 2005 using the ROW_NUMBER function in conjunction with the COUNT aggregate. This method (credited to Itzik Ben-Gan) is interesting, but I discovered an even better way to attack the problem in Joe Celko's Analytics and OLAP in SQL.

Rather than using a COUNT aggregate in conjunction with the ROW_NUMBER function, Celko's method uses ROW_NUMBER twice: Once with an ascending sort, and again with a descending sort. The output rows can then be matched based on the ascending row number being within +/- 1 of the descending row number.  This becomes clearer with a couple of small examples:

A

1

4

B

2

3

C

3

2

D

4

1

 

A

1

5

B

2

4

C

3

3

D

4

2

E

5

1

In the first table (even number of rows), the median rows are B and C. These can be matched based on [Ascending Column] IN ([Descending Column] + 1, [Descending Column] - 1). In the second table (odd number of rows), the median row is C, which is matched where [Ascending Column] = [Descending Column]. Note that in the second table, the match criteria for the first table does not apply -- so the generic expression to match either case is the combination of the two:  [Ascending Column] IN ([Descending Column], [Descending Column] + 1, [Descending Column] - 1).

We can apply this logic within the AdventureWorks database to find the median of the "TotalDue" amount in the Sales.SalesOrderHeader table, for each customer:

SELECT
   CustomerId,
   AVG(TotalDue)
FROM
(
   SELECT
      CustomerId,
      TotalDue,
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
   FROM Sales.SalesOrderHeader SOH
) x
WHERE
   RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;

The equivalent logic using Itzik Ben-Gan's method follows:

SELECT
   CustomerId,
   AVG(TotalDue)
FROM
(
   SELECT
      CustomerId,
      TotalDue,
      ROW_NUMBER() OVER (
         PARTITION BY CustomerId
         ORDER BY TotalDue) AS RowNum,
       COUNT(*) OVER (
          PARTITION BY CustomerId) AS RowCnt
   FROM Sales.SalesOrderHeader
) x
WHERE
   RowNum IN ((RowCnt + 1) / 2, (RowCnt + 2) / 2)
GROUP BY CustomerId
ORDER BY CustomerId;

Taking a look at the estimated execution plans for these two queries, we might believe that Ben-Gan's method is superior: Celko's algorithm requires an expensive intermediate sort operation and has an estimated cost of 4.96, compared to 3.96 for Ben-Gan's.

Remember that these are merely estimates. And as it turns out, this is one of those times that the Query Optimizer's cost estimates are are totally out of line with the reality of what happens when you actually run the queries. Although the performance difference is not especially noticeable on a set of data as small as that in Sales.SalesOrderHeader, check out the STATISTICS IO output. Celko's version does 703 logical reads; Ben-Gan's does an astonishing 140110!

There is a good lesson to be learned from this: Cost-based optimization is far from perfect!
Never completely trust what estimates tell you; they've come a long way, but clearly there is still some work to do in this area. The only way to actually determine that one query is better than another is to run it against a realistic set of data and look at how much IO and CPU time is actually used.

In this case, Ben-Gan's query probably should perform better than Celko's. It seems odd that the Query Processor can't collect the row counts at the same time it processes the row numbers. Regardless, as of today this is the best way to solve this problem... Not that I've ever needed a median in any production application I've worked on. But I suppose that's beside the point!



Cross-posted from SQLBlog! - http://www.sqlblog.com


Log Buffer #21: A Carnival of the Vanities for DBAs

Hello, there!  You’ve somehow managed to navigate your way through the blogosphere and into the 21st edition of Log Buffer.

 

It’s fitting that this is the 21st edition, because that just so happens to be the legal drinking age here in the United States. And the folks over at Oracle sure need a drink or three this week. Computerworld’s Jaikumar Vijayan reported on a study showing that Oracle has more security flaws than SQL Server. And over at IT Toolbox, Chris Eaton was nice enough to link us to the actual study, and in his post also mentioned that a security firm called Argeniss had promised to release one Oracle security flaw every day this week.

 

The Oracle Security Blog's update on the topic, posted a few days after Chris's post, reveals that the week of disclosure is not happening -- at least, not quite yet (Argeniss apparently temporarily "suspended" the project).  But don’t worry!  If you just can’t live without that sticky-sweet feeling of bliss that accompanies finding a particularly nasty hole in someone else's software, head on over to Eddie Awad's blog, where you can learn how to snarf a dangling cursor. And now I will sit back and watch as my inclusion of that phrase gets this post banned by all of my readers' corporate indecency filters. Snarf on!

 

For those serve up your customers' data to hackers on a silver platter live-on-the-edge types in the audience who have time to worry about anything non-security-related amidst all of the concerns being raised,  there were a few interesting tidbits posted this week. Edgar Hoover dished up some tips on using functional indexes in 9i. And Lucas Jellema showed us a way to avoid long strings of UNION ALL'd queries when trying to create "dummy" data. But if you're running on Linux, good luck using these tips at all! Brian Aker clued us into the fact that most installations are incredibly instable and that there just aren't enough experts around to fix the problems. "Free" operating system. Expensive consultant needed to fix it. What doesn't add up here?

 

For better or for worse, life in the MySQL camp was not nearly so exciting this week. Ronald wrote an interesting post about Pluggable Storage Engines, but to me it sounds like it's just metadata management given a snazzy new name. And over at the So What Co-operative blog, Jeff Hunter got some flack because he implied in a post that MySQL isn’t ready for prime time (i.e., no mission critical apps). For shame, Jeff! Rounding things out was a good post by Kevin Burton all about Ethernet latency and how it can affect database performance.

 

It's life as usual for the PostgreSQL and DB2 crowds. Josh the PostgreSQL lead posted not one, but two articles containing performance tips. And Willie Favero pointed out the availability of a couple of new IBM Red Books: One covers the oh-so-popular topic of SOA, and the other is all about LOB data.

 

Which brings us nicely around to the world of SQL Server. You didn’t think I'd forgotten, did you? A good post came from Denis the SQL Menace, who brought to light a not very well publicized feature of SQL Server 2005: the ability to tell the engine to update statistics asynchronously. But this week's highlight came from Mladen Prajdic, who discussed an interesting way to get high-precision performance metrics in SQL Server, via a few SQLCLR routines. The only problem with Mladen's technique is that it appears to only work on a single thread at a time. Still, it's definitely an interesting technique to consider, and a great SQLCLR sample to keep around.

 

In the SQL Server tools department, we were informed by Bart Reed of Red Gate that SQL Prompt 3 might actually be coming some day! I was in the beta for this product, and I have to say that it's looking pretty interesting -- I definitely hope that some day is sooner rather than later, so get back to work and stop blogging, Bart! Luckily, we don't have to wait for Red Gate to get some enhancements to SQL Server Management Studio. Paul A. Mestemaker II provided detailed information on how to add a custom report to Management Studio in SQL Server 2005 SP2. And low and behold, Jasper Smith has already published a very useful report, a recreation of the much-missed Taskpad View from Enterprise Manager.

 

Since not everything in our industry is product-dependent (well, it's not supposed to be), let's not forget that database design is a more or less transferable skill -- at least, if you have any clue about what you're doing. The first consideration is usually data types, but as Tom Kyte pointed out, some people just don't get the difference between strings and all of those other newfangled types (as an aside: it was an Oracle person… should we be surprised?)  Apparently it was character types vs. the world week in DB2 land as well, because Craig Mullins also weighed in with a post on the topic.

 

To close, I'd like to point out that despite Eddie Awad's concerns about laziness, I believe that all of the best computer professionals are lazy at heart. That's why we use macros, create shortcuts, and script/automate everything. We don't like working any more than we have to! And that is why I'll leave it to not-so-lazy people -- like Joe Celko -- to come up with information on techniques such as an additive congruenital method of generating values in pseudo-random order. Uh, yeah.

 

So with pseudo-randomness in mind, I bid you, kind reader, adieu. And remember to watch out for those pesky dangling cursors! There's no worse feeling than coming in on a Monday morning and settling down with that first hot cup of coffee, only to discover that you've been snarfed over the weekend.




Cross-posted from SQLBlog! - http://www.sqlblog.com