Exploring the secrets of intermediate materialization
When working with SQL Server 2000, I used to have this little trick I'd pull out after exhausting all other ideas for tuning a query. And I thought that my little trick was dead in SQL Server 2005, but thanks to fellow SQL Server MVP
Rob Farley, I am officially reviving my trick from the dead here and now, in this blog post.
... But first, let's start with an example query. Here's the scenario: You work for AdventureWorks, and management has asked you to create a report to find out how many peers each employee in the company has. You see, AdventureWorks management seems to believe that if two employees are
managed by the same person, they must have exactly the same job
function, and they can do each others' jobs equally well. So what they want to do is find out which employees have too many peers (might as well downsize some of that extraneous fluff), and at the same time find out which employees, should they be hit by a bus tomorrow, could be immediately substituted for by a colleague. Now, whether or not management's belief is utterly moronic or not is beyond the scope of this post, so dash any such thoughts from your head until you've read to the end, and then resume pondering along those lines, which I'm sure will end up putting a smile on your face.
But smiles are for later. At this point I've managed to go off on a horribly involved tangent, so let's get back to the query at hand:
SELECT
x.EmployeeId,
COUNT(*) AS TheCount
FROM HumanResources.Employee x
JOIN HumanResources.Employee y on y.ManagerId = x.ManagerId
GROUP BY x.EmployeeId
What we're doing here is finding all employees managed by the same manager, and then taking a count of those employees. Yes, it would have made more sense to simply find out how many employees are managed by each manager, but that's not what management asked for, and management clearly thinks better than you do. So go run the report!
But what does any of this have to do with query tuning tricks, you ask (while tidying up your resume a bit)? To answer that question, let's take a quick peek at the I/Os our query is using, in addition to the query plan. First, a baseline for the I/Os:
SELECT *
FROM HumanResources.Employee
Result, obtained via Profiler: 20 logical reads. OK, and now the test query in question: 827 logical reads. Quite a jump for a query that only uses the one table -- we're clearly wasting a lot of resources. And looking at the query plan, it's obvious we can do better. An outer table scan, looped to find the count for each employee -- that's a lot of index operations.
A common way to start tuning this kind of query is to move the aggregation into a derived table. After peering at this query for a while, one might come to the conclusion that there's no reason to aggregate on ManagerId more than once per manager. Why do it once per employee?
SELECT
x.EmployeeId,
y.theCount
FROM HumanResources.Employee x
JOIN
(
SELECT
p.ManagerId,
COUNT(*)
FROM HumanResources.Employee p
GROUP BY p.ManagerId
) y (ManagerId, theCount) ON y.ManagerId = x.ManagerId
Seems better, but upon running it, we see that it produces 819 logical reads and almost exactly the same query plan. Not much of an improvement. And alas, there's not much more we can do here. There just aren't too many ways to skin this query, and each of them requires some kind of loop to get the count, either implied or otherwise... Right?
And now we're almost to "dirty trick" territory. But let's first try a not-so-dirty trick. A temp table might eliminate some of the overhead, right? Then we'll only have to query the base tables once...
SELECT
p.ManagerId,
COUNT(*) AS theCount
INTO #y
FROM HumanResources.Employee p
GROUP BY p.ManagerId
SELECT
x.EmployeeId,
y.theCount
FROM HumanResources.Employee x
JOIN #y y ON y.ManagerId = x.ManagerId
207 logical reads. Quite an improvement! But the temp table is still using a nested loop, and a merge would be so much nicer, wouldn't it? A MERGE JOIN hint drops the number of reads to 115, but I still feel that we can do even
better.
Now in SQL Server 2000 at about this point in my query tuning excercise, I might try forcing
intermediate materialization of the derived table, sans the temp table, by using TOP 100 PERCENT in conjunction with ORDER BY. Unfortunately, the SQL Server query optimizer team
decided that this wasn't a good idea, and the optimizer now ignores such attempts.
And until earlier today, I thought the game was over. Until I was reminded by Rob that TOP takes a number of rows in addition to a percent. The trick, then? Use a bigger number of rows than you'll ever actually get back... Say, the maximum value for SQL Server's INTEGER type (2147483647)?
By applying TOP and ORDER BY within the derived table, we can force SQL Server to perform intermediate materialization of the results. And by playing with the ORDER BY properly, we can even prompt the optimizer to choose a merge...
SELECT
x.EmployeeId,
y.theCount
FROM HumanResources.Employee x
JOIN
(
SELECT TOP (2147483647)
p.ManagerId,
COUNT(*)
FROM HumanResources.Employee p
GROUP BY p.ManagerId
ORDER BY p.ManagerId
) y (ManagerId, theCount) on y.ManagerId = x.ManagerId
And the result of all of this hard labor?
10 logical reads (1000% improvement over the next best method), a merge operation, and if I do say so myself, a very good topic for a blog post.
The usual caveats apply. Do not try this at home. Do not rely on this undocumented behavior. Do not pass Go. Do not fail to hire me to tune your databases if this trick doesn't fix all of your problems. And, lest I forget, do not waste time reading this blog when management needs that report
yesterday!
Anyway, until next time, enjoy!
Cross-posted from SQLBlog! -
http://www.sqlblog.com