Yesterday I was relaxing a bit at the Red Carpet Club (D7 gate) as the Dulles Airport when an Instant Message popped up. It was from one of my fellow DevelopMentor instructors. He was wondering why this query was complaining that RN is an invalid column:
SELECT ROW_NUMBER() OVER (ORDER BY IMAGEID) AS RN,IMAGEID,IMG FROM DBO.IMAGES WHERE RN BETWEEN 10 AND 19
Unless you know how these ranking functions like ROW_NUMBER work, you might find yourself asking the same question. The trick here is that the value of the any of the ranking functions is computed after the underlying query against the table is completed. Therefore, the column RN isn't available in the WHERE predicate. So the next question is how do we work around this for making pages of data? That, in turn produced what seemed like a good from my cohort. Here's two variations on a solution:
SELECT RN,IMAGEID,IMG FROM (SELECT ROW_NUMBER() OVER(ORDER BY IMAGEID) AS RN,* FROM DBO.IMAGES) AS T WHERE RN BETWEEN 10 AND 19
This solution uses a sub-query to feed the computed result -- the query within the parens -- up to another query -- where the computed value for the row number is available. Another solution for the problem is:
WITH C(RN,IMAGEID,IMG) AS (SELECT ROW_NUMBER() OVER (ORDER BY IMAGEID),IMAGEID,IMG FROM DBO.IMAGES) SELECT IMAGEID,IMG FROM C WHERE RN BETWEEN 10 AND 19
This uses a Common Table Experssion (the with... as() part) to do the same thing as the previous query. So the question becomes, is this a more efficient query than doing something like this:
SELECT TOP(10) IMAGEID,IMG FROM DBO.IMAGES WHERE IMAGEID >= 10 ORDER BY IMAGEID
The reason for the question was that the first two queries have two select within them, and that might lead one to believe that two queries over the data are required to produce the desired result. However this is not the case. Both the sub-query and the CTE generate operationally identical query plans, namely:
select <- filter <- top <- sequence project (compute scalar) <- compute scalar <- segment <- clustered index scan
meaning that first SQL Server finds a batch of row values from the primary key index on the table, reduces that to a set rows and then select that out to the ten records of interest. So there's while there's no second recordset created, there is an intermediate working set created. The last query, the SELECT TOP(10)..., avoid this the intermediate result set. This generates a more efficient query plan (8.5% lower cost) of:
select <- top <- clustered index scan
For a query over a four-thousand row example table.
The most important thing to note about the query efficiency here that unless these queries are definetly benefiting from having a cluster primary index key on the table. Repeating the cost analysis without these indexes. In the CTE example, the estimated query cost is roughly 25-times more expensive than the query over the indexed table. The once cheaper SELECT TOP() query becomes just as expense without the index.