Many presentations at events such as VLDB, SIGMOD or XSym present interesting and novel ideas. As is commonly required from academic papers, the presentation of these ideas needs to include a comparison with related work. This has several benefits:
- It makes sure that the presenters check that their idea is indeed novel
- It gives readers a way to quickly understand the new idea based on the differences to old ideas
- It gives readers the ability to learn about other work in the same area
- It provides the researchers baselines to judge their own work against
Normally, papers that get published have a good amount of comparison with related academic work. However, in some areas, one needs to look beyond the related academic work and look at what has been done by the industry in released products.
A case in point is a presentation on querying XML data using SQL. The presenter talked about mapping SQL into XQuery and finally presented a prototype that extended a competitors component to map XML into tables that is very close to our OpenXML mechanism. E.g., his paper can be described in the following way in SQLServer 2000/2005 (similar functionality will be/is available in DB2 and Oracle):
select POId, PODate, …
from OpenXML(@XMLdoc, '//PurchaseOrder')
with (POId nvarchar(10) '@id', PODate datetime 'orderdate', ...)
Now, I am quite happy to see somebody else provide a similar approach to query XML in a relational context using SQL as we provided already in SQL Server 2000 (it validates our approach), but if the authors and/or the reviewers would have looked at the functionality provided by us and our competitors, the available related work would have shown that the idea of generating tables out of XML and then using SQL to query them was not really novel. And maybe the current implementations would have given the authors ideas about improving what is available: e.g., how to push SQL predicates into the XQuery/XPath predicates for better performance…
Often researchers complain about the lack of availability of information about what products really do (the marketing material does not cut it). This used to be a valid criticism when you needed to buy expensive manual libraries. But in today's world, where technical websites provide a vast array of technical resources about features and people from the product teams are publishing at conferences (for example, Shankar presented our indexing mechanism on the XML datatype at VLDB) and are otherwise accessible, I don't buy that argument. Especially not in cases such as the one above. There may be some internal details that companies will not disclose, but a general technical description of publicly accessible features is always available (and sometimes even more).
If you are a researcher and want to know more about the XML functionality that we provide, feel free to send me a note or check out the MSDN resources.
Many researchers provide some form of performance evaluation of their algorithms in comparison to other approaches. This is very laudable, but unfortunately, many researchers do not provide the necessary statistical background and real-world information to allow the reader to judge the relevance of the results.
As a case in point (and take this only as one representative examples of many such cases that I have seen while reviewing papers submitted for publication), one researcher (name withheld to protect the guilty) presented two ways to optimize SQL statements that are generated by mapping XQuery or XPath expressions. Many of such SQL statements have common subexpressions, left outer joins and outer unions (think FOR XML EXPLICIT mode queries). The optimizations assume that they need to rewrite the queries on the midtier when generating the SQL statements (and not let the query optimizer do it).
This is interesting research, but I don't want to bore you with the detail of the approach. What I want to focus on is the way the performance evaluation was done. The three approaches (the "naïve" query and the two optimized queries) were compared by running different types of queries against a TPC-H based data set which two data set sizes. So far so good. However, the resulting performance figures show the problem of presenting relevant results.
The performance figures on the smaller data set showed that the "naïve" queries in all but one case were outperforming the "optimizations". The one case where an optimization was more efficient, the performance improvement was about 60%. Note: Knowing that the optimizations do not work in all/most cases is a valuable result as well.
The performance figures on the larger data set showed that the "naïve" queries was always worse than one of the optimizations. That sounds great, doesn't it? Well, if you looked closely, all but the same query in the small case, were less than 5% more efficient.
The importance of statistical significance and confidence intervals
So what is the problem with that? First, the evaluation does not provide statistical significance information (the so-called confidence interval).
Why is knowing the confidence interval important? In cases, where you provide not-100% repeatable measurements such as elapsed time or throughput, you will get measurement results that are spread over a range of values. A performance evaluation often reports only the average value of all the measured values (or a subset). A confidence interval is a statistical determination which says that the actual value is with a very high probability within a certain percentage of the reported average value. For example a reported value of 100 with a +-5% confidence interval says that any value smaller than 95 or greater than 105 are considered to be different in a statistically significant way, but that value between 95 and 105 have to be considered to be non-different since the actual value can be anywhere within that range.
So how do I get to determine the confidence interval? Basically, you need to run your experiments at least as many times as necessary until the statistical formulas that calculate the significance of a set of measurements reach the confidence interval that you want in your experiment.
In the example above, the measurements were run exactly once for each measurement. The formulas would result in a confidence interval of way more than 20%. Thus, any results that have a difference less than the confidence interval have no importance. In order to make a 4% difference relevant, the confidence interval would have to be at least +-3%.
So by not providing a confidence interval, the result was that potentially not even the single obviously better performing query can be considered statistically significantly different. And since no confidence interval was provided, we can also not conclude that there is no difference (which is also a useful result, since it would say that one does not need to perform the cost of applying these optimizations). The evaluation results become basically useless…
Other performance relevance issues
So let's assume for a moment that the measurements where statistically significant. Then there are three other aspects to be taken into account. The first one is scalability, the second one is a benefit/cost analysis, and the third one is Moore's law.
Scalability is important in that the solution should scale both in number of requests, data size, complexity of expression etc. For example, knowing that an approach that is significantly faster than another on a small set of data, but does not scale to a larger set of data is important for judging the quality of the work and to understand when it makes sense to apply the approach and when not.
Similarly, just knowing how much better an approach is without understanding the additional cost of providing the better approach is not useful. Having a clear understanding of the cost in comparison to the benefit allows one to understand the applicability of the approach and when for example a diminishing return on the investment may make an approach less usable. For example, if an optimization takes 4 units and scales logarithmically in cost to the size of the problem and provides a performance benefit of 1 unit but scales linearly, we can determine at what size of the problem it becomes valuable to employ the optimization. OTOH, we may also learn that for example adding more resources (disk, processors) may not be worth the cost since the approaches return on resource investment does not outweigh the benefits (e.g., the solution does not scale well enough).
Finally, even if we get a significant improvement in performance, that scales and has an acceptable benefit/cost ratio given the available resources, one has to consider Moore's law (and related "laws"). For example, if I can show that today the investigated approach makes some small part of a database query 20% faster (with a confidence of +-5%) and scales linearly to the size of the problem, I can thus execute this part in 48s instead of a minute. However, assuming that in 2 or 3 years, the naïve solution will take only 10s instead of a minute and only 1s in another 2 to 3 years, the benefit/cost ratio may become less favorable for the approach since the absolute difference may become so small (2s and then 0.2s) that it will disappear in the general execution environment. So the benefit/cost model and performance discussions should also take the over the longterm potentially diminishing benefits and the benefit and costs with respect to the overall environment into account.
To summarize:
- Performance evaluations should always provide a confidence interval for the measurements to help determine the relevance of the results. Exceptions are cases, where the measurements are repeatable (number of logical disk accesses) or the figures are so far apart that it is obvious (orders of magnitudes)
- Provide information about the scalability of the approach among the relevant dimensions
- Provide a benefit/cost model
- Understand the longterm value of the approach
- Understand the contribution in the larger context
Currently I am attending VLDB 2004 (one of the premier academic database conferences) in Toronto and just co-chaired the XML Symposium Workshop that was held in conjunction with VLDB (the proceedings will soon be available from Springer). There are lots of very interesting XML focused talks including topics on concurrency control for XML data manipulation, XQuery view management, implementation and indexing strategies for XML Query languages etc..
Most presentations have improved considerably in their real-world applicability compared to presentations a couple of years ago that tended to abstract the problems (such as the XML data model or the XPath semantics) to a level where they were not relevant to the real-life situation anymore.
Unfortunately a small amount of presentations show a lack of knowledge of what the commercial systems actually are already capable of doing and some presentations show the importance of a rigorous presentation of performance evaluation results. I will discuss two cases as examples (follow the links for more details). Note that while these are indeed critiques of the presentations, I want use these critiques to make more general points in the hope that future research work will be taking them into account.
In the near future, you will have a chance to meet me/hear me at SQL PASS 2004 in Orlando and at the XML 2004 conference. I still have to do the presentations, so I will be busy :-).