Michael Rys

Musings on XML, XQuery and more...

<October 2008>
SuMoTuWeThFrSa
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678


Navigation

Papers

SQL Server XML Whitepapers

Weblogging Links

MS Bloggers

Recommended Books

Other Blogs

Recommended Links

Presentations (Upcoming)

Presentations (Recent)

Subscriptions

News


Upcoming Presentations


TechEd 2007, Orlando, June 4 to June 8, 2007


Books I co-authored



www.flickr.com
This is a Flickr badge showing public photos from Michael Rys. Make your own badge here.
eXTReMe Tracker

Post Categories

Article Categories



On "native" XML support in databases

A couple of months ago, I provided feedback on an article on XML support in DB2 written by Philip Howard of Bloor Research. I invited him to comment on my analysis, so I am assuming his recent article can be taken as his response. In the new article, he claims that DB2 is the only relational database product to provide native XML support. He also states that SQL Server 2005 does not offer this support and alleges that Microsoft is abusing the term. Let me cite [reprinted claiming fair use]:

Microsoft, in its presentations about SQL Server 2005, claims that the database will have a "native XML store". It justifies this by stating that it will have a "native XML datatype" and support for XML schemas, XML indexes, XQuery, SQLXML and so on – all very well and good except: what is a "native datatype" and what does the company mean by a "native XML store"? Actually, what it means by the latter is that the database has an XML datatype. [...], but this is a (deliberate?) misunderstanding of the terminology.

It would have been helpful if Mr. Howard would have given us his definition of a "native XML store." For a definition, let me quote from the introduction of the upcoming ACM SIGMOD 2005 tutorial "XML and Relational Database Management Systems: the Inside Story" that I co-authored with Don Chamberlin (IBM) and Dana Florescu (Oracle) [reprinted with acknowledging the ACM copyright]:

[...] a more general storage technique is needed to take advantage of the structural information in unconstrained XML documents. For this purpose, the SQL-2003 standard provides a new datatype called XML for storing well-formed XML documents and fragments, based on the XML Information Set and–in an upcoming revision of the standard–on the XQuery Data Model. Since this data model can represent all aspects of a well-formed XML document, systems based on the XML datatype are said to provide XML fidelity. These systems are able to preserve XML-centric information such as document order and namespace bindings, and to exploit this information using an XML-based query language such as XQuery. For these reasons, this level of support is often referred to as native XML.

The above definition can be summarized as “native” XML support is:

  1. A way to store XML documents that preserves the Infoset or XQuery data model and thus providing XML fidelity
  2. A way to query (and update) all aspects of such stored XML

Based on this definition, I can assure you that SQL Server's claim of supporting XML natively is deliberate and accurate and not based on any misunderstanding of the technology.

The Bloor article continues:

to [sic!] understand this you need to appreciate that the way that a database physically stores data on disk is different from the way in which the data is logically presented. A datatype definition is, in essence, the definition of a mapping from the physical storage mechanisms used by the database to the logical representation of that data, so that it can be recognised and used automatically by standard database facilities such as the optimiser.

This is a clear and correct distinction between logical and physical model...

When you define a new datatype you only create a new mapping, you do not impact the underlying storage mechanisms used by the database. Thus the fact that you have an XML datatype does not mean that the data is stored in XML format – it is still stored using whatever physical mechanisms are employed by that database.

However here I have to disagree with the author. Defining a new data type at some level impacts how information is being stored by the underlying storage mechanism since it needs to preserve the correct semantics on the physical level. Equally important is that as long as the logical abstraction of the new datatype provides the logical semantics (in our case the XML fidelity), it will not matter what exactly the underlying physical storage model looks like. Let me cite from the tutorial again:

The logical data model on which the XML datatype is based does not specify any particular organization for physical storage. Many physical storage techniques are possible, providing various tradeoffs for time and space efficiency under query and update. In addition, access aids such as indexes may be created to improve query performance. Index creation and maintenance are more complex in a native XML system than in a pure relational system, since the XML data model is less constrained than the relational data model. XML indexes may support access to data at various levels of the element hierarchy, and the objects indexed may vary in cardinality and datatype. In general, native XML databases raise new challenges for all aspects of query optimization, including join planning, index selection, and cost estimation.

Mr. Howard concludes:

So, let's be clear about this: IBM has announced that the next version of DB2 will have native XML storage capabilities. IBM means what it says – it really is changing the way that it physically stores data on disk. Microsoft and others that are claiming "native" XML capabilities, based on support for XML datatypes, are abusing the language.

The above conclusion looks to me like a non-sequitur. Both DB2 and SQL Server (and others) expose or will expose at the logical level an XML datatype that provides XML fidelity plus query and update functionality. Thus all of them provide “native” XML capabilities (without abusing the language). What IBM's physical design is irrelevant. Whether you store it as a string, store it in some internal binary format making use of existing storage facilities provided by the relational database system or design a complete new storage engine does not matter.

By the way, if Mr. Howard (or any other reader for that matter) is interested in better understanding the details of how SQL Server 2005 is managing XML natively, please feel free to contact me. I would also highly recommend the above mentioned tutorial and my chapter on XQuery and relational databases in the “XQuery from the Expert“ book.

posted on Wednesday, April 20, 2005 7:25 PM by mrys


# On &amp;quot;native&amp;quot; XML support in databases @ Thursday, April 21, 2005 3:37 PM

A couple of months ago, I provided&nbsp;feedback on&nbsp;an article on XML support in DB2 written by...

mrys

# What is "native" XML support? @ Thursday, April 21, 2005 5:59 PM

Michael Rys responds to pro-IBM FUD on the meaning of "native" XML support in a database. By any reasonable definition, SQL Server 2005 provides "native" support for XML datatypes....

mrys

# Authentic Analysis and Argumentation? @ Friday, April 22, 2005 7:33 AM

I've been a bit out of the habit of writing here - for the last couple of months most of my free time...

mrys

# http://weblogs.goshaky.com/weblogs/page/lars/20050421 @ Friday, April 22, 2005 7:48 AM

http://weblogs.goshaky.com/weblogs/page/lars/20050421

mrys

# Native XML Databases, SQL Server and the future of databases @ Monday, May 02, 2005 7:14 PM

Native XML Databases, SQL Server and the future of databases

mrys

# http://sql.ru/subscribe/2005/250.shtml @ Thursday, May 05, 2005 10:10 AM

http://sql.ru/subscribe/2005/250.shtml

mrys

# re: On &amp;quot;native&amp;quot; XML support in databases @ Thursday, April 06, 2006 12:48 PM

But it is still true that MS SQL server 2005 stores XML documents as BLOB thus requiring parsing and reparsing each time you want to access data within that XML document, is it not?

"In SQL Server 2005, XML data is stored as binary large objects (BLOBs) in an internal representation"
http://www.microsoft.com/sql/prodinfo/overview/whats-new-in-sqlserver2005.mspx

Any one who cares about performance might want to consider that. how do you deal with big XML documents, and frequemt access?

amelia

# re: On &quot;native&quot; XML support in databases @ Monday, May 22, 2006 7:05 PM

Actually, the internal format in which the XML is being represented is much easier to traverse than the original XML. And its "parsing" therefore is much better performing and less problematic than the XML parsing.

And performance of frequent access and/or handling of large documents is very good if you do full document retrieval (much better than a secondary storage structure) while for node-level queries using XQuery, we have efficient mechanisms for dealing with queries over the blob both with and without the XML indexing framework.

mrys

# re: On &amp;quot;native&amp;quot; XML support in databases @ Tuesday, November 28, 2006 9:13 AM

I felt you challenge to the author of the other paper a little unfair. It's clear that in order to get any performance, SQL Server requires a primary index which is an index on all elements, nodes, attributes for on the XML document to facilitate sub-document retrieval. This is then stored in a relational table. If you want to index your XML data you require a primary key on that table to reunite the data. SS is really shredding data and storing it in a LOB. Then you have to create up to 3 other secondary indexes for path specific indexing. What's the logging of all this. The FLOWR expression of XQuery (which doesn't support the LET clause) MUST be wrapped in SQL. This isn't pure XML.

It matters what's under the covers. I don't usually comment like this, but the suggestion and challenge seemed obsurd. Coming from PASS2006 a lot of my collegues wondered about SS support for XML (albeit it is better than Oracle). If what's under the hood didn't matter, Gates wouldn't drive a Porshe

:)

I don't mean to sound like a jerk -- but let's be fair. SQL Server is forcing that XML into the relational model and that could have an effect on performance - no?

Paul

# re: On &amp;quot;native&amp;quot; XML support in databases @ Tuesday, January 09, 2007 11:29 PM

I am not sure that I would agree with exact wording such as "abusing the language" but I do believe that there is a very significant difference in the way DB2 9 and SQL Server 2005 implement XML data type. DB2 XML data type is stored as a pre-parsed tree while SQL Server stores XML as a LOB. We measured the difference during query and retrieval of pre-parsed XML vs. LOB to be as high as 44 times on some queries. 44 times faster queries are not just a matter of semantics.

Leon Katsnelson

# re: On &quot;native&quot; XML support in databases @ Tuesday, January 23, 2007 10:24 PM

Hi Paul and Leon.

Thanks for the comments.

Let me first address Paul's comments:

Again, I am pointing out that the purpose of this posting was to dispell some of the FUD that the Bloor Researcher was trying to spread about SQL Server 2005 not having native XML support. Given that three technology experts from the three major database vendors agree on the definitions that I quote from the SIGMOD article, I think it is fair to say that all have native support.

Now I agree with Paul that I would prefer SQL Server to also have what I called top-level XQuery in my chapter in the <a href="http://www.amazon.com/exec/obidos/ASIN/0321180607/musionxmlxque-20?creative=125581&camp=2321&link_code=as1">“XQuery from the Expert“ book</a>. But that is more of a convenience than a necessity and the feature got postponed.

SQL Server's primary XML index and the XQuery evaluation on top of both the primary LOB storage and the XML index is not completely based on relational processing in the classical sense and XML is not forced in a relational streight-jacket but attempts of using relational technology where we felt it would be beneficial for certain reasons. The internal binary XML format actually has quite a bit of internal structure and instead of doing region encoding, we use OrdPath for node positioning in the tree (see some of the VLDB and SIGMOD papers in my recommended Link section). So while we leverage the relational framework more than DB2 does, we are not getting necessarily bad or worse performance due to that approach. It can be that we perform worse for some queries and better for others, but given that there is no industry wide accepted XML benchmark, it is hard to make any clear, unbiased statement.

The same comment regarding missing benchmarks also applies to Leon's point. It is easy to find cases where one solution outperforms the other at this stage of the technology. And of course I take it as a challenge to get more performant over time (as I am sure does the DB2 team and Oracle team). Having said that, I would love to see the data set and queries that you ran where you had a 44 times perf difference.

There are differences between how the XML is stored and implemented between DB2 and SQL Server. Semantically they provide both the same functionality. For example, DB2's approach automatically generates some indices as part of their storage which gives better support for certain types of queries at the cost trade off of using more storage. We decided to give the user the choice to add indices as needed and may not provide the performance benefit out of the box. So comparing a solution with some tuning to one without is not necessarily a fair comparison.

Best regards
Michael

PS: "Pure XML" seems to be trademarked by IBM, so they can define that marketing term in whichever way they want. However "native XML" processing has a more widely understood definition.

mrys




Powered by Dot Net Junkies, by Telligent Systems