On "native" XML support in databases
A couple of months ago, I provided feedback on an article on XML support in DB2 written by Philip Howard of Bloor Research. I invited him to comment on my analysis, so I am assuming his recent article can be taken as his response. In the new article, he claims that DB2 is the only relational database product to provide native XML support. He also states that SQL Server 2005 does not offer this support and alleges that Microsoft is abusing the term. Let me cite [reprinted claiming fair use]:
Microsoft, in its presentations about SQL Server 2005, claims that the database will have a "native XML store". It justifies this by stating that it will have a "native XML datatype" and support for XML schemas, XML indexes, XQuery, SQLXML and so on – all very well and good except: what is a "native datatype" and what does the company mean by a "native XML store"? Actually, what it means by the latter is that the database has an XML datatype. [...], but this is a (deliberate?) misunderstanding of the terminology.
It would have been helpful if Mr. Howard would have given us his definition of a "native XML store." For a definition, let me quote from the introduction of the upcoming ACM SIGMOD 2005 tutorial "XML and Relational Database Management Systems: the Inside Story" that I co-authored with Don Chamberlin (IBM) and Dana Florescu (Oracle) [reprinted with acknowledging the ACM copyright]:
[...] a more general storage technique is needed to take advantage of the structural information in unconstrained XML documents. For this purpose, the SQL-2003 standard provides a new datatype called XML for storing well-formed XML documents and fragments, based on the XML Information Set and–in an upcoming revision of the standard–on the XQuery Data Model. Since this data model can represent all aspects of a well-formed XML document, systems based on the XML datatype are said to provide XML fidelity. These systems are able to preserve XML-centric information such as document order and namespace bindings, and to exploit this information using an XML-based query language such as XQuery. For these reasons, this level of support is often referred to as native XML.
The above definition can be summarized as “native” XML support is:
- A way to store XML documents that preserves the Infoset or XQuery data model and thus providing XML fidelity
- A way to query (and update) all aspects of such stored XML
Based on this definition, I can assure you that SQL Server's claim of supporting XML natively is deliberate and accurate and not based on any misunderstanding of the technology.
The Bloor article continues:
to [sic!] understand this you need to appreciate that the way that a database physically stores data on disk is different from the way in which the data is logically presented. A datatype definition is, in essence, the definition of a mapping from the physical storage mechanisms used by the database to the logical representation of that data, so that it can be recognised and used automatically by standard database facilities such as the optimiser.
This is a clear and correct distinction between logical and physical model...
When you define a new datatype you only create a new mapping, you do not impact the underlying storage mechanisms used by the database. Thus the fact that you have an XML datatype does not mean that the data is stored in XML format – it is still stored using whatever physical mechanisms are employed by that database.
However here I have to disagree with the author. Defining a new data type at some level impacts how information is being stored by the underlying storage mechanism since it needs to preserve the correct semantics on the physical level. Equally important is that as long as the logical abstraction of the new datatype provides the logical semantics (in our case the XML fidelity), it will not matter what exactly the underlying physical storage model looks like. Let me cite from the tutorial again:
The logical data model on which the XML datatype is based does not specify any particular organization for physical storage. Many physical storage techniques are possible, providing various tradeoffs for time and space efficiency under query and update. In addition, access aids such as indexes may be created to improve query performance. Index creation and maintenance are more complex in a native XML system than in a pure relational system, since the XML data model is less constrained than the relational data model. XML indexes may support access to data at various levels of the element hierarchy, and the objects indexed may vary in cardinality and datatype. In general, native XML databases raise new challenges for all aspects of query optimization, including join planning, index selection, and cost estimation.
Mr. Howard concludes:
So, let's be clear about this: IBM has announced that the next version of DB2 will have native XML storage capabilities. IBM means what it says – it really is changing the way that it physically stores data on disk. Microsoft and others that are claiming "native" XML capabilities, based on support for XML datatypes, are abusing the language.
The above conclusion looks to me like a non-sequitur. Both DB2 and SQL Server (and others) expose or will expose at the logical level an XML datatype that provides XML fidelity plus query and update functionality. Thus all of them provide “native” XML capabilities (without abusing the language). What IBM's physical design is irrelevant. Whether you store it as a string, store it in some internal binary format making use of existing storage facilities provided by the relational database system or design a complete new storage engine does not matter.
By the way, if Mr. Howard (or any other reader for that matter) is interested in better understanding the details of how SQL Server 2005 is managing XML natively, please feel free to contact me. I would also highly recommend the above mentioned tutorial and my chapter on XQuery and relational databases in the “XQuery from the Expert“ book.