Michael Rys

Musings on XML, XQuery and more...

<August 2008>
SuMoTuWeThFrSa
272829303112
3456789
10111213141516
17181920212223
24252627282930
31123456


Navigation

Papers

SQL Server XML Whitepapers

Weblogging Links

MS Bloggers

Recommended Books

Other Blogs

Recommended Links

Presentations (Upcoming)

Presentations (Recent)

Subscriptions

News


Upcoming Presentations


TechEd 2007, Orlando, June 4 to June 8, 2007


Books I co-authored



www.flickr.com
This is a Flickr badge showing public photos from Michael Rys. Make your own badge here.
eXTReMe Tracker

Post Categories

Article Categories



Thursday, June 10, 2004 - Posts

"What's new for FOR XML in SQL Server 2005" is now live on the MSDN XML DevCenter.
My whitepaper on the new FOR XML functionality is now available on the MSDN XML DevCenter. Thanks for all the questions and comments on my earlier posts on this topic which served as input to this paper. Community in action! Pat yourselves on your back and go check it out! :-)

posted Thursday, June 10, 2004 4:25 PM by mrys with 4 Comments

My comments on the Infoworld article "Databases flex their XML"

Sean McCown wrote this analysis (PDF version) in Apr 2004. In the article, he compares the XML capabilities of the 4 major relational database systems (comparing publicly available versions) both in terms of functionality, ease, flexibility and speed, and adds a sidebar on Yukon. Before I start giving my comments on the article, let me disclose that I talked to Sean during his research for the article and answered his questions on SQL Server 2000 and Yukon. Thus, some of the comments below are just my attempts to make Sean's translation of my answers clearer, because I was not answering his questions clear enough :-).

First, let me point out that this is one of the best comparative articles that I have read on this topic in the press. I really liked the following statement:

“I’ve never seen an XML document even half that size [Ed.: 2GB]. XML is about finding records in a file, not dumping an entire database to a file and trying to run an XQuery search against it.“

I am not going to critique his evaluation of the competing products (although I have my reservations about some of his ratings based on my own experience and customer feedback, and don't find the sentence “just create the XSLT“ for doing schema evolution very reassuring). I would like to point out however, that comparing SQL Server 2000 with Oracle 10g is like comparing a BMW build in 2000 with a Mercedes build in 2004: SQL Server 2000's main focus (like its contemporary Oracle 8i) was to help move relational data in XML format and not provide a native XML storage solution (see my earlier comment on this issue). And I also would like to point out that WebDAV support is not really an XML functionality (and has its own protocol-level issues).

I will organize my comments into comments on methodology and terminology and comments that are SQL Server specific.

Comments on Methodology/Terminology

Storing XML

Sean writes: “There are three methods for physically storing XML data in a relational database: shredded, unstructured, and structured. Shredded and unstructured are useful methods but limited. The structured method allows you to leverage the power of both relational data and XML hierarchies.“

At this stage it is hard (for me) to understand what he means with unstructured and structured. Let me first define my own, more detailed terminology and then compare his definitions with them.

XML structures:

“highly-structured“: data is highly regularly structured. It may be hierarchical, but easily represented using a relational structure or an XML schema that has very little variety. The XML markup itself is important for conveying the semantics of the data. An example: Relational data encoded in XML (customer elements containing order elements).

“semi-structured“: The XML markup is still important for conveying the semantics of the data. But the structure is looser, may change from instance to instance. The schema is heterogeneous and may not be easily represented using a relational structure. Examples: Integrated data from several data sources, data with fast evolving schema.

“markup-structured“: The XML is representing documents where the primary semantic meaning existed before the markup. This means that most of the understandability of the data still exists even if the markup is removed. The data is highly order-dependent and has lots of mixed markup. Example: Business documents, Hamlet.

Note that XML can mix these different structures into one document!

XML Storage fidelity on the logical level:

preserving relational fidelity“: Preserve the mappability of relational data encoded in XML. This means, that values are being preserved but no order of elements, cannot deal with mixed-content, comments etc.

preserving InfoSet fidelity“: Preserves the Information Set information of an XML document. This will preserve element order, comments, etc., but does not care about attribute order, or the quotes used for attribute values or entity preservation. Most XML databases provide this.

preserving textual fidelity“: Preserves the XML document on a code-point level. You get exactly what you put in.

Preserving textual fidelity preserves InfoSet information which in turn preserves relational fidelity, but not the other way around.

Physical XML storage (there are more, but the two are enough for this discussion):

relational storage”: XML data is stored decomposed in some relational form. This could be in a variety of ways ranging from fully decomposed models over node-table formats to universal table formats.

blob storage”: XML data is stored in a non-relational byte stream format.This can range from a Unicode byte stream to some internal binary format representing the XML.

It should be clear, that by making this distinction, the terms “shredding“, “unstructured“ and “structured“ are confusing. XML's structure can be highly structured, semi-structured or markup-structured, but it is always structured. And either of these formats can be stored in a way to provide relational, InfoSet or textual fidelity using either relational or blob storage.

So let's see whether we can find out which of the 3x3x2 cells Sean is actually looking at.

He continues with “Shredding puts XML data into relational columns but strips it of its XMLness, meaning the hierarchical relationships among the data in the original XML document are lost. Shredding is useful when you're not concerned about keeping the data in XML format.“

This seems to refer to a mapping process that uses relational storage to achieve some level of relational fidelity (although often such shreddings allow to preserve the relationships expressed via the hierarchy). It still can be applied to any of the three XML structures, although you will loose information in the markup case and may not be doable in the semi-structured case (depending on the shredding).

Sean defines “unstructured storage“ as: “The unstructured method uses a data type called a CLOB (Character Large Object) to store an entire XML document as a single unit. ... The unstructured method provides limited search capabilities, but it is still quite useful. You can't base queries on it, but the structure of the original data is preserved.“

This clearly maps to a blob storage using a textual stream that can preserve the textual fidelity. Note however, that not all blob storage needs to be limited in its queryability.

Finally, he defines “structured storage“: “The structured method allows you to store XML data inside the database and preserve the hierarchy of the data. Structured storage, also known as "native XML" storage, is what every vendor is trying to achieve. The most obvious benefit of preserving the hierarchical relationships of XML data is being able to receive an XML document, combine it or manipulate it with relational data, and produce XML as a result. “

This definition is covers a different dimension than the previous two terms. What he is describing here can be achieved by either a blob or relational storage. The main parts here are that it should provide at least InfoSet fidelity (based on the notion “native XML” storage), and support all three of XML structured documents.

So the main distinction between “unstructured” and “structured” store is not really the storage, but the ability to query the data. And if we take his later comments into account, it is not even the queryability but the ease of queryability that he is concerned about.

Let's keep this in mind for the SQL Server specific comments below.

XQuery

The next comment I would like to make is about including XQuery into the comparison. XQuery does not exist yet in its final version. Even tough SQL Server 2005 provides a subset of XQuery, I would not blame any implementation to not yet support it. If the first release of a database still does not support it after XQuery becomes a final recommendation, then I would give it some weight. Maybe he really meant XML Query capabilities which can include XPath, XSLT or XQuery?

There are also some minor inaccurate and comment-worthy tidbits about XQuery in the article:

  • “XQuery, which is a language used for crawling through structured data paths”: XQuery is a declarative language that among other things provide the capabilities to use path expressions on the markup structure to identify parts of an XML document. It is not the only thing it can do (it can for example construct and re-order). And it certainly does not crawl :-).
  • “offers significant query functionality beyond what SQL can do“: Well SQL operates on the relational data model, XQuery on the XML data model. XQuery actually offers less functionality (no grouping). But they are not really comparable like this. I would rather say that they complement each other.
  • “Because XQuery is an identity-based query language, whereas SQL is value-based, you can use XQuery to find out whether a certain element exists; SQL can tell you whether an element that's known to exist holds a certain value.”: It is not really the language that is identity or value based. It is the data model. And the XML data model for XQuery provides both values and nodes with identities. Since the relational model does not have elements but rows and columns, SQL obviously cannot tell anything about elements :-). So again, this sentence mixes aspects of the data models, and the language capabilities. SQL can use existential checks to see whether a row exists and can check for NULL values (which are often used to say a certain column does not apply for the given row). However more important is the distinction on the data model level (where the relational database model is much less flexible to leave properties away).
  • “Unlike SQL, XQuery also has knowledge of attributes inside element tags,”: Again, this is somewhat non-sensical, given that the relational model does not need to make such a distinction and thus the language does not need to know about. This is like saying that SQL and XQuery have no idea about how to make a good cheese fondue :-).
  • “it has stronger data types than SQL, providing you greater control over the type of data being written or queried.”: It certainly has different types, some of which are more strongly typed that their cousins in the SQL type system. 

SQL Server specific comments

Sean writes: “Only Microsoft SQL Server 2000, among the four databases tested, does not support structured XML storage.“

This comment is somewhat incorrect. SQL Server 2000 provides relational fidelity with its mapping mechanisms into a relational storage format, textual fidelity (and thus InfoSet fidelity) with storing it in a large string field (basically using a blob storage format). And you can take such a string, pass it to sp_xml_preparedocument (thus generating an internal format that preserves InfoSet fidelity) and using OpenXML to query the data using XPath 1.0 and SQL. And to Sean's credit, he actually points this out toward the end of the article where he presents OpenXML.

However, it is harder to do this than for example applying an XQuery or XPath expression directly on the native XML data type. Thus, I understand his perception, but his scorecard wording is too absolute and untrue.

Sean also writes: “You can also use SQL Server as a sort of bidirectional XML repository by mapping XML documents directly to relational tables through the use of Updategrams, “

The mapping is actually provided by our mapping schema technology: annotated XML schema documents. Updategrams is just one of the functions can can perform their effect on the underlying relational data using the mapping schemas. The other functions are XPath queries to get data out in XML format and the XML Bulkload object.

Regarding performance, he writes: “I found that SQL Server speedily imported XML files, but it was the slowest of the group at creating XML files. It was also slowest at reading large XML files but not by a significant margin. “

I find this interesting and would like to know how he created the files. My own investigations show quite the opposite when comparing FOR XML statements against for example the XMLELEMENT constructor syntax. With respect to the reading performance, SQL Server 2000's sp_xml_preparedocument (or better MSXML 2.6) could indeed perform better and it already does so in the 64-bit version of SQL Server 2000 and will in SQL Server 2005/Yukon (both are using the MSXML 3.0 parser for sp_xml_preparedocument, and the parser for the XML data type should be even faster).

Now, let's go on to the Yukon sidebar.

Sean writes: “Yukon’s approach to preserving the hierarchical data relationships in XML documents differs from that of Oracle, IBM, and Sybase. Instead of using nested tables for storing structured XML, Yukon features what I refer to as a managed BLOB (Binary Large Object) data type.”

While this is indeed a difference on the physical level, it is interesting to note that this approach (blob storage that provides InfoSet fidelity with queryability, and the ability to define indices on the XML that provide materialized “relational storage” views that the query processor can take advantage of) is hiding the complexity of the physical design and the dependency management of approaches where your data has to be stored one way if a schema of a certain form is present and another way if your data does not. Especially if such approaches also allow access to the physical level using SQL statements that may have to be changed when the XML structure changes.

Furthermore, he writes: “Yukon also offers a simple version of schema evolution. ... it will allow you to update existing documents by changing the namespace instead of attaching the new schema and scrubbing the existing data through XSLT (XSL Transformation).“

Well, users certainly can write their own scrubbing using either an XSLT stylesheet in a CLR UDF or an XQuery transform. For more information about the schema evolution capabilities, I would like to point to my earlier blog post and my TechEd presentation.

Sean continues: “Microsoft’s method ignores validation for existing data, while Oracle’s method allows you to fill in missing data based on the new schema. In that respect, at least, Oracle’s implementation is more flexible. “

I do not understand this sentence. If I add a new schema to a schema collection, the existing data cannot be affected (otherwise the schema change would not be allowed). Often schema changes are more complex than just adding a new sub element. Again, I would like to point to my discussion of schema evolution. Sean, if you are reading this, can you please forward me an example?

Next, he writes: “So far, from what I can tell using the Yukon beta, there’s no way to turn off schema validation once you’ve turned it on“. Actually, you can turn it off again, using ALTER TABLE ALTER COLUMN. But we are still missing the direct ISVALID capability. You would have to use the new try/catch mechanism to catch the validation error that occurs during a dynamic CAST to a schema constrained XML. If you find this an important feature, please let me know. I will file the DCR (although as always, no promises can be made for this release etc. pp.).

To conclude, I urge you to tell us whether we are delivering on the hope that “there’s gold in those hills.“

 

posted Thursday, June 10, 2004 4:09 PM by mrys with 10 Comments




Powered by Dot Net Junkies, by Telligent Systems