A frequent question on one the newsgroups I frequent is "why, when I store and XML document in a column doesn't the amount of space used exactly match the size of my document?" Usually, there's some thought that because less space is consumed than expected that part of the XML has been lost in flight to the database. Rest assured that's usually not strong. What you are actually seeing in these cases is a compression of the data during storage.
But that's not "compression" in the sense that we might normally think of it: Like RLE or LZW. Rather, when XML data is stored inside of SQL Server 2005, the internal representation is not an exact image of the source XML. In fact, the XML is "shredded" into a other form (what that is isn't relevant to this discussion). As part of that process, XML QNames are tokenized. The more repeating QNames you have, the greater this "compression effect" is.
Now, there are times where the "compression effect" is just the opposite. Suppose your XML text node data was composed mostly of 1-byte character strings. When that goes into storage, it takes up two bytes: Remember -- all XML string data is stored in UTF-16 encoding today. Another common case is when you have typed XML. Suppose you have a node typed as xs:double, but it contains the string value "1." Well, in an typed scenario, the amount of space taken up is the size of the type class: 8-bytes in this case, rather than the 2-bytes it would take to store "1."
Mad props to Eugene Kogan from the SQL Server Team for enlightening me about this.
I just want to make five things really clear about this whole issue:
- The other Express Products are free to and usually do distribute a different version of SQL Server 2005 Express Edition than the "straight" download version of SQL Server 2005 Express. This means they can -- and usually do -- run on diffent runtimes. While all of these products go through the CTP and Beta cycle, you should NOT expect them to interoperate on anything other than the least common level. For example, you shouldn't have the expectation that you can install Visual C# Express from April, October's SQL Express and December's XM all on the same machine. They require different runtimes. Along the same lines, don't expect that the December CTP for SQL Server 2005 Developer Edition will be compatible with any Visual Studio 2005 CTP for December UNLESS MICROSOFT SPECIFICALLY SAYS OTHERWISE.
- XM works best with the December 2004 of SQL Server 2005 Express Edition and its associated CLR.
- I'm providing a list of what versions *should* work with what at: http://sqljunkies.com/WebLog/ktegels/articles/4895.aspx
- No, I don't expect this behavior is going to change. It pre-Beta and Beta software folks. While I feel your pain and agree, yes, it would be nice to have everything in sync, the cost of doing that is prohibitive for MS. The choice for MS is simple: Let the teams distribute what they want and get bits in our hands more frequently and sooner or have a coordinated process that means much less frequent releases to us. They decided to opt for the first which I think gives us a better picture of the products and more opportunity for feedback.
- Please don't complain about the process, understand it. Microsoft took a risk putting these bits out early. If I was them and I was reading all the push-back we're seeing in the NGs, I'd really have to question if it was worth it. It has to be frustrating for them when we don't RTFM and then complain about the products when they've been very upfront about the process and the need to be more careful with these bits than would be with RTMs. I'd have to think that the next time the teams come back to working on VS.next and SQL.next, they'll all still have scars from this and we'll go back to being more shut out of the process. Let's give them a reason to decide to stick with us and the "early and often model" by providing useful feedback about what we like and what we don't, why we do or don't like and sharing ideas for making it better, PLEASE!