Michael Rys

Musings on XML, XQuery and more...

<December 2008>
SuMoTuWeThFrSa
30123456
78910111213
14151617181920
21222324252627
28293031123
45678910


Navigation

Papers

SQL Server XML Whitepapers

Weblogging Links

MS Bloggers

Recommended Books

Other Blogs

Recommended Links

Presentations (Upcoming)

Presentations (Recent)

Subscriptions

News


Upcoming Presentations


TechEd 2007, Orlando, June 4 to June 8, 2007


Books I co-authored



www.flickr.com
This is a Flickr badge showing public photos from Michael Rys. Make your own badge here.
eXTReMe Tracker

Post Categories

Article Categories



"An invalid character was found" errors when upgrading sp_xml_preparedocument from SQL Server 2000 SP3 to SP4 or SQL Server 2005

Recently, I received several customer reports, that sp_xml_preparedocument started rasing the following errors after upgrading their XML-based application from SQL Server 2000 SP3a (or earlier) to either SP4 or SQL Server 2005:

Msg 6602, Level 16, State 2, Procedure sp_xml_preparedocument, Line 1
The error description is 'An invalid character was found in text content.'.
Msg 6607, Level 16, State 3, Procedure sp_xml_removedocument, Line 1
sp_xml_removedocument: The value supplied for parameter number 1 is invalid.

This error is now raised because of a stricter error-discovery during parsing. When we moved from MSXML 2.6 to MSXMLSQL for SQL Server 2000 SP4 and SQL Server 2005, we fixed a couple of bugs in the parser that could lead to data corruption (invalid data being parsed). As a consequence, you are now receiving this error code instead of having invalid characters accepted.

The consequence is that one has to be more explicit with setting the encoding. This will also help mitigate against involuntary data corruption (see below for an example). There are two ways to fix an application (besides moving to an XML datatype and the nodes() method):

  1. Make sure that the XML document when passed in a TEXT or (VAR)CHAR argument is compatible with the default code page of the database (or the string type) by either
    1. setting the encoding property in the XML declaration (e.g., when the code page is ISO-Latin1), or
    2. by making sure that the code page of the database (or string type) can preserve all UTF-8 code points that will ever be passed in through the string.
  2. Change the type of the argument to NTEXT (or N(VAR)CHAR) and pass in the XML in UTF-16 encoding (an XML declaration is optional).

Here are the technical details:

sp_xml_preparedocument takes either a single-byte character string (TEXT, (VAR)CHAR) or a two-byte character string (NTEXT, N(VAR)CHAR). In the first case, the string is associated with a code page (normally the database’s default code page). In the second case the string is assumed to be either UCS-2 or UTF-16 encoded. Sp_xml_preparedocument (unlike the newer XML datatype parser) will only pick up the strings code page in the second case to detect the encoding of the XML document, but not in the first case. Instead it will look at the string and follow the XML 1.0 spec detection rules. That means that for a single-byte character string unless there is an XML declaration saying otherwise, the data will be parsed as UTF-8. This works fine as long as your instance documents happen to only use characters that share the same code points on the UTF-8 and the database code page (e.g., the ASCII range in ISO-Latin1 and UTF-8), but it will lead to problems, if you use characters that are mapped to different code pages.

For example, the character ® (Unicode U+00AE, represented as 0xAE00 in SQL Server) will be represented in an ISO-LATIN1 code page (SQL_Latin1_General_CP1_CS_AS) as 0xAE. However if you do not specify an encoding on the XML document that contains the character ® and pass it as a single-byte character string, the XML parser will interpret the code point 0xAE not as the character ® but as an invalid starting character of a multi-byte UTF-8 encoding: UTF-8 characters bit sequences start either with 0 (1-byte encoded characters), 110 (two byte encoded characters), 1110 (three byte encoded characters), or 11110 (4-byte encoded surrogate pairs), while 0xAE is 10101110.

In MSXML 2.6 (and thus SQL Server 2000 SP3 and earlier), the XML parser would preserve such invalid characters and thus corrupt the data, while in MSXML 3.0/MSXMLSQL (and thus SQL Server 2000 SP4 and SQL Server 2005), the parser will reject them.

Note that you will still have to watch out for data corruption if the ISO-Latin characters make up a valid UTF-8 encoded character (solutions are as outlined above). For example,

declare @h int
exec sp_xml_preparedocument @h output, ''
select * from openxml(@h, '/root') with (attr nvarchar(200) '@attr')
exec sp_xml_removedocument @h

will parse in SQL Server 2005 but return the character ב (U+05D1) since the two input characters ב (0xD791 in an ISO-Latin1 encoding) form the bit stream: 11010111 10010001 which is translated into 00000101 11010001 according to the UTF-8 encoding rules which is a valid UTF-8 2-byte encoded character representing U+05D1.

I hope you agree, that detecting such issues is better, even if we break "backwards-bug-compatibility". If this is a problem for you, please contact me.

posted on Tuesday, September 06, 2005 7:28 PM by mrys


# &amp;amp;quot;An invalid character was found&amp;amp;quot; errors when upgrading sp_xml_preparedocument from SQL Server 2000 SP3 to SP4 or SQL Server 2005 @ Tuesday, September 06, 2005 7:57 PM

Recently, I received several customer reports, that sp_xml_preparedocument started rasing the following...

Anonymous




Powered by Dot Net Junkies, by Telligent Systems