Hexadecimal value 0x is an invalid character
Posted May 7, 2009 by Chris | Filed under .NET, XML
Ever get a
"Hexadecimal value 0x[whatever] is an invalid character"
...when trying to load a XML document using one of the .NET XML API objects like
| 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 |
0x0B 0x0C 0x0E 0x0F |
0x10 0x11 0x12 0x13 0x14 0x15 |
0x1A 0x1B 0x1C 0x1D 0x1E 0x1F 0x16 0x17 0x18 0x19 |
0x7F |
The problem that causes these "invalid character"
Most of these illegal characters are in the ASCII control character range (think whacky characters like null, bell, backspace, etc). These aren't characters that have any business being in XML data; they're illegal characters, usually having found their way into the data from file format conversions, like when someone tries to create an XML file from Excel data, or export their data to XML from a format that may be stored as binary like PDF. In fact, if XML data contains the character '\b' (bell), your motherboard will actually make the bell sound before the
Although most ASCII control characters are disallowed, the formatting characters '\n', '\r', and '\t' are not illegal in XML (1.0 and 1.1), and therefore do not cause this
Sanitizing Strings
If you're encountering XML data that is causing an
/// <summary> /// Remove illegal XML characters from a string. /// </summary> public string SanitizeXmlString(string xml) { if (string.IsNullOrEmpty(xml)) { return xml; } var buffer = new StringBuilder(xml.Length); foreach (char c in xml) { if (IsLegalXmlChar(c)) { buffer.Append(c); } } return buffer.ToString(); } /// <summary> /// Whether a given character is allowed by XML 1.0. /// </summary> public bool IsLegalXmlChar(int character) { return ( character == 0x9 /* == '\t' == 9 */ || character == 0xA /* == '\n' == 10 */ || character == 0xD /* == '\r' == 13 */ || (character >= 0x20 && character <= 0xD7FF) || (character >= 0xE000 && character <= 0xFFFD) || (character >= 0x10000 && character <= 0x10FFFF) ); }
Useful as these methods are, don't go off pasting them into your code anywhere. Create a class instead. Let's say you use the routine to sanitize a string in one section of code. Then another section of code uses that same string that has been sanitized. How does the other section positively know that the string doesn't contain any control characters anymore, without checking? It doesn't.
Who knows where that string has been (if it's been sanitized) before it gets to a different routine, further down the processing pipeline. Program defensive and agnostically. If the sanitized string isn't a string and is instead a different type that represents sanitized strings, you can guarantee that the string doesn't contain illegal characters. Use something like this instead:
public class XmlSanitizedString { private readonly string value; public XmlSanitizedString(string s) { this.value = XmlSanitizedString.SanitizeXmlString(s); } /// <summary> /// Get the XML-santizied string. /// </summary> public override string ToString() { return this.value; } /// <summary> /// Remove illegal XML characters from a string. /// </summary> private static string SanitizeXmlString(string xml) { if (string.IsNullOrEmpty(xml)) { return xml; } var buffer = new StringBuilder(xml.Length); foreach (char c in xml) { if (XmlSanitizedString.IsLegalXmlChar(c)) { buffer.Append(c); } } return buffer.ToString(); } /// <summary> /// Whether a given character is allowed by XML 1.0. /// </summary> private static bool IsLegalXmlChar(int character) { return ( character == 0x9 /* == '\t' == 9 */ || character == 0xA /* == '\n' == 10 */ || character == 0xD /* == '\r' == 13 */ || (character >= 0x20 && character <= 0xD7FF) || (character >= 0xE000 && character <= 0xFFFD) || (character >= 0x10000 && character <= 0x10FFFF) ); } }
Sanitizing Streams
Now, if the strings that need to be sanitized are being retrieved from a
string xml; using (WebClient downloader = new WebClient()) { using (TextReader reader = new StreamReader(downloader.OpenRead(uri))) { xml = reader.ReadToEnd(); } } // xml potentially contains illegal characters
You could use the
string xml; using (WebClient downloader = new WebClient()) { using (TextReader reader = new StreamReader(downloader.OpenRead(uri))) { xml = reader.ReadToEnd(); } } // Sanitize the XML XmlSanitizedString safeXml = new XmlSanitizedString(xml); // Do something with safeXml.ToString()
But
string xml; using (WebClient downloader = new WebClient()) { using(var reader = new XmlSanitizingStream(downloader.OpenRead(uri))) { xml = reader.ReadToEnd() } } // xml contains no illegal characters
The declaration for this
public class XmlSanitizingStream : StreamReader { // Pass 'true' to automatically detect encoding using BOMs. // BOMs: http://en.wikipedia.org/wiki/Byte-order_mark public XmlSanitizingStream(Stream streamToSanitize) : base(streamToSanitize, true) { } /// <summary> /// Whether a given character is allowed by XML 1.0. /// </summary> public static bool IsLegalXmlChar(int character) { return ( character == 0x9 /* == '\t' == 9 */ || character == 0xA /* == '\n' == 10 */ || character == 0xD /* == '\r' == 13 */ || (character >= 0x20 && character <= 0xD7FF ) || (character >= 0xE000 && character <= 0xFFFD ) || (character >= 0x10000 && character <= 0x10FFFF) ); } // ...
To get this
private const int EOF = -1; public override int Read() { // Read each char, skipping ones XML has prohibited int nextCharacter; do { // Read a character if ((nextCharacter = base.Read()) == EOF) { // If the char denotes end of file, stop break; } } // Skip char if it's illegal, and try the next while (!XmlSanitizingStream.IsLegalXmlChar(nextCharacter)); return nextCharacter; } public override int Peek() { // Return next legal XML char w/o reading it int nextCharacter; do { // See what the next character is nextCharacter = base.Peek(); } while ( // If it's illegal, skip over and try the next. !XmlSanitizingStream.IsLegalXmlChar(nextCharacter) && (nextCharacter = base.Read()) != EOF ); return nextCharacter; }
Next, we'll need to override the other Read* methods (
To make life easy and avoid writing these other Read* methods from scratch, we can disassemble the
The complete version of
October 31st, 2009 at 1:31 am
Awesome peice of code! Helped a lot! Thanks..
November 2nd, 2009 at 2:39 pm
I process a lot of XML files per day and have encountered these bad characters dozens of times each month.
Can you demonstrate how one might read a corrupt XML file and write to another?
Thanks,
UDT
January 19th, 2010 at 3:47 pm
Hi UpsideDownTire, did you find the solution of reading corrupt xml file and write to the another file excluding those invalid characteres?
Please share your solution if you have it.
Tons of Thanks in Advance!
Muskan