« SetData actually adds data | Main | Casting delegates »

July 24, 2008

.NET Regular Expressions and Unicode

A fundamental limitation of .NET regular expressions when it comes to processing Unicode text is that the regex engine apparently operates on UTF-16 code units (i.e., the 16-bit value(s) that are used to encode a single Unicode character) not code points (the values between 0 and 0x1FFFFF that are assigned to characters encoded in the Unicode standard).

This limitation can be inferred from the list of named blocks for character classes, which claims to be based on Unicode 4.0 but only goes up to FFF0–FFFF, IsHalfwidthandFullwidthForms. (Unicode 4.0 defines many blocks of supplementary characters, starting with Linear B Syllabary, U+010000..U+01007F.) It can be confirmed by verifying the return values of the following code snippet:

// search a string containing two Linear B letters for a letter

isMatch = Regex.IsMatch("\U00010000\U00010001", @"\p{L}");

// isMatch should be true, but is actually false

 

// search a string containing two Linear B letters for a surrogate code point

isMatch = Regex.IsMatch("\U00010000\U00010001", @"\p{Cs}");

// isMatch is true

The fundamental problem here is that \p{L} doesn't match sequences of chars that encode a character that’s defined as a letter by Unicode. Ideally, \p{L} would match supplementary characters, and \p{Cs} would match nothing because regular expressions would operate on characters, not code units (and there’s no such thing as a surrogate character).

Because of this problem, none of the 46,982 supplementary characters encoded in Unicode 5.1 can be matched by specifying Unicode properties. Furthermore, many other regular expression language elements (such as the period and quantifiers) do not correctly handle these characters, which are encoded with two UTF-16 chars.

It's unfortunate that the implementation details of UTF-16 encoding leak out into what is otherwise an excellent regular expression engine. I don't know of any .NET-based workaround for this issue; with native code, this problem can be solved by using the regular expression engine of ICU.

Update (25 July): I've filed a suggestion on Microsoft Connect, asking that the regex engine be extended to process Unicode characters, not UTF-16 code units.

Update 2 (25 July): Michael Kaplan also blogged about regular expressions and Unicode today, and included a link to Unicode Technical Standard #18: Unicode Regular Expressions. The essence of my Microsoft Connect suggestion is that .NET regular expressions be improved to have “Basic Unicode Support” as defined by that document.

Posted by Bradley Grainger at July 24, 2008 11:34 AM

Trackback Pings

TrackBack URL for this entry:
http://ancientblogs.logos.com/mt-cgi/mt-tb.cgi/217