« June 2008 | Main | August 2008 »
July 28, 2008
Casting delegates
One of the annoying things about delegates in .NET is that delegates with exactly the same parameters and return type are not compatible. Specifically, you cannot cast a delegate to a delegate of another type even if they have the same parameters and return type.
Predicate<int> isPositive = n => n > 0;
Func<int, bool> isPositive2 = (Predicate<int>) isPositive; // COMPILER ERROR
This problem is mitigated somewhat in C# 3.5, which defines generic delegates that take arbitrary parameters and return types and encourages their use: Action, Action<T>, Action<T1, T2>, ..., Func<TR>, Func<T, TR>, Func<T1, T2, TR>, ...
However, all of the "old" delegates still exist and are in use: AsyncCallback, Comparison<T>, and Predicate<T>, to name a few.
The biggest source of delegate types is event handlers. There's plain old EventHandler and the newer EventHandler<T>, but there are still lots of non-generic event handlers like CancelEventHandler. Neither WPF nor Windows Forms use EventHandler<T>, so they are chock full of unique delegate types that take an object and some EventArgs-derived class.
Usually this doesn't present a problem, but occasionally you'd like to convert between compatible delegates. If both types are known at compile-time, you can just use a lambda:
Predicate<int> isPositive = n => n > 0;
Func<int, bool> isPositive2 = n => isPositive(n);
But sometimes, the types aren't known at compile-time. We primarily find this to be the case when trying to write generic utility code that can work with arbitrary event handlers. Fortunately, it is possible to cast between arbitrary delegate types, though it isn't as efficient as you might like – DelegateUtility.Cast:
public static class DelegateUtility
{
public static T Cast<T>(Delegate source) where T : class
{
return Cast(source, typeof(T)) as T;
}
public static Delegate Cast(Delegate source, Type type)
{
if (source == null)
return null;
Delegate[] delegates = source.GetInvocationList();
if (delegates.Length == 1)
return Delegate.CreateDelegate(type,
delegates[0].Target, delegates[0].Method);
Delegate[] delegatesDest = new Delegate[delegates.Length];
for (int nDelegate = 0; nDelegate < delegates.Length; nDelegate++)
delegatesDest[nDelegate] = Delegate.CreateDelegate(type,
delegates[nDelegate].Target, delegates[nDelegate].Method);
return Delegate.Combine(delegatesDest);
}
}
There is a generic version and a non-generic version. Note that the null case is handled first, followed by the single-invocation case, followed by the rare multiple-invocation case.
It is quite straightforward to use. (We'd have made it an extension method, but converting delegates isn't really a common enough need to justify it.)
CancelEventHandler handler = (source, e) => e.Cancel = OnCancel();
EventHandler<CancelEventArgs> handler2 =
DelegateUtility.Cast<EventHandler<CancelEventArgs>>(handler);
The types used by the two delegate types must be exactly the same for DelegateUtility.Cast to work. Supporting compatible types is left as an exercise for the reader; we certainly haven't needed it.
Posted by Ed Ball at 4:00 PM | Comments (0) | TrackBack
July 24, 2008
.NET Regular Expressions and Unicode
A fundamental limitation of .NET regular expressions when it comes to processing Unicode text is that the regex engine apparently operates on UTF-16 code units (i.e., the 16-bit value(s) that are used to encode a single Unicode character) not code points (the values between 0 and 0x1FFFFF that are assigned to characters encoded in the Unicode standard).
This limitation can be inferred from the list of named blocks for character classes, which claims to be based on Unicode 4.0 but only goes up to FFF0–FFFF, IsHalfwidthandFullwidthForms. (Unicode 4.0 defines many blocks of supplementary characters, starting with Linear B Syllabary, U+010000..U+01007F.) It can be confirmed by verifying the return values of the following code snippet:
// search a string containing two Linear B letters for a letter
isMatch = Regex.IsMatch("\U00010000\U00010001", @"\p{L}");
// isMatch should be true, but is actually false
// search a string containing two Linear B letters for a surrogate code point
isMatch = Regex.IsMatch("\U00010000\U00010001", @"\p{Cs}");
// isMatch is true
The fundamental problem here is that \p{L} doesn't match sequences of chars that encode a character that’s defined as a letter by Unicode. Ideally, \p{L} would match supplementary characters, and \p{Cs} would match nothing because regular expressions would operate on characters, not code units (and there’s no such thing as a surrogate character).
Because of this problem, none of the 46,982 supplementary characters encoded in Unicode 5.1 can be matched by specifying Unicode properties. Furthermore, many other regular expression language elements (such as the period and quantifiers) do not correctly handle these characters, which are encoded with two UTF-16 chars.
It's unfortunate that the implementation details of UTF-16 encoding leak out into what is otherwise an excellent regular expression engine. I don't know of any .NET-based workaround for this issue; with native code, this problem can be solved by using the regular expression engine of ICU.
Update (25 July): I've filed a suggestion on Microsoft Connect, asking that the regex engine be extended to process Unicode characters, not UTF-16 code units.
Update 2 (25 July): Michael Kaplan also blogged about regular expressions and Unicode today, and included a link to Unicode Technical Standard #18: Unicode Regular Expressions. The essence of my Microsoft Connect suggestion is that .NET regular expressions be improved to have “Basic Unicode Support” as defined by that document.
Posted by Bradley Grainger at 11:34 AM | Comments (0) | TrackBack
July 18, 2008
SetData actually adds data
Incredibly, MSDN does not make it clear that the SetData method of IDataObject does not replace the data, but actually adds the data to the data object.
So, if you want multiple formats, just call SetData multiple times.
Update: The SetData method of the Clipboard class replaces the data. No wonder I'm so confused!
Posted by Ed Ball at 1:15 PM | Comments (0) | TrackBack