July 24, 2008

.NET Regular Expressions and Unicode

A fundamental limitation of .NET regular expressions when it comes to processing Unicode text is that the regex engine apparently operates on UTF-16 code units (i.e., the 16-bit value(s) that are used to encode a single Unicode character) not code points (the values between 0 and 0x1FFFFF that are assigned to characters encoded in the Unicode standard).

This limitation can be inferred from the list of named blocks for character classes, which claims to be based on Unicode 4.0 but only goes up to FFF0–FFFF, IsHalfwidthandFullwidthForms. (Unicode 4.0 defines many blocks of supplementary characters, starting with Linear B Syllabary, U+010000..U+01007F.) It can be confirmed by verifying the return values of the following code snippet:

// search a string containing two Linear B letters for a letter

isMatch = Regex.IsMatch("\U00010000\U00010001", @"\p{L}");

// isMatch should be true, but is actually false

 

// search a string containing two Linear B letters for a surrogate code point

isMatch = Regex.IsMatch("\U00010000\U00010001", @"\p{Cs}");

// isMatch is true

The fundamental problem here is that \p{L} doesn't match sequences of chars that encode a character that’s defined as a letter by Unicode. Ideally, \p{L} would match supplementary characters, and \p{Cs} would match nothing because regular expressions would operate on characters, not code units (and there’s no such thing as a surrogate character).

Because of this problem, none of the 46,982 supplementary characters encoded in Unicode 5.1 can be matched by specifying Unicode properties. Furthermore, many other regular expression language elements (such as the period and quantifiers) do not correctly handle these characters, which are encoded with two UTF-16 chars.

It's unfortunate that the implementation details of UTF-16 encoding leak out into what is otherwise an excellent regular expression engine. I don't know of any .NET-based workaround for this issue; with native code, this problem can be solved by using the regular expression engine of ICU.

Posted by Bradley Grainger at 11:34 AM | Comments (0) | TrackBack (0)

July 18, 2008

SetData actually adds data

Incredibly, MSDN does not make it clear that the SetData method of IDataObject does not replace the data, but actually adds the data to the data object.

So, if you want multiple formats, just call SetData multiple times.

Update: The SetData method of the Clipboard class replaces the data. No wonder I'm so confused!

Posted by Ed Ball at 01:15 PM | Comments (0) | TrackBack (0)

June 10, 2008

Salsa20 Implementation in C#

Salsa20 is a stream cipher submitted to eSTREAM, the ECRYPT Stream Cipher Project, by Daniel Bernstein. (Salsa20/12, a version of the algorithm that uses fewer rounds, was one of four software implementations to be included in the final eSTREAM portfolio.) The algorithm can use either 128-bit or 256-bit keys, and is designed to be secure and efficient. For more information, see the Wikipedia article and the algorithm homepage.

There is a .NET port of this algorithm in the Bouncy Castle Crypto Library. Being a port from a Java library, however, that version doesn't interoperate with the System.Security.Cryptography APIs.

The code attached to this post implements Salsa20 using a subclass of SymmetricAlgorithm (with the actual encryption class implementing ICryptoTransform), so it can be used with CryptoStream and other .NET cryptography classes.

The focus is not on efficiency (for that, one should probably use a hand-coded SSE2 implementation), but on being a straightforward port to C# from the reference implementation in C. There is also a suite of tests (that use the eSTREAM test vectors) to verify the correctness of the implementation.

Like the reference C implementation, this code is in the public domain. Download it here: Salsa20.cs, Salsa20Tests.cs.

Posted by Bradley Grainger at 07:00 AM | Comments (0) | TrackBack (0)

June 09, 2008

Implementing Clone

Now, before you get too excited, I'm not suggesting that you implement the all-but-deprecated ICloneable interface. Rather, this post is about how best to implement a method that duplicates an object – such a method is commonly named "Clone".

If you like, you can skip the long-winded commentary below and start reading the recommendations at the bottom.

Some would argue that Copy is a better name than Clone, since it avoids the “smell” of ICloneable, but I think Clone is more discoverable. Obviously the patterns of this post can be used regardless of whether you like Clone, Copy, Duplicate, or Replicate.

Any Clone method should document its semantics if they aren't obvious; in particular, it should be clear whether a “shallow” or a “deep” clone will be used. (A “deep” clone clones its “children”; a “shallow” clone simply copies the references of its children.)

In some circumstances, it may be useful to add a parameter to Clone that can change the behavior. For example, proposals to improve the ICloneable interface included a parameter that would indicate whether a shallow or deep clone is desired. Adding a parameter to the Clone method should be a simple extension to the patterns described below.

The simplest kind of clonable class is a sealed class, because we don't need to support derived classes. Even in this simple case, the cleanest approach is to delegate the cloning to a private “copy constructor”:

public sealed class Vector

{

    public Vector(int length)

    {

        m_array = new int[length];

    }

 

    public int Length

    {

        get { return m_array.Length; }

    }

 

    public int this[int index]

    {

        get { return m_array[index]; }

        set { m_array[index] = value; }

    }

 

    public Vector Clone()

    {

        return new Vector(this);

    }

 

    private Vector(Vector v)

    {

        m_array = (int[]) v.m_array.Clone();

    }

 

    int[] m_array;

}

Now, suppose we wanted Vector to be an abstract class so that derived classes could decide how the items are stored. Furthermore, we determine that the length needs to be cached in the base class for performance reasons:

public abstract class Vector

{

    protected Vector(int length)

    {

        m_length = length;

    }

 

    protected Vector(Vector v)

    {

        m_length = v.m_length;

    }

 

    public int Length

    {

        get { return m_length; }

    }

 

    public abstract int this[int index] { get; set; }

 

    public abstract Vector Clone();

 

    int m_length;

}

The protected “copy constructor” is provided to simplify the Clone override:

public sealed class ArrayVector : Vector

{

    public ArrayVector(int length)

        : base(length)

    {

        m_array = new int[length];

    }

 

    private ArrayVector(ArrayVector v)

        : base(v)

    {

        m_array = (int[]) v.m_array.Clone();

    }

 

    public override int this[int index]

    {

        get { return m_array[index]; }

        set { m_array[index] = value; }

    }

 

    public override Vector Clone()

    {

        return new ArrayVector(this);

    }

 

    int[] m_array;

}

In the ArrayVector implementation above, I'd like the overridden Clone method to be more type-safe – that is, I'd like it to return an ArrayVector instead of a Vector. Unfortunately, while some languages (e.g. C++) allow overrides to return a more-derived class than the method they override, C# does not. Not to be deterred in my quest for type safety, my preferred pattern is to use CloneCore as the overridable and define Clone as a separate non-overridable method in each concrete class.

public abstract class Vector

{

    // ...

    protected abstract Vector CloneCore();

    // ...

}

 

public sealed class ArrayVector : Vector

{

    // ...

    public ArrayVector Clone()

    {

        return (ArrayVector) CloneCore();

    }

 

    protected override Vector CloneCore()

    {

        return new ArrayVector(this);

    }

    // ...

}

Incidentally, if you really want to implement ICloneable:

public class Vector : ICloneable

{

    // ...

    object ICloneable.Clone()

    {

        return Clone();

    }

    // ...

}

Okay, enough commentary. Here are the recommendations:

Recommendations

There are three parts to a clonable class: (1) the copy constructor, (2) the CloneCore method, and (3) the Clone method.

The copy constructor is always defined, and does all of the copying, being sure to call the base class copy constructor if available. It is always protected (unless the class is sealed, in which case it is private).

The CloneCore and Clone methods are only implemented on non-abstract classes, and always have the same definitions. (Exception: if the root class is abstract, CloneCore is an abstract method on that class.) The CloneCore method uses the copy constructor to clone the instance. The Clone method calls CloneCore and casts the result to the correct type. If the Clone method overloads a base class method, use the “new” keyword.

Clear? No? Let's try sample code. Consider the clonable classes Base and Derived, where Derived derives from Base. The Base class should always have a protected copy constructor:

protected Base(Base x)

{

    // copy members from x

}

If Base is abstract, it only has an abstract CloneCore:

protected abstract Base CloneCore();

If Base is not abstract, it defines both Clone and CloneCore:

public Base Clone()

{

    return (Base) CloneCore();

}

 

protected virtual Base CloneCore()

{

    return new Base(this);

}

Similarly, the Derived class should always have a copy constructor:

protected Derived(Derived x)

    : base(x)

{

    // copy members from x

}

But if Derived is sealed, it will need to be private:

private Derived(Derived x)

    : base(x)

{

    // copy members from x

}

If Derived is abstract, it is done. If Derived is not abstract, it defines both Clone and CloneCore:

public new Derived Clone()

{

    return (Derived) CloneCore();

}

 

protected override Base CloneCore()

{

    return new Derived(this);

}

If no ancestors have defined a Clone method (e.g. Base is abstract), you'll have to omit the “new” keyword from the Derived Clone method.

Whew. It feels like I wrote too much, but I'll just publish and move on. Hope this is useful!

Posted by Ed Ball at 09:15 AM | Comments (0) | TrackBack (0)

May 29, 2008

Events and Threads (Part 3)

We've discussed reasonable mechanisms for subscribing to events and for raising events, but we skirted the issue of "thread-safe" events until now.

What is a thread-safe event? A good definition would be "an event that may be subscribed, unsubscribed, and/or raised simultaneously on arbitrary threads." In that case, what must we do to create a thread-safe event?

Certainly it must be true that if you add an event handler, it is added, and if you remove an event handler, it is removed. As discussed earlier, the default implementation of the add and remove methods accomplishes this by locking the object, but I'd recommend using your own lock:

public event EventHandler Click

{

    add

    {

        lock (m_lockClick)

            m_click += value;

    }

    remove

    {

        lock (m_lockClick)

            m_click -= value;

    }

}

 

EventHandler m_click;

object m_lockClick = new object();

It is also certain that a thread-safe event must not throw a null reference exception when raising the event. The problem is that another thread could remove the last event handler at any moment, which sets the event delegate to null. In the following naïve implementation, Click could become null after the check but before the call:

private void RaiseClick()

{

    if (m_click != null)

        m_click(this, EventArgs.Empty);

}

The most common solution is to make a copy of the event delegate before calling it:

private void RaiseClick()

{

    EventHandler handler = m_click;

    if (handler != null)

        handler(this, EventArgs.Empty);

}

However, I learned from Juval Lowy's book that aggressive compiler inlining could theoretically eliminate the copy, which would bring us back to the same problem. His solution is to write a non-inlined method that raises the event, something like this:

private void RaiseClick()

{

    RaiseEvent(m_click);

}

 

[MethodImpl(MethodImplOptions.NoInlining)]

private void RaiseEvent(EventHandler handler)

{

    if (handler != null)

        handler(this, EventArgs.Empty);

}

Another good solution is to add a do-nothing event handler; follow the link for an explanation of that approach.

Of course, the most "correct" solution is probably to use the lock that's already there:

private void RaiseClick()

{

    EventHandler handler;

    lock (m_lockClick)

        handler = m_click;

    if (handler != null)

        handler(this, EventArgs.Empty);

}

Perhaps the last solution helped you think of another aspect of thread-safe events that isn't discussed very often. A problem common to all of these solutions is that a subscriber's event handler may be called even after it has been unsubscribed!

I found this behavior very surprising when I was writing thread-safe objects with events. For example, the Dispose method of one object might unsubscribe from an event of another object, assuming that the event handler won't be called again; but, in fact, that event handler might actually be called after the object has been disposed, which can obviously cause problems.

If you want to guarantee that an event handler won't be called after it is unsubscribed, as well as guarantee that an event handler can't be unsubscribed until the event is done being raised, the most direct solution is to call the event handler from within the lock:

private void RaiseClick()

{

    lock (m_lockClick)

    {

        if (m_click != null)

            m_click(this, EventArgs.Empty);

    }

}

This is a bit hair-raising, of course, because you're calling arbitrary code from within a lock, which is a good recipe for deadlock. I don't have enough experience with this pattern to know how common a problem that might be.

One final note about thread-safe events – make sure that your clients understand that their event handler will be invoked on an arbitrary thread, so that they know to dispatch to their UI thread if necessary.

I wish I had more solid conclusions as regards thread-safe events, but I'm still working through these issues. Hopefully I've at least given you some things to think about when you're considering adding events to a thread-safe class – it might be easier to just avoid them altogether.

Posted by Ed Ball at 02:46 PM | Comments (7) | TrackBack (0)

May 23, 2008

Events and Threads (Part 2)

It's time to continue our discussion of events and threads. You'll note in the last post that I didn't say much about "thread safe" events, because it's not clear what that would mean, particularly as regards the raising of an event. You won't see much in this post about "thread safe" events, either, though I do hope to get to that eventually.

We've already talked about adding and removing an event handler, so it's only natural that we would now talk about raising the event. The most commonly discussed problem that we face when raising an event in C# is that the event delegate is null if there are no subscribers.

public event EventHandler Click;

 

private void RaiseClick()

{

    // throws NullReferenceException if no subscribers

    Click(this, EventArgs.Empty);

}

In fact, I touched on this subject back in March, where I noted that assigning a do-nothing event handler to the event delegate avoids that problem entirely, though it does add a bit of inefficiency.

public event EventHandler Click = delegate { };

 

private void RaiseClick()

{

    // never throws NullReferenceException

    Click(this, EventArgs.Empty);

}

If your class has thread affinity, you must only raise the event from the UI thread, so you can safely do a null check without worrying about another thread removing the last event handler between the check and the call.

public event EventHandler Click;

 

private void RaiseClick()

{

    VerifyAccess();

 

    if (Click != null)

        Click(this, EventArgs.Empty);

}

If your class is thread-compatible, it must be assumed that you only raise an event from the thread that is currently accessing your instance, so, again, you can safely do a null check without worrying about other threads.

public event EventHandler Click;

 

private void RaiseClick()

{

    if (Click != null)

        Click(this, EventArgs.Empty);

}

But what if you want to raise an event in response to background work on a worker thread? In the case of a thread-affined class, there is usually a way to submit work to the UI thread, allowing you to raise the event from the UI thread. In WPF, you can use the Dispatcher for the UI thread.

public event EventHandler Click;

 

private void RaiseClick()

{

    Dispatcher.Invoke(DispatcherPriority.Send, new SendOrPostCallback(

        delegate

        {

            if (Click != null)

                Click(this, EventArgs.Empty);

        }), null);

}

In Windows Forms or WPF, you can use the SynchronizationContext of the UI thread.

public event EventHandler Click;

 

private void RaiseClick()

{

    m_context.Send(

        delegate

        {

            if (Click != null)

                Click(this, EventArgs.Empty);

        }, null);

}

 

SynchronizationContext m_context = SynchronizationContext.Current;

Raising an event in response to background work on a worker thread for a thread-compatible class is more interesting, because subscribers to the event will be called on an arbitrary thread. Therefore, for all intents and purposes, the event must be thread-safe, because it could be subscribed or unsubscribed on one thread and raised on another thread at the same time.

Which means that it's time to talk about thread-safe events, but I think I'll save that discussion for a future post.

Posted by Ed Ball at 01:15 PM | Comments (0) | TrackBack (0)

May 09, 2008

Events and Threads (Part 1)

Once upon a time, I mentioned that I'd like to blog about thread-safety as it relates to events, so I figured I'd better get moving on that.

There are so many issues with .NET events and threads that it's hard to know where to begin, but let's start with the adding and removing of event handlers.

Unless documentation specifies otherwise, one must assume that adding and removing an event handler falls under the same thread safety requirements as any other method of the class. So, if the class has thread affinity (Windows Forms controls, WPF elements, etc.), assume that events can only be added and removed from the UI thread. If the class is thread-compatible (most non-UI classes in .NET), assume that events can be added and removed from any thread, but no two threads can add or remove events (or call any other method, for that matter) at the same time.

When authoring an event, if you allow C# to implement the add and remove methods (by not including your own), the default implementation attempts to be thread-safe by locking "this" before adding or removing the handler from the event delegate. In other words, these two events are implemented the same way:

public event EventHandler Event1;

 

public event EventHandler Event2

{

    add { lock (this) m_event2 += value; }

    remove { lock (this) m_event2 -= value; }

}

private EventHandler m_event2;

If your event has thread affinity or is thread-compatible, the lock is unnecessary overhead, so you're better off with a lock-free implementation:

public event EventHandler Event3

{

    add { m_event3 += value; }

    remove { m_event3 -= value; }

}

private EventHandler m_event3;

Better yet, if your event has thread affinity, make sure that the caller is on the UI thread.

public event EventHandler Event4

{

    add { VerifyAccess(); m_event4 += value; }

    remove { VerifyAccess(); m_event4 -= value; }

}

private EventHandler m_event4;

Furthermore, locking "this" is not recommended (see the MSDN documentation on the lock statement and on MethodImplOptions.Synchronized), so you might consider always implementing your own add and remove methods anyway.

While we're on the subject of adding and removing event handlers, if your class has more than a few events, consider using the EventHandlerList class to manage all of the event handlers, or manage the event handlers in a similar way with your own collection. This will save memory when many of the events have no subscribers. The EventHandlerList class is not thread-safe, which makes it most suitable for thread-affined and thread-compatible events.

There's obviously much more to discuss, not the least of which is a discussion of what it would mean for an event to be entirely thread-safe; hopefully part 2 won't be so long in coming!

Posted by Ed Ball at 10:25 AM | Comments (4) | TrackBack (0)

April 09, 2008

Exception 0xc0020001 in C++/CLI assembly

After reorganising some code in a C++/CLI assembly, I started getting exception "0xc0020001: The string binding is invalid" when shutting down the C# application that loaded that assembly.

When the program was run under the debugger, it would throw the exception from a function in crtdll.c that was processing the DLL_PROCESS_DETACH notification sent to DllMain. The error occurred when attempting to call the function pointer function_to_call (on line 444).

  437 /* cache the function to call. */

  438 function_to_call = (_PVFV)_decode_pointer(*onexitend);

  439 

  440 /* mark the function pointer as visited. */

  441 *onexitend = (_PVFV)_encoded_null();

  442 

  443 /* call the function, which can eventually change __onexitbegin and __onexitend */

  444 (*function_to_call)();

  445 

  446 onexitbegin_new = (_PVFV *)_decode_pointer(__onexitbegin);

  447 onexitend_new = (_PVFV *)_decode_pointer(__onexitend);

Here's where a feature of the Visual Studio debugger that I hadn't seen before came in very handy. If I set a breakpoint on line 444 and simply hovered my mouse over function_to_call, the debugger tooltip showed the full decorated name of the function, in this case, "_t2m@???__FstaticNativeObject@?1??NativeMethod@@YAHH@Z@YAXXZ@?A0x754dd9c9@@YAXXZ". 

Chris Brumme explains error C0020001 and identifies one of the causes as "trying to call into managed code … after the runtime has started shutting down". According to a forum post (about this same error), "t2m" stands for "transition to managed". The information in the decorated function name ("staticNativeObject" and "NativeMethod") was enough to piece together the rest of the puzzle. I had written code much like the following:

#pragma unmanaged

 

class NativeClass

{

public:

    NativeClass() { }

    ~NativeClass() { }

};

 

bool NativeMethod()

{

    static NativeClass staticNativeObject;

    return true;

}

Even though NativeMethod is emitted as native code, the disassembly showed that it registers a managed entry point for the NativeClass destructor (for staticNativeObject) with the atexit function. But by the time atexit ran this destructor (from DllMain when the C++/CLI assembly was unloaded), the CLR had already started shutting down, and the function call failed.

This problem can be solved by removing the static variable. Either make it non-static, or move it to class or file (or global!) scope. (Slightly more complex workarounds may, of course, be necessary depending on the expense or difficulty of initialising the object.)

It seems like the compiler is emitting incorrect code here—it should register a native entry point for the destructor (or call the managed version from AppDomain.DomainUnloaded) instead—so I filed a bug report with Microsoft Connect on this problem.

Posted by Bradley Grainger at 12:08 PM | Comments (0) | TrackBack (0)

April 08, 2008

Finalizers called from partially constructed objects

Did you know that finalizers are called from partially constructed objects? I certainly didn't. If an exception is thrown from a class constructor, that object is considered “partially constructed” – and its finalizer is still run when the object is garbage collected. Chris Brumme mentioned this four years ago when he helped us understand that it’s hard to implement Finalize properly: “Your Finalize method must tolerate partially constructed instances.”

A coworker discovered this fact when he was unit testing a class that called Debug.Fail in its finalizer to make sure that its instances were being disposed properly. He passed an invalid argument to the constructor to verify that an exception would be thrown – but then found that the call to Debug.Fail in the finalizer was causing tests to fail.

We couldn't figure out a good way to determine whether an object is partially constructed, so we just had to hack around the problem. Any better ideas for detecting undisposed objects?

Posted by Ed Ball at 03:49 PM | Comments (1) | TrackBack (0)

April 05, 2008

“Memory leak” with BitmapImage and MemoryStream

The code snippet below has a small “memory leak”:

BitmapImage bitmap = new BitmapImage();

 

byte[] buffer = GetHugeByteArray(); // from some external source

using (MemoryStream stream = new MemoryStream(buffer, false))

{

    bitmap.BeginInit();

    bitmap.CacheOption = BitmapCacheOption.OnLoad;

    bitmap.StreamSource = stream;

    bitmap.EndInit();

    bitmap.Freeze();

}

 

// use bitmap...

The BitmapImage keeps a reference to the source stream (presumably so that you can read the StreamSource property at any time), so it keeps the MemoryStream object alive. Unfortunately, even though MemoryStream.Dispose has been invoked, it doesn't release the byte array that the memory stream wraps. So, in this case, bitmap is referencing stream, which is referencing buffer, which may be taking up a lot of space on the large object heap. Note that there isn't a true memory leak; when there are no more references to bitmap, all these objects will (eventually) be garbage collected. But since bitmap has already made its own private copy of the image (for rendering), it seems rather wasteful to have the now-unnecessary original copy of the bitmap still in memory.

The solution here is fairly straightforward: create an implementation of Stream that wraps another stream (in this example, the MemoryStream). The Dispose method of this wrapper class needs to release the wrapped stream, so that it can be garbage collected. Once the BitmapImage is initialised with this wrapper stream, the wrapper stream can be disposed, releasing the underlying stream, and allowing the large byte array itself to be freed.

Posted by Bradley Grainger at 12:40 PM | Comments (1) | TrackBack (0)