« Detecting Bindings that should be OneTime | Main | Concurrency Pre-Con Highlights »

October 22, 2008

How to Reverse a Unicode String in C#

Perhaps due to the lack of a built-in String.Reverse method in the .NET Framework, it's very common (1, 2, 3, 4, 5, 6, 7) for implementations of such a method to be posted.

Unfortunately, most of these implementations do not handle characters outside Unicode's Basic Multilingual Plane correctly. These supplementary characters have code points between U+10000 and U+10FFFF and so cannot be represented with one 16-bit char. In UTF-16 (which is how .NET strings are encoded), these Unicode characters are represented as two C# chars, a high surrogate followed by a low surrogate. When the string is reversed, the order of these two chars has to be preserved.

Here's our method that reverses a string while handling surrogate code units correctly:

/// <summary>

/// Reverses the specified string.

/// </summary>

/// <param name="input">The string to reverse.</param>

/// <returns>The input string, reversed.</returns>

/// <remarks>This method correctly reverses strings containing supplementary characters

/// (which are encoded with two surrogate code units).</remarks>

public static string Reverse(this string input)

{

    if (input == null)

        throw new ArgumentNullException("input");

 

    // allocate a buffer to hold the output

    char[] output = new char[input.Length];

    for (int outputIndex = 0, inputIndex = input.Length - 1; outputIndex < input.Length; outputIndex++, inputIndex--)

    {

        // check for surrogate pair

        if (input[inputIndex] >= 0xDC00 && input[inputIndex] <= 0xDFFF &&

            inputIndex > 0 && input[inputIndex - 1] >= 0xD800 && input[inputIndex - 1] <= 0xDBFF)

        {

            // preserve the order of the surrogate pair code units

            output[outputIndex + 1] = input[inputIndex];

            output[outputIndex] = input[inputIndex - 1];

            outputIndex++;

            inputIndex--;

        }

        else

        {

            output[outputIndex] = input[inputIndex];

        }

    }

 

    return new string(output);

}

Posted by Bradley Grainger at October 22, 2008 9:07 PM

Trackback Pings

TrackBack URL for this entry:
http://blog.logos.com/mt-cgi/mt-tb.cgi/261

Comments

Thanks Bradley - how does the speed of it compare to the others? Are there ways to optimize this?

Posted by: Guy Ellis at October 22, 2008 9:35 PM

I haven't profiled this function for performance--mostly because I've never been reversing strings in a performance-critical section of my code.

I suspect that there would be performance penalties due to the repeated array index checks (that are automatically inserted by the CLR) each time through the loop; it's possible that unsafe code could avoid these checks and run faster.

When writing utility libraries, however, I prefer to stick to safe code (because it may not always be used in a Full Trust application).

Posted by: Bradley Grainger Author Profile Page at October 22, 2008 10:19 PM

Hey don't u think you need to increment outputIndex and decrement inputIndex twice if you get a surrogate pair?

This code
output[outputIndex + 1] = input[inputIndex];
output[outputIndex] = input[inputIndex - 1];
outputIndex++;
inputIndex--;

Posted by: Sid at July 17, 2009 10:31 PM

Yes: the variables are incremented (decremented) once in the if statement and then again in the for statement's loop expression.

Posted by: Bradley Grainger Author Profile Page at July 18, 2009 7:49 AM

Post a comment




(you may use HTML tags for style)