Alternate Lookups in .NET 9.0

Introduction

Buried in the .NET 9 Preview 6 release notes is the bullet point “Collection lookups with spans”. This is an exciting new feature that I’ve wanted in .NET for a long time. (And based on the age of issue #27229, so has the .NET team.)

Motivating Example

Our Faithlife.Utility library provides the StringCache class. If you have a set of strings that will be used repeatedly, you can add them to the StringCache to avoid having duplicate strings in memory. This might happen when reading lines from files, for example. (It’s a little bit like String.Intern but with the benefit that the whole StringCache and all its strings can be garbage collected when no longer needed.)

However, one big problem with StringCache (from an allocation perspective) is that you first have to allocate a string before you can look it up in the cache (to get the cached version). That’s where IAlternateEqualityComparer<TAlternate, T> comes in.

IAlternateEqualityComparer

IAlternateEqualityComparer<TAlternate, T> is a new interface that allows .NET collections (such as Dictionary and HashSet) to look up items in a collection of type T by comparing them to items of type TAlternate. If T is string, a useful TAlternate type is ReadOnlySpan<char>, which lets you look up existing strings without first allocating a new one.

How to Implement

You implement this interface on the equality comparer that the collection is initialized with. In .NET 9 Preview 6, all StringComparer implementations implement IAlternateLookup<ReadOnlySpan<char>, string?>. (EqualityComparer<string>.Default isn’t handled yet, but it will be by the time .NET 9 ships.)

For demonstration purposes, we’ll build a worked example here.

First, let’s build a simple implementation of IEqualityComparer<string> that performs ordinal comparison:

class OurStringComparer : IEqualityComparer<string>
{
    public bool Equals(string? x, string? y) => string.Equals(x, y);

    public int GetHashCode([DisallowNull] string str)
    {
        if (str is null)
            return 0;

        // use the djb2 hash function for simplicity: http://www.cse.yorku.ca/~oz/hash.html
        uint hash = 5381;
        foreach (var ch in str)
            hash = hash * 33u + ch;
        return (int) hash;
    }
}

Next, we’ll implement the new IAlternateEqualityComparer<ReadOnlySpan<char>, string> interface. This adds three new methods (and we’ll rewrite the existing GetHashCode to delegate to the new one):

  • string Create(ReadOnlySpan<char> alternate) — creates a string from the alternate representation; this is used to add a new item (which has to be of type string) to the collection given an alternate representation
  • bool Equals(ReadOnlySpan<char> alternate, string other) — compares the alternate representation to a string; this is used when looking up an item in the collection
  • int GetHashCode(ReadOnlySpan<char> alternate) — computes a hash code for the alternate representation; used for looking up items
class OurStringComparer : IEqualityComparer<string>,
    IAlternateEqualityComparer<ReadOnlySpan<char>, string>
{
    public string Create(ReadOnlySpan<char> alternate) => new(alternate);

    public bool Equals(string? x, string? y) => string.Equals(x, y);

    public bool Equals(ReadOnlySpan<char> alternate, string other) =>
        other?.AsSpan().SequenceEqual(alternate) ?? false;

    public int GetHashCode([DisallowNull] string str) =>
        str is null ? 0 : GetHashCode(str.AsSpan());

    public int GetHashCode(ReadOnlySpan<char> alternate)
    {
        // use the djb2 hash function for simplicity: http://www.cse.yorku.ca/~oz/hash.html
        uint hash = 5381;
        foreach (var ch in alternate)
            hash = hash * 33u + ch;
        return (int) hash;
    }
}

How to Use

The primary mechanism for using this feature is the GetAlternateLookup extension method. This provides a “view” of the collection that’s indexed by the alternate representation. The methods on this “view” will differ based on the underlying collection (e.g., Dictionary vs HashSet) but generally include functions like Contains, TryGetValue, and Add.

This code snippet shows how to use our custom OurStringComparer with HashSet<string>:

// create a HashSet that uses our custom equality comparer and add a string to it
var set = new HashSet<string>(new OurStringComparer());
set.Add("test");

// get a "view" that can look up items by the alternate representation
var spanLookup = set.GetAlternateLookup<string, ReadOnlySpan<char>>();

// look up a string by the alternate representation
ReadOnlySpan<char> test = "test";
Console.WriteLine(spanLookup.Contains(test)); // True

// find the existing string given the alternate representation
spanLookup.TryGetValue(test, out var testString);

// add a new string to the HashSet via the alternate representation
ReadOnlySpan<char> test2 = "test2";
spanLookup.Add(test2);

Console.WriteLine(set.Contains("test2")); // True

UTF-8 Example

The above example is useful if, for example, you had a large char[] and were parsing ReadOnlySpan<char> slices out of it (e.g., by splitting on a delimiter). But in real-world usage, it would be much more likely for the input data to be a byte[] (or ReadOnlySpan<byte>) containing UTF-8 encoded text. Fortunately, IAlternateEqualityComparer is generic over the alternate type, so you can use ReadOnlySpan<byte> as the alternate representation.

We just have to make sure that our GetHashCode method returns the same hash code for the same string, regardless of whether it’s represented as ReadOnlySpan<char> or ReadOnlySpan<byte>. We can no longer process the raw characters or bytes, because é is stored as 0x00E9 in UTF-16, but as 0xC3 0xA9 in UTF-8.

One solution would be to decode every input ReadOnlySpan<byte> to Span<char> and then call the existing GetHashCode method. Alternatively, we can walk the Runes in the input spans and hash those.

For string and ReadOnlySpan<char>, .NET provides EnumerateRunes methods that efficiently return the runes from a string. For the UTF-8 ReadOnlySpan<byte> data, we’ll have to implement our own:

public ref struct Utf8SpanRuneEnumerator
{
    private ReadOnlySpan<byte> _remaining;
    private Rune _current;

    public Utf8SpanRuneEnumerator(ReadOnlySpan<byte> buffer)
    {
        _remaining = buffer;
        _current = default;
    }

    public Rune Current => _current;

    public Utf8SpanRuneEnumerator GetEnumerator() => this;

    public bool MoveNext()
    {
        if (_remaining.IsEmpty)
        {
            _current = default;
            return false;
        }

        Rune.DecodeFromUtf8(_remaining, out _current, out var bytesConsumed);
        _remaining = _remaining.Slice(bytesConsumed);
        return true;
    }
}

Now we can implement a second IAlternateEqualityComparer interface:

class OurStringComparer : IEqualityComparer<string>,
    IAlternateEqualityComparer<ReadOnlySpan<char>, string>,
    IAlternateEqualityComparer<ReadOnlySpan<byte>, string>
{
    // ... same methods as above ...

    public string Create(ReadOnlySpan<byte> alternate) => Encoding.UTF8.GetString(alternate);

    public bool Equals(ReadOnlySpan<byte> alternate, string other)
    {
        // walk the strings one rune at a time, comparing them for equality
        var utf16Enumerator = other.EnumerateRunes();
        var utf8Enumerator = new Utf8SpanRuneEnumerator(alternate);
        while (utf16Enumerator.MoveNext())
        {
            if (!utf8Enumerator.MoveNext())
                return false;
            if (utf16Enumerator.Current != utf8Enumerator.Current)
                return false;
        }
        return !utf8Enumerator.MoveNext();
    }

    public int GetHashCode(ReadOnlySpan<char> alternate)
    {
        uint hash = 5381;
        foreach (var rune in alternate.EnumerateRunes())
            hash = hash * 33u + (uint) rune.Value;
        return (int) hash;
    }

    public int GetHashCode(ReadOnlySpan<byte> alternate)
    {
        uint hash = 5381;
        foreach (var rune in new Utf8SpanRuneEnumerator(alternate))
            hash = hash * 33u + (uint) rune.Value;
        return (int) hash;
    }
}

It can be used as follows (adding on to the previous example):

// get a lookup for UTF-8 bytes
var utf8SpanLookup = set.GetAlternateLookup<string, ReadOnlySpan<byte>>();

// look up UTF-8 bytes in the HashSet (without allocating a string)
Console.WriteLine(utf8SpanLookup.Contains("test"u8)); // True

// add a new string from UTF-8 bytes
utf8SpanLookup.Add("\U0001F600"u8);
Console.WriteLine(spanLookup.Contains("\U0001F600")); // True
Console.WriteLine(set.Contains("\U0001F600")); // True

Summary

The new IAlternateEqualityComparer<TAlternate, T> interface in .NET 9.0 is a powerful tool for reducing allocations in your code. It allows you to look up items in collections without having to allocate a new object of type T. This can be especially useful when working with strings, because you can use ReadOnlySpan<char> or ReadOnlySpan<byte> as the alternate representation. By implementing this interface on your custom equality comparers, you can take advantage of this feature in your own code.

Posted by Bradley Grainger on July 18, 2024