Buried in the .NET 9 Preview 6 release notes is the bullet point “Collection lookups with spans”. This is an exciting new feature that I’ve wanted in .NET for a long time. (And based on the age of issue #27229, so has the .NET team.)
Our Faithlife.Utility library provides the StringCache
class.
If you have a set of strings that will be used repeatedly, you can add them to the StringCache
to avoid having duplicate strings in memory.
This might happen when reading lines from files, for example.
(It’s a little bit like String.Intern
but with the benefit that the whole StringCache
and all its strings can be garbage collected when no longer needed.)
However, one big problem with StringCache
(from an allocation perspective) is that you first have to allocate a string
before you can look it up in the cache (to get the cached version).
That’s where IAlternateEqualityComparer<TAlternate, T>
comes in.
IAlternateEqualityComparer<TAlternate, T>
is a new interface that allows .NET collections (such as Dictionary
and HashSet
) to look up items in a collection of type T
by comparing them to items of type TAlternate
.
If T
is string
, a useful TAlternate
type is ReadOnlySpan<char>
, which lets you look up existing strings without first allocating a new one.
You implement this interface on the equality comparer that the collection is initialized with.
In .NET 9 Preview 6, all StringComparer
implementations implement IAlternateLookup<ReadOnlySpan<char>, string?>
.
(EqualityComparer<string>.Default
isn’t handled yet, but it will be by the time .NET 9 ships.)
For demonstration purposes, we’ll build a worked example here.
First, let’s build a simple implementation of IEqualityComparer<string>
that performs ordinal comparison:
class OurStringComparer : IEqualityComparer<string>
{
public bool Equals(string? x, string? y) => string.Equals(x, y);
public int GetHashCode([DisallowNull] string str)
{
if (str is null)
return 0;
// use the djb2 hash function for simplicity: http://www.cse.yorku.ca/~oz/hash.html
uint hash = 5381;
foreach (var ch in str)
hash = hash * 33u + ch;
return (int) hash;
}
}
Next, we’ll implement the new IAlternateEqualityComparer<ReadOnlySpan<char>, string>
interface.
This adds three new methods (and we’ll rewrite the existing GetHashCode
to delegate to the new one):
string Create(ReadOnlySpan<char> alternate)
— creates a string
from the alternate representation; this is used to add a new item (which has to be of type string
) to the collection given an alternate representationbool Equals(ReadOnlySpan<char> alternate, string other)
— compares the alternate representation to a string
; this is used when looking up an item in the collectionint GetHashCode(ReadOnlySpan<char> alternate)
— computes a hash code for the alternate representation; used for looking up itemsclass OurStringComparer : IEqualityComparer<string>,
IAlternateEqualityComparer<ReadOnlySpan<char>, string>
{
public string Create(ReadOnlySpan<char> alternate) => new(alternate);
public bool Equals(string? x, string? y) => string.Equals(x, y);
public bool Equals(ReadOnlySpan<char> alternate, string other) =>
other?.AsSpan().SequenceEqual(alternate) ?? false;
public int GetHashCode([DisallowNull] string str) =>
str is null ? 0 : GetHashCode(str.AsSpan());
public int GetHashCode(ReadOnlySpan<char> alternate)
{
// use the djb2 hash function for simplicity: http://www.cse.yorku.ca/~oz/hash.html
uint hash = 5381;
foreach (var ch in alternate)
hash = hash * 33u + ch;
return (int) hash;
}
}
The primary mechanism for using this feature is the GetAlternateLookup
extension method.
This provides a “view” of the collection that’s indexed by the alternate representation.
The methods on this “view” will differ based on the underlying collection (e.g., Dictionary
vs HashSet
) but generally include functions like Contains
, TryGetValue
, and Add
.
This code snippet shows how to use our custom OurStringComparer
with HashSet<string>
:
// create a HashSet that uses our custom equality comparer and add a string to it
var set = new HashSet<string>(new OurStringComparer());
set.Add("test");
// get a "view" that can look up items by the alternate representation
var spanLookup = set.GetAlternateLookup<string, ReadOnlySpan<char>>();
// look up a string by the alternate representation
ReadOnlySpan<char> test = "test";
Console.WriteLine(spanLookup.Contains(test)); // True
// find the existing string given the alternate representation
spanLookup.TryGetValue(test, out var testString);
// add a new string to the HashSet via the alternate representation
ReadOnlySpan<char> test2 = "test2";
spanLookup.Add(test2);
Console.WriteLine(set.Contains("test2")); // True
The above example is useful if, for example, you had a large char[]
and were parsing ReadOnlySpan<char>
slices out of it (e.g., by splitting on a delimiter).
But in real-world usage, it would be much more likely for the input data to be a byte[]
(or ReadOnlySpan<byte>
) containing UTF-8 encoded text.
Fortunately, IAlternateEqualityComparer
is generic over the alternate type, so you can use ReadOnlySpan<byte>
as the alternate representation.
We just have to make sure that our GetHashCode
method returns the same hash code for the same string, regardless of whether it’s represented as ReadOnlySpan<char>
or ReadOnlySpan<byte>
.
We can no longer process the raw characters or bytes, because é is stored as 0x00E9 in UTF-16, but as 0xC3 0xA9 in UTF-8.
One solution would be to decode every input ReadOnlySpan<byte>
to Span<char>
and then call the existing GetHashCode
method.
Alternatively, we can walk the Runes in the input spans and hash those.
For string
and ReadOnlySpan<char>
, .NET provides EnumerateRunes
methods that efficiently return the runes from a string.
For the UTF-8 ReadOnlySpan<byte>
data, we’ll have to implement our own:
public ref struct Utf8SpanRuneEnumerator
{
private ReadOnlySpan<byte> _remaining;
private Rune _current;
public Utf8SpanRuneEnumerator(ReadOnlySpan<byte> buffer)
{
_remaining = buffer;
_current = default;
}
public Rune Current => _current;
public Utf8SpanRuneEnumerator GetEnumerator() => this;
public bool MoveNext()
{
if (_remaining.IsEmpty)
{
_current = default;
return false;
}
Rune.DecodeFromUtf8(_remaining, out _current, out var bytesConsumed);
_remaining = _remaining.Slice(bytesConsumed);
return true;
}
}
Now we can implement a second IAlternateEqualityComparer
interface:
class OurStringComparer : IEqualityComparer<string>,
IAlternateEqualityComparer<ReadOnlySpan<char>, string>,
IAlternateEqualityComparer<ReadOnlySpan<byte>, string>
{
// ... same methods as above ...
public string Create(ReadOnlySpan<byte> alternate) => Encoding.UTF8.GetString(alternate);
public bool Equals(ReadOnlySpan<byte> alternate, string other)
{
// walk the strings one rune at a time, comparing them for equality
var utf16Enumerator = other.EnumerateRunes();
var utf8Enumerator = new Utf8SpanRuneEnumerator(alternate);
while (utf16Enumerator.MoveNext())
{
if (!utf8Enumerator.MoveNext())
return false;
if (utf16Enumerator.Current != utf8Enumerator.Current)
return false;
}
return !utf8Enumerator.MoveNext();
}
public int GetHashCode(ReadOnlySpan<char> alternate)
{
uint hash = 5381;
foreach (var rune in alternate.EnumerateRunes())
hash = hash * 33u + (uint) rune.Value;
return (int) hash;
}
public int GetHashCode(ReadOnlySpan<byte> alternate)
{
uint hash = 5381;
foreach (var rune in new Utf8SpanRuneEnumerator(alternate))
hash = hash * 33u + (uint) rune.Value;
return (int) hash;
}
}
It can be used as follows (adding on to the previous example):
// get a lookup for UTF-8 bytes
var utf8SpanLookup = set.GetAlternateLookup<string, ReadOnlySpan<byte>>();
// look up UTF-8 bytes in the HashSet (without allocating a string)
Console.WriteLine(utf8SpanLookup.Contains("test"u8)); // True
// add a new string from UTF-8 bytes
utf8SpanLookup.Add("\U0001F600"u8);
Console.WriteLine(spanLookup.Contains("\U0001F600")); // True
Console.WriteLine(set.Contains("\U0001F600")); // True
The new IAlternateEqualityComparer<TAlternate, T>
interface in .NET 9.0 is a powerful tool for reducing allocations in your code.
It allows you to look up items in collections without having to allocate a new object of type T
.
This can be especially useful when working with strings, because you can use ReadOnlySpan<char>
or ReadOnlySpan<byte>
as the alternate representation.
By implementing this interface on your custom equality comparers, you can take advantage of this feature in your own code.
Posted by Bradley Grainger on July 18, 2024