Jon Skeet recently discussed what it means to be “lazy” (in the context of LINQ to Objects). He introduced some new definitions based on how the operator processes the source sequence. Another aspect (which Jon didn’t cover) is how lazily the operator produces output when its results are enumerated.
Let’s take a concrete example: what’s the difference in running time between these two statements?
var count1 = bigCollection.Count(); var count2 = bigCollection.Take(10).Count();
Assuming “bigCollection” doesn’t implement ICollection
var count1 = bigCollection.OrderBy(x => x).Count(); var count2 = bigCollection.OrderBy(x => x).Take(10).Count();
The running time of both these statements is actually about the same. Even though the OrderBy documentation states that “This method is implemented by using deferred execution. … The query represented by this method is not executed until the object is enumerated”, what that actually means is that the sort isn’t performed until the first call to MoveNext on the enumerator returned by OrderBy. But when that first call to MoveNext happens, the entire input is sorted, even though only the smallest ten items are required.
There’s no reason this needs to be the case; Wikipedia has an extensive discussion on efficiently selecting the k smallest elements from a collection. Quicksort is a natural choice for implementing such an algorithm: since it divides the array in half around a pivot element, it’s easy to defer sorting the right-hand side (i.e., elements greater than the pivot) until we know that the consumer needs them.
yield statement will make implementing a lazy algorithm easy, but
quicksort is typically implemented recursively, which is hard to combine with
yield (without using foreach statements that simply yield the recursive
output). Instead, we can use a stack to maintain the state of the recursion
and implement it in an iterative fashion. As we divide the array in half, we
push the pieces that are not yet sorted onto a stack. When we detect that a
section of the array is completely sorted, we yield those elements to the
caller. This ensures that the sorting work is not done until it’s needed.
In practice, this implementation performs very well; the following graph shows
how long it takes to call
.OrderBy(x => x).Take(100) on arrays (of integers)
of various sizes. The blue line shows the performance of the standard .NET
implementation, while the red line shows how the new lazy implementation out-
Not only that, but the algorithm is no slower if you end up enumerating the
entire sequence (and thus sorting every element). The full source code is
available at GitHub
(specifically in the OrderedEnumerable.cs
and ElementComparer.cs files);
the following excerpt illustrates the core
concepts. (Note that this simplified version doesn’t maintain compatibility
OrderBy: specifically, the sort is not stable, and you can’t chain a
.ThenBy to the resulting sequence. The full implementation at GitHub does
not have these limitations.)
Posted by Bradley Grainger on April 08, 2010