Don't let your babies grow up to be microbenchmarkers!

20 Aug 2018 Tags: C++

I never wanted to be a microbenchmarker. I was OK with a modest but respectable career choice, such as programming in RPG II or IBM 1604 assembler.

But then I went and wrote several posts on the machine language efficiency of the code compiled from several STL-based idioms. Which pretty much demanded a follow up actually comparing the performance of the various algorithms in practice. Which in turn meant microbenchmarks. And here we are.

Microbenchmarks are influenced by many sources of variability

The pitfalls of microbenchmarks are legion. Oh, it’s easy enough to run a few timing tests and compute some comparative measure but how broadly can you generalize from those numbers? There’s plenty of sources of variability underlying a microbenchmark:

Variance between instances of the same chip design
Variance between microarchitectures in the same family
Variance between vendor implementations of a common instruction set
Variance between architectures (x86 vs. ARM)
Variance due to chip temperature changes, including those caused by a benchmark run
Variance across virtual machine startups
Compiled code variance with compiler, version, and optimization level
Variance due to test data sets
Interference from co-resident processes and active kernel threads on the same machine
Heap layout
Timing of garbage collection
Other factors …

An example

Recently, I microbenchmarked two C++ implementations of a simple text manipulation problem (the one given to Albino Tonnina for his job interview). One was an O(n²) iterative algorithm featuring a pair of nested loops, the other an O(n) algorithm using a hash table. For my initial benchmarks, I used a test dataset with n=740 and ran benchmarks using nonius, which computes confidence intervals for the mean and standard deviation using the accelerated bias-corrected bootstrap, a nonparametric estimator.

The initial implementations performed broadly as their respective asymptotic behaviour suggested: On this dataset, the hash table implementation was typically about 100 times as fast as the nested-loop implementation.

But I wanted more than simple confirmation of their expected asymptotic performance. As each implementation was a first draft, I wanted to remove inefficiencies in their code and see how their respective performance improved. The uftrace user-space tracer revealed a number of extraneous string constructor calls in the nested-loop algorithm, as well as some superfluous calls to other expensive routines.

The first few performance fixes for the nested-loop implementation yielded sufficiently large improvements that it became only 2–3 times as slow as the hash table implementation. At this point I hit a wall, as I could no longer get repeatable benchmark results. Consecutive runs of the benchmark would yield means that were never within the computed confidence intervals of each other.

Some runs using nonius’s default of 100 runs (times in microseconds):

Mean	Lower bound 95% CI	Upper bound 95% CI
662.5	660.6	666.0
675.0	671.9	680.3
678.5	669.3	698.9
742.7	738.0	749.3

Of the six combinations, only two of the estimated means (678.5 lies in 675.0’s interval, 675.0 lies in 678.5’s interval) lie within the CI estimated for another mean.

How do I interpret this variation?

The bootstrap method is powerful but requires care. It can be sensitive to sample size (the nonius default of 100 runs may well be too small). Running the benchmark with 1000 runs yields estimates within the range of those above but no greater consistency of means to confidence intervals, though the confidence intervals tend to be tighter.

I am unsure how to interpret these numbers. The confidence intervals are computed from the sample of runs and so each is subject to the particularities of its underlying sample. If a sample run sequence includes values from the extreme low or high tail (and for timing data, the probability of extremely high values is much higher than for low values), these values will skew the computed interval. The intervals will not necessarily overlap. Some distance between them is expected. But how far apart is too far? How tight must the means be for us to consider them “consistent”?

What are microbenchmarks good for?

Ultimately, the microbenchmarks are consistent enough to indicate that the nested-loop implementation is substantially slower than the hash table implementation for this dataset. But they do not provide sufficiently consistent results to say how much faster the hash table implementation is. When I get such variance between benchmark runs, which means do I present as the “results”?

I’ve wrestled with this topic for months, revising this post. I don’t have a firm conclusion. Microbenchmarks are variable, difficult to interpret, even unreliable, but what else do we have? My goal was to gain a sense of the relative performance of several C++ features, as implemented by a specific compiler and standard library, and I achieved that. The microbenchmarks could distinguish substantial performance improvements from trivial ones, a useful result. But in the end they were incapable of making a definitive estimate of the sizes of those improvements. Perhaps I should just set my expectations to a more realistic level.

All my marbles in one place

Don't let your babies grow up to be microbenchmarkers!

Microbenchmarks are influenced by many sources of variability

An example

How do I interpret this variation?

What are microbenchmarks good for?