Microbenchmarks are highly sensitive to the heap context!
10 Jun 2017 Tags: C++Over two weeks ago, I posted an article about by-value parameters in the loop and transform
idioms. Included in one of the plots was an anomaly that, when I considered it more closely, blew up the analysis. In fact, I’ve spent most of the past two weeks following up on the implications of this seemingly small anomaly. Ultimately, it’s given me reason to question the reliability of microbenchmarks in a wide range of circumstances. And after two weeks of work, using a variety of tools, I remain unclear about the underlying phenomenon. I’m writing a series of posts to clarify what I’ve learned to date and to set my future agenda.
The anomaly
But this is getting ahead of the story. I’ll begin with the results that started it all: The upper-left corner plot in the overview section of that two-week-old post. I’ve isolated the relevant data in the following plot, together with some new replication runs that demonstrate the effect is reliable:
The plot compares the results for the two versions of the loop, which differ only in using either emplace_back()
or push_back()
in their body. The by-value version of the loop parameter was used in both cases:
std::size_t str_loop_emplace (plist& res) {
assert ( ! res.size());
for (auto p : src) {
if (p.first != exclude)
res.emplace_back(p.first, p.second*p.second);
}
return res.size();
}
std::size_t str_loop_push (plist& res) {
assert ( ! res.size());
for (auto p : src) {
if (p.first != exclude)
res.push_back(ppair{p.first, p.second*p.second});
}
return res.size();
}
Each version has been compiled with -O2
and run twice on random data sets, with from 0–100% of the source data set copied to the result.
At first glance, the result wasn’t surprising. The loop using push_back()
was slower, which I would expect, given that it constructs an extra pair
compared to the one using emplace_back()
. This is borne out by an analysis of the generated object code. Summarizing their object code in pseudo-C:
// str_loop_emplace_long_by_copy:
goto begin
loop: V <- src_i->second ^ 2
if (res.cap_end() == res_i) goto expand_capacity [never taken]
res_i->first <- p.first
res_i->second <- V
++res_i
next: p.first.~string()
if (++src_i == src.end()) goto end
begin: p <- *src_i
if (exclude != p.first) goto loop
else goto next
end:
// str_loop_push_long_by_copy:
goto begin
loop: T.first <- p. first
T.second <- src_i->second ^ 2
res.emplace_back(T)
T.first.~string()
next: p.first.~string()
if (++src_i == src.end()) goto end
begin: p <- *src_i
if (exclude != p.first) goto loop
else goto next
end:
Spot the problem? I’ll take solace if you didn’t; I looked at the original plot for weeks before seeing it myself.
Here’s the anomaly: For a data set in which no values are copied (a 0% data set), the executed code is identical but in the above plot the push_back()
loop is 75% slower on that data set. (Note that the push_back()
loop could be slower on other data sets that actually require copying data. The two loops only execute identical code for the 0% case.)
The order effect
Results such as this strongly suggest an order effect: The emplace_back()
loop is faster because it is executed first. Given that the nonius
benchmarking framework executes complex code as it analyses the results after each benchmark, there could easily be such an effect. And in fact when the order of the two benchmarks is reversed, the push_back()
loop becomes faster than the emplace_back()
loop.
I revised the structure of the benchmark calls so that a given run executes only a single benchmark, followed by computing the analysis. When the loops are re-benchmarked under the revised structure, we get the following results (new values in right column of each lane, in green):
The emplace_back()
loop has the same times whether run individually or first in a sequence, whereas the push_back()
loop is much slower than the other loop when run second in sequence but achieves comparable speed when measured in its own run. A similarly dramatic speedup occurs (shown later) when the transform
idiom is measured in its own run rather than following the loop benchmarks. Given such clear evidence of an order effect, the benchmarks should all be run individually rather than in sequence.
The confounding effect of order means that all the microbenchmark results in my previous posts in this series should be considered tentative until they can be replicated using individual runs. Most likely only the long string, by-value runs will require revision but this needs to be verified. I will do those tests later. For now, let us continue exploring the order effect.
The likely source of the order effect
When does order matter? The effect doesn’t appear on any runs with with short or unique strings nor on runs with long strings and by-reference parameters, only on runs with long strings and by-value parameters. These runs make the heaviest use of the heap. Short and unique strings do not use the heap at all and the by-reference code for long strings use the heap far less than the by-value ones.
The traditional first response to an outcome like this is “heap fragmentation”. But when considered in detail, fragmentation doesn’t seem to be the likely cause of this performance. Fragmentation has two observable outcomes:
- An allocation request fails despite there being sufficient available space to satisfy it (because that available space is spread across multiple smaller blocks, none of which is individually large enough), or
- Allocation requests become progressively slower (because the allocator has to consider progressively more available blocks before finding one sufficiently large for the request).
The first result is manifestly not occurring—all requests succeed. The second result is unlikely to arise in these benchmarks because for our test datasets, every long string body is exactly 27 bytes long. Thus any previously-freed block for a string body would be sufficient to satisfy a new request.
If classical heap fragmentation is not the source of the order effect, what is? The cause is most likely due to some interaction between the heap layout and the sequence of heap requests, just not due to a sequence of progressively-larger requests that induce classical fragmentation. The cause of the slowdown seen in these benchmarks is related but more subtle.
Consider the following plot:
The left-hand facet (BY_REF=Ref) is the original by-reference results, run using the old structure, with all four benchmarks consecutively in a single run. The right-hand facet (BY_REF=Copy) is the by-value results, including both the older consecutive structure (blue) and the newer individual (green) structure. The plot compares results for the loop with emplace_back()
and the transform
idiom with push_back()
. The two idioms are run on four data sets, including the random (R) and sorted (S) used in previous benchmarks, together with two new data sets, the shuffled (H) and in-situ sorted (I). We’ll get to the new data sets in a bit; for now, we’ll focus on the random and sorted data sets, specifically the left four lanes (lp_em(R), lp_em(S)) on the right facet.
These four lanes are timings for the emplace_back()
loop, for random and sorted data sets, run first in a consecutive run (followed by three other benchmarks) and individually (no benchmarks following). Previous results suggest that:
- The times for a first-in-consecutive run and an individual run on the same data set should be identical. There is no prior benchmark affecting the heap.
- For the random and sorted data sets, the times for 0% and 100% copies should be the same but the times for 25%, 50%, and 75% copied should be shorter for sorted data.
But these are not the patterns that we see in the above plot. Instead:
- For sorted data, the individual runs are slower than the first-in-consecutive runs, despite the nominal similarity of the setup preceding them.
- For individual runs, sorted data sets with 75% and 100% copies are slower than random data sets, despite the branch prediction benefits of sorted data.
These results both seem to result from heap effects:
- The setup sequence for the individual runs is not identical to that for consecutive runs but in fact does more heap operations. This seems to slow down the benchmark code that follows, at least on sorted data.
- The sorted data set is created by applying
std::sort
to the random data set. Moving the source array values around seems to slow down the code, at least when it is copying large parts of the source array into the result (75% and 100%).
In short, performing more heap operations before running the benchmarks seems to slow the code. To test this with more extreme cases, I added two new data sets:
- Shuffled (H): The random data set is fully shuffled via
std::shuffle
. As a full shuffle, this does many more moves than a sort on a data set with only two distinct values. - In-situ sorted (I): Where the "sorted" data set was created by sorting the random data set, which required moving some values, this data set was created by building the values in sorted order directly in their original locations. Consequently, no element in this data set has been moved.
As you can see in the above plot (lanes suffixed (H) and (I)), number of element moves is strongly predictive of time: The in-situ sorted data covers the same range as the random data (no moves in either data set), while the shuffled data set (with many moves) is the slowest by far, for both idioms.
What next?
These results highlight two potentially-related effects that arise when the by-value versions of these idioms process strings whose bodies lie on the heap:
- The number and type of heap operations in the setup modestly affect the idioms' speed.
- The number of times elements in the source vector have been moved in the setup strongly affects the idioms' speed.
- The uniform length of the string bodies (27 bytes) manipulated by the idioms indicates that the timing effects are not due to classical heap fragmentation.
What is the problem and what tools might we use to identify it? That, reader, is the tale to come.