Benchmarking the transform and loop idioms
19 May 2017 Tags: C++Update: Mon Aug 20, 2018: Also see this semi-rant about the limitations of microbenchmarkering.
The last posts compared the machine code generated by transform
and the equivalent idiom using an explicit for
loop and if
statement. The transform
idiom generated more intermediate copies of the std::string
member than the loop.
As I noted in the posts, code differences indicate the degree to which the optimizer was able to reduce the standard library abstractions to the essential underlying algorithm but only approximately predict differences in performance. The performance effects must ultimately be assessed by benchmarks.
There are a wide range of methods and levels at which benchmarks might be used to compare these two idioms. I will focus on microbenchmarking, timing short code sequences under several conditions of underlying data type and order. This does not predict the effect of choosing these idioms in a full application, where their contribution will typically be small. But it does provide insight into the effectiveness of the different machine code sequences that the optimizer generated for each idiom.
I emphasize that the most important outcome of these benchmarks is the relative performance of the various constructs, not their absolute values. Your performance will certainly vary. See the Caveats and Appendix sections for further discussion of the limits of these results.
Conditions
I ran initial tests exploring the effect of various factors on performance and settled on the following factors as most important:
Type of the elements: The samples that I presented in earlier posts used elements that were
std::pair
, with the first part, thestd::string
, used for the filter. Thelibstdc++
implementation of astring
is complex, offering different performance profiles for short strings (15 character or less) versus long strings and for operations that only use the string length versus those that require accessing the string body.The second member, the
To sample the performance range of string data, I ran microbenchmarks with three different string representations as the first member of the pair:int
, generated simple code that had no performance impact.- Short:
std::string
values exactly 3 characters long. Strings of this length are stored in the string handle and can only be distinguished by comparing the actual string bodies. - Long:
std::string
values exactly 26 characters long. Strings of this length are stored in heap-allocated storage and can only be distinguished by comparing the actual string bodies. - Unique: Null-terminated
char const *
values that are uniquely identified by their address. Copies, moves, and comparisons of these values are all trivial, requiring only manipulation of eight-byte pointers.
There are further possibilities, such as short
std::string
values that can be distinguished simply by their differing lengths, as well as far more complicated types, but the above three represent an initial sample of three distinct performance profiles.- Short:
Using
vector::push_back()
versus usingvector::emplace_back()
: Both the loop and thetransform
idiom can be written to push a previously-constructed temporary value onto the end of thevector
(push_back()
) or to construct a value directly in uninitialized storage at thevector
's end (emplace_back()
). Typically,emplace_back()
is faster, as it does not generate a temporary value.The sample idioms that I presented in previous posts differed in this. The
transform
idiom usedpush_back()
. I verified thatgcc
6.2 generated the same machine code when the idiom was written usingemplace_back()
. By contrast, I presented the loop code usingemplace_back()
and the code generated for that was simpler than whenpush_back()
was used. In this post, I microbenchmark both versions of both idioms.Proportion of elements copied: The source vector contains pairs whose first member value is one of two string values of the given representation. One of those values is the one to be filtered out, while the other value is the one to be copied. The benchmarks were run with five different source data sets, with different fractions of values to be copied: 0%, 25%, 50%, 75%, and 100%. The string comparison is performed for every element in the source vector, but only the copied proportion of the elements are appended to the result vector.
Order of the data elements: Both idioms are at heart a branch located inside a loop. For such algorithms, the order of the elements affects the branch prediction rate. For example, if 50% of the values are copied, then 50% of the branches will be taken and 50% will not. If the successful branches are all collected together at the start or end of the source vector, the hardware branch predictor will perfectly predict the branch (with a small number of failed predictions at the transition), whereas if the copied and not-copied elements are randomly mingled, the predictor will consistently fail. The benchmarks were run with the five data sets presented in random order and in sorted order.
Method
I chose the Nonius C++ microbenchmarking framework, due to its strong statistical design. The two idioms were placed in functions called by the framework. Each idiom was written in both push_back()
and emplace_back()
versions:
str_loop_emplace
: Basic loop using emplace_back()
std::size_t str_loop_emplace (plist& res) {
assert ( ! res.size());
for (auto& p : src) {
if (p.first != exclude)
res.emplace_back(p.first, p.second*p.second);
}
return res.size();
}
str_loop_push
: Basic loop using push_back()
std::size_t str_loop_push (plist& res) {
assert ( ! res.size());
for (auto& p : src) {
if (p.first != exclude)
res.push_back(ppair{p.first, p.second*p.second});
}
return res.size();
}
str_option_emp
: transform
idiom using an output iterator calling emplace_back()
std::size_t str_option_emp(plist& res) {
assert ( ! res.size());
std::transform(src.cbegin(), src.cend(), opt_back_emplacer(res),
[=](auto& p) {
if (p.first == exclude)
return oppair();
else
return oppair(ppair{p.first, p.second*p.second});
} );
}
str_option
: transform
idiom using an output iterator calling push_back()
std::size_t str_option(plist& res) {
assert ( ! res.size());
std::transform(src.cbegin(), src.cend(), opt_back_inserter(res),
[=](auto& p) {
if (p.first == exclude)
return oppair();
else
return oppair(ppair{p.first, p.second*p.second});
} );
}
There are two blemishes in the above code that I only noticed after gathering all the data. Neither of them substantively affects the benchmark results:
- The initial
assert()
calls, checking that the result vector is empty, were a legacy of initial testing that the benchmarking code worked as expected. Left in the benchmark, they add a small amount of unnecessary computation. This will have no effect on the relative performance, as the same cost is paid for every benchmark and the dominant cost is the processing of the 50,000 vector elements. Nonetheless, in future runs, theassert
calls should be deleted. - The two
transform
benchmark functions,str_option
andstr_option_emp
, declare a result but do not return one. Accepting such functions without warning is a "feechure" of gcc. The result actually returned from these function is gibberish (by chance, it is the square of the second member for the last copied element). The only purpose of the results of the benchmark functions is to force the optimizer to call the functions; the return value is never used. Reviewing the generated code, the function logic is generated and the function called, so the declared return value serves its purpose despite being gibberish and the performance results are unaffected. Nonetheless, in future runs, these two functions should have explicitreturn
statements.
Reviewing the generated code, the two implementations of the transform
idiom produced identical code, both calling emplace_back()
. This arose because the libstdc++
implementation of push_back()
includes an overload that simply maps to emplace_back()
and this overload was the one resolved for the opt_back_insert_iterator()
used as the output iterator for str_option
. In the benchmark results, the two transform
implementations performed identically.
Refactoring the code into functions changed the generated code in small ways from the code discussed in the two prior posts: The res
vector is now a by-reference parameter rather than a local variable, while the src
vector and exclude
constant are now global rather than local. The actual changes to the generated code are modest; the performance results for these functions should correlate well with the performance of the code described earlier.
The source array always had 50,000 elements. Each benchmark was run 10,000 times and their arithmetic mean is reported. For every reported result, the size of the 95% confidence interval reported by Nonius was less than 3% of the mean. Full details of the benchmark conditions are given in the Appendix.
Overview of results
I’ll start with an overview of the benchmark results:
Each dot represents the mean of 10,000 samples. The number of dots varies by condition because I ran the benchmark varying numbers of times, from one run for each condition with unique strings, to five runs for the sorted datasets of short strings.
The largest effect is due to string type (comparing rows of plots): Using the short strings (middle row) as a reference point, long strings (top row) are roughly twice as slow and unique strings (bottom row) are twice as fast.
The next largest effect is proportion of items actually copied (order of colours within each lane; see legend correlating colours to proportion). Within each string type, copying more values into the result vector requires more time. This is hardly surprising but it does emphasize the cost of moving data, even for simple data types such as short strings (32 bytes to copy in a 64-bit implementation) and char const *
(8 bytes to copy in a 64-bit implementation).
Using vector::emplace_back()
versus vector::push_back()
had a substantial effect for the loop (leftmost lane is loop with emplace_back()
, second lane from left is loop with push_back()
) but as expected had no effect for std::transform()
(rightmost lane is transform
with emplace_back()
, second lane from right is transform
with push_back()
).
Finally, consider the question motivating this analysis: How did the performance of transform
compare with that of the loop? I will take the emplace_back()
implementation as representative of the loop’s performance (it is trivially easy to use emplace_back()
in the loop, as well as widely-recommended—for example, see Item 42 of Scott Meyers’s Effective Modern C++). The choice is arbitrary for transform
given the two implementations’ equivalent performance and identical code, so I will take the push_back()
implementation. These choices produce a comparison between the leftmost lane of the plots (the loop with emplace_back()
) and the second lane from the right (transform
with push_back()
).
Within each subplot, the loop is faster than transform
for the cases where at least some elements are copied to the result vector. The two idioms offer equivalent performance for the case where no data elements were copied.
Finally, I note some anomalies in the plots:
- For the case of 75% copied of sorted data (upper right subplot), three of the implementations have a single run that was anomalously slow. I speculate that the conditions in my machine were substantively different during those runs---perhaps some background OS housekeeping process was running. The other two runs for this case also seem unexpectedly high, the only cases in all plots where algorithms ran more slowly on sorted data. I am not aware what the actual cause was and it does not seem to have arisen during the
str_loop_emplace
benchmark, which was run first. - For the loop processing unsorted short strings and unique strings (the left lane of the two bottom left plots), the performance is not predicted by the proportion of copies. I will discuss this in detail below.
For the rest of the post, I will focus on the emplace_back()
version of the loop and the push_back()
version of the idiom.
Comparison of long std::string
The results for long strings are the most straightforward to interpret:
The long strings are two 26-character strings, differing only in the last character. As such, they require a utility routine call to compare the string bodies, which are stored in the heap and likely cause cache misses. Move constructions and assignments remain cheap, because the heap pointer can simply be moved to the target, which accepts ownership of the heap object. By contrast, copy constructions and assignments are expensive, requiring a heap allocation for the copy of the string body, again incurring cache misses.
Given the high cost of string operations, the dominant factor in performance is the number of items to copy, which predicts the order of results: More copies mean slower performance.
We see a mild effect of branch misprediction in the case of 50% copied: The loop processes sorted input (second lane from left) about .25 ms faster than unsorted input (leftmost lane). This effect is barely present for transform
.
The loop is faster than transform
directly proportional to the number of copies. The two idioms are most different for 100% and 75% copies, while they are equivalent for 0%. For long strings, the slower performance of transform
appears to be due to extra temporary values constructed when copying into the result vector.
I consider the results for transform
on the 75% copied case (top of two right lanes) more tentative than the others, as they are anomalous in two ways. First, the three means for the sorted data are not clustered, with one 75% mean instead clustered with the two means for the 100% copied data. Second, all three means for the transform
of sorted data are higher than the corresponding means for unsorted data. This is the only instance of processing sorted data more slowly than unsorted and I cannot think of a reasonable explanation. Due to these anomalies, I accord these data points less confidence than the others.
Comparison of short std::string
The results for short strings are more complex:
The short strings are both 3 characters long, differing in their final character. All operations can be performed by access to the string handle (libstdc++
stores string bodies of less than 16 characters in the handle rather than the heap), avoiding heap accesses and highly reducing the likelihood of cache misses. As noted in the overview, the improvement is clear: Short strings are about twice as fast as long strings.
The reduced effect of string copying allows other effects to appear, particularly the effect of branch misprediction. As the misprediction effect is most noticeable for the loop, I will begin with the more straightforward transform
results (the two right lanes).
The dominant factor in performance of transform
on short strings is the percent of values copied to the result. On sorted data, the means are directly proportional to this percentage. For unsorted data, branch misprediction effects appear. For the 100% and 0% copied cases, branch prediction should be perfect irrespective of data order, as the branch will succeed or fail for every element. This is confirmed by the means for these cases, which are the same for unsorted and sorted data sets.
However, for the 50% copied case, branch prediction will nearly always be wrong for unsorted data but essentially perfect for sorted data, only mispredicting at the transition from excluded to included values. The means for 50% copies show a substantial improvement for sorted data.
For the data sets that have 75% and 25% of their values copied, there will be two contending effects. For both cases, branch prediction will be modestly effective even for unsorted data, as occasional runs of inclusion or exclusion in the data will support successful prediction. However, the 25% data set requires substantially fewer copies than the 75% data set, making it faster. These effects are apparent in the means: transform
is substantially faster on sorted data than unsorted for the 25% case, where branch misprediction is a large contributor, but the idiom is only slightly faster for sorted data in the 75% case, where the cost of copying values dominates.
Branch misprediction has a much larger effect on the loop (two left lanes). When this idiom processes unsorted data (leftmost lane), the proportion of values copied no longer predicts the relative performance of the data sets. The two fastest unsorted data sets are those for which branch prediction is perfect, with 0% and 100% copied. All datasets for which branch prediction is imperfect, the 25%, 50%, and 75%, are slower. Indeed, the 50% and 75% cases have almost the same performance, due to near-total branch misprediction (in the 50% case) having almost as much effect as half again more string copies (in the 75% case).
When the loop runs on sorted data (second lane from left), branch prediction is perfect, leaving only the proportion of copies as a factor. The more copies required, the slower the processing.
Finally, comparing the results for the two idioms, transform
is 25–50% slower than the loop, with the gap proportional to the number of values copied. For 0% copied, the two idioms are equivalent.
Comparison of unique char const *
strings
The unique char const *
strings are trivially fast to copy or compare, requiring only manipulation of 8-byte pointers. Consequently, the influence of dataset order is even stronger for these strings:
For both idioms, order once again has no effect on performance for datasets where 0% or 100% of the elements are copied, due to perfect branch prediction. For transform
(right two lanes), the 50% copied case suffered the greatest loss due to branch misprediction, with the unsorted data about a third slower than the sorted data. Misprediction had no discernible effect on the 75% copied data and only a slight effect on the 25% copied data. For transform
, the relative ordering of the means was always determined by the proportion of copies.
For the loop, branch misprediction had a larger effect (left two lanes). As with the short string case, the order of the results for the unsorted data (leftmost lane) corresponded to the number of correctly-predicted branches, not the number of copies. The datasets with perfect prediction (0% and 100% copied) were fastest, while the unpredictable dataset (50% copied) was slowest, with the partially-predictable datasets (25% and 75%) in between. Within each level of branch predictability, the number of copies determined the relative ranking.
For the sorted case, the loop performance was directly proportional to the number of copies made, as branch prediction was near-perfect for every dataset.
Comparing the two idioms, transform
was surprisingly slow, from 2.5 times to the same rate. Once again, the number of extraneous temporaries made by transform
seems to be the source of the difference, as the two algorithms take identical times for datasets where no elements are copied.
Caveats
These results reflect a small number of samples (only a single sample in the case of the unique strings) on a single OS/machine configuration (see Appendix for details). For the cases where I have multiple samples, they seem reliable, with the lone exception of the sorted data set of long strings with 75% copies for transform
. My confidence is increased by the consistency between these results, my analysis of the machine code, and well-understood theories of performance on modern processors. Nonetheless, the numbers are more suggestive than definitive. I would like to run more tests in future but I have put enough time into this analysis already and want to post it. These results at least tell a consistent, believable story.
Conclusion
Is the transform
idiom a useful alternative to the loop? I confess that the analysis of the code and benchmarks is making me question the transform
idiom’s utility. The loop idiom is immediately understandable to anyone who’s programmed in any language, robust, and has a more direct expression of the underlying algorithm that permits the compiler to generate code without extraneous temporaries. More familiar and faster: a hard pair to beat.
The only advantages of the transform
idiom, such as they may be, include increased familiarity with the standard algorithm family. I still believe that there are many cases where a standard algorithm is preferable to a hand-crafted alternative. I don’t want to write my own sort or even my own unique-values algorithm, thank you. But the filter-and-transform embodied by the basic Python list comprehension maps cleanly to a range for
loop and if
statement. Introducing a std::optional
value and an optional-aware output iterator just to use transform
now seems too clever by half. In my own code, I expect I’ll stick with a for
loop in this case.
In future posts, I’ll extend this analysis to a few more cases but my focus will likely shift to the impact of temporary values on execution speed. The biggest lesson of these benchmarks is how much impact extraneous temporaries have.
Appendix: Detailed method description
This appendix presents the technical specifics of the benchmarks.
Nonius calls
The code for running the str_loop_emplace
benchmark:
using ppair = std::pair<string,int>;
using oppair = optional<ppair>;
using plist = std::vector<ppair>;
plist src {};
plist res {}; // NOT USED IN BENCHMARKS
int included {};
string exclude {};
bool check(const plist& res) {
#if PRINT_SIZE
std::cerr << "LEFT " << LEFT << " res.size() " << res.size() << std::endl;
#endif
if (res.size() != included) {
std::cerr << "res.size() " << res.size() << ", included " << included << '\n';
for (auto& v : res) {
std::cerr << v.first << ", " << v.second << std::endl;
}
}
assert(res.size() == included);
}
#ifdef DO_LOOP_EMPLACE
std::size_t str_loop_emplace (plist& res) {
assert ( ! res.size());
for (auto BY p : src) {
if (p.first != exclude)
res.emplace_back(p.first, p.second*p.second);
}
return res.size();
}
NONIUS_BENCHMARK("str_loop_emplace", [] (nonius::chronometer meter) {
std::vector<plist> res(meter.runs());
for (auto& r : res) {
r.reserve(src.size());
}
meter.measure( [&] (int i) {return str_loop_emplace(res[i]);} );
for (const auto& r : res)
check(r);
});
#endif
The code for the other three benchmarks was the same, substituting the appropriate function from the beginning of this post for str_loop_emplace
.
Vector src
was loaded before any benchmarks were run and re-used for every benchmark. The code is given below.
The NONIUS_BENCHMARK
function may execute a given benchmark several times if the measured clock resolution is insufficiently accurate to measure just one execution. The number of runs is available via meter.runs()
. To ensure the executions are independent, a separate result vector is created for each, represented as vector res
, local to the function that ran the benchmark. Each benchmark was passed its own res[i]
to receive the result.
The global res
vector declared in Line 6 is a remnant of earlier versions and was not used in these benchmarks. The local res
vector of vectors defined in Line 36 was used instead.
Several preprocessing macros determined the final code:
PRINT_SIZE
1
when testing the benchmark code.0
for actual benchmarks.BY
- Set to
&
for actual benchmarks. The functions given in the main post present the code after this macro had been expanded to an ampersand. In future, will be set to empty to test performance impact of pass-by-value. DO_LOOP_EMPLACE
- Set to
1
when running actual benchmarks. In testing, can be set to0
to not compile a benchmark. NONIUS_BENCHMARK
- Standard macro for defining a Nonius benchmark harness.
Initialization of src
vector
The source vector was initialized and Nonius called by the following code:
std::array<char const * const, 13> regions = {
"British Columbia",
"Alberta",
"Saskatchewan",
"Manitoba",
"Ontario",
"Quebec",
"Newfoundland and Labrador",
"New Brunswick",
"Prince Edward Island",
"Nova Scotia",
"Yukon",
"Northwest Territories",
"Nunuvat"
};
std::array<char const * const, 2> long_regs = {
"Newfoundland and Labrador0",
"Newfoundland and Labrador1"
};
std::array<char const * const, 2> short_regs = {
"NL0",
"NL1"
};
template<unsigned long N>
void set_copied(int v, int proportion, std::array<char const * const, N>®s, int value, int& included, plist& vec) {
if (0 <= v && v < proportion) {
vec.push_back(ppair{regs[1], value});
included++;
}
else
vec.push_back(ppair{regs[0], value});
}
void init(int size, SeedType v) {
auto rbits = make_unique<RBits>(v);
// Range of random #s includes both bounds: [0, 99]
MRandInt rnd(rbits.get(), 99);
#if LEFT == 1
exclude = regions[0];
#elif LEFT == 2 || LEFT == 6 || LEFT == 7 || LEFT == 10 || LEFT == 11
exclude = long_regs[0];
#elif LEFT == 3 || LEFT == 4 || LEFT == 5 || LEFT == 8 || LEFT == 9
exclude = short_regs[0];
#else
static_assert(false, "Missing case in init " ## LEFT);
#endif
src.reserve(size);
for (int i = 0; i < size; ++i) {
auto v = rnd();
#if LEFT == 1
if (1 <= v && v <= 12) {
src.push_back(ppair{regions[v], i});
included++;
}
else
src.push_back(ppair{regions[0], i});
#elif LEFT == 2
set_copied(v, 50, long_regs, i, included, src);
#elif LEFT == 3
set_copied(v, 50, short_regs, i, included, src);
#elif LEFT == 4
set_copied(v, 0, short_regs, i, included, src);
#elif LEFT == 5
set_copied(v, 100, short_regs, i, included, src);
#elif LEFT == 6
set_copied(v, 0, long_regs, i, included, src);
#elif LEFT == 7
set_copied(v, 100, long_regs, i, included, src);
#elif LEFT == 8
set_copied(v, 25, short_regs, i, included, src);
#elif LEFT == 9
set_copied(v, 75, short_regs, i, included, src);
#elif LEFT == 10
set_copied(v, 25, long_regs, i, included, src);
#elif LEFT == 11
set_copied(v, 75, long_regs, i, included, src);
#else
static_assert(false, "src proportions undefined");
#endif
}
#if SORTED == 1
std::sort(src.begin(), src.end());
#elif SORTED == 2
std::shuffle(src.begin(), src.end(), *rbits);
#endif
/*
// Following just for ensuring correctness
for (const auto& p : src)
std::cerr << '(' << p.first << ", " << p.second << ")\n";
std::cerr << std::endl;
*/
res.reserve(size);
}
int main(int argc, char** argv) {
const auto size = SIZE;
init(size, MRandInt::getSeed(true,std::cout));
return nonius::main(argc, argv);
}
The following macros determined the compiled code:
LEFT
- Selected the combination of string type and proportion of copies for a given run. Ranged from 1 to 11.
SIZE
- Number of entries in
src
vector. Set to 50,000 for all benchmarks. SORTED
- Defined whether the
src
vector was in random order (SORTED==0
), sorted (SORTED==1
), or shuffled (SORTED==2
). Both the random and sorted cases were used in benchmarks; the shuffled case was not used.
Class MRandInt
generates random integers over a specified inclusive range. For these benchmarks, it generated values between 0 and 99 (inclusive), which were used to randomly select whether the included or excluded value. A different random seed was used for every run. The proportions of values copied consequently changed from run to run but all were close to the target.
I made a slight variation of the main Nonius function to allow me to call init
before calling nonius::main
.
Run conditions
All benchmarks were run on a guest Ubuntu 16.04 system, running under VirtualBox 5.0.20r106931, on a host Mac OS 10.12.4. The hardware was a MacBook Pro with a 2.4 GHz Intel Core i5 and 8 GB memory.
No other user-facing applications were running on either the guest or host system (though both had ample numbers of background daemons running) and the host machine had its network turned off.
Each run specified a Nonius sample size of 10,000 and the default bootstrap resample size of 100,000.
For every benchmark, the 95% confidence interval returned by the bootstrap was less than or equal to 3% of the sample mean.