Benchmarking the transform and loop idioms

Update: Mon Aug 20, 2018: Also see this semi-rant about the limitations of microbenchmarkering.

The last posts compared the machine code generated by transform and the equivalent idiom using an explicit for loop and if statement. The transform idiom generated more intermediate copies of the std::string member than the loop.

As I noted in the posts, code differences indicate the degree to which the optimizer was able to reduce the standard library abstractions to the essential underlying algorithm but only approximately predict differences in performance. The performance effects must ultimately be assessed by benchmarks.

There are a wide range of methods and levels at which benchmarks might be used to compare these two idioms. I will focus on microbenchmarking, timing short code sequences under several conditions of underlying data type and order. This does not predict the effect of choosing these idioms in a full application, where their contribution will typically be small. But it does provide insight into the effectiveness of the different machine code sequences that the optimizer generated for each idiom.

I emphasize that the most important outcome of these benchmarks is the relative performance of the various constructs, not their absolute values. Your performance will certainly vary. See the Caveats and Appendix sections for further discussion of the limits of these results.

Conditions

I ran initial tests exploring the effect of various factors on performance and settled on the following factors as most important:

  1. Type of the elements: The samples that I presented in earlier posts used elements that were std::pair, with the first part, the std::string, used for the filter. The libstdc++ implementation of a string is complex, offering different performance profiles for short strings (15 character or less) versus long strings and for operations that only use the string length versus those that require accessing the string body.

    The second member, the int, generated simple code that had no performance impact.

    To sample the performance range of string data, I ran microbenchmarks with three different string representations as the first member of the pair:
    • Short: std::string values exactly 3 characters long. Strings of this length are stored in the string handle and can only be distinguished by comparing the actual string bodies.
    • Long: std::string values exactly 26 characters long. Strings of this length are stored in heap-allocated storage and can only be distinguished by comparing the actual string bodies.
    • Unique: Null-terminated char const * values that are uniquely identified by their address. Copies, moves, and comparisons of these values are all trivial, requiring only manipulation of eight-byte pointers.

    There are further possibilities, such as short std::string values that can be distinguished simply by their differing lengths, as well as far more complicated types, but the above three represent an initial sample of three distinct performance profiles.

  2. Using vector::push_back() versus using vector::emplace_back(): Both the loop and the transform idiom can be written to push a previously-constructed temporary value onto the end of the vector (push_back()) or to construct a value directly in uninitialized storage at the vector's end (emplace_back()). Typically, emplace_back() is faster, as it does not generate a temporary value.

    The sample idioms that I presented in previous posts differed in this. The transform idiom used push_back(). I verified that gcc 6.2 generated the same machine code when the idiom was written using emplace_back(). By contrast, I presented the loop code using emplace_back() and the code generated for that was simpler than when push_back() was used. In this post, I microbenchmark both versions of both idioms.

  3. Proportion of elements copied: The source vector contains pairs whose first member value is one of two string values of the given representation. One of those values is the one to be filtered out, while the other value is the one to be copied. The benchmarks were run with five different source data sets, with different fractions of values to be copied: 0%, 25%, 50%, 75%, and 100%. The string comparison is performed for every element in the source vector, but only the copied proportion of the elements are appended to the result vector.

  4. Order of the data elements: Both idioms are at heart a branch located inside a loop. For such algorithms, the order of the elements affects the branch prediction rate. For example, if 50% of the values are copied, then 50% of the branches will be taken and 50% will not. If the successful branches are all collected together at the start or end of the source vector, the hardware branch predictor will perfectly predict the branch (with a small number of failed predictions at the transition), whereas if the copied and not-copied elements are randomly mingled, the predictor will consistently fail. The benchmarks were run with the five data sets presented in random order and in sorted order.

Method

I chose the Nonius C++ microbenchmarking framework, due to its strong statistical design. The two idioms were placed in functions called by the framework. Each idiom was written in both push_back() and emplace_back() versions:

str_loop_emplace: Basic loop using emplace_back()

std::size_t str_loop_emplace (plist& res) {
  assert ( ! res.size());
  for (auto& p : src) {
    if (p.first != exclude)
      res.emplace_back(p.first, p.second*p.second);
  }
  return res.size();
}

str_loop_push: Basic loop using push_back()

std::size_t str_loop_push (plist& res) {
  assert ( ! res.size());
  for (auto& p : src) {
    if (p.first != exclude)
      res.push_back(ppair{p.first, p.second*p.second});
  }
  return res.size();
}

str_option_emp: transform idiom using an output iterator calling emplace_back()

std::size_t str_option_emp(plist& res) {
  assert ( ! res.size());
  std::transform(src.cbegin(), src.cend(), opt_back_emplacer(res),
    [=](auto& p) {
      if (p.first == exclude)
        return oppair();
      else
        return oppair(ppair{p.first, p.second*p.second});
    } );
}

str_option: transform idiom using an output iterator calling push_back()

std::size_t str_option(plist& res) {
  assert ( ! res.size());
  std::transform(src.cbegin(), src.cend(), opt_back_inserter(res),
    [=](auto& p) {
      if (p.first == exclude)
        return oppair();
      else
        return oppair(ppair{p.first, p.second*p.second});
    } );
}

There are two blemishes in the above code that I only noticed after gathering all the data. Neither of them substantively affects the benchmark results:

  1. The initial assert() calls, checking that the result vector is empty, were a legacy of initial testing that the benchmarking code worked as expected. Left in the benchmark, they add a small amount of unnecessary computation. This will have no effect on the relative performance, as the same cost is paid for every benchmark and the dominant cost is the processing of the 50,000 vector elements. Nonetheless, in future runs, the assert calls should be deleted.
  2. The two transform benchmark functions, str_option and str_option_emp, declare a result but do not return one. Accepting such functions without warning is a "feechure" of gcc. The result actually returned from these function is gibberish (by chance, it is the square of the second member for the last copied element). The only purpose of the results of the benchmark functions is to force the optimizer to call the functions; the return value is never used. Reviewing the generated code, the function logic is generated and the function called, so the declared return value serves its purpose despite being gibberish and the performance results are unaffected. Nonetheless, in future runs, these two functions should have explicit return statements.

Reviewing the generated code, the two implementations of the transform idiom produced identical code, both calling emplace_back(). This arose because the libstdc++ implementation of push_back() includes an overload that simply maps to emplace_back() and this overload was the one resolved for the opt_back_insert_iterator() used as the output iterator for str_option. In the benchmark results, the two transform implementations performed identically.

Refactoring the code into functions changed the generated code in small ways from the code discussed in the two prior posts: The res vector is now a by-reference parameter rather than a local variable, while the src vector and exclude constant are now global rather than local. The actual changes to the generated code are modest; the performance results for these functions should correlate well with the performance of the code described earlier.

The source array always had 50,000 elements. Each benchmark was run 10,000 times and their arithmetic mean is reported. For every reported result, the size of the 95% confidence interval reported by Nonius was less than 3% of the mean. Full details of the benchmark conditions are given in the Appendix.

Overview of results

I’ll start with an overview of the benchmark results:

overview_plot

Each dot represents the mean of 10,000 samples. The number of dots varies by condition because I ran the benchmark varying numbers of times, from one run for each condition with unique strings, to five runs for the sorted datasets of short strings.

The largest effect is due to string type (comparing rows of plots): Using the short strings (middle row) as a reference point, long strings (top row) are roughly twice as slow and unique strings (bottom row) are twice as fast.

The next largest effect is proportion of items actually copied (order of colours within each lane; see legend correlating colours to proportion). Within each string type, copying more values into the result vector requires more time. This is hardly surprising but it does emphasize the cost of moving data, even for simple data types such as short strings (32 bytes to copy in a 64-bit implementation) and char const * (8 bytes to copy in a 64-bit implementation).

Using vector::emplace_back() versus vector::push_back() had a substantial effect for the loop (leftmost lane is loop with emplace_back(), second lane from left is loop with push_back()) but as expected had no effect for std::transform() (rightmost lane is transform with emplace_back(), second lane from right is transform with push_back()).

Finally, consider the question motivating this analysis: How did the performance of transform compare with that of the loop? I will take the emplace_back() implementation as representative of the loop’s performance (it is trivially easy to use emplace_back() in the loop, as well as widely-recommended—for example, see Item 42 of Scott Meyers’s Effective Modern C++). The choice is arbitrary for transform given the two implementations’ equivalent performance and identical code, so I will take the push_back() implementation. These choices produce a comparison between the leftmost lane of the plots (the loop with emplace_back()) and the second lane from the right (transform with push_back()).

Within each subplot, the loop is faster than transform for the cases where at least some elements are copied to the result vector. The two idioms offer equivalent performance for the case where no data elements were copied.

Finally, I note some anomalies in the plots:

For the rest of the post, I will focus on the emplace_back() version of the loop and the push_back() version of the idiom.

Comparison of long std::string

The results for long strings are the most straightforward to interpret:

long_plot

The long strings are two 26-character strings, differing only in the last character. As such, they require a utility routine call to compare the string bodies, which are stored in the heap and likely cause cache misses. Move constructions and assignments remain cheap, because the heap pointer can simply be moved to the target, which accepts ownership of the heap object. By contrast, copy constructions and assignments are expensive, requiring a heap allocation for the copy of the string body, again incurring cache misses.

Given the high cost of string operations, the dominant factor in performance is the number of items to copy, which predicts the order of results: More copies mean slower performance.

We see a mild effect of branch misprediction in the case of 50% copied: The loop processes sorted input (second lane from left) about .25 ms faster than unsorted input (leftmost lane). This effect is barely present for transform.

The loop is faster than transform directly proportional to the number of copies. The two idioms are most different for 100% and 75% copies, while they are equivalent for 0%. For long strings, the slower performance of transform appears to be due to extra temporary values constructed when copying into the result vector.

I consider the results for transform on the 75% copied case (top of two right lanes) more tentative than the others, as they are anomalous in two ways. First, the three means for the sorted data are not clustered, with one 75% mean instead clustered with the two means for the 100% copied data. Second, all three means for the transform of sorted data are higher than the corresponding means for unsorted data. This is the only instance of processing sorted data more slowly than unsorted and I cannot think of a reasonable explanation. Due to these anomalies, I accord these data points less confidence than the others.

Comparison of short std::string

The results for short strings are more complex:

short_plot

The short strings are both 3 characters long, differing in their final character. All operations can be performed by access to the string handle (libstdc++ stores string bodies of less than 16 characters in the handle rather than the heap), avoiding heap accesses and highly reducing the likelihood of cache misses. As noted in the overview, the improvement is clear: Short strings are about twice as fast as long strings.

The reduced effect of string copying allows other effects to appear, particularly the effect of branch misprediction. As the misprediction effect is most noticeable for the loop, I will begin with the more straightforward transform results (the two right lanes).

The dominant factor in performance of transform on short strings is the percent of values copied to the result. On sorted data, the means are directly proportional to this percentage. For unsorted data, branch misprediction effects appear. For the 100% and 0% copied cases, branch prediction should be perfect irrespective of data order, as the branch will succeed or fail for every element. This is confirmed by the means for these cases, which are the same for unsorted and sorted data sets.

However, for the 50% copied case, branch prediction will nearly always be wrong for unsorted data but essentially perfect for sorted data, only mispredicting at the transition from excluded to included values. The means for 50% copies show a substantial improvement for sorted data.

For the data sets that have 75% and 25% of their values copied, there will be two contending effects. For both cases, branch prediction will be modestly effective even for unsorted data, as occasional runs of inclusion or exclusion in the data will support successful prediction. However, the 25% data set requires substantially fewer copies than the 75% data set, making it faster. These effects are apparent in the means: transform is substantially faster on sorted data than unsorted for the 25% case, where branch misprediction is a large contributor, but the idiom is only slightly faster for sorted data in the 75% case, where the cost of copying values dominates.

Branch misprediction has a much larger effect on the loop (two left lanes). When this idiom processes unsorted data (leftmost lane), the proportion of values copied no longer predicts the relative performance of the data sets. The two fastest unsorted data sets are those for which branch prediction is perfect, with 0% and 100% copied. All datasets for which branch prediction is imperfect, the 25%, 50%, and 75%, are slower. Indeed, the 50% and 75% cases have almost the same performance, due to near-total branch misprediction (in the 50% case) having almost as much effect as half again more string copies (in the 75% case).

When the loop runs on sorted data (second lane from left), branch prediction is perfect, leaving only the proportion of copies as a factor. The more copies required, the slower the processing.

Finally, comparing the results for the two idioms, transform is 25–50% slower than the loop, with the gap proportional to the number of values copied. For 0% copied, the two idioms are equivalent.

Comparison of unique char const * strings

The unique char const * strings are trivially fast to copy or compare, requiring only manipulation of 8-byte pointers. Consequently, the influence of dataset order is even stronger for these strings:

unique_plot

For both idioms, order once again has no effect on performance for datasets where 0% or 100% of the elements are copied, due to perfect branch prediction. For transform (right two lanes), the 50% copied case suffered the greatest loss due to branch misprediction, with the unsorted data about a third slower than the sorted data. Misprediction had no discernible effect on the 75% copied data and only a slight effect on the 25% copied data. For transform, the relative ordering of the means was always determined by the proportion of copies.

For the loop, branch misprediction had a larger effect (left two lanes). As with the short string case, the order of the results for the unsorted data (leftmost lane) corresponded to the number of correctly-predicted branches, not the number of copies. The datasets with perfect prediction (0% and 100% copied) were fastest, while the unpredictable dataset (50% copied) was slowest, with the partially-predictable datasets (25% and 75%) in between. Within each level of branch predictability, the number of copies determined the relative ranking.

For the sorted case, the loop performance was directly proportional to the number of copies made, as branch prediction was near-perfect for every dataset.

Comparing the two idioms, transform was surprisingly slow, from 2.5 times to the same rate. Once again, the number of extraneous temporaries made by transform seems to be the source of the difference, as the two algorithms take identical times for datasets where no elements are copied.

Caveats

These results reflect a small number of samples (only a single sample in the case of the unique strings) on a single OS/machine configuration (see Appendix for details). For the cases where I have multiple samples, they seem reliable, with the lone exception of the sorted data set of long strings with 75% copies for transform. My confidence is increased by the consistency between these results, my analysis of the machine code, and well-understood theories of performance on modern processors. Nonetheless, the numbers are more suggestive than definitive. I would like to run more tests in future but I have put enough time into this analysis already and want to post it. These results at least tell a consistent, believable story.

Conclusion

Is the transform idiom a useful alternative to the loop? I confess that the analysis of the code and benchmarks is making me question the transform idiom’s utility. The loop idiom is immediately understandable to anyone who’s programmed in any language, robust, and has a more direct expression of the underlying algorithm that permits the compiler to generate code without extraneous temporaries. More familiar and faster: a hard pair to beat.

The only advantages of the transform idiom, such as they may be, include increased familiarity with the standard algorithm family. I still believe that there are many cases where a standard algorithm is preferable to a hand-crafted alternative. I don’t want to write my own sort or even my own unique-values algorithm, thank you. But the filter-and-transform embodied by the basic Python list comprehension maps cleanly to a range for loop and if statement. Introducing a std::optional value and an optional-aware output iterator just to use transform now seems too clever by half. In my own code, I expect I’ll stick with a for loop in this case.

In future posts, I’ll extend this analysis to a few more cases but my focus will likely shift to the impact of temporary values on execution speed. The biggest lesson of these benchmarks is how much impact extraneous temporaries have.

Appendix: Detailed method description

This appendix presents the technical specifics of the benchmarks.

Nonius calls

The code for running the str_loop_emplace benchmark:

using ppair = std::pair<string,int>;
using oppair = optional<ppair>;
using plist = std::vector<ppair>;

plist src {};
plist res {}; // NOT USED IN BENCHMARKS

int included {};
string exclude {};

bool check(const plist& res) {
  #if PRINT_SIZE
  std::cerr << "LEFT " << LEFT << " res.size() " << res.size() << std::endl;
  #endif
  if (res.size() != included) {
    std::cerr << "res.size() " << res.size() << ", included " << included << '\n';
    for (auto& v : res) {
      std::cerr << v.first << ", " << v.second << std::endl;
    }
  }
  assert(res.size() == included);
}

#ifdef DO_LOOP_EMPLACE

std::size_t str_loop_emplace (plist& res) {
  assert ( ! res.size());
  for (auto BY p : src) {
    if (p.first != exclude)
      res.emplace_back(p.first, p.second*p.second);
  }
  return res.size();
}

NONIUS_BENCHMARK("str_loop_emplace", [] (nonius::chronometer meter) {
  std::vector<plist> res(meter.runs());
  for (auto& r : res) {
    r.reserve(src.size());
  }

  meter.measure( [&] (int i) {return str_loop_emplace(res[i]);} );
  for (const auto& r : res)
    check(r);
});

#endif

The code for the other three benchmarks was the same, substituting the appropriate function from the beginning of this post for str_loop_emplace.

Vector src was loaded before any benchmarks were run and re-used for every benchmark. The code is given below.

The NONIUS_BENCHMARK function may execute a given benchmark several times if the measured clock resolution is insufficiently accurate to measure just one execution. The number of runs is available via meter.runs(). To ensure the executions are independent, a separate result vector is created for each, represented as vector res, local to the function that ran the benchmark. Each benchmark was passed its own res[i] to receive the result.

The global res vector declared in Line 6 is a remnant of earlier versions and was not used in these benchmarks. The local res vector of vectors defined in Line 36 was used instead.

Several preprocessing macros determined the final code:

PRINT_SIZE
1 when testing the benchmark code. 0 for actual benchmarks.
BY
Set to & for actual benchmarks. The functions given in the main post present the code after this macro had been expanded to an ampersand. In future, will be set to empty to test performance impact of pass-by-value.
DO_LOOP_EMPLACE
Set to 1 when running actual benchmarks. In testing, can be set to 0 to not compile a benchmark.
NONIUS_BENCHMARK
Standard macro for defining a Nonius benchmark harness.

Initialization of src vector

The source vector was initialized and Nonius called by the following code:

std::array<char const * const, 13> regions = {
  "British Columbia",
  "Alberta",
  "Saskatchewan",
  "Manitoba",
  "Ontario",
  "Quebec",
  "Newfoundland and Labrador",
  "New Brunswick",
  "Prince Edward Island",
  "Nova Scotia",
  "Yukon",
  "Northwest Territories",
  "Nunuvat"
};

std::array<char const * const, 2> long_regs = {
  "Newfoundland and Labrador0",
  "Newfoundland and Labrador1"
};

std::array<char const * const, 2> short_regs = {
  "NL0",
  "NL1"
};

template<unsigned long N>
void set_copied(int v, int proportion, std::array<char const * const, N>&regs, int value, int& included, plist& vec) {
  if (0 <= v && v < proportion) {
    vec.push_back(ppair{regs[1], value});
    included++;
  }
  else
    vec.push_back(ppair{regs[0], value});      
}

void init(int size, SeedType v) {
  auto rbits = make_unique<RBits>(v);
  // Range of random #s includes both bounds: [0, 99]
  MRandInt rnd(rbits.get(), 99);
  
#if LEFT == 1
  exclude = regions[0];
#elif LEFT == 2 || LEFT == 6 || LEFT == 7 || LEFT == 10 || LEFT == 11 
  exclude = long_regs[0];
#elif LEFT == 3 || LEFT == 4 || LEFT == 5 || LEFT == 8 || LEFT == 9
  exclude = short_regs[0];
#else
  static_assert(false, "Missing case in init " ## LEFT);
#endif

  src.reserve(size);
  for (int i = 0; i < size; ++i) {
    auto v = rnd();
    #if LEFT == 1
    if (1 <= v && v <= 12) {
      src.push_back(ppair{regions[v], i});
      included++;
    }
    else
      src.push_back(ppair{regions[0], i});
    #elif LEFT == 2
    set_copied(v, 50, long_regs, i, included, src);
    #elif LEFT == 3
    set_copied(v, 50, short_regs, i, included, src);
    #elif LEFT == 4
    set_copied(v, 0, short_regs, i, included, src);
    #elif LEFT == 5
    set_copied(v, 100, short_regs, i, included, src);
    #elif LEFT == 6
    set_copied(v, 0, long_regs, i, included, src);
    #elif LEFT == 7
    set_copied(v, 100, long_regs, i, included, src);
    #elif LEFT == 8
    set_copied(v, 25, short_regs, i, included, src);
    #elif LEFT == 9
    set_copied(v, 75, short_regs, i, included, src);
    #elif LEFT == 10
    set_copied(v, 25, long_regs, i, included, src);
    #elif LEFT == 11
    set_copied(v, 75, long_regs, i, included, src);
    #else
    static_assert(false, "src proportions undefined");
    #endif
  }

  #if SORTED == 1
  std::sort(src.begin(), src.end());
  #elif SORTED == 2
  std::shuffle(src.begin(), src.end(), *rbits);
  #endif
  /* 
  // Following just for ensuring correctness
  for (const auto& p : src)
    std::cerr << '(' << p.first << ", " << p.second << ")\n";
  std::cerr << std::endl;
  */ 
  res.reserve(size);
}

int main(int argc, char** argv) {
  const auto size = SIZE;
  init(size, MRandInt::getSeed(true,std::cout));
  return nonius::main(argc, argv);
}

The following macros determined the compiled code:

LEFT
Selected the combination of string type and proportion of copies for a given run. Ranged from 1 to 11.
SIZE
Number of entries in src vector. Set to 50,000 for all benchmarks.
SORTED
Defined whether the src vector was in random order (SORTED==0), sorted (SORTED==1), or shuffled (SORTED==2). Both the random and sorted cases were used in benchmarks; the shuffled case was not used.

Class MRandInt generates random integers over a specified inclusive range. For these benchmarks, it generated values between 0 and 99 (inclusive), which were used to randomly select whether the included or excluded value. A different random seed was used for every run. The proportions of values copied consequently changed from run to run but all were close to the target.

I made a slight variation of the main Nonius function to allow me to call init before calling nonius::main.

Run conditions

All benchmarks were run on a guest Ubuntu 16.04 system, running under VirtualBox 5.0.20r106931, on a host Mac OS 10.12.4. The hardware was a MacBook Pro with a 2.4 GHz Intel Core i5 and 8 GB memory.

No other user-facing applications were running on either the guest or host system (though both had ample numbers of background daemons running) and the host machine had its network turned off.

Each run specified a Nonius sample size of 10,000 and the default bootstrap resample size of 100,000.

For every benchmark, the 95% confidence interval returned by the bootstrap was less than or equal to 3% of the sample mean.