Performance of the STL idiom for list comprehension: Introduction
02 May 2017 Tags: C++In my last post, I promised to consider the machine code generated from the STL idiom equivalent to a basic Python list comprehension. As I proceeded, the project expanded to a broader analysis of the idiom’s performance. I am shocked at how much effort this project took. I have spent much of the last five weeks generating and reading the output from g++ 6.2. So. Many. Alternatives. In this post, I’ll describe my baseline for comparison and the criteria I’ll use. In future posts, I’ll present the results.
All performance analyses are fraught. Conference speakers frequently argue that the gold standard is to run your actual code on your actual data, or as close to actual code and data as you can get. I think this claim is overly strong, but whatever its merits, by design this method by provides no general comparison of two approaches to writing code, only specific results for your context.
My focus for this series has been more general, on combining the standard algorithms and lambda expressions to create C++ expressions that approximate the concision of basic Python list comprehensions. The design of the C++ language and the STL components of the Standard Library purports to support abstractions that can be compiled to efficient object code. In principle, you get the benefits of abstraction with the efficiency of more concrete, machine-specific code. The general idiom for std::transform()
that I presented in the last post had its share of abstractions: the transform algorithm itself, the std::optional
type, and a custom OutputIterator
to conditionally append values. What price do we pay for all that abstraction? And how might we best estimate the price?
In this series, I will address these questions using two approaches: Analysis of the generated code and microbenchmarks. The machine code is the definitive indicator of how well the compiler was able to infer the simple structure underlying the abstractions and generate code for that essential structure. Microbenchmarks indicate how well the compiler was able to organize that structure into an operation sequence that executes efficiently on a specific microarchitecture and memory hierarchy. There are other stories, aspects that these approaches miss, but together these approaches capture many important aspects of performance.
Criteria for analysis of generated machine code
The criteria for quality of generated code are various and potentially contradictory. The most obvious criterion, code length, is only the most indirect approximation of execution efficiency. Execution time is heavily influenced by locality of reference, given the roughly 100-fold penalty of accessing main memory instead of L1 cache. Locality of reference can be improved both by tightening the range of locations accessed by the code and by making such accesses more amenable to hardware prefetch.
Furthermore, the proper focus of the analysis is the loop code. Assuming the idiom transforms long sequences, the bulk of its time will be consumed by the loop. Accordingly, I’ll emphasize the following metrics when assessing the machine code for the transform
loop:
- Number of instruction cache lines to represent the code. For loops such as the code of interest in this analysis, the lines will be cache-resident for every iteration after the first, so their access cost will be typically not be a major factor, although if one or more external routines are called mid-loop, these lines could be flushed and require reloading for the next iteration.
- Number of data cache lines accessed for local values. As with the instruction cache lines, for a loop these lines will typically be cache-resident for every iteration except the first and so their access cost will not contribute substantially to performance. Note: In the rest of this series, I will use "cache" to refer to the data cache and "instruction cache" in those few places I want to refer to the instruction cache.
- Number of heap allocations and deallocations, each of which will typically require access to multiple lines not currently in the cache. This count is an indirect measure of the cost of those main memory accesses.
- Number of data cache lines accessed for values on the heap. Unlike accesses to local variables, accesses to heap values have a far higher likelihood of requiring access to main memory.
- Predictability of data accesses for hardware prefetch. This feature can partially mitigate the cost of heap accesses. If the hardware prefetch unit can load heap data into the cache before the references occur, a stall can be averted.
I will assume that instruction and data cache lines are 64 bytes, the typical size for processors implementing the x86_64 instruction set. The number of cache lines required for a block of data or instructions ranges from ceil(size/64)–ceil(size/64)+1
, depending upon the block’s alignment.
Factors contributing to these metrics
The metrics listed above are influenced by the algorithm, the encoding in C++, the library implementation of the data types, and the compiler code generator. My focus in these posts is to compare two C++ encodings of the same algorithm, one using the STL paradigm of standard algorithms and iterators, the other using basic C++ facilities (but still using the Standard Library string and vector classes). Thus the effects of the algorithm are held constant. The effects of the encoding in C++ and the efficiency of its generated code are direct questions in this work.
Finally, the library implementation of the data types is orthogonal to the above. As we will see, the choice of data type has major effects on the generated code, the pattern of cache misses, and its execution speed. In short, the question “Can the compiler generate as efficient code for the transform idiom as for the idiom using basic control structures?” depends strongly on what is being transformed. A source vector whose province names are represented using char const *
null-terminated strings is easier for the optimizer than a source vector whose province names are represented using std::string
instances as implemented by libstdc++
6.2, and both generate substantially different code from say, Version 4.x of the same library, which lacked a short-string optimization.
Due to the strong effects of data type implementation, I will analyze and benchmark versions of the two idioms operating on different data types. Initially, I will use the type I presented in the earlier posts, a std::pair<std::string,int>
. In future posts, I will extend the analysis to pairs of other types.
Implementation of std::string
in libstsdc++
6.2
As we will see, the implementation of std::string
has a large effect on the efficiency of the code and even of the optimizer’s ability to compile down the std::transform
and std::optional<>
abstractions. To understand the machine code and estimate its accesses to main memory, we need to understand the string implementation in the library.
The 64-bit-address implementation of std::string
in libstdc++
6.2 represents a string using a combination of a 32-byte data type handle and an optional heap block. The handle has four fields:
Hex offset | Name used in this series | Field name in library code | Purpose |
0x0 | str_buff | _M_p | Pointer to string body, either _M_local_buf or heap buffer |
0x8 | size | _M_string_length | String length, not counting null terminator |
0x10 | zstring | _M_local_buf | String body for strings < 16 characters (null-terminated; union with _M_allocated_capacity ) |
0x10 | capacity | _M_allocated_capacity | Capacity of heap buffer, if allocated (union with _M_local_buf ) |
The handle includes an 8-byte pointer to the string body, which is null-terminated, and an 8-byte count of the characters (not including the terminating null).
Strings of less than 16 characters are directly stored in the remaining 16 bytes of the handle. For these strings, the pointer at the head of the handle points to the first of these bytes. Strings of 16 or more characters are stored in a heap-allocated block, which can be larger than the string. The pointer in the handle indicates the first byte of this block, while the second half of the handle includes an 8-byte count of the heap block capacity.
This structure is designed to minimize heap accesses and by implication cache misses. It supports a fast check for string inequality: If two strings have unequal lengths, which can be determined simply by looking at the handle, there is no need to access the heap values. If two strings are the same length and less than 16 characters, their bodies can be compared with a fast access to the string handle, but strings of equal length greater than or equal to 16 characters require access to heap values, typically requiring two or more cache lines to be loaded.
Move constructors and move assignments are fast for all strings, even those whose bodies are on the heap, as the operation only requires copying the fields in the handle.
Copy constructors and copy assignments are fast for short strings, as the operation simply copies the local values in the handles. For longer strings, however, both operations will require an expensive heap allocation, potentially incurring multiple cache line loads. Once this has completed, the actual copy operations should access the already-loaded lines.
The comparison baseline
The original idiom embedded conditionals into abstractions (the std::transform
algorithm, the std::optional
type, and the opt_back_insert_iterator
) to allow a filter-and-transform algorithm to be expressed declaratively. The baseline comparison exposes the conditionals directly as standard C++ control structures. Here is the filter-and-transform algorithm implemented using basic control structures:
using ppair = std::pair<string,int>;
using plist = std::vector<ppair>;
plist res {};
res.reserve(src.size());
for (auto& p : src) {
if (p.first != us)
res.emplace_back(p.first, p.second*p.second);
}
This version does a range for
over src
and an if
to select the elements to add to res
. The for
index is by reference and res
is extended using vector::emplace_back()
, minimizing intermediate copies. The code isn’t particularly long and is arguably simpler than the transform idiom, at least if you have never seen the transform before.
Machine code for the basic C++
Here is the annotated gdb
disassembly of the machine code generated from -O2
for the loop (Lines 7–9, highlighted) of the basic C++ source presented above. The code was generated by g++ 6.2 and its associated version of libstdc++
, compiled with only the -g -O2
options, for 64-bit Ubuntu 16.04. I found that using -O3
optimization increased code length substantially (40% more bytes). I will include the -O3
in the benchmarks but use the shorter -O2
output for annotation.
// LOOP
// 654 - 536 = 118 bytes = 2-3 instruction cache lines
// All (non-terminating) branches are local to loop body
// Register usage
// %rbx == src_i
// %r13 == res_i
// %r14 == src.end()
// Local variables
// Range for local variables used in loop: 0x70 = 112 bytes = 2-3 data cache
// lines
// (includes 0xc+0x8+0x8 = 0x1c = 28 bytes = 0-1 extra cache lines for
// alignment padding)
// Local variable offsets (hexadecimal) from %rsp:
// c Rvalue SQUARE (value ^ 2)
// 10 src
// 10 begin()
// 18 end()
// 20 cap_end() Pointer to byte following the reserved capacity
// 30 res
// 30 begin()
// 38 end()
// 40 cap_end() Pointer to byte following the reserved capacity
// 50 us
// 50 str_buff
// 58 size
// 60 Union
// zstring 16-byte buffer for null-terminated strings of length < 16
// capacity 8-byte size of heap buffer for strings of length >= 16
// Heap references (arrays for src and res, strings longer than 15 chars)
// src and res: Single pass in monotonically-increasing order
// (data prefetching should work)
// strings: single memcpy call for every string > 15 chars
// (most likely two distinct heap locations = 2 data cache lines)
// When us.len > 15:
// single memcmp call for every string of same length as us
// (most likely two distinct heap locations = 2 data cache lines)
// rvalue SQUARE <- src_i->value ^ 2
<+536>: mov 0x20(%rbx),%eax
<+539>: imul %eax,%eax %eax <- src_i->value ^ 2
<+542>: cmp %r13,0x40(%rsp) res_i ==? res.cap_end()
<+547>: mov %eax,0xc(%rsp) SQUARE <- value ^ 2
<+551>: je 0x4010fe <main()+942> Jump if need to extend res (never taken in this idiom)
<+557>: test %r13,%r13
<+560>: je 0x400fa6 <main()+598> Jump if res has not been allocated (never taken in this idiom)
// *res_i <- (src_i->province, rvalue SQUARE)
<+562>: lea 0x10(%r13),%rax %rax <- & res_i->province.zstring
<+566>: mov %r13,%rdi %rdi <- & res_i->province
<+569>: mov %rax,0x0(%r13) res_i->str_buff <- & res_i->province.zstring
<+573>: mov (%rbx),%rsi %rsi <- src_i->province.str_buff
<+576>: lea (%rsi,%rbp,1),%rdx %rdx <- src_i->province.str_buff + src_i->province.size+1
<+580>: callq std::string::_M_construct<char*>(char*, char*, std::forward_iterator_tag) HEAP STRING REFERENCE (for string > 15 chars)
<+585>: mov 0xc(%rsp),%eax
<+589>: mov %eax,0x20(%r13) res_i->value <- SQUARE
// Increment res_i, src_i, check strings, take branches
<+593>: mov 0x38(%rsp),%r13 (Unnecessary load, %r13 already contains res_i)
<+598>: add $0x28,%r13 res_i++
<+602>: mov %r13,0x38(%rsp) res.end() <- res_i
<+607>: add $0x28,%rbx src_i++
<+611>: cmp %rbx,%r14 src_i == src.end()
<+614>: je 0x400fe0 <main()+656> At end, start sort
<+616>: mov 0x8(%rbx),%rbp %rbp <- src_i->province.size
<+620>: cmp 0x58(%rsp),%rbp us.size ==? src_i->province.size
<+625>: jne 0x400f68 <main()+536> Jump if strings different sizes
<+627>: test %rbp,%rbp
<+630>: je 0x400faf <main()+607> Go to next src_i if src_i->province.size == 0
<+632>: mov 0x50(%rsp),%rsi %rsi <- us.str_buff
<+637>: mov (%rbx),%rdi %rdi <- src_i->province.str_buff
<+640>: mov %rbp,%rdx %rdx <- src_i->province.size
<+643>: callq 0x400cd0 <memcmp@plt> Compare string contents HEAP STRING REFERENCE (for string > 15 chars)
<+648>: test %eax,%eax
<+650>: jne 0x400f68 <main()+536> Strings not equal: Store value
<+652>: jmp 0x400faf <main()+607> Strings equal: Go to next value
The code is annotated by lines beginning with //
and comments starting in Column 41, many of which require horizontal scrolling to read completely.
The generated code is short and straightforward, requiring 2–3 lines of the instruction cache. Other than branches to terminate the loop, all conditionals in the loop body branch to locations within the body.
The local variables require only 112 bytes, 2–3 data cache lines. These lines are likely to be loaded during the first iteration and remain in the cache for the loop duration.
Heap references have the greatest potential to slow down the code, as they may require access to locations that have not been recently accessed, causing last-level cache misses.
For the two vectors, whose elements are stored in blocks on the heap, the machine code makes the fewest possible such accesses, walking through them in a single pass each, a pattern amenable to hardware prefetch.
As described above, the std::string
types can incur multiple cache misses, depending upon the lengths of the strings and how many long strings have the same length but different contents. In every case, the above code makes the minimum such accesses.
This baseline demonstrates that code for the filter-and-transform algorithm using basic C++ control structures and the Standard Library string and vector can generate straightforward code. Indeed, most of the complexity in the above performance analysis is the complexity of the performance of the std::string
type. This complexity stems in turn from optimizations in the string algorithms. As we will see, using char const *
types generates simpler code but it can produce more cache misses. But before we consider alternative data types, we need to explore the code produced by the std::transform
idiom for the data types used in the baseline. Spoiler: It’s more complex.