Performance of the STL idiom for list comprehension: Introduction

In my last post, I promised to consider the machine code generated from the STL idiom equivalent to a basic Python list comprehension. As I proceeded, the project expanded to a broader analysis of the idiom’s performance. I am shocked at how much effort this project took. I have spent much of the last five weeks generating and reading the output from g++ 6.2. So. Many. Alternatives. In this post, I’ll describe my baseline for comparison and the criteria I’ll use. In future posts, I’ll present the results.

All performance analyses are fraught. Conference speakers frequently argue that the gold standard is to run your actual code on your actual data, or as close to actual code and data as you can get. I think this claim is overly strong, but whatever its merits, by design this method by provides no general comparison of two approaches to writing code, only specific results for your context.

My focus for this series has been more general, on combining the standard algorithms and lambda expressions to create C++ expressions that approximate the concision of basic Python list comprehensions. The design of the C++ language and the STL components of the Standard Library purports to support abstractions that can be compiled to efficient object code. In principle, you get the benefits of abstraction with the efficiency of more concrete, machine-specific code. The general idiom for std::transform() that I presented in the last post had its share of abstractions: the transform algorithm itself, the std::optional type, and a custom OutputIterator to conditionally append values. What price do we pay for all that abstraction? And how might we best estimate the price?

In this series, I will address these questions using two approaches: Analysis of the generated code and microbenchmarks. The machine code is the definitive indicator of how well the compiler was able to infer the simple structure underlying the abstractions and generate code for that essential structure. Microbenchmarks indicate how well the compiler was able to organize that structure into an operation sequence that executes efficiently on a specific microarchitecture and memory hierarchy. There are other stories, aspects that these approaches miss, but together these approaches capture many important aspects of performance.

Criteria for analysis of generated machine code

The criteria for quality of generated code are various and potentially contradictory. The most obvious criterion, code length, is only the most indirect approximation of execution efficiency. Execution time is heavily influenced by locality of reference, given the roughly 100-fold penalty of accessing main memory instead of L1 cache. Locality of reference can be improved both by tightening the range of locations accessed by the code and by making such accesses more amenable to hardware prefetch.

Furthermore, the proper focus of the analysis is the loop code. Assuming the idiom transforms long sequences, the bulk of its time will be consumed by the loop. Accordingly, I’ll emphasize the following metrics when assessing the machine code for the transform loop:

I will assume that instruction and data cache lines are 64 bytes, the typical size for processors implementing the x86_64 instruction set. The number of cache lines required for a block of data or instructions ranges from ceil(size/64)–ceil(size/64)+1, depending upon the block’s alignment.

Factors contributing to these metrics

The metrics listed above are influenced by the algorithm, the encoding in C++, the library implementation of the data types, and the compiler code generator. My focus in these posts is to compare two C++ encodings of the same algorithm, one using the STL paradigm of standard algorithms and iterators, the other using basic C++ facilities (but still using the Standard Library string and vector classes). Thus the effects of the algorithm are held constant. The effects of the encoding in C++ and the efficiency of its generated code are direct questions in this work.

Finally, the library implementation of the data types is orthogonal to the above. As we will see, the choice of data type has major effects on the generated code, the pattern of cache misses, and its execution speed. In short, the question “Can the compiler generate as efficient code for the transform idiom as for the idiom using basic control structures?” depends strongly on what is being transformed. A source vector whose province names are represented using char const * null-terminated strings is easier for the optimizer than a source vector whose province names are represented using std::string instances as implemented by libstdc++ 6.2, and both generate substantially different code from say, Version 4.x of the same library, which lacked a short-string optimization.

Due to the strong effects of data type implementation, I will analyze and benchmark versions of the two idioms operating on different data types. Initially, I will use the type I presented in the earlier posts, a std::pair<std::string,int>. In future posts, I will extend the analysis to pairs of other types.

Implementation of std::string in libstsdc++ 6.2

As we will see, the implementation of std::string has a large effect on the efficiency of the code and even of the optimizer’s ability to compile down the std::transform and std::optional<> abstractions. To understand the machine code and estimate its accesses to main memory, we need to understand the string implementation in the library.

The 64-bit-address implementation of std::string in libstdc++ 6.2 represents a string using a combination of a 32-byte data type handle and an optional heap block. The handle has four fields:

Fields for std::string in libstdc++ 6.2
Hex offset Name used in this series Field name in library code Purpose
0x0str_buff_M_pPointer to string body, either _M_local_buf or heap buffer
0x8size_M_string_lengthString length, not counting null terminator
0x10zstring_M_local_bufString body for strings < 16 characters (null-terminated; union with _M_allocated_capacity)
0x10capacity_M_allocated_capacityCapacity of heap buffer, if allocated (union with _M_local_buf)

The handle includes an 8-byte pointer to the string body, which is null-terminated, and an 8-byte count of the characters (not including the terminating null).

Strings of less than 16 characters are directly stored in the remaining 16 bytes of the handle. For these strings, the pointer at the head of the handle points to the first of these bytes. Strings of 16 or more characters are stored in a heap-allocated block, which can be larger than the string. The pointer in the handle indicates the first byte of this block, while the second half of the handle includes an 8-byte count of the heap block capacity.

This structure is designed to minimize heap accesses and by implication cache misses. It supports a fast check for string inequality: If two strings have unequal lengths, which can be determined simply by looking at the handle, there is no need to access the heap values. If two strings are the same length and less than 16 characters, their bodies can be compared with a fast access to the string handle, but strings of equal length greater than or equal to 16 characters require access to heap values, typically requiring two or more cache lines to be loaded.

Move constructors and move assignments are fast for all strings, even those whose bodies are on the heap, as the operation only requires copying the fields in the handle.

Copy constructors and copy assignments are fast for short strings, as the operation simply copies the local values in the handles. For longer strings, however, both operations will require an expensive heap allocation, potentially incurring multiple cache line loads. Once this has completed, the actual copy operations should access the already-loaded lines.

The comparison baseline

The original idiom embedded conditionals into abstractions (the std::transform algorithm, the std::optional type, and the opt_back_insert_iterator) to allow a filter-and-transform algorithm to be expressed declaratively. The baseline comparison exposes the conditionals directly as standard C++ control structures. Here is the filter-and-transform algorithm implemented using basic control structures:

using ppair = std::pair<string,int>;
using plist = std::vector<ppair>;

plist res {};
res.reserve(src.size());

for (auto& p : src) {
  if (p.first != us)
    res.emplace_back(p.first, p.second*p.second);
}

This version does a range for over src and an if to select the elements to add to res. The for index is by reference and res is extended using vector::emplace_back(), minimizing intermediate copies. The code isn’t particularly long and is arguably simpler than the transform idiom, at least if you have never seen the transform before.

Machine code for the basic C++

Here is the annotated gdb disassembly of the machine code generated from -O2 for the loop (Lines 7–9, highlighted) of the basic C++ source presented above. The code was generated by g++ 6.2 and its associated version of libstdc++, compiled with only the -g -O2 options, for 64-bit Ubuntu 16.04. I found that using -O3 optimization increased code length substantially (40% more bytes). I will include the -O3 in the benchmarks but use the shorter -O2 output for annotation.

// LOOP
// 654 - 536 = 118 bytes = 2-3 instruction cache lines
// All (non-terminating) branches are local to loop body

// Register usage
// %rbx == src_i
// %r13 == res_i
// %r14 == src.end()

// Local variables
// Range for local variables used in loop: 0x70 = 112 bytes = 2-3 data cache
//   lines
// (includes 0xc+0x8+0x8 = 0x1c = 28 bytes = 0-1 extra cache lines for
//   alignment padding)
// Local variable offsets (hexadecimal) from %rsp:
//  c Rvalue SQUARE (value ^ 2)
// 10 src
//   10 begin()
//   18 end()
//   20 cap_end() Pointer to byte following the reserved capacity
// 30 res
//   30 begin()
//   38 end()
//   40 cap_end() Pointer to byte following the reserved capacity
// 50 us
//   50 str_buff
//   58 size
//   60 Union
//      zstring 16-byte buffer for null-terminated strings of length < 16
//      capacity 8-byte size of heap buffer for strings of length >= 16

// Heap references (arrays for src and res, strings longer than 15 chars)
// src and res: Single pass in monotonically-increasing order
// (data prefetching should work)
// strings: single memcpy call for every string > 15 chars
// (most likely two distinct heap locations = 2 data cache lines)
//          When us.len > 15:
//            single memcmp call for every string of same length as us
//            (most likely two distinct heap locations = 2 data cache lines)

// rvalue SQUARE <- src_i->value ^ 2
<+536>:	mov    0x20(%rbx),%eax          
<+539>:	imul   %eax,%eax                 %eax <- src_i->value ^ 2
<+542>:	cmp    %r13,0x40(%rsp)           res_i ==? res.cap_end()
<+547>:	mov    %eax,0xc(%rsp)            SQUARE <- value ^ 2
<+551>:	je     0x4010fe <main()+942>     Jump if need to extend res (never taken in this idiom)
<+557>:	test   %r13,%r13
<+560>:	je     0x400fa6 <main()+598>     Jump if res has not been allocated (never taken in this idiom)
// *res_i <- (src_i->province, rvalue SQUARE)
<+562>:	lea    0x10(%r13),%rax           %rax <- & res_i->province.zstring
<+566>:	mov    %r13,%rdi                 %rdi <- & res_i->province
<+569>:	mov    %rax,0x0(%r13)            res_i->str_buff <- & res_i->province.zstring
<+573>:	mov    (%rbx),%rsi               %rsi <- src_i->province.str_buff
<+576>:	lea    (%rsi,%rbp,1),%rdx        %rdx <- src_i->province.str_buff + src_i->province.size+1
<+580>:	callq  std::string::_M_construct<char*>(char*, char*, std::forward_iterator_tag) HEAP STRING REFERENCE (for string > 15 chars)
<+585>:	mov    0xc(%rsp),%eax
<+589>:	mov    %eax,0x20(%r13)           res_i->value <- SQUARE

// Increment res_i, src_i, check strings, take branches
<+593>:	mov    0x38(%rsp),%r13           (Unnecessary load, %r13 already contains res_i)
<+598>:	add    $0x28,%r13                res_i++  
<+602>:	mov    %r13,0x38(%rsp)           res.end() <- res_i

<+607>:	add    $0x28,%rbx                src_i++
<+611>:	cmp    %rbx,%r14                 src_i == src.end()
<+614>:	je     0x400fe0 <main()+656>     At end, start sort
<+616>:	mov    0x8(%rbx),%rbp            %rbp <- src_i->province.size
<+620>:	cmp    0x58(%rsp),%rbp           us.size ==? src_i->province.size
<+625>:	jne    0x400f68 <main()+536>     Jump if strings different sizes
<+627>:	test   %rbp,%rbp
<+630>:	je     0x400faf <main()+607>     Go to next src_i if src_i->province.size == 0
<+632>:	mov    0x50(%rsp),%rsi           %rsi <- us.str_buff
<+637>:	mov    (%rbx),%rdi               %rdi <- src_i->province.str_buff
<+640>:	mov    %rbp,%rdx                 %rdx <- src_i->province.size
<+643>:	callq  0x400cd0 <memcmp@plt>     Compare string contents HEAP STRING REFERENCE (for string > 15 chars)
<+648>:	test   %eax,%eax
<+650>:	jne    0x400f68 <main()+536>     Strings not equal: Store value
<+652>:	jmp    0x400faf <main()+607>     Strings equal: Go to next value

The code is annotated by lines beginning with // and comments starting in Column 41, many of which require horizontal scrolling to read completely.

The generated code is short and straightforward, requiring 2–3 lines of the instruction cache. Other than branches to terminate the loop, all conditionals in the loop body branch to locations within the body.

The local variables require only 112 bytes, 2–3 data cache lines. These lines are likely to be loaded during the first iteration and remain in the cache for the loop duration.

Heap references have the greatest potential to slow down the code, as they may require access to locations that have not been recently accessed, causing last-level cache misses.

For the two vectors, whose elements are stored in blocks on the heap, the machine code makes the fewest possible such accesses, walking through them in a single pass each, a pattern amenable to hardware prefetch.

As described above, the std::string types can incur multiple cache misses, depending upon the lengths of the strings and how many long strings have the same length but different contents. In every case, the above code makes the minimum such accesses.

This baseline demonstrates that code for the filter-and-transform algorithm using basic C++ control structures and the Standard Library string and vector can generate straightforward code. Indeed, most of the complexity in the above performance analysis is the complexity of the performance of the std::string type. This complexity stems in turn from optimizations in the string algorithms. As we will see, using char const * types generates simpler code but it can produce more cache misses. But before we consider alternative data types, we need to explore the code produced by the std::transform idiom for the data types used in the baseline. Spoiler: It’s more complex.