Analysis of code generated from the transform idiom

The last post presented the baseline C++ code and its generated machine code. Now it’s time to compare the code generated by the std::transform idiom and see how well it fares.

Removing two inefficiencies

I’m going to analyze a slightly improved version of the transform idiom. When I examined the code generated by the version I presented originally, I saw two simple inefficiencies I’d left in, two rookie errors:

  1. The lambda parameter `p` is passed by copy rather than by reference, constructing a new string, including a heap allocation, for every invocation.
  2. The `string("us")` expression inside the lambda filter inserted a conversion from `char const *` to `std::string` for every invocation, just to create a constant string.

I improved the code by eliminating these conversions from the loop. I eliminated the first by simply passing the lambda parameter by-reference.

I could not eliminate the second conversion altogether, so I moved it outside the transform call, declaring const string us and then capturing it by-copy in the lambda call.

Here is the revised code with the revised lines highlighted:

using ppair = std::pair<string,int>;
using oppair = optional<ppair>;
using plist = std::vector<ppair>;

const string us = "US";

std::transform(src.cbegin(), src.cend(), opt_back_inserter(res),
  [=](auto& p) {
    if (p.first == us)
      return oppair();
    else
      return oppair(ppair{p.first, p.second*p.second});
  } );

The copy-capture of us in the lambda expression only generates a single copy-constructor call. The above code compiles to something like:

{
  struct lambda_type {
    string l_us;
    lambda_type(string& us) l_us(us) {};
    ppair operator (ppair& p) { ... l_us ...  };
  };
  lambda_type lambda {us};
  std::transform(..., lambda);
}

The member variable l_us is constructed once, before the call to transform, then its destructor is called immediately following. The calls to lambda_type::operator(ppair& p) inside transform are efficient references to a local variable, internally represented as offsets 0x80–0x9f from the stack pointer %rsp in the machine code. The extra l_us variable does increase the stack requirements by 32 bytes, though.

Machine code generated by the more efficient version

With those changes made, the transform call is compiled to the following machine code (again, gcc 6.2 with libstdc++, compiled with -O2, as disassembled by gdb). It’s a bit of a slog, so just glance at it and I’ll see you on the other side:

// LOOP
// 941 - 600 = 341 bytes (6-7 instruction cache lines)
//         PLUS 69 bytes in outside branch targets (2 instruction cache lines)
//           = 410 bytes (8-9 instruction cache lines)

// Register usage
// %rax == res_i (for most of the loop body, lines +721 to end)
// %rbx == src_i
// %r12 == & RT.province.zstring
// %r13 == src.end()
// %r14 == & RT
// %r15 == & T1.province.zstring

// Local variables
// Range for local variables used in loop: 0x100 = 256 bytes = 4-5 data cache
//                                                             lines
// (includes 0x18+0x8+0x28 = 0x48 = 52 bytes = 1-2 extra cache lines for
//                                                 padding and locals unused
//                                                 in the loop)
// Local variable offsets (hexadecimal) from %rsp:
//  0 TEMP0 (saves %rcx across call to memcmp)
//(18 bytes of padding and other locals)
// 20 src
//   20 begin()
//   28 end()
//   30 cap_end() Pointer to byte following the reserved capacity
//( 8 bytes of padding)
// 40 res
//   40 begin()
//   48 end()
//   50 cap_end() Pointer to byte following the reserved capacity
//(28 bytes of padding and other locals)
// 80 l_us lambda capture by copy (copy-constructed once, before loop body,
//                                 destructor called after loop body)
//   80 str_buff Pointer to string value (null-terminated)
//   88 string length (not counting null)
//   90 Union
//      zstring 16-byte buffer for null-terminated strings of length < 16
//      capacity 8-byte size of heap buffer for strings of length >= 16
// a0 T1 Rvalue ppair constructed in lambda via aggregate initialization
//   a0 province
//     a0 str_buff Pointer to string value (null-terminated)
//     a8 string length (not counting null)
//     b0 Union
//      zstring 16-byte buffer for null-terminated strings of length < 16
//      capacity 8-byte size of heap buffer for strings of length >= 16
//   c0 value
// d0 RT move-constructed parameter (type oppair) t of
//      opt_back_emplace_iterator::operator=(T&& t)
//   d0 province
//     d0 str_buff Pointer to string value (null-terminated)
//     d8 string length (not counting null)
//     e0 Union
//      zstring 16-byte buffer for null-terminated strings of length < 16
//      capacity 8-byte size of heap buffer for strings of length >= 16
//   f0 value
//   f8 has_value

// Heap references (arrays for src and res, strings longer than 15 chars)
// src and res: Single pass in monotonically-increasing order (data prefetching
//   should work)
// strings: single memcpy call for every string > 15 chars (most likely two
//   distinct heap locations = 2 data cache lines)
//          When us.len > 15:
//            single memcmp call for every string of same length as us
//            (most likely two distinct heap locations = 2 data cache lines)

// Body: Build optional<pair<string,int>> and append to res (value is ensured)
// T1 = (src_i->province, src_i->value ^ 2)
<+600>:	mov    0x20(%rbx),%ebp           %ebp <- src_i->value
<+603>:	mov    %r15,0xa0(%rsp)           T1.province.addr <- &T1.province.zstring
<+611>:	lea    0xa0(%rsp),%rdi           %rdi <- &T1.province
<+619>:	mov    (%rbx),%rsi               %rsi <- &src_i->province.zstring
<+622>:	imul   %ebp,%ebp                 %ebp <- src_i->value ^ 2
<+625>:	lea    (%rsi,%rcx,1),%rdx        %rdx <- &src_i->province.zstring[len+1]
<+629>:	callq  0x401550 std::string::_M_construct<char*>(char*, char*, std::forward_iterator_tag) HEAP STRING REFERENCE (for string > 15 chars)
<+634>:	mov    0xa0(%rsp),%rax           %rax <- & T1.province.zstring
<+642>:	mov    %ebp,0xc0(%rsp)           T1.value <- src_i->value ^ 2
// RT = (T1.province, src_i->value ^ 2, true)
<+649>:	mov    %r12,0xd0(%rsp)           RT.province.str_buff <- & RT.province.zstring
<+657>:	cmp    %r15,%rax                 Was T1.str_buff changed?
<+660>:	je     0x401238 <main()+1256>    Jump if no change (taken when string is of length < 16)
/*
  Direct move of T1.province.str_buff to
  RT.province.str_buff. T1.province has no destructor called. RT is
  directly initialized and T1 "never really existed" This approach
  allows RT.province to be initialized without any heap accesses
 */
<+666>:	mov    %rax,0xd0(%rsp)           RT.province.str_buff <- T1.province.str_buff
<+674>:	mov    0xb0(%rsp),%rax
<+682>:	mov    %rax,0xe0(%rsp)           RT.province.capacity <- T1.province.capacity
<+690>:	mov    0xa8(%rsp),%rax           %rax <- T1.province.len
<+698>:	mov    %ebp,0xf0(%rsp)           RT.value <- src_i->value ^ 2
<+705>:	movb   $0x1,0xf8(%rsp)           RT.optional <- true
<+713>:	mov    %rax,0xd8(%rsp)           RT.province.len <- T1.province.len
// *res_i = (RT.province, RT.value)
// For rest of loop body, %rax == res_i
<+721>:	mov    0x48(%rsp),%rax           %rax <- res.end()
<+726>:	cmp    0x50(%rsp),%rax           res.cap_end() ==? &res.end()
<+731>:	je     0x401295 <main()+1349>    Jump if equal; extend res (never taken, sufficient space reserved)
<+737>:	test   %rax,%rax                 
<+740>:	je     0x40108f <main()+831>     Jump if res.end() == nullptr (never taken, space has already been reserved)
<+742>:	lea    0x10(%rax),%rdx           %rdx <- &res.end().zstring
<+746>:	mov    %rdx,(%rax)               res.end()->str_buff <- &res.end().zstring

<+749>:	mov    0xd0(%rsp),%rdx           %rdx <- RT.province.str_buff
<+757>:	cmp    %r12,%rdx                 Is RT buffer local?
<+760>:	je     0x401260 <main()+1296>    if == ... (string length < 16---Move the 16 bytes from RT.province.zstring
                                         to res_i->province.zstring
                                         else move-assign the heap buffer res_i->province <- RT
/* 
   Unlike RT <- T1 assignment above, res_i->province <- RT.province is
   an actual move assignment.  RT.str_buff is copied to
   res_i->province.str_buff (transferring ownership of the heap
   object) and later (see below) RT.province will be reset to an empty
   string and still later have its destructor called.  The destructor
   will not reference the heap because RT.province is a null string.
*/
  <+766>:	mov    %rdx,(%rax)               res_i->province.str_buff <- RT.province.str_buff
  <+769>:	mov    0xe0(%rsp),%rdx           
  <+777>:	mov    %rdx,0x10(%rax)           res_i->province.capacity <- RT.province.capacity
                                           // endif
<+781>:	mov    0xd8(%rsp),%rdx
<+789>:	mov    %rdx,0x8(%rax)            res.end()->province.len <- RT.province.len
<+793>:	mov    0xf0(%rsp),%edx           %edx <- RT.value
// RT.province <- "" due to move assignment to res.end()->province
<+800>:	mov    %r12,0xd0(%rsp)           RT.province.str_addr <- & RT.province.zstring
<+808>:	movq   $0x0,0xd8(%rsp)           RT.province.strlen <- 0
<+820>:	movb   $0x0,0xe0(%rsp)           RT.province.zstring <- '\000'
<+828>:	mov    %edx,0x20(%rax)           res_i->value <- RT.value
// res_i++
<+831>:	addq   $0x28,0x48(%rsp)          res_i++
<+837>:	cmpb   $0x0,0xf8(%rsp)           (! RT.has_value())?  VESTIGIAL---never true
<+845>:	je     0x4010b1 <main()+865>     Branch if no value---never taken
// RT.province.~string()
<+847>:	mov    0xd0(%rsp),%rdi           %rdi <- RT.province.str_addr
<+855>:	cmp    %r12,%rdi                 &RT.province.zstring ==? RT.province.str_buff (always true due to move assignment)
<+858>:	je     0x4010b1 <main()+865>     Always taken (due to move assignment)
 <+860>:	callq  0x400c00 <_ZdlPv@plt> operator delete()@plt Delete non-local string buffer for RT.province (never necessary)

// Increment counter and check for loop completion
// %rbx      == src_i
// %r13      == src.end()

// Increment and check src_i
<+865>:	add    $0x28,%rbx                src_i++
<+869>:	cmp    %rbx,%r13                 src_i ==? src.end()
<+872>:	je     0x401100 <main()+944>     Cleanup: l_us.~string()

// Process *src_i: Check src_i->province ==? l_us
<+874>:	mov    0x8(%rbx),%rcx            %rcx <- src_i->province.len()
<+878>:	cmp    0x88(%rsp),%rcx           l_us.len() ==? src_i->province.len()
<+886>:	jne    0x400fa8 <main()+600>     => Not equal: Build result
<+892>:	test   %rcx,%rcx                 src_i->province ==? "" (and also l_us, because equal lengths)
<+895>:	je     0x4010b1 <main()+865>     => Equal null: Move to next (never taken)
<+897>:	mov    0x80(%rsp),%rsi           %rsi <- &l_us.zstring
<+905>:	mov    (%rbx),%rdi               %rdi <- &src_i->province.zstring
<+908>:	mov    %rcx,%rdx                 %rdx <- src_i->province.len()
<+911>:	mov    %rcx,(%rsp)               TEMP0 <- %rcx
<+915>:	callq  0x400cd0 <memcmp@plt>     Compare string contents HEAP STRING REFERENCE (for string > 15 chars)
<+920>:	test   %eax,%eax                 (%eax != 0 => not equal)
<+922>:	mov    (%rsp),%rcx               %rcx <- TEMP0
<+926>:	jne    0x400fa8 <main()+600>     Not equal: Build result

// src_i->province == l_us: increment and check src_i, loop to process new value
<+932>:	add    $0x28,%rbx                src_i++
<+936>:	cmp    %rbx,%r13                 src_i ==? src.end()
<+939>:	jne    0x4010ba <main()+874>     Not equal => Go to next
[/code]

The above is the main loop body.  We're not done though, as there are these two external branch targets 317 bytes later:

[code gutter="false"]
// RT.province.zstring <- T1.province.zstring (Short string optimization)
<+1256>:	mov    0xb0(%rsp),%rax           %rax <- zstring[0:7]
<+1264>:	mov    0xb8(%rsp),%rdx           %rdx <- zstring[8:15]
<+1272>:	mov    %rax,0xe0(%rsp)           RT.province.zstring <- (%rax, %rdx)
<+1280>:	mov    %rdx,0xe8(%rsp)
<+1288>:	jmpq   0x401002 <main()+690>

// Align next branch target
<+1293>:	nopl   (%rax)

// res_i->province.zstring <- RT.province.zstring (Short string optimization)
<+1296>:	mov    0xe0(%rsp),%rsi           %rsi <- zstring[0:7]
<+1304>:	mov    0xe8(%rsp),%rdi           %rdi <- zstring[8:15]
<+1312>:	mov    %rsi,0x10(%rax)           res_i->province.zstring <- (%rsi, %rdi)
<+1316>:	mov    %rdi,0x18(%rax)
<+1320>:	jmpq   0x40105d <main()+781>

Whoa, that’s a lot more code than the basic C++ version generated! Just over 3.4 times more code, in fact. As well as 2.3 times more stack storage for local variables and temporaries.

The increased stack storage indicates the key contributor to the increased code size: There are a lot more local variables and temporaries in this version. Given that these locals typically include a std::string member, there is a lot of code managing that complex type. The basic version uses one 4-byte temporary for the result of the squared integer value, while the transform idiom generates one 8-byte temporary to save a register and 128 bytes of temporary values (l_us, T1, and RT).

The temporaries increase register pressure, as well. In the basic version, src_i and res_i, the iterators for src and res, are both in registers, as is src.end(), used in the loop termination test. The transform version uses all of these but also dedicates registers for temporaries RT (the result of the lambda, of type optional<pair>) and T1 (the temporary ppair created to pass to the RT constructor). In fact, the register pressure is strong enough that the register holding iterator res_i is refreshed from local storage in Line 721.

Temporary T1 seems to enjoy the odd semi-existence of Eric the Half-a-Bee, constructed in Lines 600–642 but never in fact destroyed. Instead its values are simply moved whole-cloth into temporary result RT in Lines 649–713 and the branch target Lines 1256–1288, without any destructor code generated. This is distinctly different from RT, which is move-constructed into the element of res referenced by iterator res_i (held in register %rax) in Lines 721–828. In this sequence, Lines 800–828 implement the assignment of the null string to RT.province, as required by the standard. Later, Lines 847–860 implement the destructor RT.province.~string(). Neither of these steps, the assignment of the null string nor the destructor, is generated for T1.

For the potentially slowest operations, references to the heap causing cache misses, the two implementations are identical, making the same sequence of references to the elements of src and res.

The compiler has done a marvellous job of collapsing function calls. The transform idiom uses only a single stack frame, the same as the basic code. Techniques of function inlining have successfully collapsed all the layers of function call imposed by the Standard Library’s abstractions. This is no small feat: In the unoptimized (-O0) version of the machine code, the deepest call stack has eight levels, as reported by gdb.

The costs of so many temporaries and parameters

But the many layers of function abstraction obstruct the optimizer in another way: Each layer requires that a parameter be passed, whether by move- or copy-construction or by reference. I considered this problem when comparing different idioms for appending values. Consider the steps initiated by the body of the loop in std::transform, the following expression:

*res_i = lambda(*src_i); // Using my names for the iterators

where __result is an iterator of type opt_back_insert_iterator<vector<optional<ppair>>>.

The lambda function terminates with the statement

return oppair(ppair{p.first, p.second*p.second})

to compute the result of lambda(*src_i).

In libstdc++ 6.2, the return< statement initiates the following cascade of function calls:

  1. The temporary ppair is aggregate-initialized. This in turn requires copy-constructing the std::string and int components of the ppair at *src_i. Move constructors cannot be used because the value in src must not be modified.
  2. The temporary oppair (of optional<ppair> type) is explicitly constructed using the Standard Library optional<ppair>::optional(ppair&& p).
  3. The function opt_back_insert_iterator<vector<optional<ppair>>>::operator=(optional<ppair>&&t) is invoked.
  4. This invokes the move constructor optional<ppair>::optional(optional&&) to construct the parameter.
  5. The optional move constructor in turn invokes the move constructor for std::string and copies the int value.
  6. The rvalue has its member function optional<pair>::value() invoked, which returns an rvalue reference to the pair constructed in the first step.
  7. The constructed rvalue reference is move-constructed into the parameter for vector<ppair>::push_back(ppair&&).
  8. vector<ppair>::push_back(ppair&&) simply calls vector<ppair>::emplace_back(ppair&&), which constructs the ppair value in the element at res.end(), incrementing the end marker.

Ultimately, all this machinery is invoked to do a near-trivial operation: Copy a std::string and an int from an element of src to an element of res. The core C++ language and the Standard Library introduce many features in hope of allowing the programmer to write at the abstract level and have the code compile to the simple level. But given all the above layers, I am unsurprised that gcc 6.2 could not eliminate them all. The optimizer successfully compressed the hierarchy function calls into a single stack frame but could not compress the hierarchy of constructors, leaving residue like the unnecessary T1 temporary. How much did this extra code affected performance? In the next post, I will begin presenting microbenchmarks comparing the basic and the transform implementations of the filter-and-transform algorithm.