The elegant machine code for char const *
24 May 2017 Tags: C++Unique strings—null-terminated strings uniquely identified by their char const *—are the simplest type considered in this series of posts. Constructing, copying, moving, and assigning these strings are trivial operations on a single pointer. No destructor code is required.
This simplicity was reflected in the benchmarks. Unique strings were
substantially faster than std::string and they
generated identical code and the same performance whether passed by-value or by-reference.
Even when presented this simple data type, however, the transform idiom was two times slower than the basic loop.
In this post, I explore the machine code generated by the two idioms for unique strings. For both idioms, the generated code is much simpler for unique strings than for std::string.  On the other hand, the ability of the loop idiom to inline vector::emplace_back() appears to give the loop its speed advantage over transform, which does not inline the call.
Machine code for the loop
The benchmark form of the loop idiom was:
std::size_t str_loop_emplace (plist& res) {
  assert ( ! res.size());
  for (auto& p : src) {
    if (p.first != exclude)
      res.emplace_back(p.first, p.second*p.second);
  }
  return res.size();
}
The loop (Lines 3–6) generates the following elegant machine code (the same code is generated for both by-value and by-reference versions):
// Loop
// %rbp: res_i
// %rbx: src_i
// %r14: & res
// %r15: src.end()
<+80>:	mov    (%rbx),%rdx               %rdx <- res_i->province
<+83>:	cmp    0x26c656(%rip),%rdx        # 0x673f70 <exclude>
<+90>:	je     0x40793f <+127>
<+92>:	mov    0x8(%rbx),%ecx            %ecx <- src_i->value
<+95>:	mov    0x10(%r14),%rsi           %rsi <- res.cap_end()
<+99>:	imul   %ecx,%ecx                 %ecx <- src_i->value ^ 2
<+102>:	cmp    %rbp,%rsi                 res_i ==? res.cap_end()
<+105>:	je     0x407968 <+168>           if at capacity, branch to
                                         reallocate vector
<+107>:	test   %rbp,%rbp                 res_i ==? 0
<+110>:	je     0x407937 <+119>
<+112>:	mov    %rdx,0x0(%rbp)            res_i->province <-
                                                res_i->province
<+116>:	mov    %ecx,0x8(%rbp)            res_i->value<-srt_i->value^2
<+119>:	add    $0x10,%rbp                ++res_i
<+123>:	mov    %rbp,0x8(%r14)            res.end() <- res_i
<+127>:	mov    %r8,%rax                  %rax <- res.begin()
<+130>:	add    $0x10,%rbx                ++src_i
<+134>:	cmp    %rbx,%r15                 src_i ==? src.end()
<+137>:	jne    0x407910 <+80>
The code allocates %rbx to the source iterator to and %rbp to the result iterator, walking each through their respective vectors. Instructions – (highlighted) test whether the result iterator has reached the vector’s capacity, branching to  if it has.  The code at  (not shown) allocates a larger array, copies the values, and deletes the old array.
The caller of this function has already reserved enough space in the result vector, so the branch to  is never taken.  In actual runs, the instructions from – are the only ones executed for the loop.  This tight, simple code accounts for the high performance observed in the unique string benchmarks.
Machine code for transform
The benchmark form of transform had two parts. The loop part:
std::size_t str_option(plist& res) {
  assert ( ! res.size());
  std::transform(src.cbegin(), src.cend(), opt_back_inserter(res),
    [=](auto& p) {
      if (p.first == exclude)
        return oppair();
      else
        return oppair(ppair{p.first, p.second*p.second});
    } );
}
and the part extending the result vector (identical code was produced from a call to vector::push_back(), which simply expanded to a call to vector::emplace_back()):
  template<typename T, typename = _not_self<T>>
    opt_back_emplace_iterator<Container>&
    operator=(T&& t)
    {
      if ((t.has_value())
        container->emplace_back(std::forward<T>(t).value());
      return *this;
    }
There are two crucial differences between this source and the loop source:
- The two aspects, filtering and appending, are separate functions communicating via the temporary result from the lambda expression.  The loop code, by contrast, uses the single value pfor both filtering and appending.
- The transformidiom resolves to a different instance ofvector::emplace_back()from the instance resolved by the loop. Thetransformidiom resolves to the form taking a single value (in this case, the rvalue temporary result of the lambda), while the loop resolves to the form taking the arguments to thestd::pair()constructor, which is called to directly construct the pair at the result vector's next available location.
These differences combine to generate different code from the loop. The code merges transform and the output iterator into a single routine, but it retains a call to emplace_back() rather than inlining the routine. This code is twice as slow as its loop counterpart:
// Loop
// %rsp: RT
//    0x0  char const *
//    0x8  value
//    0x10 has_value
// %rbx: src_i
// %rbp: src.end()
// %r12: & res
<+64>:  mov    (%rbx),%rdx               %rdx <- char const *
<+67>:  cmp    0x26cf56(%rip),%rdx        # 0x674f70 <exclude>
<+74>:  mov    0x8(%rbx),%eax            %eax <- src_i->value
<+76>:  je     0x40803a <+106>
<+79>:  imul   %eax,%eax                 %eax <- src_i->value ^ 2
<+82>:  mov    %rsp,%rsi                 %rsi <- & RT
<+85>:  mov    %r12,%rdi                 %rdi <- & res
<+89>:  mov    %rdx,(%rsp)               RT.province <- char const * 
<+94>:  movb   $0x1,0x10(%rsp)           RT.has_value <- 1
<+97>:  mov    %eax,0x8(%rsp)            RT.value <- src_i->value ^ 2
<+101>: callq  0x413590 <std::vector::emplace_back(std::pair<char const*, int>&&)>
<+106>: add    $0x10,%rbx                ++src_i
<+110>: cmp    %rbx,%rbp                 src_i ==? src.end()
<+113>: jne    0x408010 <+64>
The above code calls the single-argument version of emplace_back() (highlighted line), which is straightforward for the case where the vector has remaining capacity:
// vec: this (std::vector<std::pair<char const *, int>>*)
// RT: parameter to emplace_back (std::pair<char const*, int>&&)
// %rax: vec.end()
<+0>:  mov    0x8(%rdi),%rax            %rax <- vec.end()
<+4>:  cmp    0x10(%rdi),%rax           vec.cap_end() ==? vec.end()
<+8>:  je     0x4131b0 <+48>            call emplace_back_aux to allocate larger buffer, copy, and delete old
<+10>: test   %rax,%rax                 vec.end() ==? 0
<+13>: je     0x41319d <+29>            Skip emplace if == 0 [Never taken in this program]
<+15>: mov    (%rsi),%r9                %r9 <- RT.province (8-byte char const *)
<+18>: mov    0x8(%rsi),%r10            %r10 <- RT.value (8 bytes moved, of which 4 are value)
<+22>: mov    %r9,(%rax)                vec.end()->province <- RT.province
<+25>: mov    %r10,0x8(%rax)            vec.end()->value <- RT.value
<+29>: add    $0x10,%rax                
<+33>: mov    %rax,0x8(%rdi)            ++vec.end()
<+37>: retq   
<+38>: nopw   %cs:0x0(%rax,%rax,1)
<+48>: jmpq   0x413060 <std::vector<>::_M_emplace_back_aux<std::pair<char const*, int> >(std::pair<char const*, int>&&)>
Most of the machine code for the transform loop matches that for the for loop.  This code also allocates %rbx to the source vector iterator.  However, in this case no register is allocated to the iterator for the result vector. Instead,  %r12 is allocated to the address of the result vector.
The different register allocation results from the lack of inlining.  Where the loop code inlined the call to the multiple-argument version of emplace_back(), the transform idiom’s single-argument version is retained as a function call in at Instruction  (highlighted).
In emplace_back(), the next available location is loaded into %rax, the values are stored indirectly through that register, and vec.end() is incremented.  For every copied value, this code incurs an extra subroutine call, an extra load from memory, and an extra store from memory.  Of these three, the subroutine call is likely the largest contributor to transform’s slower performance, as the memory references will consistently resolve to the L1 cache. All three would have been eliminated by inlining the function.
Conclusion
The simple structure of the char const * datatype allows the gcc optimizer to strut its stuff.  The code for both the loop and transform is simple and elegant, their respective algorithms pared to essentials.  For this case, the C++ metaprogramming facilities allow the Standard Template Library to fulfill its promise of an abstract notation that nonetheless can generate code precisely targeted to the provided type. The range for over a std::vector<std::pair> generates code as efficient as any C program directly iterating pointers.
Yet there remains a cost for a higher level of abstraction, the std::transform algorithm returning a std::optional temporary to a conditional output iterator. In this case, the optimizer did not inline the call to emplace_back() and the cost of that call appreciably slowed the resulting code. Even for this simple data type, the STL abstraction imposed a cost.