The elegant machine code for char const *
24 May 2017 Tags: C++Unique strings—null-terminated strings uniquely identified by their char const *
—are the simplest type considered in this series of posts. Constructing, copying, moving, and assigning these strings are trivial operations on a single pointer. No destructor code is required.
This simplicity was reflected in the benchmarks. Unique strings were
substantially faster than std::string
and they
generated identical code and the same performance whether passed by-value or by-reference.
Even when presented this simple data type, however, the transform
idiom was two times slower than the basic loop.
In this post, I explore the machine code generated by the two idioms for unique strings. For both idioms, the generated code is much simpler for unique strings than for std::string
. On the other hand, the ability of the loop idiom to inline vector::emplace_back()
appears to give the loop its speed advantage over transform
, which does not inline the call.
Machine code for the loop
The benchmark form of the loop idiom was:
std::size_t str_loop_emplace (plist& res) {
assert ( ! res.size());
for (auto& p : src) {
if (p.first != exclude)
res.emplace_back(p.first, p.second*p.second);
}
return res.size();
}
The loop (Lines 3–6) generates the following elegant machine code (the same code is generated for both by-value and by-reference versions):
// Loop
// %rbp: res_i
// %rbx: src_i
// %r14: & res
// %r15: src.end()
<+80>: mov (%rbx),%rdx %rdx <- res_i->province
<+83>: cmp 0x26c656(%rip),%rdx # 0x673f70 <exclude>
<+90>: je 0x40793f <+127>
<+92>: mov 0x8(%rbx),%ecx %ecx <- src_i->value
<+95>: mov 0x10(%r14),%rsi %rsi <- res.cap_end()
<+99>: imul %ecx,%ecx %ecx <- src_i->value ^ 2
<+102>: cmp %rbp,%rsi res_i ==? res.cap_end()
<+105>: je 0x407968 <+168> if at capacity, branch to
reallocate vector
<+107>: test %rbp,%rbp res_i ==? 0
<+110>: je 0x407937 <+119>
<+112>: mov %rdx,0x0(%rbp) res_i->province <-
res_i->province
<+116>: mov %ecx,0x8(%rbp) res_i->value<-srt_i->value^2
<+119>: add $0x10,%rbp ++res_i
<+123>: mov %rbp,0x8(%r14) res.end() <- res_i
<+127>: mov %r8,%rax %rax <- res.begin()
<+130>: add $0x10,%rbx ++src_i
<+134>: cmp %rbx,%r15 src_i ==? src.end()
<+137>: jne 0x407910 <+80>
The code allocates %rbx
to the source iterator to and %rbp
to the result iterator, walking each through their respective vectors. Instructions –
(highlighted) test whether the result iterator has reached the vector’s capacity, branching to
if it has. The code at
(not shown) allocates a larger array, copies the values, and deletes the old array.
The caller of this function has already reserved enough space in the result vector, so the branch to is never taken. In actual runs, the instructions from
–
are the only ones executed for the loop. This tight, simple code accounts for the high performance observed in the unique string benchmarks.
Machine code for transform
The benchmark form of transform
had two parts. The loop part:
std::size_t str_option(plist& res) {
assert ( ! res.size());
std::transform(src.cbegin(), src.cend(), opt_back_inserter(res),
[=](auto& p) {
if (p.first == exclude)
return oppair();
else
return oppair(ppair{p.first, p.second*p.second});
} );
}
and the part extending the result vector (identical code was produced from a call to vector::push_back()
, which simply expanded to a call to vector::emplace_back()
):
template<typename T, typename = _not_self<T>>
opt_back_emplace_iterator<Container>&
operator=(T&& t)
{
if ((t.has_value())
container->emplace_back(std::forward<T>(t).value());
return *this;
}
There are two crucial differences between this source and the loop source:
- The two aspects, filtering and appending, are separate functions communicating via the temporary result from the lambda expression. The loop code, by contrast, uses the single value
p
for both filtering and appending. - The
transform
idiom resolves to a different instance ofvector::emplace_back()
from the instance resolved by the loop. Thetransform
idiom resolves to the form taking a single value (in this case, the rvalue temporary result of the lambda), while the loop resolves to the form taking the arguments to thestd::pair()
constructor, which is called to directly construct the pair at the result vector's next available location.
These differences combine to generate different code from the loop. The code merges transform
and the output iterator into a single routine, but it retains a call to emplace_back()
rather than inlining the routine. This code is twice as slow as its loop counterpart:
// Loop
// %rsp: RT
// 0x0 char const *
// 0x8 value
// 0x10 has_value
// %rbx: src_i
// %rbp: src.end()
// %r12: & res
<+64>: mov (%rbx),%rdx %rdx <- char const *
<+67>: cmp 0x26cf56(%rip),%rdx # 0x674f70 <exclude>
<+74>: mov 0x8(%rbx),%eax %eax <- src_i->value
<+76>: je 0x40803a <+106>
<+79>: imul %eax,%eax %eax <- src_i->value ^ 2
<+82>: mov %rsp,%rsi %rsi <- & RT
<+85>: mov %r12,%rdi %rdi <- & res
<+89>: mov %rdx,(%rsp) RT.province <- char const *
<+94>: movb $0x1,0x10(%rsp) RT.has_value <- 1
<+97>: mov %eax,0x8(%rsp) RT.value <- src_i->value ^ 2
<+101>: callq 0x413590 <std::vector::emplace_back(std::pair<char const*, int>&&)>
<+106>: add $0x10,%rbx ++src_i
<+110>: cmp %rbx,%rbp src_i ==? src.end()
<+113>: jne 0x408010 <+64>
The above code calls the single-argument version of emplace_back()
(highlighted line), which is straightforward for the case where the vector has remaining capacity:
// vec: this (std::vector<std::pair<char const *, int>>*)
// RT: parameter to emplace_back (std::pair<char const*, int>&&)
// %rax: vec.end()
<+0>: mov 0x8(%rdi),%rax %rax <- vec.end()
<+4>: cmp 0x10(%rdi),%rax vec.cap_end() ==? vec.end()
<+8>: je 0x4131b0 <+48> call emplace_back_aux to allocate larger buffer, copy, and delete old
<+10>: test %rax,%rax vec.end() ==? 0
<+13>: je 0x41319d <+29> Skip emplace if == 0 [Never taken in this program]
<+15>: mov (%rsi),%r9 %r9 <- RT.province (8-byte char const *)
<+18>: mov 0x8(%rsi),%r10 %r10 <- RT.value (8 bytes moved, of which 4 are value)
<+22>: mov %r9,(%rax) vec.end()->province <- RT.province
<+25>: mov %r10,0x8(%rax) vec.end()->value <- RT.value
<+29>: add $0x10,%rax
<+33>: mov %rax,0x8(%rdi) ++vec.end()
<+37>: retq
<+38>: nopw %cs:0x0(%rax,%rax,1)
<+48>: jmpq 0x413060 <std::vector<>::_M_emplace_back_aux<std::pair<char const*, int> >(std::pair<char const*, int>&&)>
Most of the machine code for the transform
loop matches that for the for
loop. This code also allocates %rbx
to the source vector iterator. However, in this case no register is allocated to the iterator for the result vector. Instead, %r12
is allocated to the address of the result vector.
The different register allocation results from the lack of inlining. Where the loop code inlined the call to the multiple-argument version of emplace_back()
, the transform
idiom’s single-argument version is retained as a function call in at Instruction (highlighted).
In emplace_back()
, the next available location is loaded into %rax
, the values are stored indirectly through that register, and vec.end()
is incremented. For every copied value, this code incurs an extra subroutine call, an extra load from memory, and an extra store from memory. Of these three, the subroutine call is likely the largest contributor to transform
’s slower performance, as the memory references will consistently resolve to the L1 cache. All three would have been eliminated by inlining the function.
Conclusion
The simple structure of the char const *
datatype allows the gcc optimizer to strut its stuff. The code for both the loop and transform
is simple and elegant, their respective algorithms pared to essentials. For this case, the C++ metaprogramming facilities allow the Standard Template Library to fulfill its promise of an abstract notation that nonetheless can generate code precisely targeted to the provided type. The range for
over a std::vector<std::pair>
generates code as efficient as any C program directly iterating pointers.
Yet there remains a cost for a higher level of abstraction, the std::transform
algorithm returning a std::optional
temporary to a conditional output iterator. In this case, the optimizer did not inline the call to emplace_back()
and the cost of that call appreciably slowed the resulting code. Even for this simple data type, the STL abstraction imposed a cost.