Analysis of code generated from the transform idiom
03 May 2017 Tags: C++The last post presented the baseline C++ code and its generated machine code. Now it’s time to compare the code generated by the std::transform
idiom and see how well it fares.
Removing two inefficiencies
I’m going to analyze a slightly improved version of the transform
idiom. When I examined the code generated by the version I presented originally, I saw two simple inefficiencies I’d left in, two rookie errors:
- The lambda parameter `p` is passed by copy rather than by reference, constructing a new string, including a heap allocation, for every invocation.
- The `string("us")` expression inside the lambda filter inserted a conversion from `char const *` to `std::string` for every invocation, just to create a constant string.
I improved the code by eliminating these conversions from the loop. I eliminated the first by simply passing the lambda parameter by-reference.
I could not eliminate the second conversion altogether, so I moved it outside the transform
call, declaring const string us
and then capturing it by-copy in the lambda call.
Here is the revised code with the revised lines highlighted:
using ppair = std::pair<string,int>;
using oppair = optional<ppair>;
using plist = std::vector<ppair>;
const string us = "US";
std::transform(src.cbegin(), src.cend(), opt_back_inserter(res),
[=](auto& p) {
if (p.first == us)
return oppair();
else
return oppair(ppair{p.first, p.second*p.second});
} );
The copy-capture of us
in the lambda expression only generates a single copy-constructor call. The above code compiles to something like:
{
struct lambda_type {
string l_us;
lambda_type(string& us) l_us(us) {};
ppair operator (ppair& p) { ... l_us ... };
};
lambda_type lambda {us};
std::transform(..., lambda);
}
The member variable l_us
is constructed once, before the call to transform
, then its destructor is called immediately following. The calls to lambda_type::operator(ppair& p)
inside transform
are efficient references to a local variable, internally represented as offsets 0x80–0x9f
from the stack pointer %rsp
in the machine code. The extra l_us
variable does increase the stack requirements by 32 bytes, though.
Machine code generated by the more efficient version
With those changes made, the transform
call is compiled to the following machine code (again, gcc 6.2 with libstdc++
, compiled with -O2
, as disassembled by gdb). It’s a bit of a slog, so just glance at it and I’ll see you on the other side:
// LOOP
// 941 - 600 = 341 bytes (6-7 instruction cache lines)
// PLUS 69 bytes in outside branch targets (2 instruction cache lines)
// = 410 bytes (8-9 instruction cache lines)
// Register usage
// %rax == res_i (for most of the loop body, lines +721 to end)
// %rbx == src_i
// %r12 == & RT.province.zstring
// %r13 == src.end()
// %r14 == & RT
// %r15 == & T1.province.zstring
// Local variables
// Range for local variables used in loop: 0x100 = 256 bytes = 4-5 data cache
// lines
// (includes 0x18+0x8+0x28 = 0x48 = 52 bytes = 1-2 extra cache lines for
// padding and locals unused
// in the loop)
// Local variable offsets (hexadecimal) from %rsp:
// 0 TEMP0 (saves %rcx across call to memcmp)
//(18 bytes of padding and other locals)
// 20 src
// 20 begin()
// 28 end()
// 30 cap_end() Pointer to byte following the reserved capacity
//( 8 bytes of padding)
// 40 res
// 40 begin()
// 48 end()
// 50 cap_end() Pointer to byte following the reserved capacity
//(28 bytes of padding and other locals)
// 80 l_us lambda capture by copy (copy-constructed once, before loop body,
// destructor called after loop body)
// 80 str_buff Pointer to string value (null-terminated)
// 88 string length (not counting null)
// 90 Union
// zstring 16-byte buffer for null-terminated strings of length < 16
// capacity 8-byte size of heap buffer for strings of length >= 16
// a0 T1 Rvalue ppair constructed in lambda via aggregate initialization
// a0 province
// a0 str_buff Pointer to string value (null-terminated)
// a8 string length (not counting null)
// b0 Union
// zstring 16-byte buffer for null-terminated strings of length < 16
// capacity 8-byte size of heap buffer for strings of length >= 16
// c0 value
// d0 RT move-constructed parameter (type oppair) t of
// opt_back_emplace_iterator::operator=(T&& t)
// d0 province
// d0 str_buff Pointer to string value (null-terminated)
// d8 string length (not counting null)
// e0 Union
// zstring 16-byte buffer for null-terminated strings of length < 16
// capacity 8-byte size of heap buffer for strings of length >= 16
// f0 value
// f8 has_value
// Heap references (arrays for src and res, strings longer than 15 chars)
// src and res: Single pass in monotonically-increasing order (data prefetching
// should work)
// strings: single memcpy call for every string > 15 chars (most likely two
// distinct heap locations = 2 data cache lines)
// When us.len > 15:
// single memcmp call for every string of same length as us
// (most likely two distinct heap locations = 2 data cache lines)
// Body: Build optional<pair<string,int>> and append to res (value is ensured)
// T1 = (src_i->province, src_i->value ^ 2)
<+600>: mov 0x20(%rbx),%ebp %ebp <- src_i->value
<+603>: mov %r15,0xa0(%rsp) T1.province.addr <- &T1.province.zstring
<+611>: lea 0xa0(%rsp),%rdi %rdi <- &T1.province
<+619>: mov (%rbx),%rsi %rsi <- &src_i->province.zstring
<+622>: imul %ebp,%ebp %ebp <- src_i->value ^ 2
<+625>: lea (%rsi,%rcx,1),%rdx %rdx <- &src_i->province.zstring[len+1]
<+629>: callq 0x401550 std::string::_M_construct<char*>(char*, char*, std::forward_iterator_tag) HEAP STRING REFERENCE (for string > 15 chars)
<+634>: mov 0xa0(%rsp),%rax %rax <- & T1.province.zstring
<+642>: mov %ebp,0xc0(%rsp) T1.value <- src_i->value ^ 2
// RT = (T1.province, src_i->value ^ 2, true)
<+649>: mov %r12,0xd0(%rsp) RT.province.str_buff <- & RT.province.zstring
<+657>: cmp %r15,%rax Was T1.str_buff changed?
<+660>: je 0x401238 <main()+1256> Jump if no change (taken when string is of length < 16)
/*
Direct move of T1.province.str_buff to
RT.province.str_buff. T1.province has no destructor called. RT is
directly initialized and T1 "never really existed" This approach
allows RT.province to be initialized without any heap accesses
*/
<+666>: mov %rax,0xd0(%rsp) RT.province.str_buff <- T1.province.str_buff
<+674>: mov 0xb0(%rsp),%rax
<+682>: mov %rax,0xe0(%rsp) RT.province.capacity <- T1.province.capacity
<+690>: mov 0xa8(%rsp),%rax %rax <- T1.province.len
<+698>: mov %ebp,0xf0(%rsp) RT.value <- src_i->value ^ 2
<+705>: movb $0x1,0xf8(%rsp) RT.optional <- true
<+713>: mov %rax,0xd8(%rsp) RT.province.len <- T1.province.len
// *res_i = (RT.province, RT.value)
// For rest of loop body, %rax == res_i
<+721>: mov 0x48(%rsp),%rax %rax <- res.end()
<+726>: cmp 0x50(%rsp),%rax res.cap_end() ==? &res.end()
<+731>: je 0x401295 <main()+1349> Jump if equal; extend res (never taken, sufficient space reserved)
<+737>: test %rax,%rax
<+740>: je 0x40108f <main()+831> Jump if res.end() == nullptr (never taken, space has already been reserved)
<+742>: lea 0x10(%rax),%rdx %rdx <- &res.end().zstring
<+746>: mov %rdx,(%rax) res.end()->str_buff <- &res.end().zstring
<+749>: mov 0xd0(%rsp),%rdx %rdx <- RT.province.str_buff
<+757>: cmp %r12,%rdx Is RT buffer local?
<+760>: je 0x401260 <main()+1296> if == ... (string length < 16---Move the 16 bytes from RT.province.zstring
to res_i->province.zstring
else move-assign the heap buffer res_i->province <- RT
/*
Unlike RT <- T1 assignment above, res_i->province <- RT.province is
an actual move assignment. RT.str_buff is copied to
res_i->province.str_buff (transferring ownership of the heap
object) and later (see below) RT.province will be reset to an empty
string and still later have its destructor called. The destructor
will not reference the heap because RT.province is a null string.
*/
<+766>: mov %rdx,(%rax) res_i->province.str_buff <- RT.province.str_buff
<+769>: mov 0xe0(%rsp),%rdx
<+777>: mov %rdx,0x10(%rax) res_i->province.capacity <- RT.province.capacity
// endif
<+781>: mov 0xd8(%rsp),%rdx
<+789>: mov %rdx,0x8(%rax) res.end()->province.len <- RT.province.len
<+793>: mov 0xf0(%rsp),%edx %edx <- RT.value
// RT.province <- "" due to move assignment to res.end()->province
<+800>: mov %r12,0xd0(%rsp) RT.province.str_addr <- & RT.province.zstring
<+808>: movq $0x0,0xd8(%rsp) RT.province.strlen <- 0
<+820>: movb $0x0,0xe0(%rsp) RT.province.zstring <- '\000'
<+828>: mov %edx,0x20(%rax) res_i->value <- RT.value
// res_i++
<+831>: addq $0x28,0x48(%rsp) res_i++
<+837>: cmpb $0x0,0xf8(%rsp) (! RT.has_value())? VESTIGIAL---never true
<+845>: je 0x4010b1 <main()+865> Branch if no value---never taken
// RT.province.~string()
<+847>: mov 0xd0(%rsp),%rdi %rdi <- RT.province.str_addr
<+855>: cmp %r12,%rdi &RT.province.zstring ==? RT.province.str_buff (always true due to move assignment)
<+858>: je 0x4010b1 <main()+865> Always taken (due to move assignment)
<+860>: callq 0x400c00 <_ZdlPv@plt> operator delete()@plt Delete non-local string buffer for RT.province (never necessary)
// Increment counter and check for loop completion
// %rbx == src_i
// %r13 == src.end()
// Increment and check src_i
<+865>: add $0x28,%rbx src_i++
<+869>: cmp %rbx,%r13 src_i ==? src.end()
<+872>: je 0x401100 <main()+944> Cleanup: l_us.~string()
// Process *src_i: Check src_i->province ==? l_us
<+874>: mov 0x8(%rbx),%rcx %rcx <- src_i->province.len()
<+878>: cmp 0x88(%rsp),%rcx l_us.len() ==? src_i->province.len()
<+886>: jne 0x400fa8 <main()+600> => Not equal: Build result
<+892>: test %rcx,%rcx src_i->province ==? "" (and also l_us, because equal lengths)
<+895>: je 0x4010b1 <main()+865> => Equal null: Move to next (never taken)
<+897>: mov 0x80(%rsp),%rsi %rsi <- &l_us.zstring
<+905>: mov (%rbx),%rdi %rdi <- &src_i->province.zstring
<+908>: mov %rcx,%rdx %rdx <- src_i->province.len()
<+911>: mov %rcx,(%rsp) TEMP0 <- %rcx
<+915>: callq 0x400cd0 <memcmp@plt> Compare string contents HEAP STRING REFERENCE (for string > 15 chars)
<+920>: test %eax,%eax (%eax != 0 => not equal)
<+922>: mov (%rsp),%rcx %rcx <- TEMP0
<+926>: jne 0x400fa8 <main()+600> Not equal: Build result
// src_i->province == l_us: increment and check src_i, loop to process new value
<+932>: add $0x28,%rbx src_i++
<+936>: cmp %rbx,%r13 src_i ==? src.end()
<+939>: jne 0x4010ba <main()+874> Not equal => Go to next
[/code]
The above is the main loop body. We're not done though, as there are these two external branch targets 317 bytes later:
[code gutter="false"]
// RT.province.zstring <- T1.province.zstring (Short string optimization)
<+1256>: mov 0xb0(%rsp),%rax %rax <- zstring[0:7]
<+1264>: mov 0xb8(%rsp),%rdx %rdx <- zstring[8:15]
<+1272>: mov %rax,0xe0(%rsp) RT.province.zstring <- (%rax, %rdx)
<+1280>: mov %rdx,0xe8(%rsp)
<+1288>: jmpq 0x401002 <main()+690>
// Align next branch target
<+1293>: nopl (%rax)
// res_i->province.zstring <- RT.province.zstring (Short string optimization)
<+1296>: mov 0xe0(%rsp),%rsi %rsi <- zstring[0:7]
<+1304>: mov 0xe8(%rsp),%rdi %rdi <- zstring[8:15]
<+1312>: mov %rsi,0x10(%rax) res_i->province.zstring <- (%rsi, %rdi)
<+1316>: mov %rdi,0x18(%rax)
<+1320>: jmpq 0x40105d <main()+781>
Whoa, that’s a lot more code than the basic C++ version generated! Just over 3.4 times more code, in fact. As well as 2.3 times more stack storage for local variables and temporaries.
The increased stack storage indicates the key contributor to the increased code size: There are a lot more local variables and temporaries in this version. Given that these locals typically include a std::string
member, there is a lot of code managing that complex type. The basic version uses one 4-byte temporary for the result of the squared integer value, while the transform
idiom generates one 8-byte temporary to save a register and 128 bytes of temporary values (l_us
, T1
, and RT
).
The temporaries increase register pressure, as well. In the basic version, src_i
and res_i
, the iterators for src
and res
, are both in registers, as is src.end()
, used in the loop termination test. The transform
version uses all of these but also dedicates registers for temporaries RT
(the result of the lambda, of type optional<pair>
) and T1
(the temporary ppair
created to pass to the RT
constructor). In fact, the register pressure is strong enough that the register holding iterator res_i
is refreshed from local storage in Line 721.
Temporary T1
seems to enjoy the odd semi-existence of Eric the Half-a-Bee, constructed in Lines 600–642
but never in fact destroyed. Instead its values are simply moved whole-cloth into temporary result RT
in Lines 649–713
and the branch target Lines 1256–1288
, without any destructor code generated. This is distinctly different from RT
, which is move-constructed into the element of res
referenced by iterator res_i
(held in register %rax
) in Lines 721–828
. In this sequence, Lines 800–828
implement the assignment of the null string to RT.province
, as required by the standard. Later, Lines 847–860
implement the destructor RT.province.~string()
. Neither of these steps, the assignment of the null string nor the destructor, is generated for T1
.
For the potentially slowest operations, references to the heap causing cache misses, the two implementations are identical, making the same sequence of references to the elements of src
and res
.
The compiler has done a marvellous job of collapsing function calls. The transform
idiom uses only a single stack frame, the same as the basic code. Techniques of function inlining have successfully collapsed all the layers of function call imposed by the Standard Library’s abstractions. This is no small feat: In the unoptimized (-O0
) version of the machine code, the deepest call stack has eight levels, as reported by gdb.
The costs of so many temporaries and parameters
But the many layers of function abstraction obstruct the optimizer in another way: Each layer requires that a parameter be passed, whether by move- or copy-construction or by reference. I considered this problem when comparing different idioms for appending values. Consider the steps initiated by the body of the loop in std::transform
, the following expression:
*res_i = lambda(*src_i); // Using my names for the iterators
where __result
is an iterator of type opt_back_insert_iterator<vector<optional<ppair>>>
.
The lambda function terminates with the statement
return oppair(ppair{p.first, p.second*p.second})
to compute the result of lambda(*src_i)
.
In libstdc++
6.2, the return<
statement initiates the following cascade of function calls:
- The temporary
ppair
is aggregate-initialized. This in turn requires copy-constructing thestd::string
andint
components of theppair
at*src_i
. Move constructors cannot be used because the value insrc
must not be modified. - The temporary
oppair
(ofoptional<ppair>
type) is explicitly constructed using the Standard Libraryoptional<ppair>::optional(ppair&& p)
. - The function
opt_back_insert_iterator<vector<optional<ppair>>>::operator=(optional<ppair>&&t)
is invoked. - This invokes the move constructor
optional<ppair>::optional(optional&&)
to construct the parameter. - The
optional
move constructor in turn invokes the move constructor forstd::string
and copies theint
value. - The rvalue has its member function
optional<pair>::value()
invoked, which returns an rvalue reference to thepair
constructed in the first step. - The constructed rvalue reference is move-constructed into the parameter for
vector<ppair>::push_back(ppair&&)
. vector<ppair>::push_back(ppair&&)
simply callsvector<ppair>::emplace_back(ppair&&)
, which constructs theppair
value in the element atres.end()
, incrementing the end marker.
Ultimately, all this machinery is invoked to do a near-trivial operation: Copy a std::string
and an int
from an element of src
to an element of res
. The core C++ language and the Standard Library introduce many features in hope of allowing the programmer to write at the abstract level and have the code compile to the simple level. But given all the above layers, I am unsurprised that gcc 6.2 could not eliminate them all. The optimizer successfully compressed the hierarchy function calls into a single stack frame but could not compress the hierarchy of constructors, leaving residue like the unnecessary T1
temporary. How much did this extra code affected performance? In the next post, I will begin presenting microbenchmarks comparing the basic and the transform
implementations of the filter-and-transform algorithm.