Sidling up to the long string slowdown

In the last post, I highlighted an anomaly in the results for filtering long strings using the by-value implementation of the loop. The usual suspect for such outcomes, heap fragmentation, didn’t seem to apply in this case. What might be the actual source of the problem?

The slowdown arises only for long strings, which suggests it is related to the heap, as long strings are the only form of string considered in these posts that use the heap. Unique char const * strings only move pointers, while short strings, std::string instances less than 16 characters, are stored in the string handle by libstdc++ 6.2. The degree of slowdown correlates well with the number of elements that have been moved in the source vector after it was initially created. This suggests that the slowdown is related to some measure of the “degree of disorder” in the heap.

For some reason, moving elements of the source vector slows performance. What might be the mechanisms? If memory had uniform access cost, moving the elements would not have any effect. The effect must be due to nonuniform access costs incurred by the memory hierarchy. There several possible mechanisms, in increasing order of execution cost:

Note that I am running the benchmarks in an Ubuntu 16.04 guest operating system under Virtual Box, in turn running on a Mac OS 10.12 host. This may add confounding effects on the actual, physical cache and TLB, as they are multiplexed between the host and guest systems by Virtual Box.

In addition to memory hierarchy effects, there is the possible effect of cache fragmentation: The allocator may have to walk through more candidate blocks to find one of acceptable size (classical slowdown due to fragmentation). In addition to the basic cost of walking the data structure, increased heap walking could incur any of the above memory hierarchy costs.

The actual benchmarks make some causes unlikely

We can argue against some causes based upon how the specific benchmarks are written. As noted in the last post, the consistent size of the string bodies, all exactly 27 bytes long, ensures that classical fragmentation will not be a problem. Fragmentation is caused by sequences of allocation requests of differing sizes, whereas in this benchmark, any released blocks can be reused for the next request.

The benchmark structure also ensures that the string bodies in the source vector will in fact not be moved when the vector is sorted or shuffled. The std::vector code can call the move assignment, std::string::operator=(string&&), to swap elements (although I have not verified this). The move assignment only copies the pointer to the string body, not the actual body. Consequently, sorting or shuffling the source vector should not affect the heap layout at all. The “degree of dispersal” of the heap bodies will be exactly the same, whether the source data has been reordered or not.

A more subtle form of dispersal remains possible, however. Although the string bodies are never moved, the pointers to them are moved. After reordering, sequential access to the pointers in the source vector produces nonsequential access to the string bodies in the heap. This might well result in more cache misses when processing sorted or shuffled data. I will consider the evidence for this in the next post.