Sidling up to the long string slowdown

14 Jun 2017 Tags: C++

In the last post, I highlighted an anomaly in the results for filtering long strings using the by-value implementation of the loop. The usual suspect for such outcomes, heap fragmentation, didn’t seem to apply in this case. What might be the actual source of the problem?

The slowdown arises only for long strings, which suggests it is related to the heap, as long strings are the only form of string considered in these posts that use the heap. Unique char const * strings only move pointers, while short strings, std::string instances less than 16 characters, are stored in the string handle by libstdc++ 6.2. The degree of slowdown correlates well with the number of elements that have been moved in the source vector after it was initially created. This suggests that the slowdown is related to some measure of the “degree of disorder” in the heap.

For some reason, moving elements of the source vector slows performance. What might be the mechanisms? If memory had uniform access cost, moving the elements would not have any effect. The effect must be due to nonuniform access costs incurred by the memory hierarchy. There several possible mechanisms, in increasing order of execution cost:

Data cache misses due to the heap entries being more dispersed. For example, if when shuffling the array the string bodies are copied into a more dispersed range of locations, the pass through the array by the for loop will no longer be a consecutive pass through the heap memory but instead require accessing widely-separated parts of memory. Such accesses require 1--2 cache line loads per string body, with no benefit from data prefetching.
TLB cache misses due to the heap entries being even more widely dispersed. The TLB cache can only accommodate so many pages. For these benchmarks, the page size is 4,096 B. If the 50,000 string bodies are stored contiguously (ignoring any metadata required by the allocator), they require 330 pages. If the strings are stored in the worst possible layout, they require 50,000 pages. If the total pages required to access the code plus all data exceeds the TLB cache capacity, TLB misses will slow the benchmark. Given that the Haswell architecture of my chip has an L2 TLB with 1,024 8-way associative entries, I consider this possibility less likely than data cache misses.
Swap costs due to accesses of data pages that have been swapped. vmstat reports no swaps occurred or these benchmarks, so swapping is not a factor in this case.

Note that I am running the benchmarks in an Ubuntu 16.04 guest operating system under Virtual Box, in turn running on a Mac OS 10.12 host. This may add confounding effects on the actual, physical cache and TLB, as they are multiplexed between the host and guest systems by Virtual Box.

In addition to memory hierarchy effects, there is the possible effect of cache fragmentation: The allocator may have to walk through more candidate blocks to find one of acceptable size (classical slowdown due to fragmentation). In addition to the basic cost of walking the data structure, increased heap walking could incur any of the above memory hierarchy costs.

The actual benchmarks make some causes unlikely

We can argue against some causes based upon how the specific benchmarks are written. As noted in the last post, the consistent size of the string bodies, all exactly 27 bytes long, ensures that classical fragmentation will not be a problem. Fragmentation is caused by sequences of allocation requests of differing sizes, whereas in this benchmark, any released blocks can be reused for the next request.

The benchmark structure also ensures that the string bodies in the source vector will in fact not be moved when the vector is sorted or shuffled. The std::vector code can call the move assignment, std::string::operator=(string&&), to swap elements (although I have not verified this). The move assignment only copies the pointer to the string body, not the actual body. Consequently, sorting or shuffling the source vector should not affect the heap layout at all. The “degree of dispersal” of the heap bodies will be exactly the same, whether the source data has been reordered or not.

A more subtle form of dispersal remains possible, however. Although the string bodies are never moved, the pointers to them are moved. After reordering, sequential access to the pointers in the source vector produces nonsequential access to the string bodies in the heap. This might well result in more cache misses when processing sorted or shuffled data. I will consider the evidence for this in the next post.

All my marbles in one place

Sidling up to the long string slowdown

The actual benchmarks make some causes unlikely