Random access to long string bodies incurs cache misses
29 Jun 2017 Tags: C++As I described in the last post, shuffling the items in the source vector leaves the string handles in contiguous sequential order while introducing disorder into the sequence of string bodies. The NShuffled “data set” is really a family of 11 related data sets, differing in the fraction of their items that were shuffled. All sets start with the same vector src
of 50,000 elements, then std::shuffle()
is run on the first N items in the vector, where N ranges from 0 (no items shuffled) to 50,000 (all items shuffled) in increments of 5,000. All the data sets will have 37,500 (75% of 50,000) of their source elements copied to the result vector.
These data sets allow us to correlate the degree of disorder in the string bodies with the execution time of the filter-and-transform algorithm. More disorder in the string bodies produces more cache misses, slowing performance. I will use the NShuffle data sets to demonstrate this.
Shuffling items proportionately slows performance
The first step is to verify that shuffling produces the expected slowdown. The following graph shows the times for all 11 NShuffle data sets (right), together with the original Random data set (left), which is not shuffled:
The execution time strongly tracks the number of shuffled items. The original unshuffled data set and the 0% NShuffle data set have the same times (3.5 ms), with each increment of 5,000 more shuffled items adding about 0.25–0.30 ms. The NShuffled data set exemplifies the performance slowdown we want to understand.
Shuffling items leaves long string bodies in disorder
The next step is to verify that this data set produces the data structure effects that we claim is causing the slowdown. I created an approximate measure of “disorder” in the loop bodies of the src
vector by counting “jumps”, the number of string bodies that followed their preceding string body within a single 64-byte cache line:
void check_addresses(const plist& src) {
constexpr plist::size_type gap = 64;
unsigned jumps = 0u;
auto last = src.cbegin()->first.data();
for (const auto& p : src) {
if (p.first.data() < last || last + gap < p.first.data()) {
jumps++;
}
last = p.first.data();
}
*addr_out << "Total of " << jumps << " jumps\n";
}
The std::string::data()
member returns the address of the string body.
The number of these “jumps” closely tracks the number of shuffled items:
n_shuffled | Jumps |
0 | 1 |
5,000 | 5,000 |
10,000 | 9,999 |
15,000 | 14,998 |
20,000 | 19,997 |
25,000 | 24,997 |
30,000 | 29,999 |
35,000 | 34,999 |
40,000 | 39,999 |
45,000 | 45,000 |
50,000 | 49,999 |
Jumps in string body location predict benchmark times
A simple least-squares regression on benchmark time by number of shuffled items shows a strong linear fit:
There are measurable performance consequences to scrambling the order in which we read the string bodies. Is it due to cache misses?
Jumps in string body location predict L1 cache read misses
The jump count only estimates cache misses. For example, any string body located in a lower memory address than the address of its predecessor string is counted as a jump, though they might share a cache line. We need to verify that the jump count is a reasonable proxy for cache misses.
The callgrind tool in the valgrind suite estimates the number of cache misses for a program by simulating execution of the program binary on a specified processor architecture, counting several types of cache misses. The tool reports L1 and last-level (last-level is L3 on the Haswell architecture used in these tests) cache data read and write misses, as well as instruction read misses.
Using callgrind
, we see L1 cache read misses are perfectly correlated with the number of string body jumps in the source vector:
Each core on the Haswell chip on which these tests were run has a 32K level-1 data cache and 32K level-1 instruction cache. There are negligible instruction cache misses (under 50 for every condition) because the loop is tight and cache lines for the instructions remain hot during its entire run. In contrast, each string body jump requires loading an unread cache line from a nonsequential location, preventing prefetching. The small size of the L1 cache results in every jump causing about 5.5 read misses. These misses are resolved from the L2 cache (unmonitored by callgrind
), with L2 misses resolved from the last-level L3 cache. On this chip, the L3 cache is 3 MB, shared across both cores. The benchmarks were run on a quiet system, ensuring most to all L3 cache activity was from the benchmarks and not processes on the other core.
More than 20,000 string body jumps increases L3 cache read misses
The larger size of the last-level cache allows it to accommodate more distinct string bodies before read misses arise:
Up to around 20,000 jumps, the number of L3 cache read misses is constant but as more jumps are added, the number of L3 cache read misses increases linearly with the jumps.
Cache write misses are harder to interpret
Increasing the number of source vector string body jumps directly affects read misses (because the string body of the loop parameter has to be copied from an unpredictable heap location) but only indirectly affects write misses. The loop writes to two heap locations:
- The string body of the loop parameter: As every string body is exactly the same length, the heap location for the last parameter body potentially can be re-used for the next. If so, these writes will land on a hot cache location.
- The 37,500 strings that pass the filter will be copied from the parameter to the next available spot in the result vector. Given that the loop parameter is an l-value, a copy-assignment must be used, which requires that a new heap location be allocated for the string body and the body from the loop parameter copied into it. As these are the only heap operations performed in the loop, the blocks allocated for string bodies of the result vector should be in close to ascending, semi-contiguous order.
This analysis suggests that the string body writes themselves will be to hot cache lines (the loop parameter) or in prefetchable cache lines (appending to the result vector). However, write misses might be caused by interference from the read operations, as cache lines are evicted to make space for reads.
The callgrind
results for L1 cache write misses show little variation by jumps:
The small range of the cache miss values relative to the y-axis is deliberate, emphasizing the small variation in L1 write misses. The values are the same to three significant figures, only varying over the last 5,000.
The results for L3 cache write misses indicate several distinct write regimes (0 jumps, 5,000–20,000 jumps, and 25,000–50,000 jumps), with a range of 200,000 misses:
The causes of the behaviours of each level are unclear. L1 write misses seem independent of the number of jumps but jumps, and in particular large jump counts, do affect L3 write misses. The L3 effect may be due to cache evictions or due to greater disorder in the overall heap. We do not have sufficient data to tell at this point.
Conclusion
The order in which we access the bodies of long strings, which are stored on the heap by libstdc++
, substantially affects the overall performance of the loop processing the vectors. Reading 50,000 string bodies in random order of addresses is 2.5 times slower than reading them in sequential order of address. Randomness in the string addresses incurs greater L1 cache read misses and, for the larger counts of random access, greater L3 cache read misses as well.
Random access to string bodies has a more complicated relationship to cache write performance. In any case, the read miss performance is the dominant predictor.
In my next post, I will use callgrind
to locate the statement where the L1 read misses arise.