For long strings, access order determines performance

In the last post, I described several possible causes of the slowdown for long strings in the by-value version of the idioms. I concluded the most likely actual cause was the order in which the string bodies are accessed.

Structure of long strings in libstdc++

To understand how the order of long strings affects performance, we must first understand the memory layout of long strings in libstdc++ 6.2, the version used in this series. This implementation of std::string stores a long string in two parts: the handle and the string body.

The string handle occupies 32 bytes (in an environment with 64-bit pointers). The handle has the storage duration of its string value: automatic, static, thread, or dynamic. For a long string, the handle comprises three fields:

  1. An 8-byte pointer to the string body.
  2. An 8-byte string length.
  3. An 8-byte capacity of the string body.
  4. An unused 8-byte block (used only for short strings).

For the algorithms studied in these posts, the handles for the strings in the src vector are stored in the vector body (itself allocated on the heap), while the handle for the loop parameter string is stored in a local variable on the stack.

The string body is stored in a separate block, always allocated on the heap, regardless of the storage duration of its handle. Its size is stored in the capacity field of the handle. It must be at least long enough to hold the current string value plus a terminating null byte. In the case of the algorithms studied in these posts, all the strings have a capacity of 27, the minimum necessary for a 26-byte string.

The memory layout of the "random" condition

Now consider how the source vector is built in the basic, random, case using the libstdc++ implementations of std::vector and std::string. For the dataset with 75% of its entries copied to the source, the algorithm is:

  1. Define two strings: An "included" string, that will be copied to the result vector, and an "excluded" string, that will not be copied. Because both strings are 27 bytes long (26 characters plus a null terminator), their bodies will be stored on the heap.
  2. The source vector is prereserved to a capacity of 50,000 entries.
  3. A random number generator is called 50,000 times, generating integers in the range [0,99].
  4. For a random integer in the range [0,74], a copy of the "included" string is constructed at the next available entry in the source vector. For an integer in the range [75,99], a copy of the "excluded" string is constructed in the next available source entry. In either case, the string's 32-byte handle will be stored directly in the vector, while its body will be copied to a 27-byte block obtained from the heap.

Each string body is stored in a separate block. The layout of these blocks depends upon the underlying allocator, which for these tests is the glibc 2.23 implementation of malloc. Rather than get into the specifics of this implementation, I will describe the general issues raised by all malloc-style allocators.

Each string body is allocated via a separate call to malloc(). Thus every string body will be in a separate block. Because most allocators will insert metadata and perhaps padding between these blocks, they will not be contiguous. However, the blocks typically will be near each other, in increasing order of memory address, although the blocks can in principle be allocated at arbitrary memory locations. For example, the glibc 2.23 allocator allocates the 50,000 source string bodies in adjacent 48-byte blocks, with one or two longer jumps in the sequence.

The resulting memory layout of the src vector will look like:

coherent-long-string-bodies-75-pct

Now consider the code (loop_emplace() version) that reads that vector:

for (auto p : src) { // by-value version
  if (p.first != exclude)
    res.emplace_back(p.first, p.second*p.second);
}

The loop accesses each element of the source vector src in sequential order. The string portion of the source element is copy-constructed into p.first. (The element cannot be move-constructed because src must not be modified.) Consequently, the string handles (stored contiguously and in sequence in the vector body, located on the heap) and the string bodies (stored semi-contiguously and in sequence as individual blocks on the heap) are accessed in sequentially-increasing order of addresses. This maximizes the benefit from hardware prefetching, which will pull the next handle and string body into the cache before we request them.

The memory layout of the "shuffled" condition

Now consider the memory layout in the “shuffled” condition. This condition begins with the same initialization as the “random” condition and then adds a fifth step:

  1. Call std::shuffle on src:

    std::shuffle(src.begin(), src.end(), *rbits);
    

    The libstdc++ implementation of the std::shuffle algorithm swaps string values by calling std::string::swap(std::string&), which swaps the string handles but leaves the string bodies in their original heap locations. (This behaviour is enabled by the library standard but not required, so we have to check whether a given implementation in fact does this.)

After the shuffle, access to the string handles in the source vector remains sequential but access to the string bodies in the heap is random:

Incoherent long string bodies-75

When the string bodies are accessed in random order, copy-constructing the loop parameter will generate many cache misses.

Specifying the level of disorderly heap access

The test data sets used to date in this series only test the extremes of heap disorder. The original (“Random”) data set and in-situ sorted data set access all string bodies sequentially, while the shuffle data set accesses all the bodies randomly. We want to be able to specify a specific degree of randomness in the source vector access and measure its effects on runtime.

We will add fifth data set, “NShuffle”, in which we can specify the proportion of source strings whose bodies are accessed randomly, while the remaining proportion of source strings are accessed sequentially. This data set is created similarly to Shuffle, except that only the first n_shuffle number of elements are shuffled, while the rest are left in their original order. The n_shuffle parameter can range from 0 (all elements accessed sequentially) to 50,000 (all elements accessed randomly):

std::shuffle(src.begin(), src.begin()+n_shuffle, *rbits);

Running the benchmarks on NShuffle data sets with n_shuffle ranging from 0 to 50,000, in units of 5,000, yields very interesting results. I’ll leave that here as a teaser and present the actual results in the next post.