Three idioms for appending values—a damaging decision?

07 Mar 2017 Tags: C++

Update: See also the next post, which adds a fourth idiom and mitigates some of the criticisms in the conclusion of this post.

The STL code for squaring the positive values, included the following lines to create the intermediate result t1 containing the positive values of the original list a:

vector<int> t1 {};
copy_if (a.cbegin(),
         a.cend(),
         back_inserter(t1),
         [](auto i) { return i > 0; });

This is a typical idiom when writing sequences of STL transformations. It declares an empty vector in Line 1 then appends new values to it via a back_inserter() adaptor (Line 4).

As an idiom, this works well. It fits the STL pattern for algorithms

algorithm (start, end, sink [, other options])

that we use in the sequence and the back_inserter() adapter can be applied to a std::vector, std::deque, or std::list. But the idiom also has some problems: For some kinds of objects, it may waste a lot of CPU time on unnecessary operations. It can also fragment the heap.

There are in fact three distinct idioms that you might use to build the temporary result, of increasing efficiency. Unfortunately, the most efficient idiom does not fit the above pattern for using the STL algorithms. Instead, the programmer must construct their algorithm from a for loop and basic code. As of C++ 2017, There is no way in the Standard Library to combine the standard algorithms with the most efficient idiom for constructing results.

Allocation, initialization, and assignment

To understand the different idioms for building a result in a collection, we have to first distinguish storage duration and variable lifetime. As I mentioned several posts ago, I am planning a whole series on how C++ approaches these topics, differentiating that approach from the ones taken by managed languages. I don’t need to go into as much detail for this post, so I’ll focus the discussion on the specific points necessary to describe the stages for values being appended to STL collections.

C++ carefully distinguishes the storage management from the object life cycle. Storage management has two steps and is performed by the collection as its capacity requirements expand and contract:

Allocation: A block of raw, uninitialized bytes is reserved by the collection, typically from the C++ runtime via operator new(). The C++ runtime in turn requests large blocks of memory from the operating system.
Deallocation: A block of previously-allocated bytes is returned, typically to the C++ runtime via operator delete(). The runtime typically retains the storage for potential reallocation, returning it to the operating system when the application terminates.

Allocation and deallocation deal simply in raw bytes that have no meaning in the language. Only object values have meaning. Values appended to a collection have a distinct life cycle, with steps occurring in the following order:

Initialization: When a value is appended to the collection it is built in storage that has been allocated by the collection but not yet used to store a value. The value is built by the constructor for the object's class.
Assignment: If the value can be accessed via a non-const type, it may be overwritten via an assignment.
Destruction: When the collection is destroyed, it destroys all the values it holds, calling the destructor for each value. The value destructor in turn releases any resources that value holds.

This simple model is sufficient to describe the case of interest here, where values are appended to a collection in a single pass, perhaps to be read back as input in a subsequent pass. More complicated uses, such as erasing objects from a collection, require a more sophisticated description.

A class with a transparent life cycle

To compare the different idioms for appending results, we need to make the above stages visible. The following class instruments its member functions to print a message whenever an instance is created, assigned, or deleted:

class C {
public:
  C (int a, int b) : m1 {a}, m2{b} {
    cout << "Basic constructor for ";
    cout << collection_name(this) << '\n';
  }

  C (const C& c) : m1 {c.m1}, m2 {c.m2} {
    cout << "Copy constructor for ";
    cout << collection_name(this) << " from ";
    cout << collection_name(&c) << '\n';
  }
  // Replace above with this to test with only move constructor
  //C (const C& c) = delete;

  C (C&& c) : m1 {c.m1}, m2 {c.m2} {
    cout << "Move constructor for ";
    cout << collection_name(this) << " from ";
    cout << collection_name(&c) << '\n';
  }

  C () {
    cout << "Default constructor for ";
    cout << collection_name(this) << '\n';
  }

  C& operator= (const C& c) {
    cout << "Copy assignment for ";
    cout << collection_name(this) << " from ";
    cout << collection_name(&c) << '\n';
    if (this == &c)
      return *this;
    m1 = c.m1;
    m2 = c.m2;
    return *this;
  }

  C& operator= (C&& c) {
    cout << "Move assignment for ";
    cout << collection_name(this) << " from ";
    cout << collection_name(&c) << '\n';
    if (this == &c)
      return *this;
    m1 = c.m1;
    m2 = c.m2;
    return *this;
  }

  ~C() {
    cout << "Destructor for ";
    cout << collection_name(this) << '\n';
    temps_del(this);
  }

  friend ostream& operator<< (ostream& os, const C& c);

private:
  int m1 {};
  int m2 {};
};

The class simply stores two integers, m1 and m2. Its member functions call instrumentation functions collection_name() and temp_del, which track details of the allocations. The full code for this post, including instrumentation, is presented in the appendix.

Let’s see what happens when we use different idioms to append instances of this class to a std::vector.

Incremental append: Multiple allocations, multiple per-item operations

Consider a simple case where we want to use a vector init containing the integers 0 through 4, inclusive, to build a vector, using the back_inserter() approach:

vector<C> c1 {};
transform (init.cbegin(), init.cend(), back_inserter(c1),
           [] (auto i) { return C(i, i); });

Here is an annotated sequence of the constructions, copies, and destructions required to append the five instances of C:

New collection c1
// Append C(0, 0)
Basic constructor for T0
[vector allocates space for C value, c1[0]]
Move constructor for c1[0] from T0
Destructor for T0

// Append C(1, 1)
Basic constructor for T0
[Vector allocates space for C values, c1[0]' to c1[1]']
Move constructor for c1[1]' from T0
Copy constructor for c1[0]' from c1[0]
Destructor for c1[0]
Destructor for T0
[Space for c1[0] deallocated]

// Append C(2, 2)
Basic constructor for T0
[Vector allocates space for C values, c1[0]'' to c[3]'']
Move constructor for c1[2]'' from T0
Copy constructor for c1[0]'' from c1[0]'
Copy constructor for c1[1]'' from c1[1]'
Destructor for c1[0]'
Destructor for c1[1]'
[Space for c1[0]' to c1[1]' deallocated]
Destructor for T0

// Append C(3, 3)
Basic constructor for T0
Move constructor for c1[3]'' from T0
Destructor for T0

// Append C(4, 4)
Basic constructor for T0
[Vector allocates space for C values, c1[0]''' to c1[4]''']
Move constructor for c1[4]''' from T0
Copy constructor for c1[0]''' from c1[0]''
Copy constructor for c1[1]''' from c1[1]''
Copy constructor for c1[2]''' from c1[2]''
Copy constructor for c1[3]''' from c1[3]''
Destructor for c1[0]''
Destructor for c1[1]''
Destructor for c1[2]''
Destructor for c1[3]''
[Space for c1[0]'' to c1[3]'' deallocated]
Destructor for T0

Whoa! That’s a lot of action. Every append requires creating a temporary C value, T0, which is destroyed once the append has completed. The vector allocates four progressively larger blocks (enough for 1 C value, 2 C values, 4 C values, and a final block at least 5 C values large) as its required capacity grows. Each time it expands, it copies the old values into the new block, then destroys the originals and deallocates the old block. For this class, the constructors, copies, and destructors are efficient, but they might require substantial CPU time for larger classes. And for any class size, the allocation / deallocation pairs contribute to memory fragmentation. Can we do better?

Initialize and assign: Single allocation, two per-item constructions

The first idiom required multiple allocations and deletions because we built the vector an item at a time, expanding as more values were added. Given that the number of items of the result will be exactly the number in init, we could initialize the result vector to its full length and then build the new values there. When we construct the initial vector of five values, however, it will construct default values for every entry. When we build the result, we will be overwriting those default values with an assignment. Here is the code:

vector<C> c2 (init.size());
auto c2_it = c2.begin();
for (auto in_it = init.cbegin(); in_it != init.cend(); ++in_it, ++c2_it)
  *c2_it = C(*in_it, *in_it);

The algorithm needs to maintain two iterators, one each for the source init and the sink c2, so I chose a for loop rather than the std::copy algorithm. With some effort, you could build a custom assign-to-the-end-iterator and use std::copy.

Here’s the resulting operation sequence:

[Vector allocates space for at least 5 values, c2[0] to c2[4]]
Default constructor for c2[0]
Default constructor for c2[1]
Default constructor for c2[2]
Default constructor for c2[3]
Default constructor for c2[4]
New collection c2 // The collection is now initialized

// Append C(0, 0)
Basic constructor for T0
Move assignment for c2[0] from T0
Destructor for T0

// Append C(1, 1)
Basic constructor for T0
Move assignment for c2[1] from T0
Destructor for T0

// Append C(2, 2)
Basic constructor for T0
Move assignment for c2[2] from T0
Destructor for T0

// Append C(3, 3)
Basic constructor for T0
Move assignment for c2[3] from T0
Destructor for T0

// Append C(4, 4)
Basic constructor for T0
Move assignment for c2[4] from T0
Destructor for T0

This idiom requires substantially fewer operations on C values and has exactly one allocation, of exactly the required size. Because we’re assigning an r-value in Line 4 and class C includes a move assignment member, move assignment is used. For this class whose members are all ints, there is no performance improvement from move assignment, but it might be much more efficient for instances of a class that owns resources such as dynamic memory.

This looks like a win, overall. We’ve saved a lot of operations and used the bare minimum of allocations. The “extra” operations left, construction of a default value and move assignment, are both typically fast. The only downside is that we have to maintain two iterators, using a more complex for or a custom adaptor to subsume the iterator. Is there a way we can do even better?

Reserve and emplace: Single allocation, single per-item construction

The remaining inefficiency in the second idiom is that we construct a sequence of default values that we never need and will overwrite as soon as the collection is ready. What we really want to do is simpler than that: Have the collection allocate all the storage we need and then construct our values directly in that storage.

This approach became viable in 2011 with the addition of emplace members to the STL collections (this feature in turn was enabled by the C++ 2011 core language addition of template parameter packs). In particular, the emplace_back() member constructs a value directly after the last value in the collection. If there is allocated space available after the last value, the appended value is simply constructed there, but if the vector is full, the same allocate-copy-deallocate sequence as the first method must be used.

This suggests the following pattern:

reserve space in the vector sufficient to hold all items
build the items using emplace_back()

In C++:

vector<C> c3 {};
c3.reserve(init.size());
for (auto v : init)
  c3.emplace_back(v, v);

Once again, rather than build a custom iterator, I have used a for loop to iterate over the source and used emplace_back() in the loop body to append to the sink. Here’s the operations this code produces:

New collection c3
[Vector allocates space for at least 5 values, c3[0] to c3[4]]
Basic constructor for c3[0]
Basic constructor for c3[1]
Basic constructor for c3[2]
Basic constructor for c3[3]
Basic constructor for c3[4]

Using emplace_back(), we have built the result using the minimum number of allocations and operations on the C values. In tradeoff, we can no longer use simple STL algorithms unless we use some form of back_emplacer adaptor. Although none is provided in the 2017 STL, Andre Tomazos has presented a solution in a Stack Overflow discussion.

Explicit storage management increases programmer load

The current version of the STL forces programmers to make a Faustian choice when implementing list comprehension-style algorithms: Use the simple, well-supported longstanding approach and accept potential inefficiencies, or use the more recent emplace members, gaining efficiency at some cost in code complexity. The biggest cost of all of this is the decision itself. Where the Python or Haskell programmer simply specifies list operations, the C++ programmer must stop for every sink, whether an intermediate result or the final value, and choose an append idiom.

As I noted in an earlier post, it is not clear to me that good modern compilers, whether static or just-in-time, cannot produce equally efficient code from list comprehensions as a C++ compiler produces from the more complex STL approach. I have the sad suspicion that this is a case where C++’s emphasis on explicit storage management increases programmer load for little to no efficiency benefit.

Appendix: Full source

Here is the full source from which the annotated operation displays were produced. The instrumentation code is just functional enough to produce useful results with one specific compiler and library release. It is far from general code. Specific caveats about the instrumentation code:

The actual output from this code is more raw than the annotated versions given above. I derived the presentation versions from the output of this code.
The is_stack() function is highly specific to the implementation and specific runtime options such as stack size. It is nothing close to general production use.
The instrumentation functions call std::vector::data() during updates to the vector. The function's behaviour in such cases is undefined. For this compiler and library version, it returned the address of the old allocation. In other circumstances, it might behave differently.

/*
  Demonstration of different approaches to building a vector.
 */

#include <algorithm>
#include <cassert>
#include <iostream>
#include <list>
#include <sstream>
#include <string>
#include <vector>

#include <experimental/iterator>

using std::copy;
using std::cout;
using std::find_if;
using std::generate;
using std::ostream;
using std::ostringstream;
using std::string;
using std::to_string;
using std::transform;
using std::vector;

using std::experimental::make_ostream_joiner;

class C;
void temps_del(const C* cp);
string collection_name(const C* loc);

/*
  Class to make creation, assignment, and destruction visible.
 */
class C {
public:
  C (int a, int b) : m1 {a}, m2{b} {
    cout << "Basic constructor for ";
    cout << collection_name(this) << '\n';
  }

  C (const C& c) : m1 {c.m1}, m2 {c.m2} {
    cout << "Copy constructor for ";
    cout << collection_name(this) << " from ";
    cout << collection_name(&c) << '\n';
  }
  // Replace above with this to test with only move constructor
  //C (const C& c) = delete;

  C (C&& c) : m1 {c.m1}, m2 {c.m2} {
    cout << "Move constructor for ";
    cout << collection_name(this) << " from ";
    cout << collection_name(&c) << '\n';
  }

  C () {
    cout << "Default constructor for ";
    cout << collection_name(this) << '\n';
  }

  C& operator= (const C& c) {
    cout << "Copy assignment for ";
    cout << collection_name(this) << " from ";
    cout << collection_name(&c) << '\n';
    if (this == &c)
      return *this;
    m1 = c.m1;
    m2 = c.m2;
    return *this;
  }

  C& operator= (C&& c) {
    cout << "Move assignment for ";
    cout << collection_name(this) << " from ";
    cout << collection_name(&c) << '\n';
    if (this == &c)
      return *this;
    m1 = c.m1;
    m2 = c.m2;
    return *this;
  }

  ~C() {
    cout << "Destructor for ";
    cout << collection_name(this) << '\n';
    temps_del(this);
  }

  friend ostream& operator<< (ostream& os, const C& c);

private:
  int m1 {};
  int m2 {};
};

ostream& operator<< (ostream& os, const C& c) {
  cout << "C (" << c.m1 << ", " << c.m2 << ")";
  return os;
}

using Coll = vector<C>;

/*
  Instrumentation to attribute locations to owning collection.
  One-off code---specific to this demonstration and tool chain.
 */
string to_string(void const * p) {
  ostringstream os {};
  os << p;
  return os.str();
}

struct Cpair {
  const Coll* cp {};
  const string name {};
};

vector<Cpair> colls {};

struct Tpair {
  const C* cp {};
  string name {};
};

vector<Tpair> temps {};

auto temps_find(const C* cp) {
  return find_if(temps.begin(), temps.end(),
		 [=](auto tp) { return tp.cp == cp; });
}

void colls_add(const Coll* coll, const char* name) {
  colls.push_back(Cpair {coll, name});
  cout << "New collection " << name << ", base " << coll->data() << '\n';
}

void temps_add(const C* cp, const string& name) {
  auto it = temps_find(cp);
  if (it != temps.end())
    return;
  temps.push_back(Tpair{cp, name});
}

/*
  Highly specific to implementation, memory model, and so forth!
 */
bool is_stack(const C* cp) {
  return reinterpret_cast<unsigned long>(cp) > 0x700'000'000'000UL;
}

void temps_del(const C* cp) {
  if ( ! is_stack(cp))
    return;
  auto it = temps_find(cp);
  assert (it != temps.end());
  temps.erase(it);
}

string temps_name(const C* cp) {
  auto it = temps_find(cp);
  if (it == temps.end()) {
    if (is_stack(cp)) {
      auto nm = string("T") + to_string(temps.size());
      temps_add(cp, nm);
      return nm;
    }
    else {
      return to_string(cp);
    }
  }
  else
    return it->name;
}

string collection_name(const C* loc) {
  for (auto cp : colls) {
    auto base = cp.cp->data();
    if (base <= loc && loc < base+cp.cp->capacity())
      return cp.name +"[" + to_string(loc-base) +"]" +
	(loc == base ? string(" [")+to_string(loc)+string("]") : string(""));
  }
  return temps_name(loc);
}

/*
  Demonstrate three ways to build a result vector.
 */
constexpr int len = 5;

int main () {
  vector<int> init (len);
  generate(init.begin(), init.end(), [i=0] () mutable { return i++; });
  cout << "Source vector: ";
  copy (init.cbegin(), init.cend(), make_ostream_joiner(cout, ", "));
  cout << '\n';

  cout << "\nDefault vector, back_inserter:\n";
  Coll c1 {};
  colls_add (&c1, "c1");
  transform (init.cbegin(), init.cend(), back_inserter(c1),
	     [] (auto i) { return C(i, i); });
  cout << "Result: ";
  copy (c1.cbegin(), c1.cend(), make_ostream_joiner(cout, ", "));
  cout << '\n';

  cout << "\nDefault values of required length, assigned\n";
  Coll c2 (init.size());
  colls_add (&c2, "c2");
  auto c2_it = c2.begin();
  for (auto in_it = init.cbegin(); in_it != init.cend(); ++in_it, ++c2_it)
    *c2_it = C(*in_it, *in_it);
  cout << "Result: ";
  copy (c2.cbegin(), c2.cend(), make_ostream_joiner(cout, ", "));
  cout << '\n';

  cout << "\nReserved vector, emplace_back\n";
  Coll c3 {};
  colls_add (&c3, "c3");
  c3.reserve(init.size());
  for (auto v : init)
    c3.emplace_back(v, v);
  cout << "Result: ";
  copy (c3.cbegin(), c3.cend(), make_ostream_joiner(cout, ", "));
  cout << '\n';
  cout << "\nDestructors called as collections go out of scope:\n";
}

All my marbles in one place