Extending "The Tail at Scale"

A companion post goes into more detail on constructing the percentile function from a density function.

In their classic article, “The Tail at Scale”, Dean and Barroso formulate an important principle of service design: Reduce the variance of the latency for low-level services because that has a large impact on the performance of their callers. This effect arises due to the high number of subservices called by a typical main service, making it likely that at least one subservice will have an extremely slow response, holding back the main service. In their article, Dean and Barroso focus on the very largest latencies but extending their argument to smaller but more frequent latencies provides even more insight into service design.

Subservice latencies increase main service latency

Let’s begin with some definitions and Dean and Barroso’s basic argument.

Describing latencies

Latencies form a distribution, typically one too complex to characterize by specifying the parameters of a common distribution such Gaussian or log-normal. Instead, a latency distribution is characterized by its percentiles. Latency percentiles are a function from a fraction \(p \in [0,1)\) to latencies,

\[LP(p) = v, \mbox{such that for any latency}\:l, \;Pr(l \leq v) = p\]

The value of \(LP(p)\) specifies a boundary and can be seen two ways: It is the upper bound of fraction \(p\) of all latencies and the lower bound of fraction \(1-p\) of all latencies.

A note about percentiles: Percentiles can be written equivalently as either a fraction between 0 and 1 or a real between 0 and 100 %. For all formulas in the following, I present “percentiles” in their fractional form. Occasionally, for interpreting results, I will write them as percents.

The impact of lower-level services on their caller

Consider the case of a main service that calls multiple instances of a subservice in parallel and then waits for them all to complete before itself proceeding. We want to explore how the latency distribution of subservices affects the latencies of the main service. Informally, there clearly is an effect: Because the main has to wait for all the subservices to complete, slow subservices will slow down the main service. More precisely, the slowest subservice will set a lower bound on how quickly the main can complete.

At first glance, this doesn’t seem like much of a problem. Just take the Service Level Objective (SLO) for the subservices, add time for the main to operate on their results, and set the SLO for the main.

The problems arise when you decide which percentiles to use when computing SLOs for the caller. How much time should we set aside for the subservices to return? How about the median, .50 latency percentile, \(LP_{sub}(.50)\)? Let’s say the subservice \(LP_{sub}(.50) = 100\) ms. Should we just allocate 100 ms for the group of subservice calls?

Unfortunately, the resulting probabilities guarantee that the subservice calls will never come close to meeting such an SLO. If the main calls just five subservices, the probability that they will all return within 100 ms is just .03 (I’ll explain how I got this value in a moment).

The latencies of this simple case, a main service calling subservices whose results it must await before proceeding with further processing, are counter intuitive. There are two important lessons:

  1. All it takes is one slow subservice to drag the main down to its pace—and the probability of this occurring is surprisingly high.
  2. In particular, high-percentile latencies of the subservice will set extremely high lower bounds for the extreme latencies—yet such extreme latencies will arise far more commonly than their low probability might suggest.

Let’s walk though the details to see what causes these effects.

The basic model

Dean and Barroso present a combination of a short probabilistic analysis and some actual results from a large scale Google service. Because I do not manage a global network of datacentres from which to present data, I am just going to extend their basic probabilistic model. By itself, the model is sufficient to reveal important issues.

The assumptions of the model are straightforward:

These assumptions aren’t restrictive. Although actual systems may not conform to this model, they will suffer from the issues I describe below. At the end of this post, I’ll consider the effects of relaxing these assumptions.

Effective percentile: Latency distribution of a collection of subservices

Using the basic model, calculating the impact of subservice latencies on the main service is straightforward. In the above example of a main calling five subservices, the probability that all subservices will complete below the \(LP_{sub}(.50)\) boundary is the product of the individual probabilities:

\[.50^5 = .03125\]

In effect, the \(LP_{sub}(.50)\) value of an individual service is the \(LP(.03)\) value of the latency for five called in parallel.

Generalizing, the probability of all subservices meeting their target can be seen as the effective percentile of the subservices taken as a group. For the \(p\) percentile of a called service and a caller fanout (the number of subservices called) of \(f\), the effective percentile of the caller is given by:

\[EP_{service}(p, f) = p^f\]

The effective percentile is a probability. The latency boundary associated with this probability is the original \(LP(p)\). In effect, fanning out to \(f\) instances moves the probability of this boundary down from \(p\) to the lower \(EP(p, f)\).

The graph presented by Dean and Barroso uses the above computation but reports the fraction for which \(LP(p)\) sets the lower bound, 1-\(EP(p,f)\). For example, the value called out on the lower right of their graph (\(LP_{sub}(.9999)\), fanout=2000) is

\[1 - EP(.9999,2000) = .18\]

The effective percentile of high-percentile subservice latencies

Dean and Barroso focus on high percentiles such as \(LP(.98)\) or \(LP(.99)\), so let’s begin there. High percentile latencies tend to set disproportionately large lower bounds for extreme response times. For example, their Table 1 shows measurements of an actual Google service with \(LP(.50) = 1\) ms but \(LP(.99) = 10\) ms, an order of magnitude higher. Statistically, high percentile latencies represent samples from a different distribution than the lower-percentile values. Calls with high-percentile latencies might have occurred during garbage collection or a burst of network contention. The long latencies characteristic of high percentiles will in turn increase the latency of their caller.

You might take solace in the rarity of these high latencies—after all, they only occur 1–2% of the time. But when we call multiple instances of a service and wait for all of them to complete, the effective percentile is much lower, meaning that the probability of a high latency occurring for at least one of the subservices is surprisingly high.

The following table demonstrates this point. Each row represents a percentile, with each entry showing the minimal fanout necessary to reduce the effective percentile to the value in the column header.

Subservice\(EP(p,f)\)
LP(p) .95 .90 .80 .75 .50
.999 51 105 223 287 693
.990 5 10 22 29 69
.980 2 5 11 14 34
.900 - 1 2 3 6

For example, for a service with \(LP(.99) = 10\) ms, if 10 instances of that service are called, the effective percentile is \(EP(.99,10)=.90.\) Viewing this percentile as a lower bound, this indicates that in 10% of the executions the last response will take at least 10 ms to arrive. If instead 69 instances were called, \(EP(.99, 69)=.50\), meaning half of all responses would take 10 ms or longer.

This is the main point of the first half of “The Tail at Scale”: When you call multiple subservices, the inexorable math of effective percentiles makes high latencies far more common. These high latencies in turn increase the latency of the main service because it has to wait so long before it can begin its own processing of the results.

The rest of their article presents methods for alleviating this problem, including ways of making the high percentile latencies less extreme and ways of issuing redundant requests to reduce the probability that the main service will have to wait for an extreme response. I’m not going to repreat Dean and Barroso’s discussion here but want to extend their argument to a case they did not address: the effective percentiles for lower percentiles such as the median (\(LP(.50)\)).

Lower percentiles drop even further due to fanout

In addition to fanout increasing the risk of an extremely slow response from a subservice, it also makes the common case slower than you would expect. Indeed, this effect on common cases is even more pronounced than the one for the rarely-occurring high percentiles.

Consider again the formula for an effective percentile, \(p^f\). The exponential of a probability substantially less than 1 approaches 0 far more quickly than that of a probability close to 1. Revisiting the example above, \(EP(.50, 5) = .031\). Considering this as a lower bound, it indicates that for 97% of the calls to the 5 subservices, the response time for the last replica will exceed the median for an individual service.

This demonstrates the problem with budgeting the calls to the subservices at \(LP(.50) = 100\) ms. In all but 3% of the cases, they will require more than 100 ms for the slowest response.

This effect of fanout on lower percentiles is shown in the following table. For a given subservice percentile, it shows the fanout sufficient to bring the effective percentile below 10%:

Percentile \(p\) Fanout \(f\) for \(EP(p, f) \lt .10\)
.50 4
.60 5
.70 7
.80 11

These results astonished me when I first calculated them. If I call as few as 11 subservices, the odds that they will all respond within their 80th percentile latency are under 10%.

The conclusion of these simple calculations is stark: In a microservices architecture, where a main service calls even a moderate number of subservices, the variance of the subservice latencies must be kept as small as possible if you want reasonable performance from the main service.

Revisiting assumptions

The above tables assume independent and identical distributions. Although these assumptions often do not hold in practice, the probabilities for actual systems will typically be as bad as or worse than this. Let’s consider relaxing each assumption in turn.

Correlated distributions are unlikely to improve performance

The \(p^f\) formula assumes all latencies are uncorrelated. In practice, latencies may be correlated due to one or more of the following causes:

  1. Shared hardware: For example, when a network switch experiences a load surge, all replicas served by it will have degraded response.
  2. Shared operation sequences: When two replicas process very similar sequences of operations, they have a tendency to begin garbage collection around the same time.
  3. Anticorrelations due to hot spots: In a sharded storage service, if one shard is slow due to receiving a disproportionate share of requests, the other shards will be faster due to their lighter load.
  4. Direct dependency: If one service calls another, the caller’s latencies will directly correlate to those of its subservice.

Ultimately, most of these correlations will increase the incidence of high latencies rather than reducing them. Anticorrelations will typically aggravate the problem rather than mitigate it, as effective percentile is determined by the slowest response and anticorrelation implies that the slowest reponse is even slower.

The only form of correlated distributions that will mitigate the effective percentile effect are positive correlations where the reduction in latency of one service predicts a latency reduction of all the others. While such circumstances may be possible in modern data centers, they are unlikely. Most of the time, correlated latencies will not mitigate effective percentiles and may even aggravate them.

Nonidentical distributions

The assumption of identical distribution simplifies the computation but different services and even different replicas of the same service will typically have different distributions. For distributions that are only slightly different, the results will be the same as above. On the other hand, if one service has a much larger boundary value for a given percentile than the others, that service will dominate the latency for that percentile and all higher percentiles. The designer should probably focus on managing the spread of latencies for the slow service, reducing its boundary values before managing the latencies for the faster services.

Calls in sequence rather than parallel

This analysis has focused on parallel calls to services but this is not essential. The mathematics is the same if the services are called in sequence: The odds of one service exceeding its percentile latency increase dramatically as more services are called. The effective percentile dilemma arises for any type of fanout, sequential or parallel.

Function calls will have the same effect as service calls

These effects are extremely general. I presented this analysis in terms of microservices but could rephrase it in terms of a main routine and the subroutines it calls, with identical results. The analysis also applies to a main Web page that loads components from many other sources.

Quorums keep effective percentiles high

Quorum-based algorithms, where \(n\) multiple replicas are called and only the fastest \(k < n\) responses are required for the computation to proceed, avoid the effective percentile trap by pruning the slowest responses. The “tied request” and “hedged request” techniques described by Dean and Barroso are variants of this approach.

The math for this case is out of the scope of this post but the informal logic is straightforward: If you do not wait for the slowest response, you will consistently draw from the lower-percentile latencies and consequently the extreme latencies associated with high percentiles will be even less likely than their percentiles indicate.

Conclusion

The probability of extreme latencies increases when a main service calls multiple subservices. The effective percentile of a given subservice latency is much lower than the stated percentile when that service is called multiple times. Dean and Barroso highlighted this outcome for high-percentile latencies, pointing out that the extreme boundary values associated with high percentiles become common enough to routinely affect performance of the caller. Extending this argument to middle percentiles such as the median, the effect becomes still more pronounced. If as few as five instances of the subservice are called, the 97% of the group of calls will feature at least one call that takes longer than the median latency of a single call. System designers need to keep the variance of lower-level services as low as possible. The effective percentiles induced by multiple calls will make latencies longer than the median far more common than their single-call percentile would indicate.