Guarantees vs. metrics: What the designers intend, what the clients see

In my distributed systems design space, I distinguish guarantees from metrics. A guarantee is a property a system is designed to ensure in all cases, such as consistency between replicas of a data store, or the number of copies written to persistent storage before a result is returned to the client. In contrast, a metric is a measurable quantity of a system, such as median latency or annual availability. The guarantee/metric distinction arises from the designer’s perspective and clearly separates the two categories. But when we consider the categories instead from the perspective of the system’s clients, this distinction becomes fuzzier. Is it real? Is it valuable?

more ...

Flambloozlement and the student enthusiasm for crunch mode

How do students assess the value of what they learn in class? On the one hand, years of attending school have made them sophisticated judges of teaching effectiveness: They know whether an approach is working for their circumstances and purposes—although these may diverge far from the instructor’s goals. At the same time, students can harbour a curious enthusiasm for a course precisely because it overhwelms them. Why do they respond this way?

more ...

How hard is it to implement replicated state machines?

Note: Over the past eight months, I have been working with the code for TenCent’s Phxpaxos, an open-source implementation of replicated state-machines, with consensus enforced by Paxos. I have learned a lot from studying this code and from comparing it with the more focussed libpaxos, which does not include state machine code.

Recently I have isolated several topics well-adapted to the smaller size of blog posts and I will be adding them here.

Distributed systems are often made highly available by implementing them as replicated state machines, coordinating their operations via a consensus algorithm such as Paxos. Two popular tutorials from the 1990s describe the basic theory but how hard is it to implement this theory? The code from the Phxpaxos project suggests that the general solution requires more work than it might seem.

more ...

Applying resilient design to my own systems

When I teach a course in modern distributed systems design, I spend a lot of time on topics such as engineering for resilience and automating production processes. It turns out that applying those big topics in the small matters of my life is a big challenge.

more ...

"Distributed Systems" are not what they used to be

The phrase “distributed systems” has had several meanings over the last 50 years, from a mathematical topic to a style of engineering practice to a suite of technologies. What can we learn from these changes and how do we reconcile these perspectives?

more ...