The service design space (2020 Edition)

31 Dec 2019 Tags: Course design

Post in an ongoing series on the issues constraining service design for datacentres. A previous post presented 2019’s version, a design space for data engineering.

Starting January 2020, I will teach a course on service design for datacentres. The roster indicates that nearly all students will be in a professional master’s program. Most will be in their second semester of a program in Big Data, some will be in their fifth semester (including a two-semester paid internship), with a few in other programs. What are the key principles that should structure such a course?

This will be a revision of the course that I taught in Spring 2019. I was generally satisfied with that version and do not want to mess it up through excessive revision. “Second System Effect” is as much an issue in course design as it is in system design.

Defining the course focus as “service design for datacentres”

I want this course to provide students with the essential skills for designing and analyzing service-based architectures that will run in contemporary datacentre environments. This goal is broad, encompassing both practice using exemplary technologies and the underlying principles.

I have taught variations of this topic since 2014. Each year, I have struggled to define a succinct characterization. There doesn’t seem to be an accepted term for this material in either the academic or industrial literatures. Most troubling, I have not found a phrase that both appeals to students and accurately indicates what they are going to learn.

Calling it “cloud computing” sets the wrong expectation. Cloud computing typically refers to managed services offered by such providers as Alibaba, Amazon, Google, or Microsoft. These services are often categorized according to the Infrastructure-as-a-service (IaaS), Platform-as-a-Service (PaaS), and Application-as-a-Service (AaaS) model. A cloud computing course would emphasize that designers should first consider all the managed service offerings from their chosen provider, as well as aftermarket providers, before designing a custom solution. Why build and operate a data store in-house when DynamoDB, BigTable, or Azure Store could meet your needs? By contrast, this course is concerned with how you connect services together and how you will know when you need to purpose-build your own.

Calling it “distributed systems” misleads in a different way. Such courses focus on the design of distributed systems in general, often with a focus on the correctness and complexity of distributed algorithms. Nancy Lynch’s 1996 Distributed Algorithms is a common text. This theory provides the essential underpinning of the material I want to teach but it would take an entire semester to cover in its own right and the material is not of strong interest to students in a professional master’s program. This course focuses instead on the implications of the above theoretical results for actual system design.

Last year, I attempted to make the material relevant to Big Data students by framing it in terms of “data engineering”. This now seems incomplete to me. Although descriptions of data engineering include core concepts from this course:

Most of the issues that you’ll run into will be around reliability and distributed systems. … you’ll need a system that can automatically scale your server count up and down … data engineering is about learning to deal with scale and efficiency.
—What is a Data Engineer?

[Data engineers] need some understanding of distributed systems in general and how they are different from traditional storage and processing systems.
—Data Engineering: A Quick and Simple Definition

the descriptions also list many other topics:

A good data engineer has extensive knowledge on databases and best engineering practices. These include … building human-fault-tolerant pipelines, … knowledge of database administration, maintaining data cleaning, and ensuring a deterministic pipeline.
—Data Science vs. Data Engineering

I would now say that this course is more about the infrastructure underlying data engineering than data engineering itself.

Ultimately, I am left with the phrase, “service design for datacentres”. It has the merit of precisely specifying the course focus but the oh-so-considerable demerit of being opaque to everyone else, both prospective students and fellow faculty.

The topic phrase comprises theory and practice

So what does “service design for datacentres” imply? Broadly, it is the design of services intended to run on a network of large-scale datacentres. A high proportion of modern software involves the design of such services, from edge computing in embedded devices, to middleware that processes and stores the data relayed from the edge, to the interfaces by which end users interact with that data and administrators manage the services. In the near future, it is likely that virtually every program written will provide a service interface to other programs and call other programs through their service interface, all mediated by networks.

Most of these programs will run inside large datacentres. This execution context has major constraints for the architecture running upon it:

A service will scale by adding or removing replicas
A service will need to tolerate failures of services it calls
A service will need to be able provide degraded or approximate service to its callers when it fails partially
A service will have performance objectives, the most important of which will typically specify permissible latencies

Service interface but not necessarily micro-service

This does not necessarily imply a microservice architecture. Although that is an important design choice in contemporary systems, and one that will be covered in the course, it is ultimately an internal choice, one of several defensible designs for a given application. The key points that will characterize virtually all designs are that they expose their functionality through a service-based interface, hide their internal decisions behind that interface, and communicate over a network. The granularity of those components is an implementation choice.

Student background

Most of the students will fit the following profile:

Program entry requirements: Bachelor’s degree in CS or related field.
Courses students took in prior semesters: In their first semester, students will already have taken a course on machine learning and a lab introducing them to technologies of large-scale data analysis, including HDFS, Hadoop, Spark, Cassandra, and HBase. Some may have taken electives in visualization or natural language processing.
Students taking the course in their second year will have at least two academic semesters and a two-semester internship behind them. This group will have taken a range of electives in their second semester.
Courses students take concurrently: Concurrently with the data engineering course, they will be taking a second lab, focussed on applied statistical analysis and machine learning. They will also be taking an algorithms course including Paxos and the Gilbert/Lynch CAP Theorem.

Approach

Given the lengthy topic list that follows, the course can only provide a broad survey, not deep understanding but if the students become comfortable with these concepts, they will be prepared for a broad range of roles in modern computing environments.

The course runs on two tracks:

A tool-based track, introducing a specific implementation approach, running Docker containers on the OpenShift distribution of Kubernetes.
The principles of designing distributed, resilient services for a datacentre environment.

The tools track is neither in-depth nor exhaustive. It simply introduces one very large suite of components that is widely used for implementing services on a datacentre. It is far from the only approach but the widespread use of Docker and Kubernetes means that there is a good likelihood that students will be able to immediately apply their course experiences in their internships that immediately follow this course.

The design principles are organized into the following design space.

The design space of cloud services

Designs can be analyzed according to a variety of criteria. Taken together, these criteria establish a design space within which to evaluate and contrast designs. These properties also establish measures of system success.

These properties can be applied to the system as a whole; many also characterize system components.

Metrics—measured properties of the system

Metrics are measurable properties of the system. They reflect the match between system architecture and use.

In addition to learning the following concepts, students need to become familiar with representing the distributions of metrics via key percentiles (50th, 75th, 90th, …). Students also need to understand the impact of “long tail” high percentiles, such as the 99th, on aggregate system performance.

Metrics can be part of objectives.

System performance metrics

These metrics characterize the system itself, running on a representative workload:

Latency
Throughput
Scalability
Availability

Data centre fabric metrics

These metrics characterize the design of the data centre:

Processor performance
Clock variability
Bisection bandwidth
Likelihood of component failures

Inter-data centre round trip metrics

For systems running on multiple data centres, perhaps in different continents, we need to consider the round trip times:

Between availability zones in one region
Between regions

Business metrics

As described in Chaos Engineering, pp. 22–23, business metrics are specific to your application. They measure how successfully the application is serving clients and meeting its purpose. The measurement is often indirect, using some metric as a proxy for the actually desired indicator, such as “real time customer satisfaction”. For example, Netflix measures the number of times the “start” button is pressed per second. As their service ultimately is about streaming video, an unexpected drop in start presses indicates the service is not meeting its purpose.

Specific to application, measuring “service success” or “client satisfaction” (often indirectly)

Guarantees and invariants: Inherent properties of the system’s design

Whereas metrics are measurable attributes of a system that can vary with such factors as system load, properties are fundamental to the system’s design. For example, the design either replicates durable data or it does not.

Data durability

Durable data is stored by the system on behalf of the users, typically on disk or solid-state storage. What sort of guarantees does the system provide on this data?

Replication for durability
Replication for availability
Synchronous vs. asynchronous writes
Consistency guarantees

In principle, consistency guarantees can be made for ephemeral data that only ever resides in memory but in most cases only eventual consistency is provided, minimizing processing cost.

Data security and privacy

When a service stores data, the designers must consider what techniques they will adopt to ensure its security and privacy. Although these concepts are related and partially overlap, it is useful to distinguish techniques for each:

Security: Techniques that ensure that data will only be accessed by authorized individuals and services.
Privacy: Techniques that ensure that that data will only be used for the purposes for which it was granted to the storing service.

For example, if you provide your address to a service for them to ship you a product, the service’s security guarantees would ensure that only employees in the shipping department could see that address, while their privacy guarantees would ensure that the employees who can see your address would only use it to ship the product, not to send harassing mail.

Security and privacy methods include both technical choices and organizational policy. Ultimately, while the technology can guarantee that the data will be processed according to rules such as “all records on persistent storage will be encrypted”, and the policies can limit access, perfect security and privacy are ideals that can ony be approached, never achieved. A design establishes barriers to misuse but sufficiently dedicated individuals with sufficient resources will always be able to circumvent those barriers.

Data security techniques

Security techniques include:

Encryption, at rest and in transit
Key management
Key revocation
Access roles and rules

Data privacy techniques

Privacy techniques include:

Differential privacy
Cell size

Instrumentation

How well does the system support monitoring its current state and diagnosing problems:

Logs
Traces
Dashboards
Probes

Fault tolerance

How well does the system tolerate failure of its own components and the services and data centre fabric upon which it depends:

Tolerance of component failures within the system
Tolerance of failure of external services upon which this depends
Tolerance of larger system failures (availability zone, region, transoceanic partition, …)

How well does it support failure, diagnosis, repair, and restart:

Diagnostic logs
Canary deployments
Admission control
Feature switches
Approximate results
Other engineering techniques

Business logic invariants

Business logic invariants are the business property counterparts to business metrics: The guarantees the system makes for entities that the system’s clients care about. For example, banks maintain the invariant that if you transfer money between two accounts, the amount taken from one equals the amount added to the other.

Eric Brewer notes that many business invariants are only implicitly specified for single-processor systems, which typically guarantee strong consistency. When consistency guarantees are loosened (see “Properties of durable data” above) to migrate the system to a distributed implementation, the business invariants need to be specified explicitly.

Business invariants often must specify a mitigation procedure, to resolve any violations (see the Brewer article cited above). For example, what is the process for redressing an error in an account’s balance, whether due to system defect or human error?

Automated deployment

In addition to readying the initial release for production, the service will require automated support for deploying updates.

Indicators and objectives

The above metrics, guarantees, and invariants shape the design space of the architecture for modern cloud systems. As such, they are direct concerns of the the development and operations staff. In addition, they may be exposed to clients as indicators and even contractual obligations. As defined in Site Reliability Engineering, Ch. 4:

Service Level Indicators (SLI)
Service Level Objectives (SLO)
Service Level Agreements (SLA)

Development process metrics

The above metrics and properties characterize the system’s design and performance. Other metrics characterize the development process:

Velocity of feature development (Chaos Engineering, p. 9)

Inherent limitations of distributed systems

Decades of study and practice in distributed systems have yielded principles and rules of thumb characterizing all such systems. As modern cloud systems are a species of distributed system, their designers must account for how these issues arise in this context.

Properties of distributed systems that must be accounted for (from Waldo et al., 1994):

Latency
Message-passing architecture (no shared memory)
True concurrency
Partial failure (process, service, and machine failure)

No common system clock (Lamport, 1978):

Systems must use some variant of logical clocks, vector clocks, or interval clocks

“The Eight Fallacies of Distributed Computing” (rephrased below as statements about how distributed systems actually work):

The network is unreliable (messages can be out of order or lost, or connectivity may be dropped altogether).
Latency is non-zero.
Bandwidth is finite.
The network is insecure.
Topology changes.
There are multiple administrators.
Transport cost is non-zero.
The network is heterogenous.

Conclusion

The design space for production cloud-based data services is huge. The service architect and implementation team must trade off between many conflicting goals and build a service that integrates well with the operations of the organization as a whole. This design process turns a robust, accurate data model—the kernel of a service but not an actual service—into a production service.

All my marbles in one place