The data engineering design space (2019 Edition)

13 Oct 2018 Tags: Course design, Big data

First in a series on designing a course on data engineering. The next post specifies the learning outcomes. There is a version updated for 2020.

I will teach a course on Data Engineering in Spring 2019. The course will be the second semester of a professional graduate program in Big Data.

Student background

Program entry requirements: Bachelor’s degree in CS or related field.
Courses students took in prior semesters: In their first semester, students will already have taken a course on machine learning and a lab introducing them to technologies of large-scale data analysis, including HDFS, Hadoop, Spark, Cassandra, and HBase. Some may have taken electives in visualization or natural language processing.
Courses students take concurrently: Concurrently with the data engineering course, they will be taking a second lab, focussed on applied statistical analysis and machine learning. They will also be taking an algorithms course including Paxos and the Gilbert/Lynch CAP Theorem.

Many students will be entering the course straight from their first semester. A few will be taking it in their last semester, after completing an eight-month paid internship in a big data-related role. For this last group, this would be an elective in their final semester.

Data engineering: From data model to cloud-ready data service

Data engineering is the practice of transforming a proof of concept data science model into a production-quality service. Amongst other things, this may include:

Hardening the system against failure
Making the model more efficient
Parallelizing a single-process model
Reimplementing the model in a language more suited to production environments
Automating the handling of exceptional cases such as data in the wrong format
Adding instrumentation to monitor system performance, input data drift, gray failures, and other issues
Automating deployment of revisions, including backout in the case of failure
Adding admission controls and feature switches for restarting after failure
Working with the data centre scheduler and orchestration services to ensure that the service is compatible with any services co-tenant on its instance and with the data center as a whole

Most of these practices are central to any component of service-oriented architectures running in a large data centre. Although these data centres may be hosted in a variety of ways—by a commercial service, on-premise, or hybrid—I will group them under the single term, “cloud”.

The one-sentence summary of all these improvements is making a data model cloud-ready.

Approach

Given the lengthy topic list that follows, the course can only provide a broad survey, not deep understanding. The importance of these topics extends beyond data engineering. If the students become comfortable with these concepts, they will be prepared for a broad range of roles in cloud computing, well beyond services for data analysis.

The design space of cloud services

I have taught cloud computing to undergraduates three times. Every time I regretted not setting out clear definitions of the foundational properties characterizing such systems. Beyond giving students a language to discuss the topics, this establishes the design space for course discussions and within which to evaluate designs. These properties are also the measures of system success.

These properties can be applied to the system as a whole; many also characterize system components.

Metrics—measured properties of the system

Metrics are measurable properties of the system. They reflect the match between system architecture and use.

In addition to learning the following concepts, students need to become familiar with representing the distributions of metrics via key percentiles (50th, 75th, 90th, …). Students also need to understand the impact of “long tail” high percentiles, such as the 99th, on aggregate system performance.

Metrics can be part of objectives.

System performance metrics

These metrics characterize the system itself, running on a representative workload:

Latency
Throughput
Scalability
Availability

Data centre fabric metrics

These metrics characterize the design of the data centre

Processor performance
Clock variability
Bisection bandwidth
Likelihood of component failures

Inter-data centre round trip metrics

For systems running on multiple data centres, perhaps in different continents, we need to consider the round trip times:

Between availability zones in one region
Between regions

Business metrics

As described in Chaos Engineering, pp. 22–23, business metrics are specific to your application. They measure how successfully the application is serving clients and meeting its purpose. The measurement is often indirect, using some metric as a proxy for the actually desired indicator, such as “real time customer satisfaction”. For example, Netflix measures the number of times the “start” button is pressed per second. As their service ultimately is about streaming video, an unexpected drop in start presses indicates the service is not meeting its purpose.

Specific to application, measuring “service success” or “client satisfaction” (often indirectly)

Guarantees and invariants: Inherent properties of the system’s design

Whereas metrics are measurable attributes of a system that can vary with such factors as system load, properties are fundamental to the system’s design. For example, the design either replicates durable data or it does not.

Guarantees for durable data

Durable data is stored by the system on behalf of the users, typically on disk or solid-state storage. What sort of guarantees does the system provide on this data?

Replication for durability
Replication for availability
Synchronous vs. asynchronous writes
Consistency guarantees

In principle, consistency guarantees can be made for ephemeral data that only ever resides in memory but in most cases only eventual consistency is provided, minimizing processing cost.

Instrumentation

How well does the system support monitoring its current state and diagnosing problems:

Logs
Dashboards
Probes

Fault tolerance

How well does the system tolerate failure of its own components and the services and data centre fabric upon which it depends:

Tolerance of component failures within the system
Tolerance of failure of external services upon which this depends
Tolerance of larger system failures (availability zone, region, transoceanic partition, …)

How well does it support failure, diagnosis, repair, and restart:

Diagnostic logs
Canary deployments
Admission control
Feature switches
Approximate results
Other engineering techniques

Business logic invariants

Business logic invariants are the business property counterparts of business metrics: The guarantees the system makes for entities that the system’s clients care about. For example, banks maintain the invariant that if you transfer money between two accounts, the amount taken from one equals the amount added to the other.

Eric Brewer notes that many business invariants are only implicitly specified for single-processor systems, which typically guarantee strong consistency. When consistency guarantees are loosened (see “Properties of durable data” above) to migrate the system to a distributed implementation, the business invariants need to be specified explicitly.

Business invariants often must specify a mitigation procedure, to resolve any violations (see the Brewer article cited above). For example, what is the process for redressing an error in an account’s balance, whether due to system defect or human error?

Automated deployment

In addition to readying the initial release for production, the service will require automated support for deploying updates.

Indicators and objectives

The above metrics, guarantees, and invariants shape the design space of the architecture for modern cloud systems. As such, they are direct concerns of the the development and operations staff. In addition, they may be exposed to clients as indicators and even contractual obligations. As defined in Site Reliability Engineering, Ch. 4:

Service Level Indicators (SLI)
Service Level Objectives (SLO)
Service Level Agreements (SLA)

Development process metrics

The above metrics and properties characterize the system’s design and performance. Other metrics characterize the development process:

Velocity of feature development (Chaos Engineering, p. 9)

Inherent limitations of distributed systems

Decades of study and practice in distributed systems have yielded principles and rules of thumb characterizing all such systems. As modern cloud systems are a species of distributed system, their designers must account for how these issues arise in this context.

Properties of distributed systems that must be accounted for (from Waldo et al., 1994):

Latency
Message-passing architecture (no shared memory)
True concurrency
Partial failure (process, service, and machine failure)

No common system clock (Lamport, 1978):

Systems must use some variant of logical clocks, vector clocks, or interval clocks

“The Eight Fallacies of Distributed Computing” (rephrased as statements about how distributed systems actually work):

The network is unreliable (messages can be out of order or lost, or connectivity may be dropped altogether).
Latency is non-zero.
Bandwidth is finite.
The network is insecure.
Topology changes.
There are multiple administrators.
Transport cost is non-zero.
The network is heterogenous.

Conclusion

The design space for production cloud-based data services is huge. The service architect and implementation team must trade off between many conflicting goals and build a service that integrates well with the operations of the organization as a whole. This design process turns a robust, accurate data model—the kernel of a service but not an actual service—into a production service.

All my marbles in one place