Minimal OS and networking knowledge for a data engineering course

23 Oct 2018 Tags: Big data, Course design

Third in a series on designing a data engineering course. The previous post presented draft learning outcomes. The next post considers what exercises will best achive the learning outcomes.

Although the students in this course are expected to have a CS degree or equivalent, most of them will have spent several years in industry, long enough for the lessons from that degree to have faded.

Many of the concepts listed in the design space presume familiarity with the topics in undergraduate operating systems and networking courses. Some of these topics, such as the basics of TCP, are likely to have been kept fresh by use at work. Anyone building a simple Web application has to know the basics of IP address and ports. Other concepts, especially those concerned with synchronization and communication between concurrent processes, will have become unfamiliar from lack of use, to the extent students ever understood them. Honestly, most students (incuding me!) complete their OS course with an incomplete and imperfect understanding of that material.

On the other hand, other students will enter the course with a strong understanding of these concepts and an eagerness to learn the concepts more deeply and apply them to data engineering. These students will be bored by repeating material they already know and see time spent on them to be a lost opportunity to learn material new to them. We don’t want to bore them, often the most competent and motivated in the cohort.

Which raises the questions:

What is the OS and networking knowledge that I should consider prerequisite to the course and unnecessary to review? If students find this material unfamiliar, I simply point them to outside resources as remedial learning.
What is the minimal OS and networking knowledge that I should review in the course and in how much depth depth should I cover it?
What is the best way to review this material, such that students who have forgotten it can pick it up quickly while students who already know it will remain interested?

I’ll begin by simply enumerating the required concepts from each domain.

OS concepts

Fortunately, only a subset of concepts from classical operating systems that are essential to data engineering:

Process model
- Process isolation
- Virtual memory
- Threads
Kernel versus applications
Concurrency
- Critical section and critical resource
- Race conditions
- Semaphore
- Deadlock
- Fairness
- Starvation
Interprocess communication
- Shared memory
- Message passing
Scheduler
Priority
Resource allocation
Memory management
- Internal versus external fragmentation
Memory hierarchy
- Caching
Context switching
Asynchronous vs. synchronous calls

Bear in mind that the students will be familiar with using tools such as file systems and even a distributed file system such as HDFS. I am listing here OS internals that they need to understand.

New concepts in operating systems

These concepts may have been touched upon in their undergraduate course but they will not have been covered in depth (or never even mentioned):

Virtualization
- Virtual machine
- Hypervisor
Container
- Leaks in process isolation
Orchestration
Multi-level scheduler

Networking concepts

The basic concepts from an undergraduate networking course that they need to know to do data engineering:

IP address
- Port number
- Well-known ports
- Loopback address
TCP and UDP
Delivery guarantees
- None (zero to unlimited copies delivered)
- At most once
- At least once
- Exactly once
Flow control
Remote procedure calls
Message buffering
Marshalling and serialization
Topology
Distinction between wide-area and local-area networks
Client-server architecture
Message signing

New concepts in networking

These concepts specifically relate to the current cloud architectures of data centres, availability zones, and regions:

Data centre networking
- Bisection bandwidth
- “Top-of-rack” (usually located in the middle of the rack) switch
Relative latencies of within-centre, within-region, and inter-region round trips
Shuffling (which exercises both the file system and the network)
Idempotent messages

Topics intrinsic to the data engineering course

The above lists comprise prerequisite knowledge of other domains. The bulk of the data engineering course will consist of topics specific to this domain, such as those listed in the design space post and the learning outcomes post.

All my marbles in one place