Minimal OS and networking knowledge for a data engineering course
23 Oct 2018 Tags: Big data, Course designThird in a series on designing a data engineering course. The previous post presented draft learning outcomes. The next post considers what exercises will best achive the learning outcomes.
Although the students in this course are expected to have a CS degree or equivalent, most of them will have spent several years in industry, long enough for the lessons from that degree to have faded.
Many of the concepts listed in the design space presume familiarity with the topics in undergraduate operating systems and networking courses. Some of these topics, such as the basics of TCP, are likely to have been kept fresh by use at work. Anyone building a simple Web application has to know the basics of IP address and ports. Other concepts, especially those concerned with synchronization and communication between concurrent processes, will have become unfamiliar from lack of use, to the extent students ever understood them. Honestly, most students (incuding me!) complete their OS course with an incomplete and imperfect understanding of that material.
On the other hand, other students will enter the course with a strong understanding of these concepts and an eagerness to learn the concepts more deeply and apply them to data engineering. These students will be bored by repeating material they already know and see time spent on them to be a lost opportunity to learn material new to them. We don’t want to bore them, often the most competent and motivated in the cohort.
Which raises the questions:
-
What is the OS and networking knowledge that I should consider prerequisite to the course and unnecessary to review? If students find this material unfamiliar, I simply point them to outside resources as remedial learning.
-
What is the minimal OS and networking knowledge that I should review in the course and in how much depth depth should I cover it?
-
What is the best way to review this material, such that students who have forgotten it can pick it up quickly while students who already know it will remain interested?
I’ll begin by simply enumerating the required concepts from each domain.
OS concepts
Fortunately, only a subset of concepts from classical operating systems that are essential to data engineering:
- Process model
- Process isolation
- Virtual memory
- Threads
- Kernel versus applications
- Concurrency
- Critical section and critical resource
- Race conditions
- Semaphore
- Deadlock
- Fairness
- Starvation
- Interprocess communication
- Shared memory
- Message passing
- Scheduler
- Priority
- Resource allocation
- Memory management
- Internal versus external fragmentation
- Memory hierarchy
- Caching
- Context switching
- Asynchronous vs. synchronous calls
Bear in mind that the students will be familiar with using tools such as file systems and even a distributed file system such as HDFS. I am listing here OS internals that they need to understand.
New concepts in operating systems
These concepts may have been touched upon in their undergraduate course but they will not have been covered in depth (or never even mentioned):
- Virtualization
- Virtual machine
- Hypervisor
- Container
- Leaks in process isolation
- Orchestration
- Multi-level scheduler
Networking concepts
The basic concepts from an undergraduate networking course that they need to know to do data engineering:
- IP address
- Port number
- Well-known ports
- Loopback address
- TCP and UDP
- Delivery guarantees
- None (zero to unlimited copies delivered)
- At most once
- At least once
- Exactly once
- Flow control
- Remote procedure calls
- Message buffering
- Marshalling and serialization
- Topology
- Distinction between wide-area and local-area networks
- Client-server architecture
- Message signing
New concepts in networking
These concepts specifically relate to the current cloud architectures of data centres, availability zones, and regions:
- Data centre networking
- Bisection bandwidth
- “Top-of-rack” (usually located in the middle of the rack) switch
- Relative latencies of within-centre, within-region, and inter-region round trips
- Shuffling (which exercises both the file system and the network)
- Idempotent messages
Topics intrinsic to the data engineering course
The above lists comprise prerequisite knowledge of other domains. The bulk of the data engineering course will consist of topics specific to this domain, such as those listed in the design space post and the learning outcomes post.