Minimal OS and networking knowledge for a data engineering course

Third in a series on designing a data engineering course. The previous post presented draft learning outcomes. The next post considers what exercises will best achive the learning outcomes.

Although the students in this course are expected to have a CS degree or equivalent, most of them will have spent several years in industry, long enough for the lessons from that degree to have faded.

Many of the concepts listed in the design space presume familiarity with the topics in undergraduate operating systems and networking courses. Some of these topics, such as the basics of TCP, are likely to have been kept fresh by use at work. Anyone building a simple Web application has to know the basics of IP address and ports. Other concepts, especially those concerned with synchronization and communication between concurrent processes, will have become unfamiliar from lack of use, to the extent students ever understood them. Honestly, most students (incuding me!) complete their OS course with an incomplete and imperfect understanding of that material.

On the other hand, other students will enter the course with a strong understanding of these concepts and an eagerness to learn the concepts more deeply and apply them to data engineering. These students will be bored by repeating material they already know and see time spent on them to be a lost opportunity to learn material new to them. We don’t want to bore them, often the most competent and motivated in the cohort.

Which raises the questions:

  1. What is the OS and networking knowledge that I should consider prerequisite to the course and unnecessary to review? If students find this material unfamiliar, I simply point them to outside resources as remedial learning.

  2. What is the minimal OS and networking knowledge that I should review in the course and in how much depth depth should I cover it?

  3. What is the best way to review this material, such that students who have forgotten it can pick it up quickly while students who already know it will remain interested?

I’ll begin by simply enumerating the required concepts from each domain.

OS concepts

Fortunately, only a subset of concepts from classical operating systems that are essential to data engineering:

Bear in mind that the students will be familiar with using tools such as file systems and even a distributed file system such as HDFS. I am listing here OS internals that they need to understand.

New concepts in operating systems

These concepts may have been touched upon in their undergraduate course but they will not have been covered in depth (or never even mentioned):

Networking concepts

The basic concepts from an undergraduate networking course that they need to know to do data engineering:

New concepts in networking

These concepts specifically relate to the current cloud architectures of data centres, availability zones, and regions:

Topics intrinsic to the data engineering course

The above lists comprise prerequisite knowledge of other domains. The bulk of the data engineering course will consist of topics specific to this domain, such as those listed in the design space post and the learning outcomes post.