Applying resilient design to my own systems

13 Feb 2020 Tags: Distributed systems

When I teach a course in modern distributed systems design, I spend a lot of time on topics such as engineering for resilience and automating production processes. It turns out that applying those big topics in the small matters of my life is a big challenge.

Some big, good ideas

I teach the usual things for such courses, including:

Test your system resilience and recovery procedures by inducing the failures you most fear occurring (though Barry O’Reilly advocates considering failures that seem impossible)
Establish build pipelines to automate routine procedures
Monitor system health and performance

Nothing especially surprising or novel, really.

Some local, small procedures

Within hours of finishing a class, I find myself in a situation in which one or more of the above principles apply:

Transferring lecture recordings from my phone to the course server, setting the necessary tags, and publishing the page with the links
Maintaining earthquake preparedness in my residence
Resorting to one-time backup passwords when my regular authentication methods are unavailable for some important Web site
Configuring regular backups for my laptop
Automating the pipeline for taking a diagram from source form to publication

and many other small-scale maintenance tasks.

Big ideas are ill-fitted to small slots

While uploading the backup, signing on to the important site, and all those other small things, I hear myself advising my students in class earlier that day. Really, shouldn’t I do in my own life the things I counselled them to do in their profession?

Most times though, the fit is awkward or even altogether impractical. The reasons vary:

Possible but a lot of work: In the case of uploading lecture audio files, I am slowly increasing the automation. By the end of the semester, I might have it down to the absolute minimum of three clicks and a single command.
Consumer tools lack the customization of production tools: The software I use to record lectures on my phone and the software I use to draw diagrams restrict my control over the names of files they produce. Production tools such as logging libraries allow filenames to be customized to a format that supports further automated processing whereas the consumer tools require manual filename adjustment to match conventions.
Testing a full disaster response is impractical: It’s one thing to regularly test that I can restore individual files from backup (and I do) but it’s impractical to test a full machine restore. Though writing this post did suggest to me that I ought to try living an entire day solely using my earthquake preparedness supplies. That would probably demonstrate some gaps.
Why seek out inconvenience? The sheer number of items I would have to test adds up to considerable effort. Do I want to pursue even more disruption in my daily activities?

So I muddle through

The principles of stress-testing and resilience engineering are powerful and worth considering but actually applying them to the smaller-scale processes of personal life often requires more resources, of time, money, space, or computing power, than is justified by the benefits. I adopt such principles as seem practical and reconsider my choices every so often. Not pursuing these principles in every applicable case, however small, is not hypocrisy, rather it is acknowledging that the principles apply at larger scales than much individual activity.

All my marbles in one place

Applying resilient design to my own systems

Some big, good ideas

Some local, small procedures

Big ideas are ill-fitted to small slots

So I muddle through