There's no such thing as a stable system

Posted on May 17, 2023

This post is the short story of an incident that I experienced while operating services at a previous job.

At the time, we were running a microservices architecture and we had a small service that was responsible, without going too much in the details, about user employee authentication. The service was a real microservice: it was doing exactly one thing and doing it well. At the time my team inherited that service, it was stable and not being developed anymore. It “just worked”.

It’s important to also mention that being the service responsible for employee authentication and being the number of employees a relatively small number that was constantly but slowly growing, the service was never subject to unpredictable load or massive growth. In that sense, it was easy to operate. And it even had a dashboard!

And still… one day it started failing

Surprise surprise, the stable service one day started throwing 500s. We started getting reports from employees that they can’t do their work and we could see clearly that some instances of the service were throwing 500s, but we struggled to understand what was wrong. The process was up, the error was cryptic and we didn’t change nor deploy the software in months.

After quite some digging on the few machines that were running the service, we learned something: the service, among the things that it was doing, was creating temporary files in temporary nested directories in a specific location of the filesystem. While it did implement logic to delete the temporary file after use, it wasn’t deleting the associated nested directories. That meant that, on the machines we were looking at, we had many “leaked” directories.

And over the months that we didn’t change nor deploy the service, we were accumulating those leaked directories which resulted in using all the available inodes making it impossible to create new temporary files.

But the service is stable!

Now that I spoiled a bunch of stuff, what I remember vividly was the reaction of some of my colleagues: management was very surprised to see the service fail because “it has been stable for so long” and “it was written by one of our most experienced engineers”.

The point is that none of this matters: there is no such thing as a forever stable service. We keep thinking of services as “just software” or “pieces of code”, but the reality is that they exist as static code in an extremely dynamic world: they are subject to user input, the machine in which they run changes over time, the time of the day changes and a million other things can and will change.

Thinking that things are done or stable is simply meaningless and it is similarly useless to think that we can freeze a system in its state or somewhat be worried about change, because change is a constant in the system that we will never control.

And complex systems will fail sooner or later.