I lost my Monday to Kubernetes

So I made a few changes to an app and I redeployed it once more to Kubernetes. I figured that it stopped working in the cluster that I am using but I don’t know why. It works locally after all. Works on my machine, the most classic statement a developer can make, still true once more.

I know that there were a bunch of things I touched: code and config (yeah I know, mistakes were made).

I tried to replicate the issue locally, to no success, and I can’t really figure out why things broke. All my changes looked legit and, worst of all, I had a few changes from the previous week (I told you, mistakes were made) that I didn’t fully remember but that seemed legit as well. I then decided to go deeper and figure out what was going on and it looked like my app was stuck doing nothing on some synchronization primitive.

Being extremely puzzled by this very unexpected behaviour, I did the following steps:

I understood that the app was stuck in a particular part of the Kubernetes source code called WaitForCacheSync. Nothing seemed to have generated this new failure and after I tried my app locally, I figured that it was working perfectly with my local Kubernetes cluster. What was different? It looked like nothing was… Well the OS was different, wasn’t it (linux/darwin)?

From there i started a race to find an issue inside the vendored libraries, trying as well a local Linux environment running in docker with the same code/environment. I was able to install delve there and debug it even more deeply confirming what I found out on the remote cluster: there was an issue with the initial sync of the caches. After some debugging with the help of a colleague (thank you D, I appreciate it more than I can say), thinking of why those caches were never populated, I asked myself: why can’t we get that data?

That made me immediately think of the change that I made in config: an RBAC rule! And I was able to jump to the conclusion that the RBAC rule wasn’t working anymore as the namespace name was changed in a previous step! A few hours after the initial change, I finally unraveled the mistery: what seemed to be black magic, something wrong with libraries or even dark OS internals stuff, it was just a wrong RBAC rule that was implicitly denying access to my application.

As much as that wasn’t easy to debug, there are a bunch of lessons learned:

Lots of learnings from a simple mistake, which is what I keep finding over and over again in my career. I need to periodically relearn my lessons, lose time and restart.

I am definitely not a 10x engineer :-P