I lost my Monday to Kubernetes

Posted on Jul 16, 2019

So I made a few changes to an app and I redeployed it once more to Kubernetes. I figured that it stopped working in the cluster that I am using but I don’t know why. It works locally after all. Works on my machine, the most classic statement a developer can make, still true once more.

I know that there were a bunch of things I touched: code and config (yeah I know, mistakes were made).

I tried to replicate the issue locally, to no success, and I can’t really figure out why things broke. All my changes looked legit and, worst of all, I had a few changes from the previous week (I told you, mistakes were made) that I didn’t fully remember but that seemed legit as well. I then decided to go deeper and figure out what was going on and it looked like my app was stuck doing nothing on some synchronization primitive.

Being extremely puzzled by this very unexpected behaviour, I did the following steps:

  • get the debugging symbols back in the binary (my Makefile was stripping them).
  • add a pprof handler to add profiling informations.
  • redeploy.
  • install gdb on the container to go step by step in the code.
  • install links (gotta thank Linux Nvidia drivers in the early 2000 for knowing that) to access the remote pprof data easily from the container.

I understood that the app was stuck in a particular part of the Kubernetes source code called WaitForCacheSync. Nothing seemed to have generated this new failure and after I tried my app locally, I figured that it was working perfectly with my local Kubernetes cluster. What was different? It looked like nothing was… Well the OS was different, wasn’t it (linux/darwin)?

From there i started a race to find an issue inside the vendored libraries, trying as well a local Linux environment running in docker with the same code/environment. I was able to install delve there and debug it even more deeply confirming what I found out on the remote cluster: there was an issue with the initial sync of the caches. After some debugging with the help of a colleague (thank you D, I appreciate it more than I can say), thinking of why those caches were never populated, I asked myself: why can’t we get that data?

That made me immediately think of the change that I made in config: an RBAC rule! And I was able to jump to the conclusion that the RBAC rule wasn’t working anymore as the namespace name was changed in a previous step! A few hours after the initial change, I finally unraveled the mistery: what seemed to be black magic, something wrong with libraries or even dark OS internals stuff, it was just a wrong RBAC rule that was implicitly denying access to my application.

As much as that wasn’t easy to debug, there are a bunch of lessons learned:

  • don’t touch code and config at the same time. If you are messing with YAMLs, do it in a separate step. The smaller the steps, the easier it is.
  • always assume that something simple rather than something complex and obscure is broken unless really proven otherwise. And most important, don’t try to prove it (this is where I got stuck) if not much later. Discard complicated assumptions by default before having first checked all the possible obvious things.
  • check for the obvious things: what are the dependencies of your app? How does the communication work? Is there a firewall rule or anything similar in between (be it a policy of any kind or anything else)?
  • don’t assume “this is kubernetes so I’m replicating the behavior”: kubernetes has a huge set of configurations and this means there is a lot that can be wrong other than just the deployment.
  • build tools for humans, even if what you are building in a library/SDK: Kubernetes had problems fetching data from the cached but there was no error returned of any kind. No RBAC access, no data, no error. Good luck debugging that.
  • use languages and frameworks that you know how to debug: no matter how much time I lost, it was incredibly easy to add debugging symbols and use gdb, delve, pprof to figure what was happening. I didn’t have all the automation to do that with one command (but maybe I’m building a tool for that, time will tell) but it wasn’t too out of reach which was a relief.
  • don’t do things that require thinking with jet lag.

Lots of learnings from a simple mistake, which is what I keep finding over and over again in my career. I need to periodically relearn my lessons, lose time and restart.

I am definitely not a 10x engineer :-P