Kubernetes sidecars

We often hear about sidecars in the context of Kubernetes pods. Kubernetes pods can contain multiple containers that will be guaranteed to run on the same machine, sharing the local network.

A popular pattern is the “sidecar pattern”. A main container is the application that we intend to run, but more containers are run together with it as part of the same pod. Those other containers are called “sidecars” because they provide additional functionalities that complement the main application. Some examples I’ve seen in the wild are:

  • TLS termination
  • Embedded logging or monitor agent

Those use cases seem perfect: sidecars allow to add functionalities to the application without modifying the application itself and to scale those naturally, which seems perfect for the agent use case.

But there are a number of downsides that must be considered.

There is no such thing as a sidecar

Sidecar containers were mentioned already in a blogpost on the Kubernetes blog from 2015. As it was said already, they have been a widespread pattern for years, but Kubernetes knows nothing about them. In fact, for Kubernetes, all containers in a pod are considered equal and there is no concept of a main container and a sidecar.

The Kubernetes community is working on a feature to enhance sidecar containers that is still being discussed and that will be available not earlier than Kubernetes 1.20, scheduled for a late 2020 release.

The proposal as it is today will allow to:

  • have containers in a pod that will start before the main container
  • handle sidecars for kubernetes jobs

Those are great additions that will solve problems for jobs and race conditions. When those features will ship, it will basically mean that Kubernetes will start to have real sidecars, but it won’t solve all the problems with sidecars.

The problems with sidecars

If you care about reliability (and I bet you do), you should consider how having more containers in a single pod can affect the reliability of the whole pod. In fact, all containers in a pod share the same failure domain: given the way the readiness is computed, if a single container in a pod is not ready then the whole pod is not ready.

Container readiness is a combination of the container being running and a successful readiness probe. If, for any reason, a sidecar container can’t run, then the whole pod is not ready. The same applies to the readiness probe, if you have any: if the readiness probe of a pod is not successful, then the whole pod is not ready.

What does this mean for us? It means that the sidecars that are “just adding functionality” could directly impact the readiness (and ultimately the availability, from a user perspective) of the application.

Also, in case of dynamic injection of containers, sidecars will likely lead to duplicating the same container everywhere, increasing overhead and reducing the possibility to optimize consumption centrally (without a redeploy).

Those points above are not everything. My friend Sandor has a one tweet opinion about sidecars that you should read.

With those considerations in mind, in my opinion, sidecars are less attractive than they look like when looked naively.

So… are sidecars considered harmful?

If you read “considered harmful” in an article title, it’s probably clickbait. And no, sidecars are not harmful or evil. You should consider the tradeoffs of what a multi container pod setup can offer and think if this is something good or if you can run without it. And remember that nothing is free: more containers, more problems.

Kpt, packages, YAMLs

A few days ago, Google announced Kpt, a tool “for Kubernetes packaging that uses a standard format to bundle, publish, customize, update, and apply configuration manifests”. I felt the urge to write a few words about the problem space, with no goal of being exhaustive… so here I am.

Kubernetes packaging

The whole Kubernetes ecosystem seems to be obsessed with the “packaging” problem. At first, Helm came out, providing a “Homebrew like” functionality. CNAB is a spec for packaging distributed applications on Kubernetes. And there’s probably more. What matters is that there have been multiple attempts at defining how to package an application or multiple applications. While it is important to have a single way to deploy an application and while reuse across different repositories is definitely useful, often an application is a fluid concept. It grows. People want to reuse parts of the configuration in other apps, but change a million things at the same time.

Well, I think that often a “package” is a cage. The analogy with Homebrew is especially wrong, in my opinion: installing an application on a Desktop is a story, running something on a production system is another one. I have no SLA on how vim runs on my machine. There are normally no customization flags to Homebrew.

On the other side, Helm, CNAB and others work in a totally different space. I have to confess that I was heavily biased on Helm and others exactly for this reason: they make it look like a helm install is enough to have core components running on your production system. The reality is much more complicated and depends mostly on where you are deploying, what your availability requirements are, what you are using charts/packages for.

The issue I have with Helm charts is that they hide the generated Kubernetes YAMLs but not completely. Helm install will talk directly to the cluster, but if you have problems with the cluster you will have to kubectl your way through it. This to say that Helm doesn’t build an abstraction and as such exposes the entire complexity of Kubernetes while providing the false sense of “it’s easy to get X up and running”. No it’s not and for a good reason.

Of course there are more shortcomings of using something like Helm: we complicate the ecosystem. Every time a tutorial starts with helm install something, it requires every user to install, learn and understand Helm. I see this as a problem, because instead of simplifying the procedures to get something up and running, we introduce additional tools which are complexity per se. If we believe we need those tools because Kubernetes doesn’t do certain things we should probably try to understand why Kubernetes doesn’t support those features and if there is anything that we can do to contribute them in the core project. Or to build something completely different on top of it. After all, Kubernetes is a platform for building platforms, isn’t it?

Manifests and Kustomize

I’m obsessed about making things as simple as they can be. Or maybe I’m just obsessed by exactly the contrary: complexity. Any additional tool introduces something that the users need to learn and consider when they are operating a system. That is cognitive load on the path to understand what will happen when they do X. Complexity, by definition.

In that regard, I often praised Kustomize. It allows us to start with non templated resources that are valid and can be independently used, customize them and render them back as modified resources. While the tool has a lot of features and it’s by no means the definition of simplicity, it has clear inputs and outputs: Kubernetes resources go in, Kubernetes resources go out. No weird templated things, nothing new. Moreover, it keeps the contract that the users have with Kubernetes still valid: the Kubernetes API (and its resources), nothing more.

Back to Kpt

It’s unclear if we are going to benefit from having yet another tool. The space is crowded, very crowded: there are at least 122 other tools for application management on Kubernetes and this does not even count all of the internal tools that companies have developed and that are closed source.

As Brian Grant says, at this point, it couldn’t hurt and I agree. It could teach us things and inspire others. But I still believe that those kinds of tools are very tied to the internal structure of the organization adopting them and the state in which an organization is with the development of home grown workflows to operate applications on Kubernetes.

I can’t help but see this as a missing opportunity to improve kubectl apply and kubectl wait. Those two basic commands that everyone uses lack basic functionalities and we keep building them somewhere else, many many times.

What’s helpful then?

I’m happy to see that with kpt we are talking about building blocks and workflows. There is still no one tool fits all and there will probably never be. Even committing to a tool or another makes very little sense. What’s important, IMO, is that:

  • tools should be composable: you don’t want to bet everything on a single tool as tools come and go.
  • the steps that make your workflow matters: you want to have clear building blocks, for example “rendering manifest”, “rollout”, “rollback” and so on. How those are implemented, is really left to your creativity.
  • identifying what you mean by an application and its dependencies and how they are mapped to your system is very important: the “application” is a concept that does not exist in Kubernetes.
  • raw resources are still the way to go: there is no one abstraction out there and as such hiding the complexity of Kubernetes’ resources can complicate things rather than simplifying them.
  • building sugar/helpers/whatever for you and/or for your organization is a good idea. Even if it’s the third time you write the same code.

That’s it, I have many other things to write, but it’s quarantine times and my tech energies are low :-)

I lost my Monday to Kubernetes

So I made a few changes to an app and I redeployed it once more to Kubernetes. I figured that it stopped working in the cluster that I am using but I don’t know why. It works locally after all. Works on my machine, the most classic statement a developer can make, still true once more.

I know that there were a bunch of things I touched: code and config (yeah I know, mistakes were made).

I tried to replicate the issue locally, to no success, and I can’t really figure out why things broke. All my changes looked legit and, worst of all, I had a few changes from the previous week (I told you, mistakes were made) that I didn’t fully remember but that seemed legit as well. I then decided to go deeper and figure out what was going on and it looked like my app was stuck doing nothing on some synchronization primitive.

Being extremely puzzled by this very unexpected behaviour, I did the following steps:

  • get the debugging symbols back in the binary (my Makefile was stripping them).
  • add a pprof handler to add profiling informations.
  • redeploy.
  • install gdb on the container to go step by step in the code.
  • install links (gotta thank Linux Nvidia drivers in the early 2000 for knowing that) to access the remote pprof data easily from the container.

I understood that the app was stuck in a particular part of the Kubernetes source code called WaitForCacheSync. Nothing seemed to have generated this new failure and after I tried my app locally, I figured that it was working perfectly with my local Kubernetes cluster. What was different? It looked like nothing was… Well the OS was different, wasn’t it (linux/darwin)?

From there i started a race to find an issue inside the vendored libraries, trying as well a local Linux environment running in docker with the same code/environment. I was able to install delve there and debug it even more deeply confirming what I found out on the remote cluster: there was an issue with the initial sync of the caches. After some debugging with the help of a colleague (thank you D, I appreciate it more than I can say), thinking of why those caches were never populated, I asked myself: why can’t we get that data?

That made me immediately think of the change that I made in config: an RBAC rule! And I was able to jump to the conclusion that the RBAC rule wasn’t working anymore as the namespace name was changed in a previous step! A few hours after the initial change, I finally unraveled the mistery: what seemed to be black magic, something wrong with libraries or even dark OS internals stuff, it was just a wrong RBAC rule that was implicitly denying access to my application.

As much as that wasn’t easy to debug, there are a bunch of lessons learned:

  • don’t touch code and config at the same time. If you are messing with YAMLs, do it in a separate step. The smaller the steps, the easier it is.
  • always assume that something simple rather than something complex and obscure is broken unless really proven otherwise. And most important, don’t try to prove it (this is where I got stuck) if not much later. Discard complicated assumptions by default before having first checked all the possible obvious things.
  • check for the obvious things: what are the dependencies of your app? How does the communication work? Is there a firewall rule or anything similar in between (be it a policy of any kind or anything else)?
  • don’t assume “this is kubernetes so I’m replicating the behavior”: kubernetes has a huge set of configurations and this means there is a lot that can be wrong other than just the deployment.
  • build tools for humans, even if what you are building in a library/SDK: Kubernetes had problems fetching data from the cached but there was no error returned of any kind. No RBAC access, no data, no error. Good luck debugging that.
  • use languages and frameworks that you know how to debug: no matter how much time I lost, it was incredibly easy to add debugging symbols and use gdb, delve, pprof to figure what was happening. I didn’t have all the automation to do that with one command (but maybe I’m building a tool for that, time will tell) but it wasn’t too out of reach which was a relief.
  • don’t do things that require thinking with jet lag.

Lots of learnings from a simple mistake, which is what I keep finding over and over again in my career. I need to periodically relearn my lessons, lose time and restart.

I am definitely not a 10x engineer :-P

Two years of Kubernetes on AWS

This post is not the normal post on experiences and discoveries of two years spent bringing Kubernetes to production on AWS. Instead, I wrote this to offer a look back at what it meant to run Kubernetes on AWS two years ago, by first describing some key facts from 2016 and then having a look at how things evolved today, hoping that this would help getting an idea of how things changed and how we can make them change for the better in the future. Last but not least, I will mention some interesting topics that the community is focusing on in 2018 and what the community needs, all from my personal point of view. This is a write up of a Meetup talk, you can find the slides here.

October 2016

A premise

In October 2016, Kubernetes has recently celebrated its first birthday party (July 2016) and it’s getting more and more popular thanks to a number of factors, including the increasing popularity of containers and the incredible evangelization effort of Google, mostly lead by Kelsey Hightower.

In 2016, I have a good enough familiarity with Kubernetes, I’ve installed it on GCE, used GKE extensively, but never really used it on AWS. Similarly to what other people are experiencing, the company I’m working for is mostly using AWS as their cloud provider and thinking of migrating to Google is simply unreasonable. For this reason, I start my journey by setting up my very first Kubernetes cluster on AWS.

The missing deployment architecture

A lot of the discussions around deploying Kubernetes on AWS were around the usual topics: are we going to deploy a cluster per availability zone or should the cluster span multiple availability zones? Are we going to create multi-region clusters? Are we going to run etcd on the masters or outside of the masters? And are we gonna use a single master containing the entire control plane or adopting an highly available (HA) multi-master setup which was already supported by Kubernetes?

The answer to those questions were nowhere to be found but varied from person to person, company to company. Let’s try to go step by step to discover more details on those topics.

To multi-AZ or not to multi-AZ?

As already introduced, one of the common questions was if the cluster should span multiple availability zones or not. Back then, most of the Kubernetes users (or better, operators) were suggesting that they were deploying Kubernetes using autoscaling groups (at least for the worker nodes), design that naturally fits AWS.

In this case, the effort of going from a single availability zone to a multi availability zone was minimal and the increased availability attracted many people. Going with a multi-AZ setup introduced some challenges though.

For example, EBS volumes in AWS are per availability zone. This implies that if the cluster does not have nodes with capacity in the given zone, pods requiring a volume could stay in pending state forever if there was no node in that zone. This is a common mistake that I’ve seen happening many times from people that are new to Kubernetes: someone creates a very small cluster with 2 nodes but spanning 3 AZs and enables autoscaling (and thus scaling up and down over time). Then, they run a stateful workload that never goes into running state because there is no node in the availability zone to satisfy the volume constraint… and that isn’t really easy to debug if we are not super familiar with the subject and how Kubernetes scheduling works.

Another issue came with the cluster autoscaler: the implementation available in 2016 was not aware of zones and thus inefficient at autoscaling nodes when the nodes would span multiple AZs. Still, most of the people decided to stick with the multiple AZs in one autoscaling group for worker nodes.

What about multi region?

One additional option would be to deploy clusters cross region. This was, to my knowledge, not attempted at all with Kubernetes in 2016 or at least I don’t know anyone actually trying that back then.

To have a multi-region setup, we should keep into account that there is the need for the Kubelet to talk to the API server and the latencies involved with multi-region deployments would probably be a problem or at least something to deal with and test carefully. Even more important than that, the benefit would probably just not be worth the pain of setting this up and the provisioner tools (which I’m gonna cover later) did not support that. The Kubernetes community made a bet on Federation as a way to orchestrate workloads across multiple federated cluster, in a way solving multi-region deployments.

Kubernetes cluster Federation

The cluster federation is based on a control plane that would allow operators to orchestrate workloads across different clusters. That was promised as more resembling to the way Borg works inside Google: the Kubernetes cluster would be mapped similarly to Borg cells, which can be deployed, for example, to a single rack in a datacenter and the central control plane could have handled multi-cell deployments and so on.

The idea behind federation encouraged to have small clusters and federating them with a control plane to achieve higher availability and make it easier to orchestrate the Kubernetes clusters themselves. For a lot of people federation looked like “something to look forward in the next year” at this stage, look later in this post to know what happened then ;-) .

Single or multiple master nodes?

Like pretty much everything in software and technology in general, the discussion around having a single master or a multi-master setup was highly opinionated: some people were saying that the multi-master was a no brainer and not something that it would be worth renouncing, while others were just okay with a single master.

Kube-AWS (a provisioner tool we will cover later) was already supporting multi-master deployments at this stage and Kops (another provisioning tool) was also supporting it, but it would default to a single master.

That topic, which seems to be trivial, requires a lot of thinking and it’s all but a detail. There are in fact several tradeoffs to discuss around that:

  • a single master can guarantee a lower availability compared to a multi master setup.
  • a multi master can allow for continuous cluster updates as we can upgrade one master at a time keeping the cluster operational.
  • a multi master setup is more expensive than a single master one (especially if we want to deploy etcd outside of the cluster).

It’s worth nothing that GKE, which at this point was the most used (and probably the only) managed Kubernetes solution, was using a single master only. We can imagine that the considerations around the decision could have been done to help not setting expectations (or SLOs) that are too high while learning to operate a new system, something to definitely take into account when starting with Kubernetes.

etcd

This wouldn’t be a blogpost on Kubernetes without talking about etcd. Etcd is the main storage behind Kubernetes and for sure it was not ready for prime time in 2016. It had bugs, performance issues and sometimes required to run manual compactions and other operations to make it survive the users operations, especially when running a high number of pods and nodes.

Clearly, on super small clusters clusters the issues were not visible, but still relevant things to take into account and that made the experience more difficult to users that wanted to run real production workloads at scale.

And of course, docker

In 2016, there were still a bunch of bugs regarding Docker that were easy to spot with Kubernetes. While running a single ec2 instance with docker on top of it (which means that Docker is essentially just used as a package mechanism) would not trigger such behaviours, with a container orchestration system it would be pretty easy to reach situation where docker would just hang (i.e. docker ps would never return anything and block).

That was reported a bit all over the place and I used to have a “golden version” of Docker that I compiled manually from my machine from a rc release that would run perfectly… but other versions wouldn’t.

Even GKE, which at this time was the reference for all the people running Kubernetes on any cloud, used to run a script on every node called docker_monitoring whose content is the following:

# We simply kill the process when there is a failure. Another systemd service will
# automatically restart the process.
function docker_monitoring {
  while [ 1 ]; do
    if ! timeout 10 docker ps > /dev/null; then
      echo "Docker daemon failed!"
      pkill docker
      # Wait for a while, as we don't want to kill it again before it is really up.
      sleep 30
    else
      sleep "${SLEEP_SECONDS}"
    fi
  done
}

That’s not a joke.

Provisioning tools

There are many provisioning tools out there, but in 2016 the space was not really super crowded (yet). To deploy a cluster on AWS, there were mostly only Kops (v1.4) and Kube-AWS (v0.8). Plenty of people, at the same time, especially given that the maturity of the two projects was not spectacular, were starting their own project.

At the very same time, the Kubernetes community was starting the effort on unifying the way to configure a node to run Kubernetes, an idea that is somehow taken from Kops, that resulted in the efforts behind kubeadm.

Kops v1.4

In 2016, Kops worked already pretty well. It was indeed the easiest way to setup a cluster on AWS. The work of people like Justin Santa Barbara and Kris Nova was really good and contained already lots of ideas that inspired both Kubernetes itself and other cluster provisioners.

The downside of Kops back then was that is was a bit difficult to understand. Kops contained a lot of code and tried to work across different clouds and to be a somehow battery included solution. While that can be a good thing, it is also the reason why several people started their own projects, thinking they could make something simpler, which sometimes was true, but somehow benefited the community less, bringing a bit of confusions for newcomers with regards to what to use, a lot of duplication (which is bad) and brought even more opinions (which is good).

Kube-AWS

Kube-AWS was trying a different approach, aiming only at AWS (hence the name), supporting only CoreOS as operating system and trying to have little code as a general approach.

The community behind the project was a bit smaller compare to Kops (none of the two is huge as of 2016) and the project seems to be maintained by a smaller set of people. While the project gained some good traction, it stayed behind in terms of popularity compared to Kops.

More questions

While the topics above are the main ones around the topic of deploying Kubernetes on AWS, that’s not everything that needs to be solved to have a production ready cluster. To have something that can really host production workloads, teams need to figure out topics such as:

  • Monitoring
  • Logging
  • Autoscaling (nodes and pods)
  • Security best practices
  • Authn, Authz
  • Overlay network configuration
  • Load balancing / Ingress traffic (ELB, ELBv2)
  • How to do cluster upgrades

That’s really a lot and there was no easy answer for some of those topics on AWS. CloudWatch was no replacement for something like StackDriver on Google Cloud, there was no native integration with the AWS IAM for authentication and authorization, overlay networks were all pretty young and buggy, there was no support for the ELBv2 a.k.a. Application Load Balancer (ALB).

This meant that any team that wanted to start using Kubernetes on AWS had a lot of topics to figure out and only the minimal basic things were working out of the box.

Back to the future!

Let’s jump to today. It’s October 2018 and the situation is pretty different. It’s easy to say that the core of Kubernetes is actually quite stable in terms of basic functionalities and that new features are still being added at an incredible pace.

As of today there are even more provisioner tools out there and a (partially) managed solution from AWS (EKS) which simplifies a lot of the operations around Kubernetes.

The architecture itself is (partially) stable, in the sense that most of the people are going with similar approaches without having to discuss all the questions all over again and, finally, the Kubernetes community is moving up the stack, trying to solve more problems than only the ones related to how to get a cluster up and running.

Core (kind of) stable

Deployments, ConfigMaps, DaemonSets and so on are here to stay. Those object are not seeing lots of change and the functionalities around those are solid enough. The code that deals with those feels strong and most of the bugs have been fixed. Even given that, there are still lots of quirks and weird bug in the system.

Some of them are not only Kubernetes specific, but are more evident in Kubernetes like the DNS intermittent delays of 5s, the effect on CPU limits on application latency or issues that are still found in CNI implementations like weave that do not handle “simple” cases (at least on AWS) where node come and go frequently which ultimately result in the cluster networking not working.

In terms of new features, Kubernetes sees lots of them at every release, it is enough to have a look at the release notes of the latest Kubernetes version (1.12) to understand that.

This clearly also means a bit of instability and the feeling that we are dealing with a system that is never done.

Staying up to date with Kubernetes feels today more difficult than it used to be.

The project itself is moving really fast and plenty of companies around the world come up with new ideas every day that complement the offering. But what if you want to stay at least up to date with the software you run in your cluster? Well, the best approach seems to be to do continuous update, the project is moving so fast that is super easy to run into an API that changed and this being unable to use some software and maybe we risk of experiencing some bugs that are already fixed in later versions (even if critical things are usually back-ported).

Another good idea is to use fully managed solutions as much as we can as those can easy the upgradeability of our cluster. An alternative to that is to build a bit of custom automation around open source tools, but this of course requires additional effort as mostly no solutions out there perfectly fits all the use cases.

On a similar note, it’s worth paying attention to the APIs that we are using: alpha resources are meant to change and we should pay a lot of attention to those. Being bleeding edge has the same risks today as in 2016 for sure and this aspect is often forgot in a community that talks a lot about innovation.

Provisioning tools

EKS

Clearly the big change of 2018 is that AWS finally released EKS in June after announcing the preview last year at re:Invent in Las Vegas (shameless plug: if you are looking for a nice video on Kubernetes on AWS, here is my presentation at last year’s re:Invent).

EKS comes with an HA Kubernetes control plane for a reasonably cheap price (0,20$ per hour at the time of writing). The offering provides a vanilla version of Kubernetes and not a fork which is really a great thing as it allows to use lots of other tools that are supposed to work with Kubernetes that are available in the opensource world. EKS has some things that are not ideal though: it is currently still stuck at version 1.10 while Kubernetes is already at version 1.12 (to be fair, GKE has also not completed the rollout to 1.11, so it’s partially still using 1.10).

The update policy is also a bit weird, with the control plane that can get updated without notice, potentially causing unplanned incompatibility with the nodes of the cluster. It’s worth remembering that EKS is only a partially managed solution as the worker nodes of the clusters have to be self managed and are not managed by AWS, similarly to what is offered with ECS and differently from GKE on Google Cloud.

Possibly, the “Fargate” version of EKS will allow to run containers using the AWS API without having to deal with the nodes, but this is still not even in preview.

Additionally, EKS still provides as API server endpoint with self signed SSL certificates which is unfortunate, giving that AWS is providing since a long time a Certificate Management service and that now even Kops supports real certs.

Kops (v1.10)

Kops has matured a lot over the past years and has been adopted by many companies. The project currently has 461 contributors on GitHub, which is quite a massive number of contributors for a provisioning project.

The project still has the old downsides of 2016, namely lots of code and still ships with a single master by default. It looks like that the project is lagging down a bit in terms of Kubernetes release (it’s currently shipping 1.10 while 1.12 is available) and it’s the goal of the project to always be not more than one release behind Kubernetes. It currently also comes with a very opinionated version of the world. It installs its docker versions on the node, etcd is mostly run on the masters even though there is the option to run it outside, and so on.

What is interesting to notice though is that Kops quickly became a good playground for experiments: it contains work on etcd manager which is a tool for managing etcd and backing it up, the Clusterbundle, project from Google to package cluster components and an experimental upgrade functionality that is aware of Stateful workloads that unfortunately was never merged.

More provisioners

In 2018 there are, as I said, more and more provisioners. While I don’t want to spend too much time talking about those, most of the new ones are based on kubeadm. The Heptio Quickstart is probably the easiest way to get a cluster up and running on AWS, interestingly even easier than creating and EKS cluster, even today.

Among the other provisioners, it’s worth nothing Kubicorn, which used to also serve as a playground for experiments for the Cluster API.

Cluster what?

The Cluster API is a community driven effort to have an API fully describing all the resources of the cluster, including the cluster itself, machines, etc. It derives from ideas that can already be found in Kops: kubeadm as a replacement for the node agent, the API server component was also in development inside Kops, with an object definition that fully describe the cluster (nodegroups, etc.). It’s a nice approach to a more Kubernetes native way of managing some cloud resources, but it’s indeed a big rewrite that is still in early stage.

As we know, rewriting code is not always a good idea and while I don’t want to judge the project (again, is in very early stage), we will have to expect mistakes and bugs, so it’s nothing worth considering for production or even testing setups for the moment, but definitely worth keeping an eye on for the future.

Federation

Long story short: the big promise of federation never delivered its promise. As of October 2018, there is “no clear path to evolve the API to GA” and there is an effort to implement a dedicated federation API As part of the Kubernetes API that is still in early stages and still mostly unused as far as I can tell.

The work on Federation v2 is ongoing though (more info here) and while it introduces some additional complexity (at last for what I can understand) it will develop further in 2018 and 2019 and shows us where the ideas is going to go and if it will be useful for the community.

Where are we going

Having a look at the Kubernetes community, it’s interesting to look at what the members are focusing on in 2018. Most of the attention is not anymore on how to provision clusters, but on other topics and definitely one of the hot ones is the service mesh.

Without going into many details, it’s interesting to see how the topic is dominating some conferences like KubeCon and how different vendors are fighting to propose their solution. Currently the most talked ones seem to be Istio and Linkerd, but given the huge change in this area, this could be outdated quite soon. As cloud aficionados, we should keep an eye on how those will be evolving in the near future, but they don’t seem to me to be really fully ready for production usage.

Both Istio and Linkerd have a similar architecture with an additional control plane to the Kubernetes ones and injected proxy to achieve smart routing, better security and out of the box better observability. That comes of course with additional complexity, another control plane and additional proxies in front of your applications and many configuration files and Custom Resource Definitions. Those tools will have to prove over time that they really add a lot of value to justify the additional complexity that they bring.

Another interesting topic is the focus on making Kubernetes a lot simpler for developers. Kubernetes is in fact often criticized to be too complex for developers (and I totally agree with that) and some efforts are trying to build higher level abstraction that don’t necessarily require dealing with tons of low level yamls.

One of them is Knative from Google which was recently announced and that currently requires Istio. The idea is really to create an abstraction layer that will make developers focus on services instead than all the details of how they work on a container orchestration system. The solution seems to be, in this first iteration, a bit complex and with a lot of moving part which will probably make sense if operated by a cloud provider, but that could result too complex for other users.

There are few other examples of higher level abstractions out there, like the StackSet controller from Zalando that simplifies the UX for the user quite a bit while providing traffic switching functionalities, but that still exposes the same PodSpec and Ingress and for that reason is not a full attempt at building a very high level abstraction, but more at simplifying the interaction and composability of the well known Services, Deployments, Ingresses.

What is needed

As we’ve seen, Kubernetes improved on a lot of things, but it’s still not done nor perfect. There are still plenty of bugs to fix and things to be improved. Contributing upstream it’s not a perfect experience: sometimes it takes ages to get things merged, even if everything is just ready.

That said, as members of the community, we should be contributing to the project as much as we can (and want) to make the project even better than it is today.

There is of course something else that is important: we need to share our horror stories. Kubernetes is often criticized as “hipster technology” and figuring out all the details related to run it in production is a fundamental step to increase its adoption while learning from the process. It’s fundamental to understand how to operate Kubernetes and build systems on top of it and to share those learnings with the rest of the community. Here a few links of presentations I know where this was done:

Conclusion

That’s it for this view on the last 2 years of Kubernetes on AWS. I hope you enjoyed this post and I hope that the next year will bring a lot of new topics and a lot of stability to our systems!

Things I've been doing recently

This blogpost is not the normal writeup about how I got into a new job, about the things that have been awesome or that suck. It’s not about love for my employer or previous ones and there is definitely no hate at all. So what’s left? This one is really about a bunch of things that I’ve been doing recently and in a somewhat different fashion:

  • I switched my default working mode to “pair by default”. This was needed for me to get used to work with new colleagues, know them better and build a connection with remote members of the team. It turned out to be much better than that. It increases my concentration, improves the quality of the things I do and reduce risks. Everybody says that, but very few are using this as default way of working and I was definitely not doing it in my previous job, not because of any blocker, just because that was the way we were working (and it was fine). Well, I am enjoying this so much I’m myself very surprised about it.

  • I take time in isolation every day. Seems like a contradiction, right? You cannot be all the time pairing. It is tiring and I need some time for myself as well. To solve that, I seek every single day some time to work in complete isolation, not long, maybe one hour. We have some nice little rooms in our office that are definitely not good if you suffer claustrophobia, but they work beautifully to regain full focus to crash a single specific task alone. It feels good in there.

  • I learn new things. We have mandatory (!) personal development days, which means time dedicated to learn new things during working hours. During those days, I do only things I enjoy. If I haven’t written much code during the week (it happens, YAMLs are my life), I do coding. I can watch a long video or learn a new technology. What is important though is that it has to be something I enjoy, because I feel I learn much more when things are fun.

  • I draw a lot of technical stuff. I do that because I realized that if I don’t draw things, I have problems visualizing problems and understanding how to solve them. Sometimes I even draw YAMLs! Where I work right now I’m not surrounded by whiteboards as I used to be for a long time and I really miss that… but guess what? You can just take some paper and make nice drawings on that. I’ve been using this technique for slides lately, instead of spending hours with digital graph tools, I just grab a bunch of sharpies, paper and start drawing, then I snap pictures… and the slides are made!

  • I don’t allow myself to keep working hard when tired. Would you drink alcohol and drive at the same time? I bet you wouldn’t, cause it is irresponsible. The same applies to most of the tasks at work and being tired. Now, if you read my tweets you know I’ve been through some serious emotional storm lately due to the tragic loss of a dear friend. Well, sometimes my mood is not perfect or I feel I get tired easily due to some stress that is sticking around… guess what I do when this happen? I go home and relax! Of course this doesn’t mean that I work only few hours, but I try to apply reasonable approaches in which I try not to push myself too much over my limits.

  • I trust (and I feel trusted). I mostly never need to know what my colleagues are doing and they don’t ask too much either. This doesn’t mean lacking care, more letting control go. We can work async. I can write this blogpost. They can take their time to do things. None of us need to know, we only make sure we enjoy working together and that we are more or less on track with what matters which is to keep doing the best for the company while becoming better at it.

  • I take time to enjoy life. Taking a long lunch break to meet a friend from time to time isn’t a bad idea, it’s actually really fun! I’ve been meeting friends and building better relationships and if I can do this during the day, well that’s great.

End of my random selection of things. That was not an exhaustive list and it don’t know if there is any benefit in this but I hope nonetheless that you will find this useful!