Two years of Kubernetes on AWS

26 Oct 2018

This post is not the normal post on experiences and discoveries of two years spent bringing Kubernetes to production on AWS. Instead, I wrote this to offer a look back at what it meant to run Kubernetes on AWS two years ago, by first describing some key facts from 2016 and then having a look at how things evolved today, hoping that this would help getting an idea of how things changed and how we can make them change for the better in the future. Last but not least, I will mention some interesting topics that the community is focusing on in 2018 and what the community needs, all from my personal point of view. This is a write up of a Meetup talk, you can find the slides here.

October 2016

A premise

In October 2016, Kubernetes has recently celebrated its first birthday party (July 2016) and it’s getting more and more popular thanks to a number of factors, including the increasing popularity of containers and the incredible evangelization effort of Google, mostly lead by Kelsey Hightower.

In 2016, I have a good enough familiarity with Kubernetes, I’ve installed it on GCE, used GKE extensively, but never really used it on AWS. Similarly to what other people are experiencing, the company I’m working for is mostly using AWS as their cloud provider and thinking of migrating to Google is simply unreasonable. For this reason, I start my journey by setting up my very first Kubernetes cluster on AWS.

The missing deployment architecture

A lot of the discussions around deploying Kubernetes on AWS were around the usual topics: are we going to deploy a cluster per availability zone or should the cluster span multiple availability zones? Are we going to create multi-region clusters? Are we going to run etcd on the masters or outside of the masters? And are we gonna use a single master containing the entire control plane or adopting an highly available (HA) multi-master setup which was already supported by Kubernetes?

The answer to those questions were nowhere to be found but varied from person to person, company to company. Let’s try to go step by step to discover more details on those topics.

To multi-AZ or not to multi-AZ?

As already introduced, one of the common questions was if the cluster should span multiple availability zones or not. Back then, most of the Kubernetes users (or better, operators) were suggesting that they were deploying Kubernetes using autoscaling groups (at least for the worker nodes), design that naturally fits AWS.

In this case, the effort of going from a single availability zone to a multi availability zone was minimal and the increased availability attracted many people. Going with a multi-AZ setup introduced some challenges though.

For example, EBS volumes in AWS are per availability zone. This implies that if the cluster does not have nodes with capacity in the given zone, pods requiring a volume could stay in pending state forever if there was no node in that zone. This is a common mistake that I’ve seen happening many times from people that are new to Kubernetes: someone creates a very small cluster with 2 nodes but spanning 3 AZs and enables autoscaling (and thus scaling up and down over time). Then, they run a stateful workload that never goes into running state because there is no node in the availability zone to satisfy the volume constraint… and that isn’t really easy to debug if we are not super familiar with the subject and how Kubernetes scheduling works.

Another issue came with the cluster autoscaler: the implementation available in 2016 was not aware of zones and thus inefficient at autoscaling nodes when the nodes would span multiple AZs. Still, most of the people decided to stick with the multiple AZs in one autoscaling group for worker nodes.

What about multi region?

One additional option would be to deploy clusters cross region. This was, to my knowledge, not attempted at all with Kubernetes in 2016 or at least I don’t know anyone actually trying that back then.

To have a multi-region setup, we should keep into account that there is the need for the Kubelet to talk to the API server and the latencies involved with multi-region deployments would probably be a problem or at least something to deal with and test carefully. Even more important than that, the benefit would probably just not be worth the pain of setting this up and the provisioner tools (which I’m gonna cover later) did not support that. The Kubernetes community made a bet on Federation as a way to orchestrate workloads across multiple federated cluster, in a way solving multi-region deployments.

Kubernetes cluster Federation

The cluster federation is based on a control plane that would allow operators to orchestrate workloads across different clusters. That was promised as more resembling to the way Borg works inside Google: the Kubernetes cluster would be mapped similarly to Borg cells, which can be deployed, for example, to a single rack in a datacenter and the central control plane could have handled multi-cell deployments and so on.

The idea behind federation encouraged to have small clusters and federating them with a control plane to achieve higher availability and make it easier to orchestrate the Kubernetes clusters themselves. For a lot of people federation looked like “something to look forward in the next year” at this stage, look later in this post to know what happened then ;-) .

Single or multiple master nodes?

Like pretty much everything in software and technology in general, the discussion around having a single master or a multi-master setup was highly opinionated: some people were saying that the multi-master was a no brainer and not something that it would be worth renouncing, while others were just okay with a single master.

Kube-AWS (a provisioner tool we will cover later) was already supporting multi-master deployments at this stage and Kops (another provisioning tool) was also supporting it, but it would default to a single master.

That topic, which seems to be trivial, requires a lot of thinking and it’s all but a detail. There are in fact several tradeoffs to discuss around that:

a single master can guarantee a lower availability compared to a multi master setup.
a multi master can allow for continuous cluster updates as we can upgrade one master at a time keeping the cluster operational.
a multi master setup is more expensive than a single master one (especially if we want to deploy etcd outside of the cluster).

It’s worth nothing that GKE, which at this point was the most used (and probably the only) managed Kubernetes solution, was using a single master only. We can imagine that the considerations around the decision could have been done to help not setting expectations (or SLOs) that are too high while learning to operate a new system, something to definitely take into account when starting with Kubernetes.

etcd

This wouldn’t be a blogpost on Kubernetes without talking about etcd. Etcd is the main storage behind Kubernetes and for sure it was not ready for prime time in 2016. It had bugs, performance issues and sometimes required to run manual compactions and other operations to make it survive the users operations, especially when running a high number of pods and nodes.

Clearly, on super small clusters clusters the issues were not visible, but still relevant things to take into account and that made the experience more difficult to users that wanted to run real production workloads at scale.

And of course, docker

In 2016, there were still a bunch of bugs regarding Docker that were easy to spot with Kubernetes. While running a single ec2 instance with docker on top of it (which means that Docker is essentially just used as a package mechanism) would not trigger such behaviours, with a container orchestration system it would be pretty easy to reach situation where docker would just hang (i.e. docker ps would never return anything and block).

That was reported a bit all over the place and I used to have a “golden version” of Docker that I compiled manually from my machine from a rc release that would run perfectly… but other versions wouldn’t.

Even GKE, which at this time was the reference for all the people running Kubernetes on any cloud, used to run a script on every node called docker_monitoring whose content is the following:

# We simply kill the process when there is a failure. Another systemd service will
# automatically restart the process.
function docker_monitoring {
  while [ 1 ]; do
    if ! timeout 10 docker ps > /dev/null; then
      echo "Docker daemon failed!"
      pkill docker
      # Wait for a while, as we don't want to kill it again before it is really up.
      sleep 30
    else
      sleep "${SLEEP_SECONDS}"
    fi
  done
}

That’s not a joke.

Provisioning tools

There are many provisioning tools out there, but in 2016 the space was not really super crowded (yet). To deploy a cluster on AWS, there were mostly only Kops (v1.4) and Kube-AWS (v0.8). Plenty of people, at the same time, especially given that the maturity of the two projects was not spectacular, were starting their own project.

At the very same time, the Kubernetes community was starting the effort on unifying the way to configure a node to run Kubernetes, an idea that is somehow taken from Kops, that resulted in the efforts behind kubeadm.

Kops v1.4

In 2016, Kops worked already pretty well. It was indeed the easiest way to setup a cluster on AWS. The work of people like Justin Santa Barbara and Kris Nova was really good and contained already lots of ideas that inspired both Kubernetes itself and other cluster provisioners.

The downside of Kops back then was that is was a bit difficult to understand. Kops contained a lot of code and tried to work across different clouds and to be a somehow battery included solution. While that can be a good thing, it is also the reason why several people started their own projects, thinking they could make something simpler, which sometimes was true, but somehow benefited the community less, bringing a bit of confusions for newcomers with regards to what to use, a lot of duplication (which is bad) and brought even more opinions (which is good).

Kube-AWS

Kube-AWS was trying a different approach, aiming only at AWS (hence the name), supporting only CoreOS as operating system and trying to have little code as a general approach.

The community behind the project was a bit smaller compare to Kops (none of the two is huge as of 2016) and the project seems to be maintained by a smaller set of people. While the project gained some good traction, it stayed behind in terms of popularity compared to Kops.

Back to the future!

Let’s jump to today. It’s October 2018 and the situation is pretty different. It’s easy to say that the core of Kubernetes is actually quite stable in terms of basic functionalities and that new features are still being added at an incredible pace.

As of today there are even more provisioner tools out there and a (partially) managed solution from AWS (EKS) which simplifies a lot of the operations around Kubernetes.

The architecture itself is (partially) stable, in the sense that most of the people are going with similar approaches without having to discuss all the questions all over again and, finally, the Kubernetes community is moving up the stack, trying to solve more problems than only the ones related to how to get a cluster up and running.

Core (kind of) stable

Deployments, ConfigMaps, DaemonSets and so on are here to stay. Those object are not seeing lots of change and the functionalities around those are solid enough. The code that deals with those feels strong and most of the bugs have been fixed. Even given that, there are still lots of quirks and weird bug in the system.

Some of them are not only Kubernetes specific, but are more evident in Kubernetes like the DNS intermittent delays of 5s, the effect on CPU limits on application latency or issues that are still found in CNI implementations like weave that do not handle “simple” cases (at least on AWS) where node come and go frequently which ultimately result in the cluster networking not working.

In terms of new features, Kubernetes sees lots of them at every release, it is enough to have a look at the release notes of the latest Kubernetes version (1.12) to understand that.

This clearly also means a bit of instability and the feeling that we are dealing with a system that is never done.

Staying up to date with Kubernetes feels today more difficult than it used to be.

The project itself is moving really fast and plenty of companies around the world come up with new ideas every day that complement the offering. But what if you want to stay at least up to date with the software you run in your cluster? Well, the best approach seems to be to do continuous update, the project is moving so fast that is super easy to run into an API that changed and this being unable to use some software and maybe we risk of experiencing some bugs that are already fixed in later versions (even if critical things are usually back-ported).

Another good idea is to use fully managed solutions as much as we can as those can easy the upgradeability of our cluster. An alternative to that is to build a bit of custom automation around open source tools, but this of course requires additional effort as mostly no solutions out there perfectly fits all the use cases.

On a similar note, it’s worth paying attention to the APIs that we are using: alpha resources are meant to change and we should pay a lot of attention to those. Being bleeding edge has the same risks today as in 2016 for sure and this aspect is often forgot in a community that talks a lot about innovation.

Provisioning tools

EKS

Clearly the big change of 2018 is that AWS finally released EKS in June after announcing the preview last year at re:Invent in Las Vegas (shameless plug: if you are looking for a nice video on Kubernetes on AWS, here is my presentation at last year’s re:Invent).

EKS comes with an HA Kubernetes control plane for a reasonably cheap price (0,20$ per hour at the time of writing). The offering provides a vanilla version of Kubernetes and not a fork which is really a great thing as it allows to use lots of other tools that are supposed to work with Kubernetes that are available in the opensource world. EKS has some things that are not ideal though: it is currently still stuck at version 1.10 while Kubernetes is already at version 1.12 (to be fair, GKE has also not completed the rollout to 1.11, so it’s partially still using 1.10).

The update policy is also a bit weird, with the control plane that can get updated without notice, potentially causing unplanned incompatibility with the nodes of the cluster. It’s worth remembering that EKS is only a partially managed solution as the worker nodes of the clusters have to be self managed and are not managed by AWS, similarly to what is offered with ECS and differently from GKE on Google Cloud.

Possibly, the “Fargate” version of EKS will allow to run containers using the AWS API without having to deal with the nodes, but this is still not even in preview.

Additionally, EKS still provides as API server endpoint with self signed SSL certificates which is unfortunate, giving that AWS is providing since a long time a Certificate Management service and that now even Kops supports real certs.

Kops (v1.10)

Kops has matured a lot over the past years and has been adopted by many companies. The project currently has 461 contributors on GitHub, which is quite a massive number of contributors for a provisioning project.

The project still has the old downsides of 2016, namely lots of code and still ships with a single master by default. It looks like that the project is lagging down a bit in terms of Kubernetes release (it’s currently shipping 1.10 while 1.12 is available) and it’s the goal of the project to always be not more than one release behind Kubernetes. It currently also comes with a very opinionated version of the world. It installs its docker versions on the node, etcd is mostly run on the masters even though there is the option to run it outside, and so on.

What is interesting to notice though is that Kops quickly became a good playground for experiments: it contains work on etcd manager which is a tool for managing etcd and backing it up, the Clusterbundle, project from Google to package cluster components and an experimental upgrade functionality that is aware of Stateful workloads that unfortunately was never merged.

More provisioners

In 2018 there are, as I said, more and more provisioners. While I don’t want to spend too much time talking about those, most of the new ones are based on kubeadm. The Heptio Quickstart is probably the easiest way to get a cluster up and running on AWS, interestingly even easier than creating and EKS cluster, even today.

Among the other provisioners, it’s worth nothing Kubicorn, which used to also serve as a playground for experiments for the Cluster API.

Cluster what?

The Cluster API is a community driven effort to have an API fully describing all the resources of the cluster, including the cluster itself, machines, etc. It derives from ideas that can already be found in Kops: kubeadm as a replacement for the node agent, the API server component was also in development inside Kops, with an object definition that fully describe the cluster (nodegroups, etc.). It’s a nice approach to a more Kubernetes native way of managing some cloud resources, but it’s indeed a big rewrite that is still in early stage.

As we know, rewriting code is not always a good idea and while I don’t want to judge the project (again, is in very early stage), we will have to expect mistakes and bugs, so it’s nothing worth considering for production or even testing setups for the moment, but definitely worth keeping an eye on for the future.

Federation

Long story short: the big promise of federation never delivered its promise. As of October 2018, there is “no clear path to evolve the API to GA” and there is an effort to implement a dedicated federation API As part of the Kubernetes API that is still in early stages and still mostly unused as far as I can tell.

The work on Federation v2 is ongoing though (more info here) and while it introduces some additional complexity (at last for what I can understand) it will develop further in 2018 and 2019 and shows us where the ideas is going to go and if it will be useful for the community.

Where are we going

Having a look at the Kubernetes community, it’s interesting to look at what the members are focusing on in 2018. Most of the attention is not anymore on how to provision clusters, but on other topics and definitely one of the hot ones is the service mesh.

Without going into many details, it’s interesting to see how the topic is dominating some conferences like KubeCon and how different vendors are fighting to propose their solution. Currently the most talked ones seem to be Istio and Linkerd, but given the huge change in this area, this could be outdated quite soon. As cloud aficionados, we should keep an eye on how those will be evolving in the near future, but they don’t seem to me to be really fully ready for production usage.

Both Istio and Linkerd have a similar architecture with an additional control plane to the Kubernetes ones and injected proxy to achieve smart routing, better security and out of the box better observability. That comes of course with additional complexity, another control plane and additional proxies in front of your applications and many configuration files and Custom Resource Definitions. Those tools will have to prove over time that they really add a lot of value to justify the additional complexity that they bring.

Another interesting topic is the focus on making Kubernetes a lot simpler for developers. Kubernetes is in fact often criticized to be too complex for developers (and I totally agree with that) and some efforts are trying to build higher level abstraction that don’t necessarily require dealing with tons of low level yamls.

One of them is Knative from Google which was recently announced and that currently requires Istio. The idea is really to create an abstraction layer that will make developers focus on services instead than all the details of how they work on a container orchestration system. The solution seems to be, in this first iteration, a bit complex and with a lot of moving part which will probably make sense if operated by a cloud provider, but that could result too complex for other users.

There are few other examples of higher level abstractions out there, like the StackSet controller from Zalando that simplifies the UX for the user quite a bit while providing traffic switching functionalities, but that still exposes the same PodSpec and Ingress and for that reason is not a full attempt at building a very high level abstraction, but more at simplifying the interaction and composability of the well known Services, Deployments, Ingresses.

What is needed

As we’ve seen, Kubernetes improved on a lot of things, but it’s still not done nor perfect. There are still plenty of bugs to fix and things to be improved. Contributing upstream it’s not a perfect experience: sometimes it takes ages to get things merged, even if everything is just ready.

That said, as members of the community, we should be contributing to the project as much as we can (and want) to make the project even better than it is today.

There is of course something else that is important: we need to share our horror stories. Kubernetes is often criticized as “hipster technology” and figuring out all the details related to run it in production is a fundamental step to increase its adoption while learning from the process. It’s fundamental to understand how to operate Kubernetes and build systems on top of it and to share those learnings with the rest of the community. Here a few links of presentations I know where this was done:

Conclusion

That’s it for this view on the last 2 years of Kubernetes on AWS. I hope you enjoyed this post and I hope that the next year will bring a lot of new topics and a lot of stability to our systems!

./blog.sh

Two years of Kubernetes on AWS

October 2016

A premise

The missing deployment architecture

To multi-AZ or not to multi-AZ?

What about multi region?

Kubernetes cluster Federation

Single or multiple master nodes?

etcd

And of course, docker

Provisioning tools

Kops v1.4

Kube-AWS

More questions

Back to the future!

Core (kind of) stable

Provisioning tools

EKS

Kops (v1.10)

More provisioners

Cluster what?

Federation

Where are we going

What is needed

Conclusion