Why Kubernetes?

I sometimes hear quite polarised opinions about Kubernetes.

Understandably, perhaps - it can be complicated but it can also be quite simple... and because it's well-known there are a lot of opinions out there.

Declarative deployment

I wanted a fabric into which we could deploy a mix of apps declaratively that would not require much management. I like to be able to "phoenix" an app and its infrastructure - tear it down and rebuild it declaratively without manual intervention. Being able to do so implies that you do actually know how to build and deploy the app. That's easy for version 1 but many systems I've worked with over the years have left the impression on me that if we lost version 798 somehow we'd never get it back because of the accumulation of little manual fixes in deployed infrastructure that are impossible to surface and reproduce.

Everything in source control

Closely related to declarative deployment. I can check in everything that matters. Rebuilding everything from scratch requires (almost) no manual intervention. Every change to the infrastructure can be found in version control and I can relate the infrastructure changes to a comment, a date, the person who committed them and any other code changes that happened at that time.

Vendor independence

I wanted to be able to move to another vendor without too much trouble and I didn't really want to own the lowest layers of the infrastructure.

Legacy workloads

We had a new app to develop and deploy but it had to run alongside and work with a lot of legacy technology while we performed a complex migration. I wanted it all on one platform so supporting legacy apps was important. This was relatively straightforward once we'd containerised our workloads. Most of the effort there was learning how to do Docker builds, push and give Kubernetes credentials to pull from our private Docker repository. 95% of the conversion effort is for the first app; after that it's straightforward to repeat.

Bin packing

We had started with some complicated legacy workloads that we've simplified over time. I'd previously worked in an environment where bin-packing became a major problem. When you have 700 AWS servers because you have many deployed microservices but no bin-packing solution, it gets really expensive. The mistake there was to give each service its own server cluster even though they were generally lightly loaded services that could have shared infrastructure.

Kubernetes is reasonably good at packing containers onto servers. It's possible to spread workloads across servers for resilience.

We have about 50 containers running, usually on 3 servers, and we use Kubernetes namespaces to represent our different environments (test, staging and production). The servers are fairly small so it's a cost-effective setup.

Self healing

Most features of Kubernetes work via reconciliation loops: processes that monitor the state of the cluster and act to resolve discrepancies between what's declared and what is found in reality. For example, if you delete a load balancer but retain the Kubernetes ingress resource it realises, Kubernetes will re-create the load balancer. If you remove one of the backend servers from the load balancer's backend group, Kubernetes will re-add it. This philosophy makes Kubernetes clusters quite self-healing. This was important to me because we run a small team with many demands on our time, so we need to level our workload as much as possible.

Apps don't know they're in Kubernetes

Or, more accurately, apps don't need to know about Kubernetes.

An app can access the Kubernetes API but it's only really necessary to orchestrate Kubernetes itself. Configuration and secrets can be injected into apps via environment variables and files and other services can be discovered through DNS. None of the 7 apps we have deployed know they're running in Kubernetes. So for us it's a fabric in which apps can run rather than a deeply embedded architectual element.

The alternatives

Heroku

Expensive, locked-in and limited options. We actually moved away from Heroku. We'd accumulated some of those little off-system changes that stop you re-creating your infrastructure from scratch.

Terraform

A bit too low level, I didn't really care about the servers - just deploying the apps. But the right kind of declarative model.

Containers in Google/AWS

At the time, Google's App Engine Flex was pretty close but we had long-running legacy batch jobs and Google would only let them run for 10 minutes max. Aside from this issue, I feel App Engine deserves more attention than it gets. It's well thought-through.

AWS Fargate seemed like it was just running a server per container under the covers. It was slow to provision and I couldn't really see the point of it at the time.

Is it difficult to use?

Kubernetes is neither very easy nor particularly difficult to use. One way to view it is as a vendor-independent "cloud API in a box". Most apps are simply a mix of standard Kubernetes resource types:

We used all the above resource types except stateful set. We didn't need that one because database-as-a-service from Google and AWS S3 covered our needs for holding state. There are other resource types and Kubernetes can be fairly easily extended with custom resource types. We use cert-manager to automate provisioning of Let's Encrypt certificates and it creates and uses several custom resource definitions including Certificate.

Ultimately, all workloads are run as a "pod". A pod is a group of Docker containers running together on a single virtual IP address. Kubernetes includes an overlay network for pods that means you don't need to worry about networking or IP address allocation. It also has good service discovery features. It's quite easy to create movable, location-independent workloads using labels and service discovery.

How reliable is it?

In practice, much will depend on how good the provider of the managed Kubernetes infrastructure is. In my experience, Google's managed service is very robust.

We've replaced servers during routine maintenance hundreds of times. Google don't upgrade in place. Instead, they provision a new server, already upgraded, add it to the Kubernetes cluster, then drain the workloads from one of the old servers.

Draining a server node causes the Kubernetes scheduler to cordon the node so no new workloads will be scheduled there and assigns workloads on the old server a new server to run on. Once the old server has been drained, Google decommission it.

Once they've cycled over all the servers, the entire cluster has been replaced. I like this approach.

Server independence

Your workloads need to be server-independent because they will be moved automatically by Kubernetes. This is fairly easy to achieve for most workloads but it's simpler if your workloads are stateless or their state is held externally.

Stateful workloads are possible (and fairly easily implemented in Kubernetes using a StatefulSet) but you need to think about the time gap between a stateful workload stopping on one server and restarting on another. We wanted a low maintenance infrastructure, so we'd already chosen to use a managed external database, which meant our state lived outside the Kubernetes cluster.

How much effort is required?

It's fairly easy to get started deploying to Kubernetes if you've got a Docker-containerised app, and they are not too difficult to make.

It's declarative

Kubernetes has a very declarative model: you write (usually in YAML) what you want and Kubernetes creates it for you and keeps it in sync with your declaration. The entire structure of our apps are YAML files in Git. If you change the number of replicas of a container, Kubernetes will start/stop containers to match. If it finds containers die / go missing / are on a server that dies, it will run new ones until the reality again matches your declaration. Load balancers will be re-wired to match cluster changes and persistent disks will be attached to servers as needed by their workloads.

Declarative isn't always best

Usually this declarative model works for you but occasionally it works against you. One example is: we run database schema migration as a job in Kubernetes and a job can only run once which means each job must be uniquely named or you must delete and recreate the migration job. Either option is a nuisance when you're used to a more procedural deployment model. But I think in all other situations the declarative model has worked better.

It can be verbose

Kubernetes resources typically offer a lot of options and Kubernetes is relatively unopinionated about how to use them. It generally favours configuration over convention. As a result, I found I did a lot more reading than was really necessary to get started. The downside of the many options is that you can get some pretty bloated YAML files describing all the bits of your app. But at least you can check them into Git and they are not difficult to read - just verbose. There isn't any practical limit to the complexity of the architecture you can build within the Kubernetes fabric (or at least I haven't hit one), so you aren't likely to box yourself into a corner.

Stateless apps are easier

This goes with declarative nature of Kubernetes and the moveable/expendable nature of workloads. But stateful apps aren't particularly difficult to write. Kubernetes will manage disks, recover from failures, move workloads, etc. Add-ons (e.g. Postgres operator) exist to provide managed databases.

It's kind of serverless... kind of...

I often see "serverless" equated with AWS Lambda (and similar offerings).

I think we should take a broader view. We know stuff is always running on servers somewhere - the question is: how much do I have to care about those servers?

Lambda does a pretty good job of relieving you of the burden of caring about servers but it's not the only solution. I think other kinds of hosted platforms offer similar benefits: hosted low/no-code application platforms, "mash-ups" of cloud services, SalesForce's Lightning platform or an app deployed on Heroku are all good examples.

Serverless to use

Kubernetes is "kind of" serverless but there's a caveat: only if you use a managed Kubernetes service provided by someone else.

We use Google's, which is very mature. Rather than upgrading servers in place, they replace each with a new one. In the last 3 years our servers have been replaced hundreds of times via upgrades and we have never seen any service interruption. Kubernetes makes it easy to shift workloads around and rewire load balancers when backend servers are replaced during upgrades, so the process is transparent.

We haven't experienced any downtime. We've never provisioned a server, we've never logged into any of the servers and we've never run updates. They're simply places to run our workloads. I care about the servers only to the extent they need to be big enough and numerous enough to run our workloads (although you can rely on autoscaling to relieve you of the burden of decided how many servers).

Not serverless to manage yourself

Kubernetes feels a lot less serverless if you're managing your own Kubernetes clusters. I'm the author of a Kubernetes cluster management tool which tries to make it easier but even with newer setup tools like Kubeadm and K3s, it's not easy to manage a cluster over the long term.

You have to worry about server failures, software updates, availability zones, updating Kubernetes itself - all the usual stuff that comes with managing servers yourself.

Setting up the Kubernetes overlay network requires a bit of reading, particularly if you are exposing your servers to the Internet. Some Kubernetes distributions can help with all this to an extent, e.g. CoreOS Tectonic, Rancher K3s.

I wouldn't advise setting up and managing your own Kubernetes cluster unless you have a lot of resources to dedicate to it. Managed Kubernetes services are available from dozens of 1st and 2nd tier cloud companies.

Would I recommend it?

Yes, definitely - if it suits your infrastructure strategy.

If you like to assemble applications from a collection of managed AWS or Azure services, Kubernetes might not provide much value, particularly if you're using some other declarative infrastructure system such as Terraform.

If you have a web app with a simple "flat" deployment model, Google App Engine or Heroku might be better - more convention over configuration, less to read, less to manage.

If you just want someone else to manage everything and you want to work well above functions as a level of abstraction, something like Salesforce's Lightning platform might be a better solution.

If you're looking for a portable, declarative, "cloud in a box", Kubernetes is worth considering.