Lessons from launching billions of Docker containers

Docker brings consistency and repeatability to large and fast-moving environments, but not without challenges

The Iron.io Platform is an enterprise job processing system for building powerful, job-based, asynchronous software. Simply put, developers write jobs in any language using familiar tools like Docker, then trigger the code to run using Iron.io’s REST API, webhooks, or the built-in scheduler. Whether the job runs once or millions of times per minute, the work is distributed across clusters of “workers” that can be easily deployed to any public or private cloud, with each worker deployed in a Docker container.

At Iron.io we use Docker both to serve our internal infrastructure needs and to execute customers’ workloads on our platform. For example, our IronWorker product has more than 15 stacks of Docker images in block storage that provide language and library environments for running code. IronWorker customers draw on only the libraries they need to write their code, which they upload to Iron.io’s S3 file storage, where our message queuing service merges the base Docker images with the user’s code in a new container, runs the process, then destroys the container.

In short, we at Iron.io have launched several billion Docker containers to date, and we continue to run Docker containers by the thousands. It’s safe to say we have more than a little experience with Docker. Our experience launching billions of containers for our customers’ workloads has enabled us to discover (very quickly) both the excellent benefits and the frustrating aspects of Docker.

The good parts

We’ve been fortunate to interact regularly with new technologies, and although new technologies bring their own sets of problems, they have helped us achieve otherwise impossible goals. We needed to quickly execute customer code in a predictable and reproducible fashion. Docker was the answer: It gives us the ability to deploy user code and environments in a consistent and repeatable way, and it provides ease of use when running operational infrastructure.

Our most common use for Docker is serving our customer's environment and code through IronWorker. Through the use of container technology, we are able to run byte-for-byte the same configuration from the point where a customer builds and runs an image locally to where it is running in our production environment. This makes it easy to rule out errors with user code, since there is no discrepancy between customer and provider environments.

In the process of improving our container deployments, we discovered the benefits of microcontainers. Microcontainers are run from pared-down Docker images carrying only the essential software packages needed to build and run your code. We blogged about microcontainers soon after we started using them, and we’re still reaping their benefits. Smaller images have less impact on bandwidth when initially downloading to the Docker daemon (network I/O) and when launching containers (disk I/O). Because we frequently launch containers on new machines, we were able to see a vast improvement in the overall execution speed of the IronWorker platform due to their lighter footprint.

Microcontainers are also theoretically more secure, in the sense that they have fewer included system packages subject to exploits. However, these benefits can be negated if you create a very large application, if you install unnecessary system packages, or if your own code has security flaws.

We have noticed the benefits of Docker not only for running customer code, but for our own code as well. We run a number of infrastructure support packages on top of Docker. Not only are our services bundled in containers, but many of our operations tools are also packaged in reusable microcontainers. Whenever we need to take some action, we docker run opstool and science happens.

The not-so-good parts

Of course, Docker has its shortcomings -- quite a few, actually. Rather than belaboring all of them, we will simply take note of the three that impact us the most: Docker Hub availability, systemd compatibility, and occasional friction between Linux kernel and Docker daemon versions. Let’s take these in turn.

Docker Hub downtime. It is normal and expected for web services to experience downtime. However, note that Docker Hub does not have any kind of SLA specifying the amount of acceptable downtime. This is not unusual for SaaS offerings -- GitHub, for example, also lacks an SLA for uptime -- but you may find it to be an issue if you are using Docker Hub as your main Docker Registry. It has occasionally been an operational issue for us, as we frequently launch containers on new instances, requiring a pull from Docker Hub. You might not find that a couple of minutes of downtime is an issue if you are using Docker Hub for a smaller-scale operation or if you are launching images less frequently. Your mileage may vary.

There are essentially two common solutions for this. You could create a Docker Hub registry proxy or set up a custom Docker Registry of your own. However, because we serve a mostly static set of images, we came up with a third alternative: All of our language-based regular and microcontainers are downloaded to an EBS volume at one point in time. From this we create an EBS snapshot and whenever a new instance is booted, it uses this snapshot instead of having to pull down images from Docker Hub, which may potentially be suffering a service outage or degradation.

Systemd conflicts. Docker and systemd are not a match made in heaven. There is a lot of overlap between what they offer, and you can easily see this when they interact. For instance, both can handle service restarts, but do so a little differently.

In systemd you might have a Docker service like the following:

[Unit]
Description=Redis container

Requires=docker.service
After=docker.service

[Service]
Restart=always

ExecStart=/usr/bin/docker start redis
ExecStop=/usr/bin/docker stop redis

But using Docker alone, you might run the service as the following:

docker run --restart=always redis

Both are completely valid! Which daemon should shoulder the responsibility to keep track of the service? Docker or systemd? We gave a lot of thought to their relationship and drew a line: We use Docker as an execution platform and systemd as a service management tool. In practice this means we use systemd service files to manipulate Docker containers. These services either stop and remove containers, or they pull and start them. Thus, we leverage the advantages of systemd while keeping the reproducible environments of Docker.

For example, a systemd unit file for one of our services might look like this:

[Service]
Restart=on-failure
ExecStartPre=-/usr/bin/docker kill <service>
ExecStartPre=-/usr/bin/docker rm <service>
ExecStartPre=/usr/bin/bash -c '/usr/bin/docker pull <org>/<image>:<tag>'
ExecStart=/usr/bin/bash -c '/usr/bin/docker run --name <service> <org>/<image>:<tag>'
ExecStop=/usr/bin/docker stop <service>

You can see that all of our Docker configuration lives inside our systemd file. This makes it much easier when starting, stopping, and restarting services because the operator does not have to remember the exact configuration at launch time.

Kernel-daemon friction.The Iron.io Platform involves a lot of short-lived containers that stress some interface points between Docker and the Linux kernel. They are both independent projects with separate development and release cycles. Thus, some grit is inevitable, and matching their versions to create a fast and stable environment is not trivial. This is a problem for which we do not have a solution yet. Every once in a while, their relationship degrades and the kernel cannot allocate resources or Docker cannot interface with the kernel appropriately anymore. In these cases, we restart the Docker daemon or replace the instance entirely.

On a similar note, although Docker aims to help you to reproduce a given environment everywhere, different Linux distributions might create environment mismatch problems. For example, we once encountered a Linux file system bug that caused our server to reboot without cause within 15 to 90 minutes of the initial launch. This affected CoreOS but did not affect Ubuntu (fix here). Therefore, whenever we need to upgrade our infrastructure, we not only run an extensive suite of functional tests, but also use canary deployments to ensure we’ll have no issues running our application on the host OS.

Docker is a great way to package and run software in a reproducible and predictable manner. It has become a key tool both in our infrastructure and in our products. But not all is red roses. Docker Hub suffers occasional blackouts. We had to find a way to make Docker and systemd play nice together. And it is a tricky process to match a kernel version with a Docker version to get the best results. Hopefully, our experience and the fixes we’ve found will help to make your Docker use smoother.

Ben Visser is infrastructure engineer and Carlos Cirello is back-end engineer at Iron.io. 

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2016 IDG Communications, Inc.