Revisiting Docker and Jenkins
It’s been over two years since I first wrote an article discussing how we combined Docker containers and Jenkins to create ephemeral build environments for a lot of our backend software at Riot Games. Today the series is seven articles strong and you’ve rewarded us with feedback, conversation, technical insights, tips, and stories about how you too use containers to do all kinds of interesting things. In the world of technology, two years is a long time. The series, while still useful, is out of date. Many of the latest Docker doodads and gizmos are absent. Rather than write an all new blog series I’ve chosen to go back and refresh the original articles. You can find the entire updated series here:
Today I’d like to cover lessons we’ve learned at Riot over two years of using Docker and Jenkins. I’ll also go over all the updates and changes I’ve made to the series for those familiar and cover what’s changed in the landscape of Docker containers and Jenkins as it pertains to the use cases presented.
The Developer Ecosystem
Back in 2016 when I first put pen to paper, getting Docker installed and usable on your desktop wasn’t exactly convenient. At the time Docker Toolbox was the bee’s knees in desktop setup. The tool set up Virtualbox on your desktop and ran Docker in a Linux Virtual Machine. Docker provided a creative client app called Docker Machine to create and manage that VM (or several others if you wished). This setup worked well, but was particularly heavy weight.
These days Docker offers native installers for Windows and OSX. Both still rely on a hypervisor style solution though the hypervisor is now more native and does not require Virtualbox. As a bonus, Docker for Windows can be run in “native windows” mode, enabling Windows Docker containers thanks to Microsoft’s partnered development efforts. Both solutions feel like integrated desktop clients and make working with Docker significantly easier with convenient GUI driven setup menus and options. I’ve updated the blog series to reflect these additions, which in turn made getting set up and running much easier to explain!
Changes to Docker
In 2016 Docker was on version 1.10 or so and Docker Compose had little to no support for Windows. Our deployment also relied on Docker Swarm version 0.3.0. Docker 1.12 was announced at DockerCon later that year and marked a major evolution. The changes to Docker in the intervening two years are too numerous to go over in detail but there are several that affected the tutorials.
Perhaps the most significant change to Docker has been the addition of Docker Volumes. In version 1.10 of Docker you still needed to create what I called a “Docker Data Container” to encapsulate storage persistence if you didn’t want to mount data from the host. This created a semi-permanent volume in Docker that other containers could mount and share data from. While convenient, it was also somewhat risky as at least one container had to point to the volume inside the data container if you wanted to maintain your storage. This made using such a setup on a production Docker Host less than desirable, and indeed we only used that setup on our local development environments for Jenkins.
With the introduction of true volume support, you can create, name, and manage volumes independently from containers or images with a host of integrated commands. Volumes persist until you willfully delete them from the Docker Host and they even integrate with storage plugins to enable shared data volumes across clusters if that’s your wish. I’ve updated the blog series to eliminate the clunky Data Volume container and define and create a Docker volume for data storage. I think you’ll find this approach more intuitive and straightforward. Indeed many of you have written in and provided feedback that such a switch was long desired.
You can read more about Docker Volumes and their various storage features here.
In the 2016 era of Docker, pre Docker 1.12, getting two containers to talk to each other required manually identifying what ports they were exposing and knowing the IP address of the Docker Hosts or using Docker Links, which were only effective on the same host. In the original tutorial, I advised retrieving your IP address for your Docker Machine-enabled Virtualbox installation and feeding that to several scripts so that your Jenkins installation could modify things like NGINX and internal configs. It made using things like “localhost” as a simple DNS reference all but impossible. It also meant that we had to use container linking to bridge two containers so they could easily find each other.
Fast forward to today and Docker has introduced Docker Networks. With the advent of Docker for Mac and Windows, your Docker Host now just uses your machine’s local IP address; container links are considered passe and have been deprecated. Like Volumes, networks can be independently created, named and maintained without any containers or images. Containers can be attached to a network and even the simplest form of Docker offers service discovery and exposes all containers inside the network via DNS.
This enabled a complete rethink of many aspects of the tutorial and local Jenkins setup. I eliminated the use of Container Links to enable Jenkins and NGINX to talk to each other and instead placed them on a Docker Network. Likewise I reconfigured the Docker Settings for the build slaves inside the Jenkins configuration so that they attach to the same network and can easily find the Jenkins Master server.
The entire setup is now self-contained and no longer requires you to provide an IP address. We can also eliminate the cumbersome startup scripts we used to rename fields in the Jenkins configuration.
The new features are extensive and work across an entire Docker Swarm cluster allowing you to bridge containers between hosts with ease. You can read more about Docker Networking features here.
Docker Compose also saw a host of changes. Beside full support for Volumes and Networks, Compose obtained full native support for Windows. This means that I now pretty much exclusively use Docker Compose when creating multi-container apps to function on my desktop. Compose makes it super simple to name and create multiple networks, volumes and “Services” (multi-container applications). The specification file has also seen a few revisions and is now on Version 3. You’ll find that I’ve fully embraced the latest version of Docker Compose and the associated configuration files are all updated.
You can read more about the latest Compose changes here.
With so many changes to Docker and the switch to native Docker installations, a problem arose. In order for the Jenkins installation to dynamically create build slaves as Docker Containers it needed to talk to Docker on your desktop as if it was a Docker Host. Ironically this was a tad bit easier when the Docker Host was running in a Virtual Machine inside Virtualbox as it had its own native IP address and could be configured to listen on port 2375 insecurely or 2376 with TLS certs. The old tutorial walked you through getting these certs and talking to that end point securely.
Now the Dockerhost doesn’t expose itself to your public internet for smart security reasons, and this change complicated the Jenkins development setup and created unique challenges for Windows and Mac. On OSX I chose to solve this problem by using socat to expose the Docker socket file via port 2375 on a Docker Network (this still doesn’t expose the file publicly). That made configuring communication with Jenkins much easier. This trick doesn’t work on Windows because there’s no Docker socket file. As a result I chose to enable the Windows-only option of exposing port 2375 publicly. This technically means the Windows configuration of this tutorial is slightly less secure than the OSX one.
You’ll find detailed walkthroughs of these changes in the last tutorial where the Jenkins install is completely tuned for ephemeral use cases.
Other Docker Changes
As you can see, there’ve been a lot of Docker changes in the intervening two years. You’ll find many little details like Docker build time arguments, labels, and other small tweaks reflected in the Dockerfiles throughout the tutorial series. Many of these are powerful new features, but they didn’t drastically alter the nature of the tutorial so I’ll leave them for you to discover on your own.
Like Docker, time marches ever onward for Jenkins as well. Back when I wrote the tutorial, Jenkins 2 was just coming online. The Docker Plugin was going through very early iterations and the codebase was forked into the Yet Another Docker Plugin. Ultimately Jenkins still works the same as it did two years ago but there have been many changes to the Dockerfile setups for Jenkins as well as configuration and settings changes to allow for communication with the Docker Host and build slaves.
The tutorial now uses Jenkins v2.112 (latest at the time of writing). Most of the significant changes were in the Dockerfile walkthroughs for how Cloudbees sets up their Dockerfiles - things like the underlying OS container, variables exposed inside the Jenkins dockerfiles, startup and Java options, and scripts.
The biggest changes in this space come from the plugins used to spin up Docker containers dynamically as builds slaves. In 2016 when I was finishing the blog series, the Yet Another Docker Plugin had just come online as a fork of the Docker Plugin. The plugin saw continuous improvements and new versions while, for about a year, the original Docker Plugin went untouched. The Docker Plugin has only recently been resuscitated by Cloudbees developers, and now both plugins exist with very different configurations. I’ve tested both extensively and both work well. Arguably the Docker Plugin is a bit easier to set up for the purposes of this tutorial and is thus highlighted throughout.
No matter which plugin you use, I advocate for ditching the SSH connection setup I originally wrote about in 2016. Instead both plugins offer a different way to use reverse JNLP connections to Jenkins, which is significantly faster and more performant. Likewise it makes it easier to configure and set up the ephemeral build slaves. As a result the tutorial has been updated to eliminate the SSH connection approach and stick with the easier methods.
So which plugin should you ultimately choose? In 2016 I was beginning to favor the Yet Another Docker Plugin because when it comes to Jenkins I always favor the plugin under active development. These days with Cloudbees supporting the Docker Plugin I suspect that’s probably the best call long term and our next upgrade will focus on moving to the latest version.
Yet More Lessons Learned
Two years is a long time to deal with the operational realities of a large and complex system like Riot’s Jenkins deployments. Odds are I’ll have forgotten more than I remember about all the little things we’ve learned along the way. That said there are some big lessons I think are worth mentioning here.
Our Jenkins clusters have only grown in the time since we published the first article. Back during the original writing I believe our Docker Jenkins setup as described had about 1,000-2,000 jobs defined and peaked around 120-200 jobs an hour. The server now sees an average of 200-400 jobs an hour average with a peak of 600 or so and has over 7,000 jobs defined.
This has of course put a lot of strain on the system. We still use the 0.15 version of the Docker Plugin on that server (we’re upgrading soon) but we did have to do several rounds of placing instance caps on containers. Our users started to get very proficient with Jenkins and would create multiple processes that spun up 5, 10, or even 20 containers at once to do things in parallel. The older version of the plugin wasn’t performant enough to handle all these cases, especially using SSH. We found capping build environment instances between 5 and 10 at once to be an optimal setting for our use case.
Over the two years we’ve been running this setup, our users created around 240 unique build environments at peak across over 7,000 jobs. After training, education, conversation, and feedback that number has been scaled back to about 150-180 or so environments. We found many people were creating unique environments when in fact they could just use a generic setup for say Java or Go software development. We continue to advocate for consolidation. We’ve found that most divergent setups across teams at Riot still exist within the Python and NPM development communities and mostly consist of package management and environment setups.
As a peace offering we provide base build slaves for various operating systems (Alpine, Ubuntu and Centos) so the teams that use our tools can work with their package managers of choice. On top of this we offer Python, Java, and Go slaves for general use cases as well as a set of basic utility scripts. We learned a potent lesson that upgrading these is actually very painful. When Jenkins switched to using Java 1.8 as a mandatory requirement for slaves, we had to get everyone’s build environments updated over the course of several days, during which time many teams’ builds didn’t work. That’s never a good time for a team owning a core service.
Since the update, we advocate for minimizing what goes into the base build environment. We’ve begun to take many of our pre-installed scripts and tools and mount or install them into the container at runtime with simple downloads from artifact servers. In many cases we’ve moved utilities into shared Jenkins Pipeline Groovy libraries (by far the easiest/speediest choice to get code into an environment).
Of the 7,000 jobs on our Docker Jenkins setup, nearly 4400 are Jenkins Pipeline jobs. We’ve created an extensive suite of common Pipeline functions using the Jenkins Global Libraries features of Pipeline. Most of these custom libraries make it easier to interact with various Riot artifact stores and systems. Some just make it easier to use Pipeline by minimizing syntax and using common sense defaults. I’ll be writing a blog about our general use functions in the future.
The main reason Pipeline jobs have taken off is that we’ve started to build what I would call “Continuous Delivery as a Service.” Much of Riot’s backend software stack has been consolidated to Java or Go services and most of these use the same core software libraries. As a result we ask teams to place a simple Jenkinsfile containing only configuration data for these builds in their code repo. We use Jenkins shared libraries to consume these files and build our software with common pipelines. As a result, no teams that build using these frameworks need a build engineer. In fact, teams don’t have to write a single build script to ship their software as even their deploy jobs are automated. More on this in a future blog!
Jenkins Restarts and Job Resumption
While the situation is getting better, one trade off of using ephemeral build environments was their lack of permanent workspace. This really throws many Jenkins Plugins for a loop as they expect a common workspace to exist on a permanent build slave. While this impacted us in several places (especially source control plugins) the biggest impact was the Jenkins Pipeline library itself and its resume capabilities.
Jenkins has a handy restart feature, but because we used so many pipelines in ephemeral environments, jobs wouldn’t restart and the attempt to restart them would often render Jenkins unusable unless we scrubbed all the resume files from disk. We created a custom Jenkins restart process to disable all Jenkins jobs that attempted to resume startup after a restart.
Since then Cloudbees has introduced Pipeline Durability, or the ability to disable pipeline resumes globally. They’ve also made several bug fixes to improve Jenkins stability. That said, Pipelines are still unlikely to resume if their ephemeral build environment has gone away between restarts, so caveat emptor, using this approach means that the build resume feature may not be as useful.
Rogue Containers and Images
I talked about this back when I first wrote the original blog and it’s still something to watch out for. The Docker Plugins do a reasonably good job of trying to clean up after creating containers but containers and jobs can and do leak additional containers. We had to tune our setup to eliminate long-running containers.
As a result, our Docker Jenkins environment has a maximum run time of about 24 hours before our background culling systems eliminate running containers. This makes the setup less useful to anyone who has long-running jobs such as performance tests unless they make a point to spin up and down Jenkins nodes in less than 24 hours.
Generic Maintenance and Automation Jobs
While we built this server to be an extensible software build platform, we didn’t anticipate just how popular it was going to be with general automation teams that wanted to automate various maintenance processes or other small jobs.
Because we don’t require teams to provision a traditional build slave, our Jenkins platform became an attractive choice. Jenkins is a powerful task running system due to its integration with source control and its cron-like scheduling capabilities. Automation jobs are some of the heaviest uses of the Jenkins server; automating anything from small data ETL loads to health checks across dozens of systems is easy with Jenkins.
The load profile of automation jobs is very different than build jobs. Where a build job is likely triggered upon source code commit during working hours, automation jobs tend to run on a schedule, anywhere from every 15 minutes to 24 hours. These jobs run 24/7 around the clock and create maintenance window challenges for the team. We’re actively considering creating two Jenkins environments, one for these automation style jobs and one for software builds. However, we haven’t devised a clean way to enforce this policy yet and without easy enforcement it could create some confusion for our Jenkins users.
It’s been incredibly rewarding to watch this system grow since 2016. Over time the basic development setup has only gotten easier to maintain and play with thanks to advancements in Docker features. Jenkins too has grown more robust and is now better integrated. The maturity of Jenkins Pipeline has made this one of the most potent automation and build setups we currently have at Riot.
Our future is full of questions about just how far our Jenkins setup will scale and when we should shard so that it’s easier for us to create maintenance windows and for our users to navigate. The question is no longer “will it work?” but “just how far we can push it.”
What started as an attempt to make build environments totally self service and eliminate the need for custom build slaves has evolved into a system that has begun to eliminate the need for custom build scripts and jobs while saving engineering time. As always I eagerly look forward to your feedback, lessons learned, and ideas in the comments below. I’ll be popping in the check things out, answer questions, and learn from your experiences with Docker and Jenkins.
Additionally if you don’t want to spend time re-reading all the tutorials, you can always hunt down the latest “get up and runnining in just a few steps” setup in my public github here: https://github.com/maxfields2000/dockerjenkins_tutorial