Containers Don’t Really Boot

Docker has been a great advancement for mass consumption of linux based containers.  The maturation of the virtual machine boom that has been happening since the early 2000’s led to mass acceptance and deployment in public and private clouds.  To be sure, asking for bare metal today can be seen as a faux pas without some well-defined use case (like super high IO).  So, now that folks have accepted that slices of CPU, memory, and disk are good enough through well-known hypervisors (kvm, esxi, xen) for most workloads, taking the next step to containers will not be that big of a leap.  Except that now it is more common to run containers on VMs than bare metal.  So now we get a slice of a slice of a slice of resources!
slice-slice
Virtual machines are just what their name implies:  full machines that are virtualized.  This means they have virtual hardware that virtually boots an OS kernel and mounts a filesystem.  The OS doesn’t really know that the hardware it is running on is not real.  Containers on the other hand are not virtual machines.  Containers are fully sandboxed processes using the host machine OS Kernel.  So, when running on a VM, they are slices of VM vCPU, memory, and disk for fully sandboxed processes.  This is the part that had me perplexed for a while until I ventured to understand exactly what happens when an lxc container starts versus a virtual machine.

Boot or Start?

Let’s compare boots of CentOS Linux on virtual machines versus containers:
Virtual Machine/Bare Metal:
  • Power on, the system BIOS loads and looks for sector 0, cylinder 0 of the boot drive (Typically /dev/hda, or /dev/sda)
  • The boot drive contains the MBR which then uses a boot loader such as GRUB (typically in /boot) which locates the kernel and loads it (based on GRUB config)
  • The kernel (vmlinuz) then uncompresses itself into memory
  • Load the temporary RO root filesystem via initramfs (configured in GRUB)
  • The Kernel locates and launches the /init program from within initramfs (/sbin/init)
  • Init determines run level via /etc/inittab and executes startup scripts
  • Per fstab entry, root filesystem completes integrity check and then is re-mounted as RW.
  • You get a shell via /bin/sh
Container:
  • Docker tells LXC (now libcontainer) to start a new container using the config in your Dockerfile
    sudo docker run -i -t centos7 /bin/bash 
    • Runs lxc-create or libcontainer equivalent with params (example)
      lxc-create -t mytemplate -f lxc.conf -n mycontainer
  • Docker on rhel/centos/fedora systems use device mapper which uses a sparse file for holding container images here:
    /var/lib/docker/devicemapper/devicemapper/data
  • Docker starts the base image (directory structure) as read only, and creates the new RW (CoW) layer on top of it.
  • Docker gives you a shell via /bin/sh (if you asked for it in the Docker run or a Dockerfile config)Docker configured union filesystem (AUFS, devicemapper, overlayfs) is used for mounted root filesystem

“It is perhaps more precise to say that a linux environment is started rather than booted with containers.”

The entire linux “boot” process that a normal virtual machine goes through is essentially skipped and only the last steps where the root filesystem is loaded and a shell is launched happens.  It is perhaps more precise to say that a linux environment is “started rather than booted”.  I was also further confused by the Docker terminology which uses the word “image” to describe something different from cloud images.  When I hear “image” I think of AMI style full virtual machine images as used in clouds.  These images are different from container images used by Docker.  Docker uses the term “image” to describe what is really a minimal layered root filesystem (union mounted).  This is all a bit confusing at first until you remember that everything in Linux is a file.  If you dig around and take a look at some of the utilities to create these “base images” such as febootstrap/supermin or debootstrap you will see that they are creating clean, consistent directories and files for the linux distro output in various formats such as .img or .tar.  So, the docker “images” are really nothing more than directories and files with pre-populated with the minimum viable set of linux components and utilities you need for a functional linux system.

“This is all a bit confusing at first until you remember that everything in Linux is a file.”

When Docker LXC/libcontainer based containers boot they are really just starting a kind of super “chroot” of processes with totally sandboxed storage and networking.  They don’t need to do a full boot since they are just leveraging the OS kernel of the host system.  All they seem to need are the minimum viable linux system directory structure tools and utilities.  Additionally, because Docker caches content, things run even faster since there is less to download.  These are reasons why containers “boot” or more precisely “start” incredibly fast.  Because you don’t have to go through a fully virtualized system boot process like a virtual machine or bare metal machine, you get productive “process-wise” rapidly in your own super sandboxed linux environment.

Union File Systems and the Neo Image Zeitgeist

One cool thing Docker introduced is the use of union mount layered file systems to control the craziness of working with images.  When I say image “craziness” I might need to clarify with a refresher for those youngsters who didn’t go through the whole “anti-image” zeitgeist of the past 5 years.  Let’s rewind to the early 2000’s when people discovered the ability to create sector by sector disk copies and saved all the time of installing apps over and over again (Ghost anyone?).  Everyone was in love with images then and naturally started using them in VMs when vmware was new and hot.  It was only after years of dealing with homegrown image versions and patching problems that folks started becoming more and more anti-image.  To be sure, many people made many horrible images (especially for Windows machines) that didn’t properly get sanitized (with sysprep or similar tools) before use as a VM template which only served to exacerbate things.

Fast forward to 2005-ish when CM tools like Puppet and later Chef in 2008 were formed in the “anti-image” rebellion.  What people wanted in these modern CM tools was the ability to repeatedly build pristine machines literally from bootstrap.  What this meant to many was no images ever: PXE the world and then Chef it.  As the public cloud took off so did people’s needs to build servers at an increasingly rapid pace. PXE bootstrapping everything was just to slow and often not possible in multi-tenant cloud environments like AWS.  The compromise answer was to create super minimal “base images” (also called JEOS or Stem Cell images) which were super pristine and small.  These base images for full virtual machines booted much faster and the fact that they had very little on them didn’t matter anymore since we could reliably and cleanly install everything declaratively in code using Puppet or Chef.

Fast forward to today and folks often find that full VM’s booted and installed with cookbooks are again not fast enough for them.  Also, the complexity of using some CM tools meant that it was a fair amount of work to stand up your Puppet or Chef environments before they paid off in speed and as a single source of truth.  Enter containers.  However, just getting a container only gives you a pristine base image if you start out with one.  Beyond any pristine base container image, any customizations you might need (like installing your app software components) would require you to get back to the days of image sprawl unless you used modern CM like Puppet or Chef to provision on top of your container base images.  Docker decided to fix this old problem with a new twist by using a union mount for layered or copy on write filesystems.  What they did was take the concept of the pristine base image (which we’ve learned is nothing more than minimum viable linux bistro directories and files with components and utilities) and allow you to leave it in a pristine shape.  They then allow you to layer components on top of each other that you need leaving each layer as read only thin deltas of changes.  The automation (infra as code) offered by Docker is via the Dockerfile where machine DNA is located.  What is still yet to be determined is whether the Dockerfile is enough to get your container in the end state you desire.  For example will layered “super thin delta images” be a replacement for modern CM tools?  Or, more likely, will CM tools like Chef be used to build each thin layer delta image (TLDI).



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s