Containers Don’t Really Boot

Docker has been a great advancement for mass consumption of linux based containers.  The maturation of the virtual machine boom that has been happening since the early 2000’s led to mass acceptance and deployment in public and private clouds.  To be sure, asking for bare metal today can be seen as a faux pas without some well-defined use case (like super high IO).  So, now that folks have accepted that slices of CPU, memory, and disk are good enough through well-known hypervisors (kvm, esxi, xen) for most workloads, taking the next step to containers will not be that big of a leap.  Except that now it is more common to run containers on VMs than bare metal.  So now we get a slice of a slice of a slice of resources!
slice-slice
Virtual machines are just what their name implies:  full machines that are virtualized.  This means they have virtual hardware that virtually boots an OS kernel and mounts a filesystem.  The OS doesn’t really know that the hardware it is running on is not real.  Containers on the other hand are not virtual machines.  Containers are fully sandboxed processes using the host machine OS Kernel.  So, when running on a VM, they are slices of VM vCPU, memory, and disk for fully sandboxed processes.  This is the part that had me perplexed for a while until I ventured to understand exactly what happens when an lxc container starts versus a virtual machine.

Boot or Start?

Let’s compare boots of CentOS Linux on virtual machines versus containers:
Virtual Machine/Bare Metal:
  • Power on, the system BIOS loads and looks for sector 0, cylinder 0 of the boot drive (Typically /dev/hda, or /dev/sda)
  • The boot drive contains the MBR which then uses a boot loader such as GRUB (typically in /boot) which locates the kernel and loads it (based on GRUB config)
  • The kernel (vmlinuz) then uncompresses itself into memory
  • Load the temporary RO root filesystem via initramfs (configured in GRUB)
  • The Kernel locates and launches the /init program from within initramfs (/sbin/init)
  • Init determines run level via /etc/inittab and executes startup scripts
  • Per fstab entry, root filesystem completes integrity check and then is re-mounted as RW.
  • You get a shell via /bin/sh
Container:
  • Docker tells LXC (now libcontainer) to start a new container using the config in your Dockerfile
    sudo docker run -i -t centos7 /bin/bash 
    • Runs lxc-create or libcontainer equivalent with params (example)
      lxc-create -t mytemplate -f lxc.conf -n mycontainer
  • Docker on rhel/centos/fedora systems use device mapper which uses a sparse file for holding container images here:
    /var/lib/docker/devicemapper/devicemapper/data
  • Docker starts the base image (directory structure) as read only, and creates the new RW (CoW) layer on top of it.
  • Docker gives you a shell via /bin/sh (if you asked for it in the Docker run or a Dockerfile config)Docker configured union filesystem (AUFS, devicemapper, overlayfs) is used for mounted root filesystem

“It is perhaps more precise to say that a linux environment is started rather than booted with containers.”

The entire linux “boot” process that a normal virtual machine goes through is essentially skipped and only the last steps where the root filesystem is loaded and a shell is launched happens.  It is perhaps more precise to say that a linux environment is “started rather than booted”.  I was also further confused by the Docker terminology which uses the word “image” to describe something different from cloud images.  When I hear “image” I think of AMI style full virtual machine images as used in clouds.  These images are different from container images used by Docker.  Docker uses the term “image” to describe what is really a minimal layered root filesystem (union mounted).  This is all a bit confusing at first until you remember that everything in Linux is a file.  If you dig around and take a look at some of the utilities to create these “base images” such as febootstrap/supermin or debootstrap you will see that they are creating clean, consistent directories and files for the linux distro output in various formats such as .img or .tar.  So, the docker “images” are really nothing more than directories and files with pre-populated with the minimum viable set of linux components and utilities you need for a functional linux system.

“This is all a bit confusing at first until you remember that everything in Linux is a file.”

When Docker LXC/libcontainer based containers boot they are really just starting a kind of super “chroot” of processes with totally sandboxed storage and networking.  They don’t need to do a full boot since they are just leveraging the OS kernel of the host system.  All they seem to need are the minimum viable linux system directory structure tools and utilities.  Additionally, because Docker caches content, things run even faster since there is less to download.  These are reasons why containers “boot” or more precisely “start” incredibly fast.  Because you don’t have to go through a fully virtualized system boot process like a virtual machine or bare metal machine, you get productive “process-wise” rapidly in your own super sandboxed linux environment.

Union File Systems and the Neo Image Zeitgeist

One cool thing Docker introduced is the use of union mount layered file systems to control the craziness of working with images.  When I say image “craziness” I might need to clarify with a refresher for those youngsters who didn’t go through the whole “anti-image” zeitgeist of the past 5 years.  Let’s rewind to the early 2000’s when people discovered the ability to create sector by sector disk copies and saved all the time of installing apps over and over again (Ghost anyone?).  Everyone was in love with images then and naturally started using them in VMs when vmware was new and hot.  It was only after years of dealing with homegrown image versions and patching problems that folks started becoming more and more anti-image.  To be sure, many people made many horrible images (especially for Windows machines) that didn’t properly get sanitized (with sysprep or similar tools) before use as a VM template which only served to exacerbate things.

Fast forward to 2005-ish when CM tools like Puppet and later Chef in 2008 were formed in the “anti-image” rebellion.  What people wanted in these modern CM tools was the ability to repeatedly build pristine machines literally from bootstrap.  What this meant to many was no images ever: PXE the world and then Chef it.  As the public cloud took off so did people’s needs to build servers at an increasingly rapid pace. PXE bootstrapping everything was just to slow and often not possible in multi-tenant cloud environments like AWS.  The compromise answer was to create super minimal “base images” (also called JEOS or Stem Cell images) which were super pristine and small.  These base images for full virtual machines booted much faster and the fact that they had very little on them didn’t matter anymore since we could reliably and cleanly install everything declaratively in code using Puppet or Chef.

Fast forward to today and folks often find that full VM’s booted and installed with cookbooks are again not fast enough for them.  Also, the complexity of using some CM tools meant that it was a fair amount of work to stand up your Puppet or Chef environments before they paid off in speed and as a single source of truth.  Enter containers.  However, just getting a container only gives you a pristine base image if you start out with one.  Beyond any pristine base container image, any customizations you might need (like installing your app software components) would require you to get back to the days of image sprawl unless you used modern CM like Puppet or Chef to provision on top of your container base images.  Docker decided to fix this old problem with a new twist by using a union mount for layered or copy on write filesystems.  What they did was take the concept of the pristine base image (which we’ve learned is nothing more than minimum viable linux bistro directories and files with components and utilities) and allow you to leave it in a pristine shape.  They then allow you to layer components on top of each other that you need leaving each layer as read only thin deltas of changes.  The automation (infra as code) offered by Docker is via the Dockerfile where machine DNA is located.  What is still yet to be determined is whether the Dockerfile is enough to get your container in the end state you desire.  For example will layered “super thin delta images” be a replacement for modern CM tools?  Or, more likely, will CM tools like Chef be used to build each thin layer delta image (TLDI).


HAV vs. Cloud

I’m tired of the word cloud.  It is so hackneyed at this point that uttering it out of my mouth feels like a “poser” skateboarder in the late 80’s wearing Vision Street Wear and holding a pristine Gator pro model with rails, copers, and a nosegaurd.  Sigh.  However, for over a year now I have had to ensure that all my peers at Gap and those upstream of me understood the differences between “true” cloud and virtualization.  Infrastructure architectural designs and discussions had to be framed in such a way to avoid people (even smart people) from being sucked into broad misinterpretations of technology implementations.

Back in July of 2011, my buddy Chris Buben and I sat in the Gap cafeteria frustrated on how to get everyone on our teams aligned on how to properly consume “cloud”.  We wasted so much time in team meetings correcting people’s widespread misuse of the term that we were determined to fix the vernacular.  One of the biggest impediments to successful technology projects is clear communications.  Specifically, the need to align everyone on the terms.  This means you need everyone from technology VP down to analyst saying the same things the same way.  We needed a clearer way to mass communicate virtualization versus cloud.  So, we came up with a simple way to quickly and easily differentiate a lot of big infrastructure architectural differences with cloud and pure virtualization.  We started describing two different architectural zones:  One zone we called High Availability Virtualization or HAV and another just called cloud.

We wanted to draw a distinction between the two to avoid having people think that we can simply shove every application we have at Gap into what we considered real cloud architected infrastructure.  The reality is that you cannot.  Not every app is architected and has code that is amenable to real cloud computing.  The sad truth is that many commercially sold applications are not optimized for real cloud.  Case in point:  many Oracle e-business apps (In fact, Oracle seems to go out of their way to make applications difficult to run on anything but Oracle platforms and software… but I digress).  So, we needed a way to give ourselves breathing room to host some apps in a virtualized environment to reap some modern architectural benefits without being caught up in the “cloud” term abuse game.  HAV became our mantra for everything non-cloud.  You might ask, what and why are you drawing a distinction?  The answer is that we had our own clear viewpoints about what cloud really was versus just virtualization.  For purposes of this analysis we are using OpenStack as our cloud (IaaS) platform.  To understand our position, I put together a breakdown for HAV versus Cloud.  Understand that some features are continuously being developed and improved in the cloud arena (especially in the networking area) so this chart is changing as I type.

HAV Versus Cloud

  • Private Cloud = cheap, fault tolerant by design, disposable, big scale
  • HAV = more expensive, fault tolerance through licensed features, less scalability

PROFILES

HAV (High Availability Virtualization)

Cloud (Private IaaS)

Hardware:
Blade Servers or Rack ServersCommodity hardware (premium or cheap)
Hardware:
Rack Servers primarilyCommodity hardware (cheap)
Storage:

  • Shared SAN or NAS Disk for VMs
  • T1 SAN or T2 NAS Tier on iSCSI based block
  • High IOPs perforance expectations
Storage:

  • Local Disk for VMs
  • iSCSI based block cloud managed volumes
  • Generally expected “lower” IOPs performance
Lifetime:
Persistent VMs (horizontal scale needs are less)
Lifetime:
Disposable/Ephemeral VMs that scale out
Hypervisors:

  • Vmware esxi
  • RHEV
  • Citrix XenServer
  • Hyper-V
Hypervisors:

  • KVM
  • XEN
  • Other hypervisors may be supported, but we didn’t want to use them (esxi).
Patching:
Patching needed due to persistent lifetime VMs

  • WSUS for Windows
  • CM pushed RPM or YUM updates for Linux
Patching:
No “Traditional” Patching (always dispose & rebuild instantly)

  • High risk vulnerabilities require new AMIs to be built
  • CM pushed RPM or YUM updates for Linux
HA:

  • Persistent VMs that are re-deployed on failure
  • Clustering used for HA
  • Live Migration
  • Apps behind load balancer VIPs
HA:

  • On demand new VM instances deployed as needed
  • No reliance Live Migration
  • Apps behind load balancer VIPs
Custom Networking:

  • Can have Active/passive NICs on HAV hosts
  • Can have 2 Active NICs on hypervisor hosts
  • Dot1q VLAN trunks to HAV hypervisor hosts
  • Bridged networking between HAV VMs and network
  • Subnets controlled by VLANs and network L3 switches
  • Default gateways are external L3 VLAN interfaces
Homogenized Networking:  (Note: this is changing)

  • 2 Active NICs or more required on hypervisor Hosts
  • Dot1q VLAN trunks to HAV hypervisor hosts
  • NAT between cloud VMs and network is typical
  • VLANs/Subnets pre-allocated per Cloud Hypervisor Host or by Tenant/project
  • Default gateways are virtual gateways on Cloud Hypervisor Host
  • Default gateways can also be external L3 VLAN interfaces
DHCP
Enterprise DHCP servers
DHCP
Local Cloud Host DHCP servers or Enterprise External DHCP servers
DNS
Enterprise DNS direct to VMs
DNS
Cloud Host DNS Proxy (DNSMasq) to VMs or Enterprise DNS

Database Tier

HAV

Physical and virtual

(Determined by performance requirements)
Linux servers can utilize LXC containers for physical pseudo-virtualization 

Cloud (Private IaaS)

Physical and virtual (predominantly virtual)
(Determined by performance requirements)
Linux servers can utilize LXC containers for physical pseudo-virtualization 

Hardware:
Blade Servers or Rack Servers
Hardware:
Rack Servers predominantly
Storage:

  • Local disk
  • T1 SAN or T2 NAS
  • Database Tier on iSCSI based block
Storage:

  • Local disk
  • iSCSI based block cloud managed volumes

So, why go to all this trouble to break these out?  Because, when trying to properly place applications in the most appropriate environment for deployment, you need to take all these factors into consideration.  You may have an application that does not tolerate network address translation, or does not work well in a load balancer  design, or needs ultra-high IOPs.  In such a case, you may be better served deploying such apps in HAV architected infrastructure zones.  However, if you have a modern, truly service oriented architecture (SOA) app that doesn’t require ultra high IOPs, and is designed from ground up to be stateless and has built in failure detection then cloud architected zones may be perfectly appropriate.

The point is:  Don’t just try to shove every app (commercial or homegrown) into cloud architected zones without understanding the implications for doing so.  You may have undesired performance or reliability headaches.  Our goal is to move as much or our application workload to OpenStack based cloud infrastructure zones as possible.  However, on the journey to that nirvana, we have a lot of legacy app crud that just isn’t optimized for true cloud infrastructure.  For these apps, we have opted for a one-two hop using just an HAV infrastructure zone on the way to cloud.  This will buy us time to either re-architect those apps or simply replace them. This brings up a larger discussion about proper cloud application design.  In the next post, I will cover months of work by our internal architects and app dev teams at Gap for what comprises “cloud architected” application best practices.