OpenStack Quantum Network Design

I can never seem to get enough OpenStack networking information.  I have been collecting various notes in Evernote for 2 years now on OpenStack networking.  Having built and destroyed OpenStack Diablo, Essex, Folsom, and now Grizzly at Gap with much pain, I decided to share my notes in a consolidated format.  My hope is this will help make OpenStack Quantum network design more tenable for folks trying to be planful with their network design and OpenStack.  You can see the full presentation embedded from scribd below or you can access it here.

View this document on Scribd

Agile and DevOps at Scale

More and more we are starting to hear about how real transformation can happen for businesses using IT.  For busineses like NetFlix or Twitter where IT is at the core of the entire business model, that has been clear for a while.  What is changing are businesses that have heretofore leveraged IT as just part of the “backoffice” seeing incredible transformative possibilities by leveraging the same IT principles as technology startups.  Those with deep IT knowledge have always intrinsically known this potential.  However, convincing a non-tech business of this possibility has been a difficult historically.  This is because non-tech businesses (and to be sure, even traditional large tech businesses) saw IT as a “cost center” and not a true “partner” with the business for achieving goals.

This perception has been changing over the past 5 years.  It has been accelerated by advancements in infrastructure as a service, platform as a service, and software as a service offerings that have made business leaders aware of ways to get things both faster and cheaper in the infrastructure world.  However, a funny thing has happened.  When business leaders asked IT leadership to “take a look” at these new “n-as-a-service” offerings, they got feedback that there were still 3 main hurdles to being able to unlock real business value fast:  infra and developer collaboration, technology freedom, and core IT and business processes.  So, while you might be able to get servers fast from cloud providers, you still had the existing internal hurdles to overcome.  These hurdles are not insignificant.

The real unlock for all these “n-as-a-service” infrastructure offerings is actually transforming the traditional way IT does business.  Recently, authors Gene Kim, Kevin Behr, and George Spafford created a fiction novel called “The Phoenix Project” which outlines what a hypothetical Agile and DevOps transformation looks like.  While the novel is a great way for people to understand how agile/devops change can happen, it might be great to hear about a real success story.  Let me offer up Gap Inc as an example of how you can take massive legacy enterprise IT and make a real transformation happen.  The actual implementation of Agile processes by a business with a heavily traditional structure and workflow is not without pain.  Being successful requires a strong spine and executive leadership that is willing to put everything on the line.  For Gap, the CIO (Tom Keiser) and VP of Infrastructure (Naveen Zutshi) were key to pushing the transformation through successfully.

I have taken the liberty of summarizing 2 main tracks of effort that made the massive Agile and DevOps transformation at Gap possible by using 2 documents.  The first document is a paper written by Kamal Manglani which focuses on core Agile principles as applied to infrastructure.  The second is a Meetup presentation I put together that focuses on the methodologies we designed at Gap to change mindsets about technologies.  Hopefully, these documents will be useful for others with the gumption to take this journey on.

Gap Infra Agile Transformation Journey

View this document on Scribd

Enterprise-DevOps-at-Scale-Gap

View this document on Scribd

Enjoy,

Jeff


The End of Hostname Standards? Hostnames in the Dynamic Infrastructure World

The fastest way to start a fight in IT is to get engineers together with the goal of creating a hostname standard.  If there is one thing that seems to get blood pressure altitudes at new heights it is the topic of hostname standards.  This was the case at Gap and boy did it go on for months.  At one point people that were friends actually stopped talking to each other because arguments got so heated.  In retrospect, the entire exercise was one in futility.

Individual server hostnames are increasingly becoming irrelevant today.  Hostnames, historically, provide IT people with a sense of control over the chaos of IT operations.  However, this “control” is just one of many legacy thought patterns that equate to an old way of doing business.  In truth, the hostname should essentially be viewed as a variable that is disposable.  How you might ask?  This is because the servers themselves should be viewed as disposable.  In fact, with modern configuration management software like OpsCode Chef or PuppetLabs Puppet, and dynamic cloud computing host instantiation, hostnames truly do become irrelevant.  What is still relevant is the virtual ip or “VIP” in networking parlance.  The VIP is a (usually) friendly DNS name that end users use to locate an application or  service behind a load balancer.  The  servers themselves are the dynamically named piece of this design discussion that I am talking about.

Think about it, if a server vm can be spawned, destroyed, and rebuilt in minutes, can you really rely on a static hostname?  The answer is probably no.  Increasingly, the server instance name needs to be as dynamic and flexible as the infrastructure.  Finding a server is often done when something is wrong, and with static hostnames, you would have had to either:  consult a monitoring system that told you the name because something was wrong, someone else came to you and said ” I need help with the yoda123 server”, or think for a minute to try and derive it from a hostname standard.  With modern configuration management, if you need to locate any server, or more likely a group of servers, you can do so easily and programmatically.  You dont really need a hostname standard.  Let’s examine some examples of why.

With OpsCode Chef, you can search your entire infrastructure for any combination of server roles, attributes, or functions.  It is super easy and straightforward.  Here is an example of how to search for all servers you have that are web servers:

knife search node "role:webserver"

You can also do boolean searches for multiple roles.  Here we search for all nodes with apache and memcached roles:

knife search node "role:apache AND role:memcached"

You can also do some fun things like search all your nodes for things like a variation of a hostname!

knife search node "id:yodaweb*"

Finding hosts by role or function like this is as easy or easier than relying on a static traditional hostname standard.  It also is dynamic.  When the host is terminated, the CM clients are told to remove them from the node list.  So, the next time you do a search, you will not get a result set without that server.  This type model means that the server hostname can be something like aabbccddeeffgg1234 and it can be easier to find machines using intelligent search rather than trying to map aabbcc to some kind of master hostname schema, where aabb means it is a web server.  Plus, is it always accurate as of the way the server was built.  Meaning, you don’t have situations where Mr. Joe DevOps built a memcached server but did not name it yodamemd, but instead named it r2d2mem.  In this type scenario you would not be able to locate the server, but with the modern CM system you could find it.

Still, after many arguments, folks still say there are use cases for hostname standards.  I think that argument can have some relevance when you break down the type of machine and the specific use case in question.  For example, hypervisor hosts often don’t need to have an end user friendly hostname because they never access these servers directly.  However, a VIP for a well known app probably would be nice if it were friendly and iterative.  Perhaps, the case can be made that servers that end users access directly (with no VIP) is convenient to have “easy” names to remember.  However, this is most likely a set of use cases that are small.  Anything, that needs to scale most likely won’t benefit from a predictable hostname standard.

I did a quick breakdown of how hostnames can be viewed by use case:

CLOUD

Dynamic Host Names that do not adhere to any hostname standard.

Ex) This GUID name can be formed from well known UUID sources like the host mac address.  Here is a quick and dirty example.

     MAC address of NIC1 
eg) 04:0c:ce:d0:7a:b6

    – The following steps are performed to arrive at the dynamic Hypervisor host name:
1) Query to system to get the mac address of NIC1 the RAW name.
2) Any dashes (-), underscores (_), colons (:), or dots (.) are removed from the RAW name.
3) The CLEANED mac name is then prefixed in lowercase with anything you chose to ensure alphabetical characters.

HAV

See my previous post of HAV versus Cloud for what HAV means.  Persistent hosts that end users may access directly without VIPs can benefit from some sane hostname standard.

Here is an example of a hostname standard that might work for some common infrastructure use cases.

  • Hostnames are accessed using friendly 13 character CNAME.
  • 15 Characters are the full characters allotted for the hostname standard.
    • 14-15th characters only used for mgmt and ethernet interfaces identification.
    • CNAME should automatically be created for any of the e0 and e7 interfaces to the hostname ending at Server Instance ID.

Env ID (1)      Location ID (3)   App ID (3)   App Function (3)   Instance ID (3)   Mgmt ID (2)

  • The 15 character format is the full name.
  • Most physical hosts will have at least 3 interfaces:  2 ethernet interfaces, and 1 mgmt interfaces.
  • Most virtual guest hosts will have at least 1 or 2 virtual ethernet interfaces.

Note:  You can go higher than 15 characters for the name, but Windows machines will not like them and unattended installs will in fact fail.  Yes, this still happens in 2012.  Stupid, but there it is.

While I did this work, I compiled what I think are the relevant standards information to guide hostname making decisions.  I have added my notes below for your reading pleasure.

DNS Naming Rules:

FORMAT:  HOST-LABEL.SUBDOMAIN-LABEL.TLD

http://tools.ietf.org/html/rfc1035

  • There cannot be more than 127 levels of LABEL
  • Each LABEL has a max of 63 ASCII subset characters.
  • The total number of characters in a FQDN cannot exceed 253 ASCII subset characters including decimal dots.
  • The hierarchy of domains descends from right to left.
  • DNS names may use any ASCII subset of characters:  A – Z, 0 – 9, and a – hyphen. (The LDH rule for letters, digits, and hyphen)
  • DNS names may NOT start with the hyphen.
  • DNS names are NOT case sensitive and not interpreted as such either (case insensitive)

LINUX LIMITS

The Single UNIX Specification version 2 (SuSv2) guarantees ‘Host names are limited to 255 bytes’. POSIX 1003.1-2001 guarantees ‘Host names (not including the terminating NULL) are limited to HOST_NAME_MAX bytes’.  These values in Linux are set to 64 bytes.

(libc defaults to 64.)

DNS

Minimum name length

2 characters.

Maximum name length

63 characters.

MICROSOFT LIMITS

http://support.microsoft.com/kb/909264

DNS

Minimum name length

2 characters.

Maximum name length

24 characters.

NETBIOS (LEGACY)

Minimum name length

1 character.

Maximum name length

15 characters.

The maximum length of the host name and of the fully qualified domain name (FQDN) is 63 octets per label and 255 bytes per FQDN. This maximum includes 254 bytes for the FQDN and one byte for the ending dot.


HAV vs. Cloud

I’m tired of the word cloud.  It is so hackneyed at this point that uttering it out of my mouth feels like a “poser” skateboarder in the late 80’s wearing Vision Street Wear and holding a pristine Gator pro model with rails, copers, and a nosegaurd.  Sigh.  However, for over a year now I have had to ensure that all my peers at Gap and those upstream of me understood the differences between “true” cloud and virtualization.  Infrastructure architectural designs and discussions had to be framed in such a way to avoid people (even smart people) from being sucked into broad misinterpretations of technology implementations.

Back in July of 2011, my buddy Chris Buben and I sat in the Gap cafeteria frustrated on how to get everyone on our teams aligned on how to properly consume “cloud”.  We wasted so much time in team meetings correcting people’s widespread misuse of the term that we were determined to fix the vernacular.  One of the biggest impediments to successful technology projects is clear communications.  Specifically, the need to align everyone on the terms.  This means you need everyone from technology VP down to analyst saying the same things the same way.  We needed a clearer way to mass communicate virtualization versus cloud.  So, we came up with a simple way to quickly and easily differentiate a lot of big infrastructure architectural differences with cloud and pure virtualization.  We started describing two different architectural zones:  One zone we called High Availability Virtualization or HAV and another just called cloud.

We wanted to draw a distinction between the two to avoid having people think that we can simply shove every application we have at Gap into what we considered real cloud architected infrastructure.  The reality is that you cannot.  Not every app is architected and has code that is amenable to real cloud computing.  The sad truth is that many commercially sold applications are not optimized for real cloud.  Case in point:  many Oracle e-business apps (In fact, Oracle seems to go out of their way to make applications difficult to run on anything but Oracle platforms and software… but I digress).  So, we needed a way to give ourselves breathing room to host some apps in a virtualized environment to reap some modern architectural benefits without being caught up in the “cloud” term abuse game.  HAV became our mantra for everything non-cloud.  You might ask, what and why are you drawing a distinction?  The answer is that we had our own clear viewpoints about what cloud really was versus just virtualization.  For purposes of this analysis we are using OpenStack as our cloud (IaaS) platform.  To understand our position, I put together a breakdown for HAV versus Cloud.  Understand that some features are continuously being developed and improved in the cloud arena (especially in the networking area) so this chart is changing as I type.

HAV Versus Cloud

  • Private Cloud = cheap, fault tolerant by design, disposable, big scale
  • HAV = more expensive, fault tolerance through licensed features, less scalability

PROFILES

HAV (High Availability Virtualization)

Cloud (Private IaaS)

Hardware:
Blade Servers or Rack ServersCommodity hardware (premium or cheap)
Hardware:
Rack Servers primarilyCommodity hardware (cheap)
Storage:

  • Shared SAN or NAS Disk for VMs
  • T1 SAN or T2 NAS Tier on iSCSI based block
  • High IOPs perforance expectations
Storage:

  • Local Disk for VMs
  • iSCSI based block cloud managed volumes
  • Generally expected “lower” IOPs performance
Lifetime:
Persistent VMs (horizontal scale needs are less)
Lifetime:
Disposable/Ephemeral VMs that scale out
Hypervisors:

  • Vmware esxi
  • RHEV
  • Citrix XenServer
  • Hyper-V
Hypervisors:

  • KVM
  • XEN
  • Other hypervisors may be supported, but we didn’t want to use them (esxi).
Patching:
Patching needed due to persistent lifetime VMs

  • WSUS for Windows
  • CM pushed RPM or YUM updates for Linux
Patching:
No “Traditional” Patching (always dispose & rebuild instantly)

  • High risk vulnerabilities require new AMIs to be built
  • CM pushed RPM or YUM updates for Linux
HA:

  • Persistent VMs that are re-deployed on failure
  • Clustering used for HA
  • Live Migration
  • Apps behind load balancer VIPs
HA:

  • On demand new VM instances deployed as needed
  • No reliance Live Migration
  • Apps behind load balancer VIPs
Custom Networking:

  • Can have Active/passive NICs on HAV hosts
  • Can have 2 Active NICs on hypervisor hosts
  • Dot1q VLAN trunks to HAV hypervisor hosts
  • Bridged networking between HAV VMs and network
  • Subnets controlled by VLANs and network L3 switches
  • Default gateways are external L3 VLAN interfaces
Homogenized Networking:  (Note: this is changing)

  • 2 Active NICs or more required on hypervisor Hosts
  • Dot1q VLAN trunks to HAV hypervisor hosts
  • NAT between cloud VMs and network is typical
  • VLANs/Subnets pre-allocated per Cloud Hypervisor Host or by Tenant/project
  • Default gateways are virtual gateways on Cloud Hypervisor Host
  • Default gateways can also be external L3 VLAN interfaces
DHCP
Enterprise DHCP servers
DHCP
Local Cloud Host DHCP servers or Enterprise External DHCP servers
DNS
Enterprise DNS direct to VMs
DNS
Cloud Host DNS Proxy (DNSMasq) to VMs or Enterprise DNS

Database Tier

HAV

Physical and virtual

(Determined by performance requirements)
Linux servers can utilize LXC containers for physical pseudo-virtualization 

Cloud (Private IaaS)

Physical and virtual (predominantly virtual)
(Determined by performance requirements)
Linux servers can utilize LXC containers for physical pseudo-virtualization 

Hardware:
Blade Servers or Rack Servers
Hardware:
Rack Servers predominantly
Storage:

  • Local disk
  • T1 SAN or T2 NAS
  • Database Tier on iSCSI based block
Storage:

  • Local disk
  • iSCSI based block cloud managed volumes

So, why go to all this trouble to break these out?  Because, when trying to properly place applications in the most appropriate environment for deployment, you need to take all these factors into consideration.  You may have an application that does not tolerate network address translation, or does not work well in a load balancer  design, or needs ultra-high IOPs.  In such a case, you may be better served deploying such apps in HAV architected infrastructure zones.  However, if you have a modern, truly service oriented architecture (SOA) app that doesn’t require ultra high IOPs, and is designed from ground up to be stateless and has built in failure detection then cloud architected zones may be perfectly appropriate.

The point is:  Don’t just try to shove every app (commercial or homegrown) into cloud architected zones without understanding the implications for doing so.  You may have undesired performance or reliability headaches.  Our goal is to move as much or our application workload to OpenStack based cloud infrastructure zones as possible.  However, on the journey to that nirvana, we have a lot of legacy app crud that just isn’t optimized for true cloud infrastructure.  For these apps, we have opted for a one-two hop using just an HAV infrastructure zone on the way to cloud.  This will buy us time to either re-architect those apps or simply replace them. This brings up a larger discussion about proper cloud application design.  In the next post, I will cover months of work by our internal architects and app dev teams at Gap for what comprises “cloud architected” application best practices.