Monday, November 30, 2015

Virtualized Containers: jeVM

The last few weeks, I've been talking about container virtualization and the work we're doing here at VMware in this area.  Specifically, I have written about both performance and security aspects and why you would even want to virtualize your containers in the first place.

In this blog post, I will talk about an implementation detail of VMware's solution for virtualized containers.  I'm writing this because there seems to be some confusion in the market about how and why you would want to virtualize containers that all seem to be based on some false assumptions about the implementation details.

I've already talked about the false performance assumptions that people make.  In this case, I'd like to talk about the false deployment assumptions.

When we deploy containers using vSphere Integrated Containers (vIC), we deploy them on a one to one basis onto VM's.  This is to say, for every container that vIC creates, there is an underlying VM supporting this container.  This one to one mapping is what gives us the security and isolation benefits we discussed earlier.

However, it is not correct to assume that the VM we create is a regular VM.  Instead of the normal VM creation process that we use in vSphere, it's important to note that vIC sits on top of a technology we internally refer to as "Bonneville."  The name Bonneville is relevant here because the entire purpose of Bonneville is to make VM creation as fast as possible (like the time trials at the salt flats).  Because of the way Bonneville works, most of your assumptions about the VM creation process are no longer true.

For example, a VM created by vIC doesn't boot.  It forks (using an ESXi feature called "instant clone").  That is to say, it is a daughter process from another parent VM.  Thus, there is no boot sequence for the underlying child.  It simply starts from where the parent left off.  The process is very similar in concept to vMotion.  We take a fully booted, running copy of Linux (Photon OS in our case) and then we stun it.  This is similar to the stun operation that happens during vMotion.  However, instead of moving the VM to another host, we leave it in this stunned state.  We then fork new child VM's off of this parent.  Thus, the child starts very quickly (measured in milliseconds).  There is no boot time because the guest never boots.

In addition, this child VM has no memory of it's own.  This is because the child has read only access to the parent.  This means that the parent RAM is re-used by all the children.  When you modify a memory page, a copy on write operation is performed to a new page that belongs just to that child.  Thus, you get the benefit of shared memory (very high consolidation ratios) but the isolation of a VM.    This operation doesn't really help if you only want one container but if you want to run hundreds, or thousands, there is a significant saving of time and resources compared to regular VMs.

We refer to this child VM as a "jeVM" or "Just Enough VM" (think the three bears here).  Since the jeVM has the properties of a container (nearly instant start and dynamic memory), it's a great way for vIC to build VM's to run docker containers.

A great side-effect of this approach is that from vSphere's point of view, the jeVMs are REAL VM's and thus can be managed like VM's.  This means that your existing tooling, automation and management platforms will continue to work unmodified.  The only difference is that now you'll be seeing each new container as a new VM with a name derived from the UUID of the embedded container.  So, they're VM's from vSphere's point of view but they're created in a new and much more efficient way.  Thus, the term "jeVM."









No comments: