Perspectives from the Field: November 2015

Monday, November 30, 2015

Virtualized Containers: jeVM

The last few weeks, I've been talking about container virtualization and the work we're doing here at VMware in this area. Specifically, I have written about both performance and security aspects and why you would even want to virtualize your containers in the first place.

In this blog post, I will talk about an implementation detail of VMware's solution for virtualized containers. I'm writing this because there seems to be some confusion in the market about how and why you would want to virtualize containers that all seem to be based on some false assumptions about the implementation details.

I've already talked about the false performance assumptions that people make. In this case, I'd like to talk about the false deployment assumptions.

When we deploy containers using vSphere Integrated Containers (vIC), we deploy them on a one to one basis onto VM's. This is to say, for every container that vIC creates, there is an underlying VM supporting this container. This one to one mapping is what gives us the security and isolation benefits we discussed earlier.

However, it is not correct to assume that the VM we create is a regular VM. Instead of the normal VM creation process that we use in vSphere, it's important to note that vIC sits on top of a technology we internally refer to as "Bonneville." The name Bonneville is relevant here because the entire purpose of Bonneville is to make VM creation as fast as possible (like the time trials at the salt flats). Because of the way Bonneville works, most of your assumptions about the VM creation process are no longer true.

For example, a VM created by vIC doesn't boot. It forks (using an ESXi feature called "instant clone"). That is to say, it is a daughter process from another parent VM. Thus, there is no boot sequence for the underlying child. It simply starts from where the parent left off. The process is very similar in concept to vMotion. We take a fully booted, running copy of Linux (Photon OS in our case) and then we stun it. This is similar to the stun operation that happens during vMotion. However, instead of moving the VM to another host, we leave it in this stunned state. We then fork new child VM's off of this parent. Thus, the child starts very quickly (measured in milliseconds). There is no boot time because the guest never boots.

In addition, this child VM has no memory of it's own. This is because the child has read only access to the parent. This means that the parent RAM is re-used by all the children. When you modify a memory page, a copy on write operation is performed to a new page that belongs just to that child. Thus, you get the benefit of shared memory (very high consolidation ratios) but the isolation of a VM. This operation doesn't really help if you only want one container but if you want to run hundreds, or thousands, there is a significant saving of time and resources compared to regular VMs.

We refer to this child VM as a "jeVM" or "Just Enough VM" (think the three bears here). Since the jeVM has the properties of a container (nearly instant start and dynamic memory), it's a great way for vIC to build VM's to run docker containers.

A great side-effect of this approach is that from vSphere's point of view, the jeVMs are REAL VM's and thus can be managed like VM's. This means that your existing tooling, automation and management platforms will continue to work unmodified. The only difference is that now you'll be seeing each new container as a new VM with a name derived from the UUID of the embedded container. So, they're VM's from vSphere's point of view but they're created in a new and much more efficient way. Thus, the term "jeVM."

Friday, November 20, 2015

Virtualized Containers: Security

Like my last post, this post is about virtualized containers. If you're not sure what I'm talking about, it's the notion that we have at VMware that we can run containers on ESXi just as well if not better than running on a "bare metal" Linux host.

While this notion is controversial to some in the community, for us this is just a natural extension of what we already do. Because we already virtualize a huge range of workloads, we already have experience with a wide range of performance, security and operational requirements. Thus, adding a new workload is usually not a big deal.

We've already talked about performance, so let's talk about security.

One of the very nice things about virtualization is that it allows you to do things like micro segmentation. That is to say, it allows you to provide very fine grained controls on your infrastructure so that a compromise in one area is less likely to penetrate other areas of your infrastructure. Sometimes this is also referred to as "defense in depth." However you describe it, it's
clear that it's better to have multiple barriers to a bad actor instead of just a "crunchy shell with a gooey center" approach.

Whenever I think about defense in depth, I think about the medieval town of Entrevaux in France. This town has a series of walls, gates, redoubts and a final bastion at the top of a hill. Anybody trying to storm that town with crossbows, trebuchets and swords would have a tough time of it.

Short of stone walls, how do you make yourself safe in the modern world? Well, we still have walls. These days, they're virtual firewalls constructed to keep out the virtual bad guys but the concept is still the same. Keep them out of your town but if they get in, make their life a living hell and pay a price in blood for every foot they advance. In the virtual world, there are tons of ways to do this including our own NSX product.

Regardless if how you choose to implement your containerized application, you will need to have some sort of strategy for containing bad actors. If you have already solved this problem for your virtualized infrastructure, you can simply re-use this solution if you virtualize your containers also. If not, you're going to need a new set of tools.

Another consideration is attack surface. One interesting side-affect of a single purpose operating system is a very small attack surface. Logically, you would assume that a single purpose operating system like ESXi would have a much smaller attack surface than a general purpose operating system like Linux. In fact, the data backs up this assumption:

Based on data from the website CVE Details, ESXi has a much lower number of reported vulnerabilities than operating systems like Linux or Windows. In fact, the attack surface is anywhere from 10x to 100x smaller depending on how your measure it and which operating system you pick. I'm not trying to throw stones here. I used to work on Windows and security is very important to Microsoft I can assure you. You can also see that distros like RHEL are doing a great job of making sure they're fully patched before they ship. However, there are some architectural advantages a smaller OS like ESXi has and this is one of them.

When you think about running workloads in large production environments, you have to assume that a bad actor is going to find their way into your environment sooner or later. No mater how good you are, you will get hacked. Thus, running with a least permissions model and increasing your chances of minimizing the danger with things like micro segmentation and small attack surfaces seems like a logical precaution.

Naturally, if the underlying platform won't run the workload you need or has crappy performance none of this matters. Fortunately, with things like vIC and Photon Platform this isn't the case for virtualizing containerized workloads.

Tuesday, November 17, 2015

Virtualized Containers: Performance

It is always fun to watch pundits and industry analysts proclaim the death of this technology or that technology. This is something that happens regularly enough that it's become a bit of a sport.

In my case, because I work at VMware, I often get told that containerization is the end of virtualization. This is a very interesting contention that doesn't seem to be based on any sort of factual evidence. The conversation usually starts out with me showing our new Photon Platform or vSphere Integrated Containers (vIC) product. Fundamentally, both of these products allow you to run containers on ESXi. vIC allows you to run Docker naively in vSphere and Photon Platform is a new platform for running containers and cloud native workloads. Or to put it another way, they allow you to virtualize your containers. After I show the demo, there's a polite pause. Then comes "the question" that I always expect. They ask "Why would I want to virtualize my containers? Everyone knows that containers run best on bare metal."

The really fun part is that when you press, there really isn't any sort of basis for the assertion. Mostly, the objection seems to come down to performance. They want to run containers on bare metal Linux because "it's faster." Again, we ask what that assertion is based on. The really clever folks will Google it for you and usually come up with this paper from IBM. While this paper is fascinating and very well researched and documented, there is one fatal flaw: it's based on KVM. What IBM is really saying is that bare metal containers are faster than KVM. Well, OK. I'll let the KVM folks answer that one. I don't work on KVM, I work on ESXi. The thing is, we know that ESXi is significantly faster than KVM for some workloads. See this or this. So, what does that mean? It means that we don't actually know anything based on this report.

Thankfully, VMware has done their own research. They have found that in smaller workloads, there is very little virtualization overhead for ESXi. In an interesting development, they also found that some workloads like Redis, actually run FASTER when you virtualize them.

So, where does the truth lie? As is usually the case, your mileage will vary. If performance is a key issue for you, test your workload and draw your own conclusions.

Just don't tell me that "everyone knows" that containers run slower when virtualized. That just ain't so.

Perspectives from the Field