Perspectives from the Field: 2015

Monday, November 30, 2015

Virtualized Containers: jeVM

The last few weeks, I've been talking about container virtualization and the work we're doing here at VMware in this area. Specifically, I have written about both performance and security aspects and why you would even want to virtualize your containers in the first place.

In this blog post, I will talk about an implementation detail of VMware's solution for virtualized containers. I'm writing this because there seems to be some confusion in the market about how and why you would want to virtualize containers that all seem to be based on some false assumptions about the implementation details.

I've already talked about the false performance assumptions that people make. In this case, I'd like to talk about the false deployment assumptions.

When we deploy containers using vSphere Integrated Containers (vIC), we deploy them on a one to one basis onto VM's. This is to say, for every container that vIC creates, there is an underlying VM supporting this container. This one to one mapping is what gives us the security and isolation benefits we discussed earlier.

However, it is not correct to assume that the VM we create is a regular VM. Instead of the normal VM creation process that we use in vSphere, it's important to note that vIC sits on top of a technology we internally refer to as "Bonneville." The name Bonneville is relevant here because the entire purpose of Bonneville is to make VM creation as fast as possible (like the time trials at the salt flats). Because of the way Bonneville works, most of your assumptions about the VM creation process are no longer true.

For example, a VM created by vIC doesn't boot. It forks (using an ESXi feature called "instant clone"). That is to say, it is a daughter process from another parent VM. Thus, there is no boot sequence for the underlying child. It simply starts from where the parent left off. The process is very similar in concept to vMotion. We take a fully booted, running copy of Linux (Photon OS in our case) and then we stun it. This is similar to the stun operation that happens during vMotion. However, instead of moving the VM to another host, we leave it in this stunned state. We then fork new child VM's off of this parent. Thus, the child starts very quickly (measured in milliseconds). There is no boot time because the guest never boots.

In addition, this child VM has no memory of it's own. This is because the child has read only access to the parent. This means that the parent RAM is re-used by all the children. When you modify a memory page, a copy on write operation is performed to a new page that belongs just to that child. Thus, you get the benefit of shared memory (very high consolidation ratios) but the isolation of a VM. This operation doesn't really help if you only want one container but if you want to run hundreds, or thousands, there is a significant saving of time and resources compared to regular VMs.

We refer to this child VM as a "jeVM" or "Just Enough VM" (think the three bears here). Since the jeVM has the properties of a container (nearly instant start and dynamic memory), it's a great way for vIC to build VM's to run docker containers.

A great side-effect of this approach is that from vSphere's point of view, the jeVMs are REAL VM's and thus can be managed like VM's. This means that your existing tooling, automation and management platforms will continue to work unmodified. The only difference is that now you'll be seeing each new container as a new VM with a name derived from the UUID of the embedded container. So, they're VM's from vSphere's point of view but they're created in a new and much more efficient way. Thus, the term "jeVM."

Friday, November 20, 2015

Virtualized Containers: Security

Like my last post, this post is about virtualized containers. If you're not sure what I'm talking about, it's the notion that we have at VMware that we can run containers on ESXi just as well if not better than running on a "bare metal" Linux host.

While this notion is controversial to some in the community, for us this is just a natural extension of what we already do. Because we already virtualize a huge range of workloads, we already have experience with a wide range of performance, security and operational requirements. Thus, adding a new workload is usually not a big deal.

We've already talked about performance, so let's talk about security.

One of the very nice things about virtualization is that it allows you to do things like micro segmentation. That is to say, it allows you to provide very fine grained controls on your infrastructure so that a compromise in one area is less likely to penetrate other areas of your infrastructure. Sometimes this is also referred to as "defense in depth." However you describe it, it's
clear that it's better to have multiple barriers to a bad actor instead of just a "crunchy shell with a gooey center" approach.

Whenever I think about defense in depth, I think about the medieval town of Entrevaux in France. This town has a series of walls, gates, redoubts and a final bastion at the top of a hill. Anybody trying to storm that town with crossbows, trebuchets and swords would have a tough time of it.

Short of stone walls, how do you make yourself safe in the modern world? Well, we still have walls. These days, they're virtual firewalls constructed to keep out the virtual bad guys but the concept is still the same. Keep them out of your town but if they get in, make their life a living hell and pay a price in blood for every foot they advance. In the virtual world, there are tons of ways to do this including our own NSX product.

Regardless if how you choose to implement your containerized application, you will need to have some sort of strategy for containing bad actors. If you have already solved this problem for your virtualized infrastructure, you can simply re-use this solution if you virtualize your containers also. If not, you're going to need a new set of tools.

Another consideration is attack surface. One interesting side-affect of a single purpose operating system is a very small attack surface. Logically, you would assume that a single purpose operating system like ESXi would have a much smaller attack surface than a general purpose operating system like Linux. In fact, the data backs up this assumption:

Based on data from the website CVE Details, ESXi has a much lower number of reported vulnerabilities than operating systems like Linux or Windows. In fact, the attack surface is anywhere from 10x to 100x smaller depending on how your measure it and which operating system you pick. I'm not trying to throw stones here. I used to work on Windows and security is very important to Microsoft I can assure you. You can also see that distros like RHEL are doing a great job of making sure they're fully patched before they ship. However, there are some architectural advantages a smaller OS like ESXi has and this is one of them.

When you think about running workloads in large production environments, you have to assume that a bad actor is going to find their way into your environment sooner or later. No mater how good you are, you will get hacked. Thus, running with a least permissions model and increasing your chances of minimizing the danger with things like micro segmentation and small attack surfaces seems like a logical precaution.

Naturally, if the underlying platform won't run the workload you need or has crappy performance none of this matters. Fortunately, with things like vIC and Photon Platform this isn't the case for virtualizing containerized workloads.

Tuesday, November 17, 2015

Virtualized Containers: Performance

It is always fun to watch pundits and industry analysts proclaim the death of this technology or that technology. This is something that happens regularly enough that it's become a bit of a sport.

In my case, because I work at VMware, I often get told that containerization is the end of virtualization. This is a very interesting contention that doesn't seem to be based on any sort of factual evidence. The conversation usually starts out with me showing our new Photon Platform or vSphere Integrated Containers (vIC) product. Fundamentally, both of these products allow you to run containers on ESXi. vIC allows you to run Docker naively in vSphere and Photon Platform is a new platform for running containers and cloud native workloads. Or to put it another way, they allow you to virtualize your containers. After I show the demo, there's a polite pause. Then comes "the question" that I always expect. They ask "Why would I want to virtualize my containers? Everyone knows that containers run best on bare metal."

The really fun part is that when you press, there really isn't any sort of basis for the assertion. Mostly, the objection seems to come down to performance. They want to run containers on bare metal Linux because "it's faster." Again, we ask what that assertion is based on. The really clever folks will Google it for you and usually come up with this paper from IBM. While this paper is fascinating and very well researched and documented, there is one fatal flaw: it's based on KVM. What IBM is really saying is that bare metal containers are faster than KVM. Well, OK. I'll let the KVM folks answer that one. I don't work on KVM, I work on ESXi. The thing is, we know that ESXi is significantly faster than KVM for some workloads. See this or this. So, what does that mean? It means that we don't actually know anything based on this report.

Thankfully, VMware has done their own research. They have found that in smaller workloads, there is very little virtualization overhead for ESXi. In an interesting development, they also found that some workloads like Redis, actually run FASTER when you virtualize them.

So, where does the truth lie? As is usually the case, your mileage will vary. If performance is a key issue for you, test your workload and draw your own conclusions.

Just don't tell me that "everyone knows" that containers run slower when virtualized. That just ain't so.

Tuesday, October 27, 2015

CNA in the Enterprise?

Following up from my post about joining the Cloud Native team at VMware, I'd like to transition to the challenge that we've decided to take on: Moving Cloud Native into the Enterprise.

Like IaaS and other cloud technologies, the idea of enterprises running cloud native applications is all the rage right now. Every CIO and every IT architect I talk to has some plan or research project going on around Docker, Kubernetes, Mesos or some other aspect of the cloud native stack. This is awesome for a product manager like me because it means everyone is interested in what I'm working on. It's very easy to draw a crowd by showing off vSphere Integrated Containers or one of the other cloud native projects we're working on.

Unfortunately, the current state of the market is not very mature. In fact, it very much reminds me of the awesome Dan Ariely quote about big data:

"Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it..."

You could basically insert "Docker" or "Cloud Native" in the quote above for Big Data and you wouldn't be far wrong. Similar to big data, there are some folks out there doing amazing work. However, for most mere mortals, this is a completely theoretical discussion at this time. Pretty much like every locker room conversation you had in middle school.

The reality is that making Cloud Native work in the enterprise is going to be VERY VERY hard. There are some very significant barriers to entry here:

There are millions of lines of code running key business functions that will need to be completely replaced. You cannot simply port existing code to cloud native, you need to re-write the application and re-architect it to get the benefits.
Your existing development process will have to be abandoned. The traditional dev/test/stage/prod cycle you are using now will not cut the mustard. One of the key values you get from Cloud Native is speed and agility. That won't happen without changing out your processes.
Cloud native assumes a robust PaaS layer. If you look closely, there is a massive "hand wave" in most architectures when it comes to persistence. Nobody runs databases in containers. This is because platforms like AWS already have a great PaaS layer. Why build your own object store when you have S3? Do you have a full PaaS stack inside your firewall?
Your current security and it governance policies will have to be re-written. It is extremely unlikely that your current policy and governance structure was designed for the type of rapid iteration and agility implied by cloud native. Do you have an Architectural Review Board? Forget about that. Centralized governance model? Nope.
Deploying within your firewall assumes you already have a robust PaaS and IaaS platform. Do you? Our research shows fewer than 10% of enterprise IT shops have a fully deployed IaaS platform.
Deploying outside your firewall (AWS, GCE, Azure, etc.) assumes that you have policies and procedures for managing things like data protection, security, etc. in a public cloud context. Do you?

Given the issues above, is it even practical to deploy or build cloud native applications in my enterprise? Most definitely yes. The reality is that cloud native solves so many problems with traditional application development that it's worth all the hassle of changing out your existing infrastructure, processes and policies. However, you should begin this journey with open eyes. Like every other technical project you've ever done, this one will have issues, complications and unforeseen consequences.

Assuming that you're not going to re-write your existing line of business apps for fun, my assumption is that your first cloud native application will be a net new application. It's going to be much easier if you apply this toolkit to a new development effort than if you attempt to retrofit an existing application. Therefore, you have the opportunity to create a new team with a new set of dev tools and processes. Thus, you are really talking about of a net new team writing a new application. Setting aside the policy and procedure issues from above (not because they're trivial, but because I can't generalize them) your biggest problem comes down to PaaS.

It turns out that running cloud native assumes you already have a cloud and all of the fun things that this implies. If you already have a robust IaaS and PaaS environment running, you're good to go.

Unfortunately, we know that for most customers this is simply not true. We know that you don't have this infrastructure running because we've spend the last year talking to literally hundreds of enterprise customers about this very topic.

The next thing that usually happens is that engineering teams go off and get an account on AWS, GCE, Azure or another cloud provider service. This is awesome because it allows them to begin writing code almost immediately. All the services that we've already talked about exist up there and they work great. They code away like mad and everyone is happy. Right?

Well, not really. If you are a software company and product velocity is the only thing that matters, this may be a very acceptable state of affairs. However, if other things matter to you like data sovereignty, multi-sourcing, ROI, etc. then you might not be OK with being locked into a single hoster. This means that you don't want to consume PaaS services that are limited to a single cloud.

Thus, we get back to creating your own PaaS layer.

I believe that this will be the single greatest stumbling block for enterprises who seek to adopt cloud native.

This is one reason why our team is working on tools and services to make this simpler. We need a way to deploy "just enough PaaS" quickly and easily inside the firewall and outside the firewall. We need a consistent services layer that will run on any cloud. We need a single management interface for all of those instances running across any cloud.

Hmmm.

Sounds like a good idea for a product, right?

Stay tuned.......

Monday, October 26, 2015

My Move to Cloud Native

In late July, I decided to move back to VMware. There were many reasons for this change, but one of them was the opportunity to work in a group within VMware called "Cloud Native Applications" or CNA. While, I've been a "cloud" guy for years; I've mostly worked on IaaS. I was on the public IaaS team at Oracle Public Cloud but I worked primarily on the Private Cloud space at VMware and NetApp. Thus, CNA is adjacent to what I was doing but very different in practice. Naturally, I have been asked many times to explain exactly what CNA is and what our mission in life is. In order to do that, I had to define exactly what we mean when we say cloud native and I thought that this would be a good topic for a blog post.

The very short answer is that when we say Cloud Native Applications, we mean a class of applications that were born in the cloud and natively use the infrastructure services of cloud. However, that answer doesn't really give you a good picture of how fundamentally different CNA is from the rest of VMware.

As a Product Manager who owned features within vSphere, I can tell you that vSphere is a great business for VMware and probably will be for the foreseeable future. It is one of those rare products that was so good, it created an industry. Like Windows or Linux, there is an entire ecosystem around vSphere and it's component technologies. As you can imagine, there is a great deal of pressure as a member of the vSphere team not to mess that up. When you don't take your time and things slip, people notice. Like front page Wall Street Journal notice. Thus, there are all kinds of systems and processes in place within vSphere to make sure that mistakes are not made and to address any issues that do come up. This is a good thing and I think that we're doing a pretty good job keeping that product stable while introducing new features at a measured, steady pace.

However, there are other markets that are moving much more quickly. In those cases, the type of process driven organization that produces a product like vSphere cannot keep pace. Thus, you cannot have one team working on a very mature product like vSphere and also going after a brand new, fast moving market. Thus, the creation of the CNA team at VMware. Our mission is to serve the cloud native market and to introduce new products, platforms and services for this market.

Naturally, because this market is so new, there is some debate about what this means and how to go after it. Our first challenge as a team was to identify why this market was different and how customers in this market make decisions about what technologies to use. It has become abundantly clear to us that unlike vSphere, CNA is fundamentally a developer driven organization. The move to cloud is largely driven by developers seeking to solve their development, deployment and operations problems by getting away from traditional architectures and infrastructures.

Thus, the fundamental truth about cloud native is that it revolves around an architectural paradigm. The notion that we need to change the way we design, build and operate is what lies at the heart of what we mean when we say cloud native. This paradigm shift then drives technologies to meet the new requirements of these applications. Note the cause-effect loop here. Different REQUIREMENTS are driving different TECHNOLOGIES, not the other way round. This important distinction is lost on some in the industry and will ultimately lead them into bad decisions and products. Thus, things like Docker are very interesting because of problems they solve, not as technology per se.

So, how do we define this new architectural paradigm? One popular way is the "12 factor app" as defined at 12factor.net. While this site is super detailed and provides everything you'd want to know about creating your own cloud native application, it may be TOO detailed for some. There are a couple of things we can summarize out of this and generalize for our use.

Fundamentally, Cloud Native is about four requirements or assumptions for building applications:

Cloud Native implies that applications are built with the assumption that infrastructure is inherently unreliable. All availability, data replication and other availability features are moved away from infrastructure and into the application.
Cloud Native assumes that the infrastructure is "Software Defined" and thus fungible. The assumption is that I can re-configure my infrastructure simply via an API call and thus control the underlying platform from within the application.
Cloud Native assumes that applications need to scale to internet scale. That is to say, Cloud Native applications should inherently "scale out" instead of "scale up" because you eventually run out of "up".
Cloud Native assumes that technology velocity is more important than ROI. Yes, cost is always a factor in the real world, but there are always trade-offs. If your goal is to be the lowest cost provider, you are unlikely to be on the leading edge. Thus, when you optimize for speed, you may incur additional cost and that's OK.

Here is the really interesting thing: ALL my customers say they want cloud native, but MOST of them are not willing to accept the assumptions and requirements above.

Similar to what happened a few years ago when IaaS started to penetrate the enterprise, there is a huge amount of "buzz" around cloud native and all things cloud within enterprise today. And just like I did five years ago, I am talking to customers about why this ISN'T a good idea for them. At least right now.

I cannot tell you how many companies I have worked with around IaaS only to see them fail utterly. This is not to say that their cloud didn't work. The technology works just fine, thank you very much. The problem is that they did not change the way IT worked. Thus, the "impedance mismatch" between what IT saw as the proper way to do things and the way their internal customers wanted to consume cloud services as an impossible barrier to overcome.

The first question you must ask yourself as an IT leader is: "what business am I in today and what business do I want to be in tomorrow?" I think the last item on the list above is probably the killer for most companies. If your company views IT as a cost center which needs to be controlled, you can never adopt the same methodologies and architectures from companies like Twitter and Google there technology is a profit center. Never.

On the other hand, if those requirements match your requirements, there are some really interesting things that can be done with the technology stack that has been developed by those companies. There are lessons to be learned about the way to scale out huge web farms and run large businesses on entrusted platforms. When those lessons are applied, there are some architectural patterns that stand out:

Containerized workloads. This is not the only way to build cloud native applications, but it is the dominant trend and we think it’s one of the better ways.
Stateless workloads. Not to say that state is never persisted anywhere, what we mean is that any individual host or VM should not hold state that can’t be recovered easily. To say it another way, I don’t “fix” the container or the VM, I kill it and start another.
Language and framework agnostic. The rise of REST and things like Zookeeper and Etcd have made it possible to integrate components that were not developed together and have no common technology platform. Thus, it becomes "run what you brung" and language becomes a developer choice.
Continuous Integration. If speed is your #1 driver, then traditional dev/test/stage/prod cycles are not going to work for you. Naturally, that means you need to have a programmable infrastructure, but we stated that in our requirements section.
REST. The reality is that everyone is moving to REST. Things like SOAP and RPC are not fast enough and are too "leaky" to maintain the level of abstraction and the scale required for cloud native.

Thus, we see that these technologies are interesting because they make it simpler to achieve our stated objectives and meet the requirements we've stated above. This also means that while these are currently the way we are doing things, there is no guarantee that this will not change. Thus, there is no point in creating a "Docker" product group within VMware. We also urge our customers to focus on requirements and let technology decisions flow from there. Yes, we love and support Docker here at VMware's CNA group, no we don't think this is the ONLY way to go about things. Our focus here is to help our customer develop, build and deploy cloud native applications, regardless of technology or platform.

Sunday, August 23, 2015

What is disruption, anyway?

Our business is a funny business. Although we all claim to be logical data driven engineers the reality is that we are an industry driven by emotion, not logic. Sometimes I feel like I'm on the set of "The Apprentice" instead of working for a real company.

There is a constant drive for attention, to stand out in a very crowded field. This is understandable, quietly building an awesome product is a great way to starve to death. The side effect of this is that everyone tries to adopt the latest trend and make crazy claims that just aren't substantiated in the real world.

Two things that you hear all the time are that a product is "disruptive" or "innovative." In fact, there has recently been a spat between some good friends of mine about how "innovative" they are. It's funny to watch, but probably not doing anyone any good. As an architect, "innovative" is a throw away word. An architect is not really worried about innovation. Either the technology is a fit for the problem at hand or it is not. Old tech is awesome and safe. If it works, then that's a great thing, move on to something that's broken.

On the other hand, disruption is something you really have to watch as an architect. If you are participating in a market that's being disrupted or about to be disrupted, this must play into your planning in the long term if not in the near term. Thus, when someone says that their product is "disruptive" this is a claim you really need to test. True disruption is very rare but extremely dangerous. Strangely, many companies wear this term as a badge of honor. In reality, it is a risky move that I try to avoid if at all possible.

If you look up the definition of the word at dictionary.com, you get:

Business. a radical change in an industry, business strategy, etc., especially involving the introduction of a new product or service that creates a new market.

Thus, to be truly disruptive, you must create a new market. Of course, defining a market is a tricky thing and I'm sure this is what economists argue about over beers. I don't really know. From our non-technical perspective, a market is simply a group of buyers and sellers who are competing. Thus, you could view cars as a single market but in reality it is actually many smaller markets. Nobody cross-shops a BMW 740iL with a Chevy Spark. Well, I hope they don't.

Thus, we have things in our business that are truly disruptive like SDN and things that are just innovative like Flash. That last statement is what Product Managers argue about over beers. On this one, I'm certain because I've had that argument over beers more than once and I am a Product Manager.

Let's look at SDS and flash and talk about why they are or are not disruptive. Starting with Flash (and by flash, I mean solid state storage as opposed to rotating media). Within the enterprise storage space, there are a ton of people who will tell you that flash is a very disruptive technology. As proof, they point to a host of new startups who are using flash based devices to carve out a new market. Thus, since there is a new market for all flash arrays, this means that by definition flash is disruptive. I think that there is something deeper that must be examined before we can declare a technology as disruptive. There needs to be a massive change associated with this technology first. To use the definition of the word, a "radical" change. Is flash in the form of an all flash array truly radical? In my view, it is not. All flash arrays are better in many (if not most) ways than traditional magnetic media based arrays but they're not really that different from the outside. They are still presenting disks, objects or files. Faster to be sure, but not functionally different than what we've had before.

On the other hand, Software Defined Storage (or SDS) is a very fundamental change in the way systems are architected. In effect, SDS moves the point of control away from the storage admin and towards the infrastructure consumer. This accelerates the trend of collapsed operations teams and allows things like dynamic services composition for storage which simply isn't possible with traditional storage architectures. The repercussions of this is that we can now offer our internal customers a highly differentiated service offering. That is to say, we move from a legacy "one size fits all" architecture to a modern cloud style services economy where customers have choice in the services they consume. This is a very "radical" change by anyone's standard.

So, why does this distinction matter to the architect? It matters because this highly disruptive change will affect the way we design our systems and processes. A fundamental shift in the way we manage and operate our systems, like cloud, will need to be factored in. We have seen many enterprises attempt to simply port existing systems and processes to cloud architectures. These efforts have been less than successful in most cases. Yes, you CAN run the system, but you will not achieve the benefits that cloud brings without a significant amount of re-architecting.

Luckily for us, really disruptive changes don't happen very often in the enterprise IT world. Enterprise IT is slow to change and many of the fads that rip through the industry simply pass us by and never get adopted. Looking back at my 20 plus years, I can only think of a few truly disruptive technologies to hit the Enterprise IT world. The shift from centralized computing to distributed computing in the early nineties was extremely disruptive and many of the big enterprise companies of the day (Wang, Tandem, NCR, Amdahl, etc.) have either ceased to exist or faded to a pale shadow of their former selves. Now, the cloud era is forcing a change back to centralized compute and the enterprise market is changing the game again. Salesforce and Amazon, the #1 and #2 cloud companies by revenue were not in existence before the cloud and I'm sure further disruption will occur before this round is done.

Disruptions like cloud force us to re-consider our options. When my old company bought a minicomputer from DEC in the late 80's, this seemed a very safe choice. They were one of the must trusted names in enterprise computing at the time. Today, they don't exist. While the changes to our industry can be unexpected, a large company like DEC doesn't fail overnight. When you take a look at your suppliers, their overall financial health is an important part of the picture. If they are making you a deal that seems too good to be true, this may tell you something about their long term viability as a company.

It is the Architect's responsibility to review not only the technical viability of the proposed solution but also the business viability. If you are building a major new system that depends on vendor support, you need to have a plan b to deal with a failed vendor. People talk about avoiding "lock in" all the time, but very rarely do they achieve it. Truly dual-sourced systems are very expensive because it implies that everything must interoperate. It's usually cheaper to build a single source system but there is some business risk that must be mitigated on the back end.

If you are working with a "disruptive" company or technology, ask yourself this: who are their competitors? Is this technology likely to put them out of business? How big is this company? Is it likely they will be bought in the next few years? Is this technology truly disruptive? Or is it merely innovative?

Perspectives from the Field