Sunday, February 25, 2024

Failure is an Orphan

 "Success has many fathers, failure is an orphan."

-English Proverb


As a seven year VMW veteran, I have been reading Chris Romano's VMW history posts with great interest. Most of the things he talks about happened before I joined, but I have met many of the people he talks about in those posts.


I thought I would post here my personal recollection of a seminal event that shaped my VMW experience: The creation of VMware Cloud on AWS (or just VMC as we called it).


At the time, I had been working for Dan Wendlandt who was running product for what we called "Cloud Native Applications" or CNA. CNA was the precursor for what is now called Tanzu. This is before the Heptio acquisition. The project I had been working on was cancelled (that's a different blog post) and I was looking for a new gig. I was pointed to Narayan Bharadwaj who was forming up a team to build something called "Skyscraper." I had no idea what that was, but I was told it was cloud related.


I set a meeting with Narayan and he pulled me into a small conference room. This was on Promontory campus in Palo Alto (Prom D, I think? We later moved to Prom E). As an aside, I have had the pleasure of working in all five of the Prom buildings (Prom F is the gym) that have offices. In the Prom buildings, the conference rooms all have glass walls. It's a nice airy feeling to use them. However, this one had those large flip chart pages stuck up all over the glass so that the room was completely hidden from outside view. We went into the room and Narayan closed the door. I was asked to sign an NDA document. Keep in mind I was already a PM working on VMW products so this was VERY strange. I had been read into many pre-release products over the years but had never signed an additional NDA. I signed the document and joked that this was a nice looking murder room but they forgot to put the plastic on the floor. Narayan didn't laugh.


Narayan proceeded to brief me on a project to run vSphere natively on public clouds and sell it as a service (i.e. a SaaS product). While selling vSphere as a service wasn't a new idea (we had recently spun out vCloud Air), the idea of running on public clouds WAS new. The plan was to run Skyscraper on multiple public clouds and the various clouds all had code names. I was shown a prototype running on "Sears Tower" and there was another one called "Empire State." The problem was that we had to choose from rapid provisioning velocity for clouds that allowed automated provisioning or we could get bare metal that was basically manually provisioned. At the time, bare metal provisioning for the clouds that supported it was basically manual. You had an API, but it just created a ticket and a human provisioned your server. It could take hours. To make Skyscraper work, we needed cloud velocity and the ability to provision servers in minutes, not hours. Thus, we needed to run on VM's.


As a former vSphere PM, I had worked on the ESXi kernel (I shipped a kernel feature caled IO Filters) and I knew the storage side (I worked on vVols). I thought I could help the team and was very excited to join. I pretty much got the job on the spot and was working for Narayan the next day. When I joined, it was a VERY small team. I think I was the third or fourth hire.


It soon became clear that Skyscraper wasn't going to work. ESXi was built to run bare metal. Yes, you could run "V on V" and ESXi and vSphere DID run on the VM's we could get from the Public Cloud provider (did you know that ESXi and vSphere are totally different things?) but the storage system was slow as hell and the overall system just didn't give the performance that we needed. It didn't look good.


Then, along came something called "Petronas". Petronas was different. Petronas ran on something called a "Metal" instance that was natively provisioned like a VM. That meant that we were using the Public Cloud API to provision a bare metal server and then running ESXi on that server natively. This took a few minutes (our target was less than 20mins) but it was WAY WAY WAY faster than the alternatives. To put it another way, we simply did a remote boot of a server running in the cloud to an ESXi image that we provided.


BOOM! Skyscraper suddenly made sense.




Yes, "Petronas" was AWS EC2.Metal. Did you ever wonder why early VMC hosts were called "i3p" instead of "i3.Metal" which was their official name? Ya, the "p" stood for Petronas. Petronas Tower was the largest office building in the world at the time, so we gave AWS that code name to hide the fact that we were building a new product on AWS. VMC was the launch customer for AWS Metal. Today, anyone can get a .Metal instance, but at the time, you had to have special permissions and we were the only ones who could provision them. Hilariously, everyone thought we were saying "Patronus" like from Harry Potter, but it was actually Petronas because Skyscraper instances were all named after buildings.


In some ways, VMC was a very simple project. We took vSphere as it was, ran it on a VM, booted a bunch of i3p instances to run ESXi, used the local SSD storage to host vSAN and away we went. We had to write drivers for the AWS gear, but that was something we knew how to do. IIRC, the AWS NIC's were pretty special and took some joint engineering between AWS and VMW, but in the end it was just a driver, it worked. In some ways, AWS just became another OEM like Dell or HP. They were the hardware "manufacturer" and we did the driver work. An i3p instance running ESXi was a full ESXi experience and when we added the rest of vSphere, you had an "Software Defined Datacenter" (SDDC) that could run any workload that a regular on prem deployment could run. Kaboom!


In other ways, VMC was a complete overhaul of everything VMware was and did. VMC meant that we were running a live service that we sold by the hour. VMC meant that we could deploy new features to customers in days, not months. VMC meant that we could fix critical security bugs in hours. It soon became clear that VMC was was a SERVICE, not a PRODUCT. We didn't "SHIP SOFTWARE" at VMC, we "PUSHED TO PROD" and then we used feature flags to expose those features. We eventually got to the point were a push became a non-event. We pushed to prod all the time, it got boring. Feature doesn't work right? Turn off the feature flag. This meant we could run much faster than the rest of VMware and take greater risks. Why bother running a three month beta? Just ship the feature and test in prod.


This caused conflicts. We were different, we were special. We were a customer of vSphere but not a direct part of the vSphere team. We were part of the same business unit, but were in many ways the little brother of vSphere with all that entails. I was personally briefing PatG on our progress, which was nuts for a PM Director. Attempts were made to kill us or shut us up. It was a grind. We all worked crazy hours. It really felt like a startup despite the fact that we lived inside a very large multi billion dollar software company. We had a mission and we had something to prove. I personally led the deployment of the first ten "lighthouse" customers. PM was extremely hands on and we were directly involved in every aspect of the business. Within six months it was obvious that VMC was a winner and we went into overdrive. New features, new regions, new team members. The team exploded, the product was generating millions of dollars in revenue.


For me, this was peak VMW. We were allowed to take a risk, to do something that was a little crazy and was certainly out of our comfort zone. To be honest, we had no business running an enterprise SaaS business. We had no idea what we were doing at first. We had to figure out how to accept credit cards, we had to figure out what a feature flag was, we didn't have a way to pre-release features without a beta. But we learned. We worked hard, we worked as a team and we solved the problem. In the end, VMC is a great service and probably at this point the crown jewel of the vSphere portfolio. I hope that Broadcom continues to invest in it.