Tuesday, November 13, 2012

Be a CIO in Three Easy Steps

Read a blog post today that made me laugh:

http://www.zdnet.com/why-cio-success-comes-down-to-just-three-things-7000006850/

Basically, the thrust of the post was that a CIO only has to do three things:

  1. Eliminate risk
  2. Improve cycle times
  3. Reduce cost
And if you read the post, then those three things really come down to one thing:  Converged Architecture.  After reading the article, I'm pretty sure the author either works for a converged architecture vendor or just spent a huge sum of money on a new converged architecture system.

Being an Ex-Vendor of converged architectures myself, I do like the idea that converged architectures solve everything.  Unfortunately, this just ain't true.  Sigh.

I also have a funny feeling that being a CIO comes down to more than just three things.  To be fair, it's more than just a feeling.  I used to run a CIO consulting program for Microsoft so I've spent quite a bit of time working with, supporting and selling to CIO's.  I've spent a good portion of my life thinking about this position and what it takes to be good at it.

However, I do like this list as a conversation starter.  If I was a CIO, I would definately be thinking about these three things.  They're probably even in the right order.  If you don't have your risks addressed, then reducing cost and improving cycle times don't matter much.

On the other hand, the blog post really misses a great opportunity when discussing risk.  Risk is one of those things that seems to be widely mis-understood in the technical community.  To me risk is not good or bad.   Eliminating all risk is not really possible nor is it desirable in most cases.  Risk is something that needs to be managed, like the weather.  As they say in the Pacific Northwest, there is no such thing as bad weather, just insufficient gear.  That's probably not strictly true, look at what our friends in the NorthEast are dealing with due to Sandy.  The fact remains that weather must be dealt with and managed.  We cannot control it just like we cannot truly control other sources of risk.

To change the saying to our purposes, let's just say that there is no such thing as a bad risk, just insufficient planning for that risk.

When thinking about risk, we must first think about our sources of risk.  These may be business (the company is loosing money and fires half of IT) or Technical (the main router blows a power supply) but they also may be political (the city of New York declares your business model illegal; yes I'm talking to you Uber) or natural (Sandy puts your DC under ten feet of water).  Under no circumstances can you do any one thing to manage all these sources of risk.

The next thing to consider is risk mitigation.  What can I do to make this source of risk less likely to actually occur?  In many cases, a strong mitigation plan can completely eliminate a source of risk.  Having no single point of failure in a DC is a common example of this type of risk planning.  However, this is not the best choice in all cases.  For example, what happens if the object being protected is basically a throw away object?

This leads us to contingency planning.  In some cases, it's simpler and easier to simply mitigate the risk once it occurs instead of making the risk less likely to occur.  For example, I used to back up my laptop once a week.  Honestly, it was a pain in the butt and I didn't remember every week.  Now, I used a cloud based storage platform (SkyDrive in my case) so there's nothing on my laptop that's not replicated to the cloud.  No need to backup any more.  If the laptop dies, then it dies.  No data is lost.  Should I bother spending huge amounts of time on keeping my laptop healthy?  No.  Should I bother buying a second hard drive to mirror my local drive?  No.  There is no point.

When deciding which sources of risk to focus on, think about the probability that the risk will occur multiplied by the business impact (measured in dollars).  The risks with the highest probability and the highest impact are naturally more important than those with relatively small impact or extremely low probability.  Unlike the blog post above, there is no way you can afford to completely mitigate all risk.  You can reduce your risk profile but you cannot take it down to zero.



Saturday, November 10, 2012

Let People Manage People

I had a great conversation this week about SLA enforcement this week that I thought I'd share here.

The conversation was about private cloud provisioning.  When you use VMware's vCD product, it wil select the correct datastore based on the assigned storage profile.  All well and good.  The question we got was:  "What if I have an SLA that says the storage must be located in the USA?  For example, if this is a federal government customer."

Ah, simple we say.  Just create a specific storage profile that is called "US Only Data" or some such and you're all set.

"Yes, but."  OK, here comes the but.  "I have a contract with the federal government that says that I have to ensure that my data does not leave the USA.  How can your product ensure that doesn't happen?" 

That's the rub, isn't it?  How can we model even a very simple SLA like geo location of data?  The reality is that you cannot.

As I have said many times before, the movement to cloud is a fundamental business shift that is enabled by technology but is not fundamentally a technology shift.  Or to put it another way, technology does not solve every problem.

In this case, we are really talking about business rules that govern the placement of data.  This business rule is created by humans and must be implemented by humans.  Yes, we can create scripts or other tools to help make this simpler, no we cannot insure that the software will always do the right thing.  The humans are, in the end, responsible for ensuring that the business rules are met.

Oh man, I hear the engineers out there protesting.  You are fuming on the other side of your screens!  There must be a technology answer here!   No, there isn't.

Let's think this one through.  Let's say that we did create a system that allowed us to model SLA's.  For example, we could do something simple like a tag for all VM's that's created at provision time that lists the countries that they're allowed to run in.  Then, when we provision, this tag gets set to "USA".  Then we tag all assets in the data center with a geo tag stating their location.  Then we write a script that compares these values.  Simple, right?

Wrong.  We are still dependant on the humans to enter the right value tags.  If that doesn't happen, the whole system breaks down.  And this is a relatively simple business rule.  If you start looking at your SLA's closely, you will find that it is almost impossible to model them completely.  And that's only the "Formal" SLA's.  What about the informal ones?  The reality is that the complexity of human interaction is so complex that it's almost impossible to model.  I'm sure that there are some really smard dudes out there who are trying and they may prove me wrong one day, but I doubt it somehow.

So, what's the message here?  The message is to let the humans manage the humans.  Create scripts and processes to check on things that are easy for computers to understand.  Workload X does on datastore Y.  Why?  Because we say so.  The computer doesn't really need to know.  Maximum acceptable latency on datastore X cannot be more than 12ms.  Whatever.  These are simple things to model and relatively simple to build into monitoring software.  Focus on these things.

The converse is also true.  When setting up formal and informal SLA's with your customers, keep in mind the limits of what the system can actully measure.  Saying that performance must be "acceptable" is really tough to gague.  An SLA around average response time of xx miliseconds can actually be measured and tracked.  Time to first byte, latency, throughput....  These things are all good fodder for SLA's.  Words that require actual judgement "Good," "Acceptable" or "Excellent" are not words that should ever be included in a formal SLA document because these require human judgement.