Tag: Performance

Sock stuffing

socksnake

For a while now the metrics most infrastructures, including Nutanix, are benchmarked against is IOps – effectively the speed the storage layer can take a write or read request from an application or VM and reply back.  Dating back to the (re)birth of SANs when they began running virtual machines and T1 applications this has been the standard for filling out the shit vs excellent spreadsheet that dictates where to spend all your money.

Recently thanks to some education and a bit of online pressure from peers in the industry, synthetic testing with tools like IOmeter have generally been displaced in favour of real-world testing platforms and methodology.  Even smarter tools such as Jetstress doesn’t give real world results because it focuses on storage and not the entire solution.  Recording and replaying operations to generate genuine load and behaviour is far better. Seeing the impact from the application and platform mean our plucky hero admin can produce a recommendation based on fact rather than fantasy.

Synthetic testing is basically like stuffing a pair of socks down your pants; it gets a lot of attention from superficial types but its only a precursor to disappointment later down the line when things get serious.

In this entry I want to drop into your conscious mind the idea that very soon performance stats will be irrelevant to everyone in the infrastructure business.  Everyone.  You, me, them, him, her, all of us will look like foolish dinosaurs if we sell our solutions based on thousands of IOps, bandwidth capacity or low latency figures.

“My God tell me more,” I hear (one of) you (mumble with a shrug).  Well consider what’s happened in hardware in the last 5ish years just in storage.  We’ve gone from caring about how fast disks spin, to what the caching tier runs on, to tiering hot data in SSD and now the wonders of all-flash.  All in 5 or so years.  Spot a trend?  Bit of Moore’s Law happening?  You bet, and it’s only going to get quicker, bigger and cheaper.  Up next new storage mediums like NVMe and Intel’s 3D XPoint will move the raw performance game on even further, well beyond what 99% of VMs will need.  Nutanix’s resident performance secret agent Michael Webster (NPX007) wrote a wonderful blog about the upcoming performance impacts this new hardware will have on networking so I’d encourage you to read it.  The grammar is infinitely better for starters.

So when we get to a point, sooner than you think, when a single node could rip through >100,000 IOps with existing generations of Intel CPUs and RAM where does that leave us when evaluating platforms?  Not synthetic statistics that’s for sure.

Oooow IO!

Oooow IO!

By taking away the uncertainty of application performance almost overnight we can start reframe the entire conversation to a handful of areas:

Simplicity

Scalability

Predictability

Insightfulness

Openness

Delight

Over the next few weeks (maybe longer as I’m on annual leave soon) I’m going to try to tackle each one of these in turn because for me the way systems are evaluated is changing and it will only benefit the consumer and the end customer when the industry players take note.

Without outlandish numbers those vendors who prefer their Speedos with extra padding will quickly be exposed.

See you for part 1 in a while.

Quality assured, not assumed.

hackintosh-dell-mini-10v

Wow, bet that runs just like Steve intended!

There are two trains of thoughts in the world of hyper convergence.  One is to own the platform and provide an appliance model with a variety of choices for the customer based on varying levels of compute and storage.  Each box goes through thousands of hours of testing both aligned to and independent of the software that it powers.  All components are beaten to a pulp in various scenarios, ran to death and performance calibrated and improved at every step.  Apple has done this from its inception and has developed a vastly more reliable and innovative platform than any PC since.  Yes I’m a fanboy…

The other train is one that can (and has) been quickly derailed.

You create a nice bit of software, one that you also spend thousands of hours building and testing but when it comes to the platform you allow all manner of hardware as its base.  Processor, memory, manufacturer all are just names at this stage.  vSAN started its HCL last year as a massive Excel spreadsheet filled with a huge variety of tin most of which was guesswork and it showed by how that spreadsheet was received by the community.   Atlantis USX also uses a similar approach.  Choice of a thousands of flavours is great if you’re buying yogurt but not so good when your business relies on consistency and predictability – oh and a fast support mechanism from your vendor.  You can imagine the finger pointing when something goes wrong…

It’s the software that matters, of course, and while this statement is correct it’s only a half truth.

Unless you can accurately test and assure every possible server platform from every manufacturer your customers use then the supportability of the platform (that’s the hardware plus the software) is flawed.  If you can somehow do the majority you’re still in for a world of pain.  Controllers on the servers may differ.  Some SSDs may provide different performance in YOUR software regardless of their claimed speeds.  Suddenly the same software performs differently across hardware that is apparently the same.

98c20_sds-nutanix-bezel-v3-620x200

At Nutanix we’ve provided cutting-edge hardware from small footprint nodes to all-flash but never once have we not known the performance and reliability of our platform before it leaves the door and is powered up by a customer.  You can read about all six hardware platforms here.  When we OEM’d our software to Dell we gave the same level of QA to the HC appliances too.

We know our hardware platform and ensure that it works with the hypervisors we support.  We then know our software works with those hypervisors.  We own and assure each step to provide 100% compatibility.  If you’re just the software on top, you have thousands of possible permutations to assure.  Sorry I mean assume.

We own it all from top to bottom and the boxes, regardless of their origin or components, are 100% Nutanix.  This is how we can take and resolve support questions and innovate within the platform without external interference.  Customers love the simplicity of the product as you probably know but their is an elegance in also displaying a structured yet flexible hardware platform.  Ownership is everything.

I’ve lost count of the flack I’ve taken by “not being software only” as that’s “the only way to be truly software defined.”

What bollocks.

It is the software that matters but if as a company you cannot fully understand the impact your software has on the hardware it must run on then the only person you’re kidding is yourself and more worryingly the first person it hurts is your customer.

Let’s see who else follows the leader once again.

There’s a storm coming. Big deal…

Remember the ominous end scene in The Terminator (yes there’s a ‘the’ in it) where Sarah Connor is showing how badly prepared she is traveling to Mexico without even a simple Spanish phrase book?  “There’s a storm coming,” says the little boy.  Sarah, looking more than a little nervous, drives off into the future to meet the rain clouds head on.  Smart move?

Bad preparation

To summarise she was badly prepared for where she was going and what was to come.  I didn’t even see a rain coat in the Jeep she was driving and don’t get me started on the blatant lack of a roof.  “There’s a storm coming and you’re wearing the wrong clothes, your car will rust from the floor panel out and frankly you should have learned basic conversational Spanish before you left,” is what the boy should have said.  His dad probably warned him away from people like that soon after.

Anyway, storms are a ballache at the best of times and generally the ones you and I know about in our working life are boot-storms.  Just like the type that’ll soak you to your skin and prove that a fancy bandana is no replacement for an umbrella we need to adopt the right technology to overcome this inevitable problem and that’s what I’m going to address today.

This morning I was with a customer who’s looking to start a VDI deployment with 500 desktops and grow to around 3000 depending on the take-up in the business.  The major headache they’d read about was IOps and in particular the rather nasty side effect booting all their VMs at once has on the systems as a whole.  SANs are not very good at serving IOps.  They’re not that great at anything other than storage really and that’s why there are lots of bandage technologies out there to cover up the holes and disguise how awful the performance can be if you tried to run VMs from them.  Now, I’m all for keeping massive investments going so if you want to throw some further expense in front of a SAN you’re locked into for four more years go right ahead.  Come and talk to me when the steak dinner invites and massive renewals come in.

Thankfully today the customer was ready for real change which is why I was discussing Nutanix’s approach to VDI and all the other wonderful challenges desktop virtualisation brings with it.

Boot storms to us at Nutanix are nothing more than a light shower with a raincoat on.  Preparation to mitigate the IO spikes for any amount of desktops is built in to our product and removes the worry for the customer.  Let me explain…

In a typical compute+SAN architecture you have a bunch of servers running VMs.  They talk down through a bunch of storage fabric to a couple of storage controller heads and then down to the disk shelves.  The shelves can only server up a finite amount of IO.  If you boot 10 machines that’ll be fine.  100 could probably work OK too.  Go to 200 and above and you’re looking at major stress being put onto everything below those servers.  The fabric might not be saturated but you can bet your last weather-proof North Face coat that the disk shelves or controllers will be.  The more VMs that boot, the slower the whole system will become as each VM has to wait for IO to be served.  Now of course you could stagger boot times, do them all at 4am before people come into work but how about the time when you need to fire them all up immediately after invoking DR or applying a critical patch during the working day.  Big trouble, Sarah.  Big trouble.

Because Nutanix is a distributed platform our approach to boot storms are to look to tackle it per node.  If we assume 80 to 120 VDI VMs could  run on a single node that’s the only thing we need to calculate for.  Once we know how many VMs each node can handle in terms of boot and general density (I’ve had IOmeter tests show 25,000 random reads and 18,000 random writes on regular 3000 series nodes a couple of weeks back) then all we have to do is add more of the same node type to get to the desired total VMs.  That’s how we scale and design clusters.  It’s that easy.

Here’s a diagram from The Nutanix Bible on how Shadow Clones work.  Click it to go to the article.

Shadow Clones

Because we read and write locally, or in some cases read over the 10GiB switch, the majority of all IO is done locally to the VMs and via the SSD tier.  Even better is that if we see blocks of data that are required by lots of VMs on a node we’ll kick in Shadow Clones to ensure that all VMs get that data localised right away.  Data locality is the key here but it’s only one of the technologies we use to make sure the cluster as a whole is predictable and efficient.  The best part is all of this is done on the fly without any administration.  We take care of it invisibly and without disruption.

So next time you start to worry about storms, just select the right clothing before you go playing in the rain and you’ll be just fine.

 

Gracias por leer.

© 2017 Nutanix Noob

Theme by Anders NorenUp ↑