Remember the ominous end scene in The Terminator (yes there’s a ‘the’ in it) where Sarah Connor is showing how badly prepared she is traveling to Mexico without even a simple Spanish phrase book? “There’s a storm coming,” says the little boy. Sarah, looking more than a little nervous, drives off into the future to meet the rain clouds head on. Smart move?
To summarise she was badly prepared for where she was going and what was to come. I didn’t even see a rain coat in the Jeep she was driving and don’t get me started on the blatant lack of a roof. “There’s a storm coming and you’re wearing the wrong clothes, your car will rust from the floor panel out and frankly you should have learned basic conversational Spanish before you left,” is what the boy should have said. His dad probably warned him away from people like that soon after.
Anyway, storms are a ballache at the best of times and generally the ones you and I know about in our working life are boot-storms. Just like the type that’ll soak you to your skin and prove that a fancy bandana is no replacement for an umbrella we need to adopt the right technology to overcome this inevitable problem and that’s what I’m going to address today.
This morning I was with a customer who’s looking to start a VDI deployment with 500 desktops and grow to around 3000 depending on the take-up in the business. The major headache they’d read about was IOps and in particular the rather nasty side effect booting all their VMs at once has on the systems as a whole. SANs are not very good at serving IOps. They’re not that great at anything other than storage really and that’s why there are lots of bandage technologies out there to cover up the holes and disguise how awful the performance can be if you tried to run VMs from them. Now, I’m all for keeping massive investments going so if you want to throw some further expense in front of a SAN you’re locked into for four more years go right ahead. Come and talk to me when the steak dinner invites and massive renewals come in.
Thankfully today the customer was ready for real change which is why I was discussing Nutanix’s approach to VDI and all the other wonderful challenges desktop virtualisation brings with it.
Boot storms to us at Nutanix are nothing more than a light shower with a raincoat on. Preparation to mitigate the IO spikes for any amount of desktops is built in to our product and removes the worry for the customer. Let me explain…
In a typical compute+SAN architecture you have a bunch of servers running VMs. They talk down through a bunch of storage fabric to a couple of storage controller heads and then down to the disk shelves. The shelves can only server up a finite amount of IO. If you boot 10 machines that’ll be fine. 100 could probably work OK too. Go to 200 and above and you’re looking at major stress being put onto everything below those servers. The fabric might not be saturated but you can bet your last weather-proof North Face coat that the disk shelves or controllers will be. The more VMs that boot, the slower the whole system will become as each VM has to wait for IO to be served. Now of course you could stagger boot times, do them all at 4am before people come into work but how about the time when you need to fire them all up immediately after invoking DR or applying a critical patch during the working day. Big trouble, Sarah. Big trouble.
Because Nutanix is a distributed platform our approach to boot storms are to look to tackle it per node. If we assume 80 to 120 VDI VMs could run on a single node that’s the only thing we need to calculate for. Once we know how many VMs each node can handle in terms of boot and general density (I’ve had IOmeter tests show 25,000 random reads and 18,000 random writes on regular 3000 series nodes a couple of weeks back) then all we have to do is add more of the same node type to get to the desired total VMs. That’s how we scale and design clusters. It’s that easy.
Here’s a diagram from The Nutanix Bible on how Shadow Clones work. Click it to go to the article.
Because we read and write locally, or in some cases read over the 10GiB switch, the majority of all IO is done locally to the VMs and via the SSD tier. Even better is that if we see blocks of data that are required by lots of VMs on a node we’ll kick in Shadow Clones to ensure that all VMs get that data localised right away. Data locality is the key here but it’s only one of the technologies we use to make sure the cluster as a whole is predictable and efficient. The best part is all of this is done on the fly without any administration. We take care of it invisibly and without disruption.
So next time you start to worry about storms, just select the right clothing before you go playing in the rain and you’ll be just fine.
Gracias por leer.