ManitobaVMUG: vSphere 5.0update 1 and stretched clusters

Original Post:

http://virtualgeek.typepad.com/virtual_geek/2012/03/vsphere-50update-1-and-stretched-clusters.html

Folks – vSphere 5.0 update 1 is out! Download it here…

Read the ESXi 5.0 update 1 release notes here.

Read the vCenter 5.0 update 1 release notes here.

There are tons of things in there to note. I'll highlight a couple, but would really recommend scanning the release notes.

"New coolness"- too much to list (it's in the release notes). Here's one example. On the continued evolution (and resolution to some issues with UNMAP in some cases and some array targets) of space reclaim, update 1 automatically disables UNMAP (the reason and workarounds pre updated 1 I've noted here). The only case (that I know of personally) where there have been issues is in some cases during svmotion as I have said in the past. But, they have updated vmkfstools to enable you to do an unmap across a datastore – even specifying the percentage of zeroed blocks to reclaim. Check that out here. BTW – this isn't stopping here. You can expect more work on the array vendors and in future vSphere releases to continue to improve this efficiency use case.
"errata" – the release notes do a great job of not only highlighting fixes, but open errata. For example (I've run into this one), when you try to unmount an NFS datastore when Storage IO Control (or Storage DRS) are in use – you can't unmount, and get an error message that the resource is in use. I'm guessing that this is because there is an open file handle. Disable SIOC/Storage DRS, then umount.

One thing in particular that I've been waiting for in update 1 was the changes in storage device failures and VM HA.

In vSphere 5, the "All Paths Down" (APD) condition (ESX can't reach a device via any path) got a new "friend" the "Permanent Device Loss" state (when a target communicates, "hey, this device is gone, but don't expect it back anytime soon" – ergo when it has been removed intentionally from the host, or the target is in a partitioned state). I've discussed this here.

For a bit of context – the use case for "Stretched vSphere Clusters" is turning out to be much more popular than I think many folks (certainly me) expected. I know that my respected colleague Scott Lowe is asked about this almost daily. Lee Dilworth and I did a very popular session at VMworld 2011 on this topic, check it out here.

For the last 1-2 years, VMware, EMC and others in the industry have been looking at really planning, thinking and engineering the solution stack around this use case. It's not the same as a regular cluster, with other failure conditions that need to be planned for. This has resulted in the "vSphere Metro Stretched Cluster" HCL category, which incorporates testing for these failure conditions (read more on that here). Beyond just testing, we continue to enhance each part of the solution to continue to make stretched clustering work better and better – working more simply, in a more integrated, and frankly invisible fashion (this is what Lee and I were talking about in the close of that VMworld session when we discussed some of the "futures").

In vSphere 5.0 update 1, one other "shoe drops". PDL codes are used by EMC VPLEX when "partitioned" (where all connectivity between 2 sites in a VPLEX cluster fail). This means that the VPLEX cluster nodes in the non-preferred site for a device (this is a "per device" setting that declares in advance which site "stops IO to that device" to avoid split brain at the storage level) says "hey, the IO to this device on this target is stopping, and you shouldn't expect it to come back momentarily".

What's changed, is up until now – the loss of a storage device doesn't by definition trigger a VM HA response. This is an example of what Lee and I were talking about in our session. People over-simplify when thinking about stretched clusters, and just assume that VM HA will work "like SRM" (often because storage vendors tell them it will). VM HA wasn't originally designed for this use case.

In vSphere 5.0 update 1 – a PDL response can trigger a VM HA response – if you set an additional VM HA parameter. Sweet! Duncan Epping also noted this change on Yellow Bricks (always awesome) here.

BUT – it is the plan to continue – with each minor/major VMware release to increasingly think of these geographically dispersed clusters, and the new category of geographically dispersed active-active storage models as a design center.

VMware – thanks for the continued coolness – and I know it's going to keep on coming!

Pages

Wednesday, 21 March 2012

vSphere 5.0update 1 and stretched clusters - Virtual Geek

No comments:

Post a Comment