A critical issue was reported with upgrading VSAN from ESXi 5.5 with Patch 4 or Express Patch 6 to vSphere 6 GA.
The problem was introduced in the above patches which attempted to fix an incorrect SSD capacity reporting when using RVC command vsan.disks_stats.
EP07 backs out the above fix.
Here is the possible sequence of events:
- Upgrade vCenter to 6.0
- Enter maintenance mode on one VSAN cluster node (Ensure availability option)
- Upgrade to ESXi 6.0 GA
- Exit maintenance mode
- vMotion some VMs to the upgraded node
The ESXi 6 node is unable to read the SSD UUID correctly on the remaining VSAN 5.5 nodes. This results in failure to access the VSAN objects on the 5.5 nodes and eventually, VMs become inaccessible and my disappear from vCenter inventory.
A look through the vmkernel.log file would show events like this:
This problem is more evident with 3 nodes clusters. If there are more nodes, the chances of hitting this issue are less depending on FTT policy and on which nodes the replicas are stored.
Examples:
Cluster nodes count: 3
FTT=1
Components placements:
5.5 node 1: Replica
5.5 node 2: Replica
6.0 node: Witness
or
5.5 node 1: Replica
5.5 node 2: Witness
6.0 node: Replica
In either case, more than 50% of the components are on 5.5 nodes. As long as the associated VMs run on 5.5 nodes, they would be accessible. However, availability is not guaranteed if one of the remaining 5.5 nodes experiences a hardware component failure.
If you have 4+ nodes cluster, all 3 components (2 replica and a witness) may be on the 5.5 nodes. As long as the majority of the components are on the 5.5 nodes, the object should still be accessible.
The best approach to upgrading, with least disruption to the running VMs, is:
- Upgrade vCenter to 6.0
- Place one node in maintenance mode (Ensure accessibility option)
- Install EP07
- Reboot and exit maintenance mode
- Wait for resync to complete
- Repeat steps 2 through 5 for each node in the cluster.
- Place one node in maintenance mode (Ensure accessibility)
- Upgrade to ESXi 6.0 GA
- Exit maintenance mode.
- Wait for resync to complete
- Repeat steps 7 through 10 for each node in the cluster.
What if I already upgraded one host prior to installing EP07?
You have 2 choices:
- Shutdown all VMs on the cluster and upgrade the remaining nodes to 6.0. Make sure to choose “No data Migration” option when entering maintenance mode on the 5.5 nodes.
- Roll-back the 6.0 node to the previous version of 5.5
The first approach needs the least administrative effort but will disrupt all running VMs
The second approach is the least disruptive but will involve a rollback, installing EP07 then upgrading to 6.0
To rollback an upgrade, as long as you have not made any other configuration changes, you may follow VMware KB 1033604