A while back, we purchased some vSAN Ready nodes for a new cluster. The machines came with ESXi installed in an all-NVMe configuration, but when setting up vSAN, Skyline Health kept complaining that the driver used for the write-intensive cache drives wasn’t certified for this purpose.
I opened support cases with both VMware and Dell as I was in a hurry to get the machines running but didn’t know where the problem lay – we had an identically specced cluster that had been manually installed with vSphere 7 earlier where this issue did not occur. Unfortunately none of the support cases ended with a viable resolution: I seem to have gotten stuck with first-line support in both cases and didn’t have time to nag my way to higher levels of support – the
shibboleet code word never seems to work in real life.
I finally compared what drivers actually were in use on the new servers versus the old ones and realized the cache disks on the new servers erroneously used the
intel-nvme-vmd driver, while on the older hosts all disks used VMware’s own
nvme-pcie driver. The solution, then was very simple:
For each host, I first set the machine in Maintenance Mode, enabled the
ssh service, and logged in.
I then verified my suspicion:
esxcli software vib list | grep nvme (...) intel-nvme-vmd 126.96.36.1996-1OEM.700.1.0.15843807 INT VMwareCertified 2021-04-19 nvme-pcie 188.8.131.52-1vmw.702.0.0.17630552 VMW VMwareCertified 2021-05-29 (...)
I removed the erroneously used driver:
esxcli software vib remove -n intel-nvme-vmd
And finally I rebooted the server. Rinse and repeat for each machine in the cluster.
After I was done, I re-checked Skyline Health for the cluster, and was greeted with the expected green tickmarks: