VMware

Set up TPM support in vCenter on Dell R7515

Quick HowTo/reminder to myself on how to activate TPM on ESXi hosts connected to vCenter.

The smoothest way is to configure the servers before they are connected to vCenter: Otherwise they must be removed from the inventory and re-added.

The BIOS security settings must be correctly configured:

Dell R7515 BIOS menu with System Security highlighted

Select System Security.

Dell R7515 BIOS System Security submenu, TPM Security section

TPM Security must be turned On.

Dell R7515 BIOS TPM Advanced Settings submenu

Under the TPM Advanced Settings menu, TPM2 Algorithm Selection must be set to SHA256.

Dell R7515 System Security submenu, Secure Boot section

Back in the System Security menu, Secure Boot must be Enabled.

Boot the server and add it to vCenter.

Enable the SSH service and log on to the server. Check the TPM status:

# esxcli system settings encryption get | grep Mode
   Mode: NONE

Set the mode to TPM:

# esxcli system settings encryption set --mode TPM

Get the encryption keys and store them somewhere safe, like a password manager:

# esxcli system settings encryption recovery list
Recovery ID                             Key
--------------------------------------  ---
{....}                                  ....

In vCenter, you’ll see a warning for each host, about the encryption key backup status. This last step was what that warning was about. If you’re confident the recovery ID and Key for each host is securely stored, reset the warning to green. The hosts are now utilizing their TPM capability.

Fixing vSAN driver compatibility on Dell R7515

A while back, we purchased some vSAN Ready nodes for a new cluster. The machines came with ESXi installed in an all-NVMe configuration, but when setting up vSAN, Skyline Health kept complaining that the driver used for the write-intensive cache drives wasn’t certified for this purpose.

I opened support cases with both VMware and Dell as I was in a hurry to get the machines running but didn’t know where the problem lay – we had an identically specced cluster that had been manually installed with vSphere 7 earlier where this issue did not occur. Unfortunately none of the support cases ended with a viable resolution: I seem to have gotten stuck with first-line support in both cases and didn’t have time to nag my way to higher levels of support – the shibboleet code word never seems to work in real life.

I finally compared what drivers actually were in use on the new servers versus the old ones and realized the cache disks on the new servers erroneously used the intel-nvme-vmd driver, while on the older hosts all disks used VMware’s own nvme-pcie driver. The solution, then was very simple:

For each host, I first set the machine in Maintenance Mode, enabled the ssh service, and logged in.

I then verified my suspicion:

esxcli software vib list | grep nvme
(...)
intel-nvme-vmd                 2.5.0.1066-1OEM.700.1.0.15843807     INT      VMwareCertified   2021-04-19
nvme-pcie                      1.2.3.11-1vmw.702.0.0.17630552       VMW      VMwareCertified   2021-05-29
(...)

I removed the erroneously used driver:

esxcli software vib remove -n intel-nvme-vmd

And finally I rebooted the server. Rinse and repeat for each machine in the cluster.

After I was done, I re-checked Skyline Health for the cluster, and was greeted with the expected green tickmarks:

Image showing green tickmarks for all tested items.

Troubleshooting vSphere update woes

It’s 2020 and I still occasionally stumble on products that can’t handle international characters.

I’ve been running my update rounds on our vSphere environment, but one host simply refused to perform is update compliance check.

To troubleshoot, I enabled the ssh service and remoted in to the host, looking for errors in /var/log/vua.log. Sure enough, I found an interesting error message:

--> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 33643: ordinal not in range(128)

The number 0xc3 sounds a lot like a Swedish or Norwegian character, so I grep’d the output of esxcfg-info until I found the culprit:

esxcfg-info | grep å
               |----Name............................................Virtual Lab Tången vSAN
                     |----Portset Name..............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                  |----Name.........................................Virtual Lab Tången vSAN
                        |----Portset Name...........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
            |----World Command Line.................................grep å

A vLab I created for a couple of my Veeam SureBackup jobs had a Nordic character in its name, and blocked updates. After removing all traces of the virtual lab and the Standard Switch it had created on the host, the same command showed no traces of characters outside of the limited ASCII set, and updating the host went as smoothly as it usually does.

Lesson learned: Client-side issues with localization may have mostly been solved for a decade or two, but server-side there are still reasons – not good ones, but reasons – to stick to plain English descriptors for everything.

When the French attack…

A consultant working with our Alcatel phone system encountered a weird issue that caused us some problems the other day. When attempting to install an Open Touch Media Server (used for receiving fax, for example), the entire vCenter client environment froze, and a reload of the page resulted in the following error message:

503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http20NamedPipeServiceSpecE:0x0000…] _serverNamespace = / action = Allow _pipeName =/var/run/vmware/vpxd-webserver-pipe)

A lot of searching the web led me nowhere – there were a bunch of solutions, but none of whose symptoms agreed with what I was experiencing; I had not changed IP addresses on the vCenter Appliance, nor had I changed its name, and I did not have an issue with logs reporting conflicting USB device instances.

What I did have, though, was a new OpenTouch server on one of my ESXi hosts, which did not have a network assigned to its network interface, and this, apparently is not a configuration that vCenter was written to take into consideration.

Logging on to the local web client on the specific ESXi host where the machine was running (after identifying that…), and selecting the machine in question, I got a warning message specifying the network problem, and a link to the Action menu. Simply selecting a valid network and saving the machine configuration was enough to allow me to ssh to the vCenter Appliance and start the vmware-vpxd service:

# service-control –start vmware-vpxd

We’ll just have to see how we proceed from here…

Manually removing ghost vVols from IBM SVC-based storage

As part of my evaluation of presenting vVols to vCenter from an IBM FlashSystem V9000, I decided to start from scratch after learning a bit about the benefits and limitations of the system. That is: I like vVols a lot, but I learned some things in my tests that I wanted to do differently in actual production.

Unfortunately, once I had migrated my VMs off the vVol datastores, I still couldn’t detach the relevant storage resources from the storage service in Spectrum Control Base. The error message was clear enough: I’m not allowed to remove a storage resource that still has vVols on it. My frustration was based in the fact that vCenter showed no VMs nor files on any of the vVol datastores, but I could clearly see them (labeled as “volume copies”) in the “Volumes by Pool” section in the SVC webUI on the V9000.

At least as of version 7.6.x of the SVC firmware, there is no way of manually removing vVols from the GUI, and as usual in such cases, we turn to the CLI:
I connected to the V9000 using ssh, taking care to log on as my VASA user. All virtual disks on the V9000 can be listed using the lsvdisk command. The first two columns are their ID and name, and any of these parameters can be fed to the rmvdisk command to manually remove a volume.

Just to be clear: The rmvdisk command DELETES stuff. Do not use it unless you really mean it! With that warning out of the way; once I had removed the volumes and waited a couple of minutes for the change to propagate to Spectrum Control Base, detaching storage resources from storage services was a breeze.

VMware Storage Providers and Certificate issues

While trying to test out vVols in our vSphere 6.5 environment, presented via IBM Spectrum Control Base 3.2 from a StoreWize V9000 SAN, I ran into a small issue that took me a while to figure out:

I installed Spectrum Control Base 3.2 and presented its web services via a FQDN.
To avoid the nagging of modern browsers, I used a regular wildcard certificate valid for the domain I chose to use.
After the initial setup, when I tried to add SCB as a storage provider in VMware, I got the following error message: “A problem was encountered while provisioning a VMware Certificate Authority (VMCA) signed certificate for the provider.
A web search showed me that this was a pretty common problem with several VASA providers, but none of the suggested solutions applied to our environment. After half an hour of skimming forums and documentation I found the following quote in an ancient support document from VMware:
Note: VMware does not support the use of wildcard certificates.

So: I generated a self-signed certificate in the Spectrum Control Base server webUI, and the problem disappeared.

Lesson of today: We don’t use wildcard certificates in a VMware service context.

The paravirtual SCSI controller and the blue screen of death

For driver reasons, the default disk controller in VMware guests is an emulated LSI card. However, once you install VMware Tools in Windows (and immediately after installing the OS in most modern Linux distributions), it’s possible to slightly lower the overhead for disk operations by switching to the paravirtual SCSI controller (“pvscsi”).

I’m all for lower overhead, so my server templates are already converted to use the more efficient controller, but I still have quite a lot of older Windows servers that still run the LSI controller, so I’ve made it a habit to switch controllers when I have them down for manual maintenance. There is a perfectly good way of switching Windows system drives to a pvscsi controller in VMware, and it’s well documented, so up until a couple of days ago, I’ve never encountered any issues.

Continue reading…