Sunday, November 5, 2017

Disabling vSAN and reclaiming disks when vCenter is not available

Enabling vSAN is very easy through the vSphere GUI. As long as you have unused disk devices and have setup vSAN traffic on a vmk interface, you're basically good to go (although, for enterprise deployments, it's important to pay attention to the HCL and design documentation).

vSAN in fact exists perfectly well without a vCenter. vCenter is just the easy-to-use UI. However, if  vCenter is lost you will need to resort to esxcli commands to be able to change vSAN settings, since the standalone html5 esxi host interface does not allow you to perform vSAN commands.

This situation has happened to me when messing around in the homelab (where my vCenters are short lived, ie, I deleted it before I disabled vSAN) or in a case where I bought homelab hardware from a friend and the vCenter didn't make the trip. Since he did provide the root password to the hosts, I was able to SSH into the host and run esxcli commands.

If disks were used by vSAN and that configuration was not properly undone, you will find that those disks cannot be re-used in a new vSAN configuration, or even to create plain datastores. I believe that information is recorded in each disk, meaning even re-installing ESXi would not clear said config. 

The "delete partition" commands available through the standalone html5 interface will fail because vSAN is still running and is protecting the disks. So, before I re-install my hosts or try to re-use the storage devices, I run the below commands.

I saw other posts that achieve the same thing using fdisk and partedutil, instead of esxcli vsan commands. That would be the brute force method - below is a much simpler and safer way, that I know works in ESXi 6.5.


There's two phases. 

1) The first phase is removing the host from a vSAN cluster. Check if the host believes it's in a vSAN cluster with

esxcli vsan cluster get

Remove the host from said cluster with

esxcli vsan cluster leave

This command can take a while to take effect, be patient. The host ceases collaborating with its cluster and running the get command again should show that the host is no longer a member of a vSAN cluster. 

esxcli vsan cluster get

Virtual SAN Clustering is not enabled on this host

At this point the vsanDatastore datastore will no longer show in the host storage, but, we aren't finished!

2) The second phase is clearing vSAN config from the disks so they can be re-used. Check if vSAN "owns" the disks with 

esxcli vsan storage list

From the list, you want to identify the cache disk, typically the best performing SSD, and copy the device name. 

naa.50026b724712194a
   Device: naa.50026b724712194a
   Display Name: naa.50026b724712194a

   Is SSD: true
...

  Is Capacity Tier: false

The gotcha - you can't perform manual operations on vSAN disks if they were claimed automatically when vSAN was configured. You must run this command to disable that auto-claiming before proceeding (use esxcli vsan storage automode get to check if you need to do this).

esxcli vsan storage automode set --enabled false

Now enter this command with the cache device name

 esxcli vsan storage remove -s naa.50026b724712194a

The -s means SSD. The command also accepts regular disks with -d. The thing to know is that doing the cache disk does all the disks in a diskgroup (ie, takes care of all capacity disks), so it's faster to just do the cache disk for the whole disk group.

This will take a while as well, but after completing, you should not see any output after issuing esxcli vsan storage list again (granted, you removed all disk groups).


I hope this helps anyone learning and playing with vSAN. Any corrections or suggestions, please reach out to me on twitter.

References to create this post (I do recommend them, they explain more):



  • William Lam has a great post detailing how even after disabling vSAN through vCenter, you may need to perform steps to reuse the disks.
  • jmalpadw gave a great answer in the VMTN community forums which is basically this post's commands, I just added more explanation and example outputs.

Wednesday, November 23, 2016

#vDM30in30 11-13-2016 Host fingerprints in PuTTY don't match what ESXi shows

This was an interesting "gotcha", thanks to my colleague @edmsanchez13 which I know happens in a fully patched 5.5 and 6.5.

PuTTY by default only shows SSH key fingerprints in the md5 format, such as 



(By the way, you can see all SSH keys PuTTY has learned by going in regedit to HKEY_CURRENT_USER\Software\SimonTatham\PuTTY\SshHostKeys . If you want to see the fingerprint for a known host again like if you were connecting for the first time, just delete it from here).

It used to be you could confirm this md5 hash in your host console. You should see this on the host console, in the Troubleshooting, View Support Information, section:



Notice how the SSH key in md5 format aa:bb:cc:dd from PuTTY does not match the SSH key in SHA256 format shown in the host? That's because since OpenSSH 6.8 "the fingerprint is now displayed as base64 SHA256 (by default)", "The default changes from MD5 to SHA256 and format from hex to base64."This means ESXi now uses the SHA256 format as well to present to you the SSH key fingerprint.

ESXi uses OpenSSH (as does the rest of the world, thanks to OpenBSD) and is correct in leaving this default on. All ssh binaries are in this directory and you can check the OpenSS version with -V

/usr/lib/vmware/openssh/bin] ssh -V
OpenSSH_7.3p1, OpenSSL 1.0.2j-fips  26 Sep 2016

I can't find an option for PuTTY to show me the new SHA256 fingerprint; so - how is anyone in Windows proving the SSH pub key hash is correct, before connecting to a host?

Off the bat, I can think of two ways:

1) Confirm from the host's console

You can verify the SSH fingerprint that PuTTY shows you by asking for the md5 fingerprint - this is done with this command, using the stored host keys:

/usr/lib/vmware/openssh/bin] ./ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub -E md5
2048 MD5:41:dd:b9:ec:ba:c0:ae:c7:9a:2a:21:f7:fd:23:96:91 no comment (RSA)

2) find a client that uses the new format. I had no problem from Ubuntu, for example




However, if you are in Windows, you are most probably using PuTTY (unless you are paying for SecureCRT). If PuTTY won't show the new SHA256 fingerprints, what clients will? I found one that is free to use, even inside organizations, called Bitvise. This client shows bothMD5 and SHA256 fingerprints; additionally, it seems to be quite handy as it immediately brings up a WinSCP like window for file transfers, so I'll be testing this as my SSH client going forward.




Good links that helped me do this post
http://superuser.com/questions/929566/sha256-ssh-fingerprint-given-by-the-client-but-only-md5-fingerprint-known-for-se
http://www.phcomp.co.uk/Tutorials/Unix-And-Linux/ssh-check-server-fingerprint.html

Saturday, August 20, 2016

storage vMotion fail with "failed to look up vmotion destination resource pool object"

This is a fun one because it's a storage vMotion on FibreChannel - the error message doesn't appear to make much sense. Normally this error indicates a network problem, but if on FC, if you can clearly see both datastores, and open and browse them, you would assume everything is ok.

The host that this was happening on was version 5.0 and had several days of uptime. I suspected the host. I moved the VM to a 5.5 host (that we were migrating to anyways) and the storage vMotion was able to be carried out without problems.

Googling I found that the related KB is 1003734 . Reading it, you find a lot of troubleshooting information for compute, or host-to-host vMotion, but not as much for storage vMotions. I did find this though:


  • This issue may be caused by SAN configuration. Specifically, this issue may occur if zoning is set up differently on different servers in the same cluster.

I believe that this may be the problem. I am not sure why I can browse the datastores ok, but since moving it from one host to another solves it, I assume either some service has failed on this long running host on a soon to be unsupported version, or there is indeed a SAN zoning problem.

Hope it helps someone!

Tuesday, May 3, 2016

Differences between the normal HCL and the VSAN HCL

TL;DR : I do recommend you use a VSAN Ready Node if you can.

After seeing first hand what @virtualhobbit went through, I was just amazed by how many "gotcha's" are involved with the VSAN HCL, especially if you are trying to deploy this in your homelab. This blog is about VMware gotcha's, so a post was in order. I hope this can help others avoid some of these mistakes.

Here's a (non-definitive) list:

Gotcha #1

In case you have been living under a rock, you should know a device on the standard VMware HCL does not automatically qualify for use with VMware VSAN. There is, in fact a separate HCL. Failure to know this before deploying VSAN will cause you lots of misery. The VSAN HCL has details for I/O Controllers, HDDs and SSDs - everything else, assume the normal HCL is valid, at least until they show up in the VSAN HCL...

Gotcha #2

Once you access the VSAN specific HCL option from the drop down, the interface does not resemble the normal HCL, where you can search for components (especially the four handy PCI device registers: VID, DID, SVID and SSID). You are greeted with a page that allows you to select a "VSAN Ready Node" which is a pre-configured full server configuration. In this example, I checked all Dell servers compatible with VSAN 6.0U2



Right. So the idea is you then go tell your server vendor "sell me exactly this". But what if you are looking for a specific hardware device, such as a SSD or Raid Controller? Where would you check the HCL to choose what to buy?

Gotcha #3

Somewhat hidden below the VSAN Ready Node selector interface is a disclaimer that tells you if you are willing to "brave the path of building your own server" (I paraphase) you can get to see the actual HCL




Gotcha #4

The URL to the real VSAN HCL? It's exactly the same as the URL for the Ready Node HCL. I know. I got a tip in the vExpert slack - to access it directly, go to http://vmwa.re/vsanhclc

Gotcha #5

Ehm... where are the four hardware identifiers we've been relying on, for so long, to unequivocally verify the HCL? It's not used here - at least to do the initial find of the devices. You will have to browse first using the available options. You will be able to verify the PCI registers from the results, but sometimes you get little jewels like this

"The device PID string (model) is truncated in ESXi. Please use both model number and firmware version when trying to identify the device. When in doubt, please consult with the hardware provider."

I'd like to offer a screenshot of this but I don't have one (I found this in the Intel P3700 NVMe drive, which is definitely a PCI device). If you do, please send one, as I'm really curious of why a lspci -v would truncate the PCI registers?

Gotcha #6

Each type of device has different columns. Please be mindful as these details can be extremely important.  For example:

  • certain I/O (RAID) controllers are only supported in a particular mode and have specific VID, DID, SVID and SSID values.
  • certain SSDs can only be used in a particular Tier (All Flash, Hybrid Caching, etc)
  • certain capacity drives are only supported in a certain disk series. Go find that out.

This apart from the driver/firmware requirements you have come to love.

Gotcha #7

Everybody knows NVMe is wicked fast and it's the future - but only a handful of drives are available today on the VSAN HCL. As far as I can tell, they are the Intel drives - HP just happens to sell them too and provide their own firmware release. I'm told Samsung and others are coming.


I know the VSAN team is hard at work trying to make this process easier. It's not easy to squeeze all the performance out from the wide variety of devices out there. Add to this the inherent human inefficiencies and costs associated with certifying and supporting all vendor hardware combinations and you can imagine how difficult things can be.

My hope is that very cool things that have stemmed from trying to help the VSAN administrators will make it into regular vSphere. I think particularly the VSAN HCL check, part of the included VSAN Health Check, should be easy to port, and a welcome addition for all of us that manage VMware HCLs (which is everybody...).

In the meantime, this particular PowerCLI script looks promising as long as we can find the "regular" and corresponding HCL JSON file locations. I wonder if someone has already thought of that and been able to get it to work? It would sure make a nice addition to my documentation templates effort!

Friday, April 29, 2016

Storage vMotion of a converted template VM that has VDS fails with "apply storage drs recommendations invalid configuration for device X"

Now this was a weird one.

I was making some space in one of my datastore clusters and decided I would move some templates to another datastore cluster. These were old templates that I hadn't touched in easily a year. I converted them to virtual machines and kicked off a storage vMotion. Right as it was finishing, I got this error



What the? I had never seen that error. All looked normal. After it failed a second time, I decided to Google and found this thread helpful

https://communities.vmware.com/thread/393779?start=0&tstart=0

Particularly because these templates had a NIC with a connection to a VDS and they hadn't been used in a long time. However, "refreshing" the port did not resolve the problem. 

The recommended solutions are to move the nic to a vSS or reclone the machine, but I don't have any vSS on this environment and ain't nobody got time for re-cloning a template. I offer a third option: I removed the NIC and re-added it. Voila, the storage vMotion now succeeded:



Who knew there was such a thing as expiring VDS port reservations? While not explained like this in KB 2006809 , and this is ESXi 5.5, that is the only way I can explain why an operation that had not failed every suddenly fails.

There's definitely a gotcha in there, and that's what this blog is about :)

Monday, February 15, 2016

vSphere template "convert to virtual machine" option grayed out

This is apparently an old bug. Seems to be triggered after vCenter and ESXi updates. You can read about it here:

https://communities.vmware.com/thread/394287?start=0&tstart=0 started in 2012
http://www.vbrain.info/2014/04/26/cannot-convert-template-to-virtual-machine/

The KB 2037005 you will find is utterly useless - this is not a permissions issue and removing/re-adding the template can be very ineffective. The manual solution in the two links above is to deploy a VM and magically the option will reappear.

I did find however a better way, in a post from 2009! by Arne Fokkema

http://ict-freak.nl/2009/08/06/vsphere-deploy-template-grayed-out/

I can confirm this works. If you don't want to paste blindly, here are the related PowerCLI commands. Italics is a variable, inside brackets is a keystroke:


#Connect to a vCenter, it will prompt for your credentials or use AD integration
Connect-VIServer hostname [enter]

#Get a list of templates (more options, such as from a unique cluster, here
Get-Template [enter]

#Convert template to VM (again, remember it's not a permissions problem, this is just a bug)
Set-Template template_name -ToVM [enter]


You will now see a successful task in vCenter that converted the template to a VM. To move it back with Powershell, you will need a variable so you can use .MarkAsTemplate() instead of defining a template with New-Template

$vm_to_tpl = Get-VM template_name | Get-View [enter]
$vm_to_tpl.MarkAsTemplate() [enter]

#remove the variable you created, notice no $ sign
remove-variable vm_to_tpl [enter]


However, his script probably works perfectly fine and is a very automated way of doing it, so if you have a lot of them, customize it, test it, save it to a ps1 and watch the magic. 

I did remove Arne's -RunAsync from mine since I didn't want the possibility of the script to try to perform a task while the vCenter was still performing the conversion.

Tuesday, February 9, 2016

vCenter 6 multi site installation error

First vCenter was a VCSA embedded and I was careful with DNS so everything went all right.

Second vCenter was a Windows embedded that had to join the SSO domain and setup a new site.


Got the error:




failed to run vdcpromoon the top part of the error: "install.vmafd.vmdir_vdcpromo_error"


Sean Whitney's post is the only help I found from a Google Search on such an error (and yes, the installation with the generic error 1603 he mentions)


http://www.virtually-limitless.com/troubleshooting/single-sign-on-or-platform-services-controller-psc-fails-to-install-or-upgrade-with-error-code-1603/


The path where I saw logs was C:\ProgramData\VMware\vCenterServer\logs though, and they only persisted until the zip file was done bundling up.


I ruled out:


1) that my SSO password had a ! sign, changed it to a @ - it still happened, so that wasn't it... it sounded weird since we've used VMware1! in VMware labs forever...


2) In my second try I got a message that said this windows VM and the VCSA were off by 22 seconds

Thanks to Google I found the commands to check NTP from the VCSA console

Command> ntp.test --server time.windows.com
Status:
   Status: red
   Messages:
     Configuration:
         Message: Failed to reach 'time.windows.com'.

         Result: failure

well well - let's change that

Command> ntp.server.delete --server time.windows.com
Command> ntp.server.add --server 0.pool.ntp.org
Command> ntp.test --server 0.pool.ntp.org
Status:
   Status: green
   Messages:
     Configuration:
         Message: '0.pool.ntp.org' is reachable.
         Result: success
Command> ntp.get
Config:
   Status: Up
   Servers: 0.pool.ntp.org

I went to my PDC and updated the time server to match the NTP server

w32tm /config /manualpeerlist:0.pool.ntp.org /syncfromflags:manual /reliable:yes /update

and rebooted the PDC - system event logs confirm it's now syncing to this server.

I cancelled the installation, rebooted the vCenter and tried again.

Now it complans i'm off by 16 seconds. Ayayayayay. Then I got it differs by 5 seconds.

I'm pissed by now.

Let's try to get this VCSA to sync directly from the lone PDC - I can't make it faster than my freaking LAN.

on the PDC, if you hadn't already, make sure it's a reliable time server:

w32tm /config /reliable:yes

On the VCSA - enough with the niceties, give me the shell, set the server in /etc/ntp.conf and stop and start the ntp service, then check with ntpq -p like Ganesh does here to make sure everything is working as intended.

And it didn't complain this time. It went smoothly until completion.

So that big error was really a NTP error.

If that's not a gotcha, I don't know what is!