Sunday, February 17, 2019

Reclaiming disks for vSAN when they used to have a datastore, especially a coredump or scratch location that prevents deletion

Typically, if you want to reuse a disk that was hosting a datastore for vSAN, you delete the datastore and for good measure use the UI to erase partitions, and life is good.

In some cases, you may get an error deleting the datastore. I find this happens typically in homelabs, but I've also seen posts in the VMTN communities when googling this topic, so it's not an uncommon thing. To be fair, this is an old phenomenon, it can happen whether your intention is to reuse a disk for vSAN or not :) This particular blog post was written with ESXi 6.7u1.

Especially when ESXi is installed to "remote" media such as USB and SD cards, the first datastore will also automatically be configured as the location for the Core Dump and the Scratch space. This can even happen post installation, right after or on the next reboot after the datastore is created, because ESXi wants a local disk for these locations. A more in depth explanation can be found in the ESXi installation guide , KBs such as https://kb.vmware.com/s/article/1020668 and through the web.

You will not be able to delete the datastore or erase all partitions until those two settings are changed. To do this, I prefer opening a SSH session into the host and running the following commands:

esxcli storage filesystem list

This lists your datastores, and provides the Datastore UUID; we will focus on Datastore1, the one I couldn't delete:

Mount Point                                        Volume Name                  UUID                                 Mounted  Type            Size          Free
-------------------------------------------------  ---------------------------  -----------------------------------  -------  ------  ------------  ------------
/vmfs/volumes/900eb6ff-a901e725                    LenovoEMC_PX4-300D_NFS_ISOs  900eb6ff-a901e725                       true  NFS     211244736512  161516716032
/vmfs/volumes/5be9ba64-49b90678-5ec4-f44d3065284a  Datastore1                   5be9ba64-49b90678-5ec4-f44d3065284a     true  VMFS-6  255818989568  254282825728
/vmfs/volumes/0a71fde6-7fce32f8-8357-9857d9c81feb                               0a71fde6-7fce32f8-8357-9857d9c81feb     true  vfat       261853184     113819648
/vmfs/volumes/d973d9e5-0b4c944c-4341-5608ca2f3424                               d973d9e5-0b4c944c-4341-5608ca2f3424     true  vfat       261853184     107634688
/vmfs/volumes/5c476983-01be6fdc-53a3-f44d3065284a                               5c476983-01be6fdc-53a3-f44d3065284a     true  vfat       299712512     116998144

Changing the Scratch location

You can run this simple command to confirm which datastore is hosting the Scratch:

cd /scratch

cd /scratch
[root@esxihost:/vmfs/volumes/5be9ba64-49b90678-5ec4-f44d3065284a/.locker] 

Or a little more complex, vim-cmd hostsvc/advopt/view ScratchConfig.ConfiguredScratchLocation

[root@esxihost:~] vim-cmd hostsvc/advopt/view ScratchConfig.ConfiguredScratchLocation
(vim.option.OptionValue) [
   (vim.option.OptionValue) {
      key = "ScratchConfig.ConfiguredScratchLocation",
      value = "/vmfs/volumes/5be9ba64-49b90678-5ec4-f44d3065284a/.locker"
   }
]

We can change the Scratch location in the UI (under advanced options) or from the command line. If this is for production, then you really want to read https://kb.vmware.com/s/article/1033696 and set a network location with each host having a dedicated folder. This is critical when you are installing ESXi to usb/sd media, which is considered remote and has terrible write endurance!

But in my case, this is for my homelab, so I'm going to use a "bogus" location, /tmp. The following command comes from the KB, which does a great job of listing several options, including PowerCLI. You would change the part after "string" to an actual datastore location. Again, don't use /tmp in production, /tmp gets deleted with each reboot and you could lose all Scratch files when you most need them!

vim-cmd hostsvc/advopt/update ScratchConfig.ConfiguredScratchLocation string /tmp

The setting requires a reboot to take effect. You will get a "System logs on host esxihost.ariel.lab are stored on non-persistent storage." alert until you set a proper Scratch location (check the KB again on how to setup a proper one). 

Changing the CoreDump location

To check where your core dump is configured, you can run these commands:

esxcli system coredump file get
esxcli system coredump network get
esxcli system coredump partition get
   Active: t10.SanDisk_Cruzer_Fit______4C530012450221105421:9
   Configured: t10.SanDisk_Cruzer_Fit______4C530012450221105421:9

In this particular case, ESXi is meant to run from a USB disk, and the last command confirms that the coredump is configured on the USB disk. If it was mapped to the datastore, you will need to change it, and then reboot the host so it takes effect. 

You can use the file, network or partition option and a variety of list and set commands; you will need to reboot after setting a new location for the change to take effect. This is a good blog post with screenshots, this one a bit more advanced . You can take a look at this command to "unconfigure" the dump partition once you have a network location setup.

esxcli system coredump partition set -u



We should be able to delete the datastore now, since nothing special is using it anymore, it's just a datastore. The disk can now be re-used for vSAN. But what happens if you still can't delete it?

Check partitions

You can take a different approach by listing the disks and their partitions, and then figuring out what they are:

ls /vmfs/devices/disks/

ls /vmfs/devices/disks/
t10.ATA_____INTEL_SSDSC2BX400G4R______________________BTHC5215055Y400VGN  vml.0100000000202042544843353231353035355934303056474e494e54454c20
t10.NVMe____PLEXTOR_PX2D256M8PeGN____________________EB88003056032300     vml.010000000034433533303031323435303232313130353432314372757a6572
t10.NVMe____PLEXTOR_PX2D256M8PeGN____________________EB88003056032300:1   vml.010000000034433533303031323435303232313130353432314372757a6572:1
t10.SanDisk_Cruzer_Fit______4C530012450221105421                          vml.010000000034433533303031323435303232313130353432314372757a6572:5
t10.SanDisk_Cruzer_Fit______4C530012450221105421:1                        vml.010000000034433533303031323435303232313130353432314372757a6572:6
t10.SanDisk_Cruzer_Fit______4C530012450221105421:5                        vml.010000000034433533303031323435303232313130353432314372757a6572:7
t10.SanDisk_Cruzer_Fit______4C530012450221105421:6                        vml.010000000034433533303031323435303232313130353432314372757a6572:8
t10.SanDisk_Cruzer_Fit______4C530012450221105421:7                        vml.010000000034433533303031323435303232313130353432314372757a6572:9
t10.SanDisk_Cruzer_Fit______4C530012450221105421:8                        vml.0100000000454238385f303033305f353630335f3233303000504c4558544f
t10.SanDisk_Cruzer_Fit______4C530012450221105421:9                        vml.0100000000454238385f303033305f353630335f3233303000504c4558544f:1

By identifying the disk we can explore its partition table. Very important, note the characters in black bold, including the quotes!

partedUtil getptbl "/vmfs/devices/disks/t10.NVMe____PLEXTOR_PX2D256M8PeGN____________________EB88003056032300"
gpt
31130 255 63 500118192
1 2048 500115456 AA31E02A400F11DB9590000C2911D1B8 vmfs 0

This output tells us there's a VMFS partition. In this case, since I've already removed the Scratch and CoreDump, they don't show anymore, but before you would have been able to see more partitions of type vmkDiagnostic apart from the VMFS datastore.

So if you've already moved the Scratch and Coredump partitions and you still can't delete the datastore, it may have been used for other things, such as HA heartbeat. You will have to work through the partitions to figure out what they are. This is a good KB to read https://kb.vmware.com/s/article/2147177

Once you only have the VMFS partition you should be able to delete it, since nothing special is using it anymore; it's just a plain datastore. The disk can now be re-used for vSAN. 

Sunday, November 5, 2017

Disabling vSAN and reclaiming disks when vCenter is not available

Enabling vSAN is very easy through the vSphere GUI. As long as you have unused disk devices and have setup vSAN traffic on a vmk interface, you're basically good to go (although, for enterprise deployments, it's important to pay attention to the HCL and design documentation).

vSAN in fact exists perfectly well without a vCenter. vCenter is just the easy-to-use UI. However, if  vCenter is lost you will need to resort to esxcli commands to be able to change vSAN settings, since the standalone html5 esxi host interface does not allow you to perform vSAN commands.

This situation has happened to me when messing around in the homelab (where my vCenters are short lived, ie, I deleted it before I disabled vSAN) or in a case where I bought homelab hardware from a friend and the vCenter didn't make the trip. Since he did provide the root password to the hosts, I was able to SSH into the host and run esxcli commands.

If disks were used by vSAN and that configuration was not properly undone, you will find that those disks cannot be re-used in a new vSAN configuration, or even to create plain datastores. I believe that information is recorded in each disk, meaning even re-installing ESXi would not clear said config. 

The "delete partition" commands available through the standalone html5 interface will fail because vSAN is still running and is protecting the disks. So, before I re-install my hosts or try to re-use the storage devices, I run the below commands.

I saw other posts that achieve the same thing using fdisk and partedutil, instead of esxcli vsan commands. That would be the brute force method - below is a much simpler and safer way, that I know works in ESXi 6.5.


There's two phases. 

1) The first phase is removing the host from a vSAN cluster. Check if the host believes it's in a vSAN cluster with

esxcli vsan cluster get

Remove the host from said cluster with

esxcli vsan cluster leave

This command can take a while to take effect, be patient. The host ceases collaborating with its cluster and running the get command again should show that the host is no longer a member of a vSAN cluster. 

esxcli vsan cluster get

Virtual SAN Clustering is not enabled on this host

At this point the vsanDatastore datastore will no longer show in the host storage, but, we aren't finished!

2) The second phase is clearing vSAN config from the disks so they can be re-used. Check if vSAN "owns" the disks with 

esxcli vsan storage list

From the list, you want to identify the cache disk, typically the best performing SSD, and copy the device name. 

naa.50026b724712194a
   Device: naa.50026b724712194a
   Display Name: naa.50026b724712194a

   Is SSD: true
...

  Is Capacity Tier: false

The gotcha - you can't perform manual operations on vSAN disks if they were claimed automatically when vSAN was configured. You must run this command to disable that auto-claiming before proceeding (use esxcli vsan storage automode get to check if you need to do this).

esxcli vsan storage automode set --enabled false

Now enter this command with the cache device name

 esxcli vsan storage remove -s naa.50026b724712194a

The -s means SSD. The command also accepts regular disks with -d. The thing to know is that doing the cache disk does all the disks in a diskgroup (ie, takes care of all capacity disks), so it's faster to just do the cache disk for the whole disk group.

This will take a while as well, but after completing, you should not see any output after issuing esxcli vsan storage list again (granted, you removed all disk groups).


I hope this helps anyone learning and playing with vSAN. Any corrections or suggestions, please reach out to me on twitter.

References to create this post (I do recommend them, they explain more):



  • William Lam has a great post detailing how even after disabling vSAN through vCenter, you may need to perform steps to reuse the disks.
  • jmalpadw gave a great answer in the VMTN community forums which is basically this post's commands, I just added more explanation and example outputs.

Wednesday, November 23, 2016

#vDM30in30 11-13-2016 Host fingerprints in PuTTY don't match what ESXi shows

This was an interesting "gotcha", thanks to my colleague @edmsanchez13 which I know happens in a fully patched 5.5 and 6.5.

PuTTY by default only shows SSH key fingerprints in the md5 format, such as 



(By the way, you can see all SSH keys PuTTY has learned by going in regedit to HKEY_CURRENT_USER\Software\SimonTatham\PuTTY\SshHostKeys . If you want to see the fingerprint for a known host again like if you were connecting for the first time, just delete it from here).

It used to be you could confirm this md5 hash in your host console. You should see this on the host console, in the Troubleshooting, View Support Information, section:



Notice how the SSH key in md5 format aa:bb:cc:dd from PuTTY does not match the SSH key in SHA256 format shown in the host? That's because since OpenSSH 6.8 "the fingerprint is now displayed as base64 SHA256 (by default)", "The default changes from MD5 to SHA256 and format from hex to base64."This means ESXi now uses the SHA256 format as well to present to you the SSH key fingerprint.

ESXi uses OpenSSH (as does the rest of the world, thanks to OpenBSD) and is correct in leaving this default on. All ssh binaries are in this directory and you can check the OpenSS version with -V

/usr/lib/vmware/openssh/bin] ssh -V
OpenSSH_7.3p1, OpenSSL 1.0.2j-fips  26 Sep 2016

I can't find an option for PuTTY to show me the new SHA256 fingerprint; so - how is anyone in Windows proving the SSH pub key hash is correct, before connecting to a host?

Off the bat, I can think of two ways:

1) Confirm from the host's console

You can verify the SSH fingerprint that PuTTY shows you by asking for the md5 fingerprint - this is done with this command, using the stored host keys:

/usr/lib/vmware/openssh/bin] ./ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub -E md5
2048 MD5:41:dd:b9:ec:ba:c0:ae:c7:9a:2a:21:f7:fd:23:96:91 no comment (RSA)

2) find a client that uses the new format. I had no problem from Ubuntu, for example




However, if you are in Windows, you are most probably using PuTTY (unless you are paying for SecureCRT). If PuTTY won't show the new SHA256 fingerprints, what clients will? I found one that is free to use, even inside organizations, called Bitvise. This client shows bothMD5 and SHA256 fingerprints; additionally, it seems to be quite handy as it immediately brings up a WinSCP like window for file transfers, so I'll be testing this as my SSH client going forward.




Good links that helped me do this post
http://superuser.com/questions/929566/sha256-ssh-fingerprint-given-by-the-client-but-only-md5-fingerprint-known-for-se
http://www.phcomp.co.uk/Tutorials/Unix-And-Linux/ssh-check-server-fingerprint.html

Saturday, August 20, 2016

storage vMotion fail with "failed to look up vmotion destination resource pool object"

This is a fun one because it's a storage vMotion on FibreChannel - the error message doesn't appear to make much sense. Normally this error indicates a network problem, but if on FC, if you can clearly see both datastores, and open and browse them, you would assume everything is ok.

The host that this was happening on was version 5.0 and had several days of uptime. I suspected the host. I moved the VM to a 5.5 host (that we were migrating to anyways) and the storage vMotion was able to be carried out without problems.

Googling I found that the related KB is 1003734 . Reading it, you find a lot of troubleshooting information for compute, or host-to-host vMotion, but not as much for storage vMotions. I did find this though:


  • This issue may be caused by SAN configuration. Specifically, this issue may occur if zoning is set up differently on different servers in the same cluster.

I believe that this may be the problem. I am not sure why I can browse the datastores ok, but since moving it from one host to another solves it, I assume either some service has failed on this long running host on a soon to be unsupported version, or there is indeed a SAN zoning problem.

Hope it helps someone!

Tuesday, May 3, 2016

Differences between the normal HCL and the VSAN HCL

TL;DR : I do recommend you use a VSAN Ready Node if you can.

After seeing first hand what @virtualhobbit went through, I was just amazed by how many "gotcha's" are involved with the VSAN HCL, especially if you are trying to deploy this in your homelab. This blog is about VMware gotcha's, so a post was in order. I hope this can help others avoid some of these mistakes.

Here's a (non-definitive) list:

Gotcha #1

In case you have been living under a rock, you should know a device on the standard VMware HCL does not automatically qualify for use with VMware VSAN. There is, in fact a separate HCL. Failure to know this before deploying VSAN will cause you lots of misery. The VSAN HCL has details for I/O Controllers, HDDs and SSDs - everything else, assume the normal HCL is valid, at least until they show up in the VSAN HCL...

Gotcha #2

Once you access the VSAN specific HCL option from the drop down, the interface does not resemble the normal HCL, where you can search for components (especially the four handy PCI device registers: VID, DID, SVID and SSID). You are greeted with a page that allows you to select a "VSAN Ready Node" which is a pre-configured full server configuration. In this example, I checked all Dell servers compatible with VSAN 6.0U2



Right. So the idea is you then go tell your server vendor "sell me exactly this". But what if you are looking for a specific hardware device, such as a SSD or Raid Controller? Where would you check the HCL to choose what to buy?

Gotcha #3

Somewhat hidden below the VSAN Ready Node selector interface is a disclaimer that tells you if you are willing to "brave the path of building your own server" (I paraphase) you can get to see the actual HCL




Gotcha #4

The URL to the real VSAN HCL? It's exactly the same as the URL for the Ready Node HCL. I know. I got a tip in the vExpert slack - to access it directly, go to http://vmwa.re/vsanhclc

Gotcha #5

Ehm... where are the four hardware identifiers we've been relying on, for so long, to unequivocally verify the HCL? It's not used here - at least to do the initial find of the devices. You will have to browse first using the available options. You will be able to verify the PCI registers from the results, but sometimes you get little jewels like this

"The device PID string (model) is truncated in ESXi. Please use both model number and firmware version when trying to identify the device. When in doubt, please consult with the hardware provider."

I'd like to offer a screenshot of this but I don't have one (I found this in the Intel P3700 NVMe drive, which is definitely a PCI device). If you do, please send one, as I'm really curious of why a lspci -v would truncate the PCI registers?

Gotcha #6

Each type of device has different columns. Please be mindful as these details can be extremely important.  For example:

  • certain I/O (RAID) controllers are only supported in a particular mode and have specific VID, DID, SVID and SSID values.
  • certain SSDs can only be used in a particular Tier (All Flash, Hybrid Caching, etc)
  • certain capacity drives are only supported in a certain disk series. Go find that out.

This apart from the driver/firmware requirements you have come to love.

Gotcha #7

Everybody knows NVMe is wicked fast and it's the future - but only a handful of drives are available today on the VSAN HCL. As far as I can tell, they are the Intel drives - HP just happens to sell them too and provide their own firmware release. I'm told Samsung and others are coming.


I know the VSAN team is hard at work trying to make this process easier. It's not easy to squeeze all the performance out from the wide variety of devices out there. Add to this the inherent human inefficiencies and costs associated with certifying and supporting all vendor hardware combinations and you can imagine how difficult things can be.

My hope is that very cool things that have stemmed from trying to help the VSAN administrators will make it into regular vSphere. I think particularly the VSAN HCL check, part of the included VSAN Health Check, should be easy to port, and a welcome addition for all of us that manage VMware HCLs (which is everybody...).

In the meantime, this particular PowerCLI script looks promising as long as we can find the "regular" and corresponding HCL JSON file locations. I wonder if someone has already thought of that and been able to get it to work? It would sure make a nice addition to my documentation templates effort!

Friday, April 29, 2016

Storage vMotion of a converted template VM that has VDS fails with "apply storage drs recommendations invalid configuration for device X"

Now this was a weird one.

I was making some space in one of my datastore clusters and decided I would move some templates to another datastore cluster. These were old templates that I hadn't touched in easily a year. I converted them to virtual machines and kicked off a storage vMotion. Right as it was finishing, I got this error



What the? I had never seen that error. All looked normal. After it failed a second time, I decided to Google and found this thread helpful

https://communities.vmware.com/thread/393779?start=0&tstart=0

Particularly because these templates had a NIC with a connection to a VDS and they hadn't been used in a long time. However, "refreshing" the port did not resolve the problem. 

The recommended solutions are to move the nic to a vSS or reclone the machine, but I don't have any vSS on this environment and ain't nobody got time for re-cloning a template. I offer a third option: I removed the NIC and re-added it. Voila, the storage vMotion now succeeded:



Who knew there was such a thing as expiring VDS port reservations? While not explained like this in KB 2006809 , and this is ESXi 5.5, that is the only way I can explain why an operation that had not failed every suddenly fails.

There's definitely a gotcha in there, and that's what this blog is about :)

Monday, February 15, 2016

vSphere template "convert to virtual machine" option grayed out

This is apparently an old bug. Seems to be triggered after vCenter and ESXi updates. You can read about it here:

https://communities.vmware.com/thread/394287?start=0&tstart=0 started in 2012
http://www.vbrain.info/2014/04/26/cannot-convert-template-to-virtual-machine/

The KB 2037005 you will find is utterly useless - this is not a permissions issue and removing/re-adding the template can be very ineffective. The manual solution in the two links above is to deploy a VM and magically the option will reappear.

I did find however a better way, in a post from 2009! by Arne Fokkema

http://ict-freak.nl/2009/08/06/vsphere-deploy-template-grayed-out/

I can confirm this works. If you don't want to paste blindly, here are the related PowerCLI commands. Italics is a variable, inside brackets is a keystroke:


#Connect to a vCenter, it will prompt for your credentials or use AD integration
Connect-VIServer hostname [enter]

#Get a list of templates (more options, such as from a unique cluster, here
Get-Template [enter]

#Convert template to VM (again, remember it's not a permissions problem, this is just a bug)
Set-Template template_name -ToVM [enter]


You will now see a successful task in vCenter that converted the template to a VM. To move it back with Powershell, you will need a variable so you can use .MarkAsTemplate() instead of defining a template with New-Template

$vm_to_tpl = Get-VM template_name | Get-View [enter]
$vm_to_tpl.MarkAsTemplate() [enter]

#remove the variable you created, notice no $ sign
remove-variable vm_to_tpl [enter]


However, his script probably works perfectly fine and is a very automated way of doing it, so if you have a lot of them, customize it, test it, save it to a ps1 and watch the magic. 

I did remove Arne's -RunAsync from mine since I didn't want the possibility of the script to try to perform a task while the vCenter was still performing the conversion.