Wednesday, November 23, 2016

#vDM30in30 11-13-2016 Host fingerprints in PuTTY don't match what ESXi shows

This was an interesting "gotcha", thanks to my colleague @edmsanchez13 which I know happens in a fully patched 5.5 and 6.5.

PuTTY by default only shows SSH key fingerprints in the md5 format, such as 



(By the way, you can see all SSH keys PuTTY has learned by going in regedit to HKEY_CURRENT_USER\Software\SimonTatham\PuTTY\SshHostKeys . If you want to see the fingerprint for a known host again like if you were connecting for the first time, just delete it from here).

It used to be you could confirm this md5 hash in your host console. You should see this on the host console, in the Troubleshooting, View Support Information, section:



Notice how the SSH key in md5 format aa:bb:cc:dd from PuTTY does not match the SSH key in SHA256 format shown in the host? That's because since OpenSSH 6.8 "the fingerprint is now displayed as base64 SHA256 (by default)", "The default changes from MD5 to SHA256 and format from hex to base64."This means ESXi now uses the SHA256 format as well to present to you the SSH key fingerprint.

ESXi uses OpenSSH (as does the rest of the world, thanks to OpenBSD) and is correct in leaving this default on. All ssh binaries are in this directory and you can check the OpenSS version with -V

/usr/lib/vmware/openssh/bin] ssh -V
OpenSSH_7.3p1, OpenSSL 1.0.2j-fips  26 Sep 2016

I can't find an option for PuTTY to show me the new SHA256 fingerprint; so - how is anyone in Windows proving the SSH pub key hash is correct, before connecting to a host?

Off the bat, I can think of two ways:

1) Confirm from the host's console

You can verify the SSH fingerprint that PuTTY shows you by asking for the md5 fingerprint - this is done with this command, using the stored host keys:

/usr/lib/vmware/openssh/bin] ./ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub -E md5
2048 MD5:41:dd:b9:ec:ba:c0:ae:c7:9a:2a:21:f7:fd:23:96:91 no comment (RSA)

2) find a client that uses the new format. I had no problem from Ubuntu, for example




However, if you are in Windows, you are most probably using PuTTY (unless you are paying for SecureCRT). If PuTTY won't show the new SHA256 fingerprints, what clients will? I found one that is free to use, even inside organizations, called Bitvise. This client shows bothMD5 and SHA256 fingerprints; additionally, it seems to be quite handy as it immediately brings up a WinSCP like window for file transfers, so I'll be testing this as my SSH client going forward.




Good links that helped me do this post
http://superuser.com/questions/929566/sha256-ssh-fingerprint-given-by-the-client-but-only-md5-fingerprint-known-for-se
http://www.phcomp.co.uk/Tutorials/Unix-And-Linux/ssh-check-server-fingerprint.html

Saturday, August 20, 2016

storage vMotion fail with "failed to look up vmotion destination resource pool object"

This is a fun one because it's a storage vMotion on FibreChannel - the error message doesn't appear to make much sense. Normally this error indicates a network problem, but if on FC, if you can clearly see both datastores, and open and browse them, you would assume everything is ok.

The host that this was happening on was version 5.0 and had several days of uptime. I suspected the host. I moved the VM to a 5.5 host (that we were migrating to anyways) and the storage vMotion was able to be carried out without problems.

Googling I found that the related KB is 1003734 . Reading it, you find a lot of troubleshooting information for compute, or host-to-host vMotion, but not as much for storage vMotions. I did find this though:


  • This issue may be caused by SAN configuration. Specifically, this issue may occur if zoning is set up differently on different servers in the same cluster.

I believe that this may be the problem. I am not sure why I can browse the datastores ok, but since moving it from one host to another solves it, I assume either some service has failed on this long running host on a soon to be unsupported version, or there is indeed a SAN zoning problem.

Hope it helps someone!

Tuesday, May 3, 2016

Differences between the normal HCL and the VSAN HCL

TL;DR : I do recommend you use a VSAN Ready Node if you can.

After seeing first hand what @virtualhobbit went through, I was just amazed by how many "gotcha's" are involved with the VSAN HCL, especially if you are trying to deploy this in your homelab. This blog is about VMware gotcha's, so a post was in order. I hope this can help others avoid some of these mistakes.

Here's a (non-definitive) list:

Gotcha #1

In case you have been living under a rock, you should know a device on the standard VMware HCL does not automatically qualify for use with VMware VSAN. There is, in fact a separate HCL. Failure to know this before deploying VSAN will cause you lots of misery. The VSAN HCL has details for I/O Controllers, HDDs and SSDs - everything else, assume the normal HCL is valid, at least until they show up in the VSAN HCL...

Gotcha #2

Once you access the VSAN specific HCL option from the drop down, the interface does not resemble the normal HCL, where you can search for components (especially the four handy PCI device registers: VID, DID, SVID and SSID). You are greeted with a page that allows you to select a "VSAN Ready Node" which is a pre-configured full server configuration. In this example, I checked all Dell servers compatible with VSAN 6.0U2



Right. So the idea is you then go tell your server vendor "sell me exactly this". But what if you are looking for a specific hardware device, such as a SSD or Raid Controller? Where would you check the HCL to choose what to buy?

Gotcha #3

Somewhat hidden below the VSAN Ready Node selector interface is a disclaimer that tells you if you are willing to "brave the path of building your own server" (I paraphase) you can get to see the actual HCL




Gotcha #4

The URL to the real VSAN HCL? It's exactly the same as the URL for the Ready Node HCL. I know. I got a tip in the vExpert slack - to access it directly, go to http://vmwa.re/vsanhclc

Gotcha #5

Ehm... where are the four hardware identifiers we've been relying on, for so long, to unequivocally verify the HCL? It's not used here - at least to do the initial find of the devices. You will have to browse first using the available options. You will be able to verify the PCI registers from the results, but sometimes you get little jewels like this

"The device PID string (model) is truncated in ESXi. Please use both model number and firmware version when trying to identify the device. When in doubt, please consult with the hardware provider."

I'd like to offer a screenshot of this but I don't have one (I found this in the Intel P3700 NVMe drive, which is definitely a PCI device). If you do, please send one, as I'm really curious of why a lspci -v would truncate the PCI registers?

Gotcha #6

Each type of device has different columns. Please be mindful as these details can be extremely important.  For example:

  • certain I/O (RAID) controllers are only supported in a particular mode and have specific VID, DID, SVID and SSID values.
  • certain SSDs can only be used in a particular Tier (All Flash, Hybrid Caching, etc)
  • certain capacity drives are only supported in a certain disk series. Go find that out.

This apart from the driver/firmware requirements you have come to love.

Gotcha #7

Everybody knows NVMe is wicked fast and it's the future - but only a handful of drives are available today on the VSAN HCL. As far as I can tell, they are the Intel drives - HP just happens to sell them too and provide their own firmware release. I'm told Samsung and others are coming.


I know the VSAN team is hard at work trying to make this process easier. It's not easy to squeeze all the performance out from the wide variety of devices out there. Add to this the inherent human inefficiencies and costs associated with certifying and supporting all vendor hardware combinations and you can imagine how difficult things can be.

My hope is that very cool things that have stemmed from trying to help the VSAN administrators will make it into regular vSphere. I think particularly the VSAN HCL check, part of the included VSAN Health Check, should be easy to port, and a welcome addition for all of us that manage VMware HCLs (which is everybody...).

In the meantime, this particular PowerCLI script looks promising as long as we can find the "regular" and corresponding HCL JSON file locations. I wonder if someone has already thought of that and been able to get it to work? It would sure make a nice addition to my documentation templates effort!

Friday, April 29, 2016

Storage vMotion of a converted template VM that has VDS fails with "apply storage drs recommendations invalid configuration for device X"

Now this was a weird one.

I was making some space in one of my datastore clusters and decided I would move some templates to another datastore cluster. These were old templates that I hadn't touched in easily a year. I converted them to virtual machines and kicked off a storage vMotion. Right as it was finishing, I got this error



What the? I had never seen that error. All looked normal. After it failed a second time, I decided to Google and found this thread helpful

https://communities.vmware.com/thread/393779?start=0&tstart=0

Particularly because these templates had a NIC with a connection to a VDS and they hadn't been used in a long time. However, "refreshing" the port did not resolve the problem. 

The recommended solutions are to move the nic to a vSS or reclone the machine, but I don't have any vSS on this environment and ain't nobody got time for re-cloning a template. I offer a third option: I removed the NIC and re-added it. Voila, the storage vMotion now succeeded:



Who knew there was such a thing as expiring VDS port reservations? While not explained like this in KB 2006809 , and this is ESXi 5.5, that is the only way I can explain why an operation that had not failed every suddenly fails.

There's definitely a gotcha in there, and that's what this blog is about :)

Monday, February 15, 2016

vSphere template "convert to virtual machine" option grayed out

This is apparently an old bug. Seems to be triggered after vCenter and ESXi updates. You can read about it here:

https://communities.vmware.com/thread/394287?start=0&tstart=0 started in 2012
http://www.vbrain.info/2014/04/26/cannot-convert-template-to-virtual-machine/

The KB 2037005 you will find is utterly useless - this is not a permissions issue and removing/re-adding the template can be very ineffective. The manual solution in the two links above is to deploy a VM and magically the option will reappear.

I did find however a better way, in a post from 2009! by Arne Fokkema

http://ict-freak.nl/2009/08/06/vsphere-deploy-template-grayed-out/

I can confirm this works. If you don't want to paste blindly, here are the related PowerCLI commands. Italics is a variable, inside brackets is a keystroke:


#Connect to a vCenter, it will prompt for your credentials or use AD integration
Connect-VIServer hostname [enter]

#Get a list of templates (more options, such as from a unique cluster, here
Get-Template [enter]

#Convert template to VM (again, remember it's not a permissions problem, this is just a bug)
Set-Template template_name -ToVM [enter]


You will now see a successful task in vCenter that converted the template to a VM. To move it back with Powershell, you will need a variable so you can use .MarkAsTemplate() instead of defining a template with New-Template

$vm_to_tpl = Get-VM template_name | Get-View [enter]
$vm_to_tpl.MarkAsTemplate() [enter]

#remove the variable you created, notice no $ sign
remove-variable vm_to_tpl [enter]


However, his script probably works perfectly fine and is a very automated way of doing it, so if you have a lot of them, customize it, test it, save it to a ps1 and watch the magic. 

I did remove Arne's -RunAsync from mine since I didn't want the possibility of the script to try to perform a task while the vCenter was still performing the conversion.

Tuesday, February 9, 2016

vCenter 6 multi site installation error

First vCenter was a VCSA embedded and I was careful with DNS so everything went all right.

Second vCenter was a Windows embedded that had to join the SSO domain and setup a new site.


Got the error:




failed to run vdcpromoon the top part of the error: "install.vmafd.vmdir_vdcpromo_error"


Sean Whitney's post is the only help I found from a Google Search on such an error (and yes, the installation with the generic error 1603 he mentions)


http://www.virtually-limitless.com/troubleshooting/single-sign-on-or-platform-services-controller-psc-fails-to-install-or-upgrade-with-error-code-1603/


The path where I saw logs was C:\ProgramData\VMware\vCenterServer\logs though, and they only persisted until the zip file was done bundling up.


I ruled out:


1) that my SSO password had a ! sign, changed it to a @ - it still happened, so that wasn't it... it sounded weird since we've used VMware1! in VMware labs forever...


2) In my second try I got a message that said this windows VM and the VCSA were off by 22 seconds

Thanks to Google I found the commands to check NTP from the VCSA console

Command> ntp.test --server time.windows.com
Status:
   Status: red
   Messages:
     Configuration:
         Message: Failed to reach 'time.windows.com'.

         Result: failure

well well - let's change that

Command> ntp.server.delete --server time.windows.com
Command> ntp.server.add --server 0.pool.ntp.org
Command> ntp.test --server 0.pool.ntp.org
Status:
   Status: green
   Messages:
     Configuration:
         Message: '0.pool.ntp.org' is reachable.
         Result: success
Command> ntp.get
Config:
   Status: Up
   Servers: 0.pool.ntp.org

I went to my PDC and updated the time server to match the NTP server

w32tm /config /manualpeerlist:0.pool.ntp.org /syncfromflags:manual /reliable:yes /update

and rebooted the PDC - system event logs confirm it's now syncing to this server.

I cancelled the installation, rebooted the vCenter and tried again.

Now it complans i'm off by 16 seconds. Ayayayayay. Then I got it differs by 5 seconds.

I'm pissed by now.

Let's try to get this VCSA to sync directly from the lone PDC - I can't make it faster than my freaking LAN.

on the PDC, if you hadn't already, make sure it's a reliable time server:

w32tm /config /reliable:yes

On the VCSA - enough with the niceties, give me the shell, set the server in /etc/ntp.conf and stop and start the ntp service, then check with ntpq -p like Ganesh does here to make sure everything is working as intended.

And it didn't complain this time. It went smoothly until completion.

So that big error was really a NTP error.

If that's not a gotcha, I don't know what is!


Wednesday, February 3, 2016

Disabling a Workstation VM set to autologon from login in automatically once moved to ESXi

Workstation has a feature (which it offers to you when you build a Windows OS virtual machine) called Autologin. You can access it manually via the VM, Settings, Option dialog


It basically remembers your login and password for you. Great if you restart the machine a lot, I guess :)

I made a machine, allowed autologin, and later moved it to an ESXi host in my lab. For some reason I thought the autologin was some feature of VMware Workstation. It's not - the VM continued doing the autologin. I then thought it may be in the VM advanced options but I couldn't find anything.

I searched Google but didn't find anything related to VMware Workstation or ESXi auto logon. The trick is that this is a Windows feature, not a VMware feature o.O

Check this KB - https://support.microsoft.com/en-us/kb/324737 - it explains how Autologon works. 

The easy way of disabling it is to open regedit, navigate to

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon

And change the value of the AutoAdminLogon key from 1 (true) to 0 (false)

However, you should also clear any information in the other keys mentioned in the article, like Domain, username, and if you find the password key.

It's not a good idea to enable it other than for testing purposes, but at least now I learned that VMware Workstation just does a registry key and this is a Windows feature. In case you run into the same thing on your lab, now you know how to fix it easily :D

There was a gotcha in there (because I just didn't know and assumed) and that's what this blog is about.

Avoid losing Windows activation when moving a VM from one environment to another

You moved it. You didn't copy it.

Don't go breaking any laws now - but that's the way for the computer's OS not to detect hardware changes and trigger re-activation.

Ok, but what if you already messed up?

The online activation will not work. There's a lot of cases and it depends on your license and ... - but if you're doing perfectly legal things you have all the right to run your machine and not have to sit on the phone with Microsoft.

What you want to do is shut down the machine, move it back to where it was originally activated. Turn it on and this time it will activate properly online. Now shut it down and move it back to where you had problems. When you get that dialog: You moved it. You didn't copy it.

There's a gotcha there, and that's what this blog is all about.

Tuesday, January 26, 2016

Finding IP on a problematic vCenter session (10 second version)

A user complains vCenter is locking him out. He "turns off everything" but still the domain controller reports the vCenter IP for the lockout source (this is a 5.0 vCenter, I don't know if this changed later).

Checking vCenter the user doesn't have any processes (hey, it could be the case) but he does show up in the vCenter logs. Alas, I don't see an IP in the logs. I google why and I find these links:

https://communities.vmware.com/thread/296871?start=0&tstart=0

http://www.virtuallyghetto.com/2010/12/how-to-identify-origin-of-vsphere-login.html

The great William Lam offers awesome explanations (he is really awesome) on how to enable verbose logging and finding out everything about each session. In the first link, however, a simpler/much faster/no change required answer appears by user aorady (which wasn't labeled as the answer).

The vCenter event view always shows IP for failed logins in form of

"Cannot login domain\username@XXX.XXX.XXX.XXX"

So, if you just needed the IP, you are good to go. There's lots of ways to do things, but finding a fast and simple way can be a big help.

No disrespect to William - I bet his explanation will come handy for a much wider variety of cases, especially if the user is having several sessions and you just need to track one.

Thursday, January 21, 2016

Things to note when upgrading to 5.5u3b and 6.0u1 (SSLv3 now disabled)

I'll focus on 5.5u3b since it's the most popular ESxi version out today.

This is verbatim from the 5.5U3b release note


What's New

  • Updated Support for SSLv3 protocol is disabled by default
    Note: In your vSphere environment, you need to update vCenter Server to vCenter Server 5.5 Update 3b before updating ESXi to ESXi 5.5 Update 3b. vCenter Server will not be able to manage ESXi 5.5 Update 3b, if  you update ESXi before updating vCenter Server to version 5.5 Update 3b. For more information about the sequence in which vSphere environments need to be updated, refer KB 2057795.

  • VMware highly recommends you to update ESXi hosts to ESXi 5.5 Update 3b while managing them from vCenter Server 5.5 Update 3b.

    VMware does not recommend re-enabling SSLv3 due to POODLE vulnerability. If at all you need to enable SSLv3, you need to enable the SSLv3 protocol for all components. For more information, refer KB 2139396
This of course causes issues which you need to be aware of:

1) You HAVE to patch vcenter first! Yes this is a best practice, but I know a lot of people just patch their hosts. Revisit this and plan patching your vCenter first, then your host
2) This has consequences with other software! For example, Veeam has a KB out (KB2063) that explains you have to upgrade to Veeam 8 update 3 for TLS to be supported.
3) If you don't do this today, always read the release notes. When 5.5u3b first came out, there weren't big warning signs like the above. VMware has done a good job of putting alerts now when you download this version and in updating KBs, but no one does the job of preparing for this but yourself, the system administrator.

Thats's one of the biggest gotchas i've seen in a good while - keep up to date :D