Saturday, February 28, 2015

Re-joining hosts to a vCenter with distributed switches for data and storage, and a gotcha for iSCSI while VMs are running

Recently I had an outage where we thought we had lost the vCenter database. I joined the hosts to another vCenter, while we worked around the original problem, and I was soon reminded that my hosts were using distributed switching and I had probably lost that distributed switch information. I would probably need to rebuild that vCenter and move them back.

Someone will say, what about your dvswitch backups? Well, sadly this was running 5.0. From this KB I don't have an option to backup dvswitch configs unless I'm running 5.1 or later. Yet another reason to upgrade, right?

In the end, the DB access was recovered, and I had my old vCenter. We didn't even have to restore, simply the access was there again, but this applies also with a restore. However, since I had moved my hosts to another vCenter, when I joined them back they did not immediately fall back in place.

The hosts knew they were no longer connected to the old vCenter, so they did not reconnect when it was again available. I had to add them manually. However, the vcenter had them grayed out so I had to first remove the hosts, and then add them.

Once I added them, I checked the network settings and the physical interfaces were not were they were supposed to be. I went to the networking tab and confirmed the distributed switches reported no hosts as members.

Adding the hosts was not difficult for my data networks - the wizard asked if I had to migrate any vmkernels and did a good job of pointing out which had to be migrated.

The one gotcha was when moving back the iSCSI distributed switching. I got the following alert:
This one did not appear when doing the Data switching. The alert is valid - you are moving a vmkernel with active iscsi traffic, which is possibly catastrophic. My first thought was "wow - so i'll need to get some downtime on the VMs, and possibly create another host, and migrate VMs over, before I can put this back the way it was". After meddling around and thinking, I convinced myself there should be a way of doing this without downtime, since the host by itself has active iscsi switching and I had no downtime.

The more I thought about it, the more I convinced myself there should be a way. After trying it again, I noticed the little checkbox that allows you to ignore these errors and continue :)

Now, I only advice you to do this if you are running a similar scenario - in my case, this was:
1) the same vcenter
2) the same hosts it had before
3) the same iscsi config as it was before
4) and I picked a host with a few non critical VMs as a testbed. 

The test was successful, and I was able to re-incorporate my hosts into the distributed switch with no downtime. Looking back, should I have moved the hosts to another vCenter until I made sure the DB was tuly unrecoverable? Maybe not. But I sure like knowing what's the worst that can happen now. 

VMware is a great technology in that most of the time, that particular scenario that has you in a bind has happened before, and there is a workaround already in place. I'm sure i could have googled and found a post like this as well, but in this lucky day, I was able to see it by myself. Anyways, hope it helps you. Remember the motto:

I'm sharing this because "there's a gotcha there" and guess what this blog is about :)


A design tip - I've asked peers what do they think of using distributed switching in storage. Most people have said "if you have the licensing, and it helps standardization, do it!" . This was in a VCAP-DCD study group anyone can join, and I do recommend it, as it's very active and full of information

In my particular case, where I don't have many hosts, I have opted to remain with normal vSwitches, for my new environments. This is because of a particular reason: we are trying to consolidate many vcenters into few, and this will help the host portability, and the movement of VMs from one vcenter to another without downtime. 

However, like the above post concludes, distributed switching would be useful if we had several hosts and had to distribute the work among many, since it would save time and prevent mis-configurations. Of course, if it's for data, use it! What else gives you traffic-load based load balancing? :)

No comments:

Post a Comment