This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Failed-over to a Standby VM Today. Minor Problems

So today we had to failover to one of our standby VM's due to it's physical counterpart suddenly dying. It went well for the most part but I did find a few things that hung me up.

The protected machine is (was) a Windows 2008r2 DC and our standby server runs 2012r2 Hyper-V.

Plan for the initial boot time. Booting, installing the services disk, network configuration and rebooting took probably 30 min at least.

Installing the integration services disk assigned the virtual optical drive the same letter as one of our mapped drives. I had to go into disk management and reassign them

Right away I noticed our mapped drives did not work so I had to reshare the data volume and remap it. Users then reported that all their files were "read only" so I had to go back and assign rights to the shared volume.

Fail back should be fun. I am thinking the server probably has a bad motherboard and the RAID array is OK. If it is I will simply do a live restore of the data volume from the newest backup as opposed to a BMR. Fingers crossed that the next scheduled backup runs w/o anything funky happening. I paused the export for the time being for obvious reasons.

I would love to hear any advice or feedback about the permissions and drive mapping issues.

0 fredbloggs over 6 years ago

don't do live recovery to a domain controller, do the system drives from scratch, which would have been all you problems anyway. bringing up an old domain controller will cause a lot of pain and a lot more work than the 30 minutes the above process took.

wrt shares not working, you need the drives to have the correct letters and then restart the server. integration services wouldn't have taken the drive letter for the cd if it was in use, perhaps you have been impacted by the known issue #102390 as listed on support.quest.com/.../3
Cancel
Up 0 Down

Cancel
0 Corrigun over 6 years ago

The live recovery thing on the data volume actually worked pretty well. The files written to the VM during fail over were immediately available to users in the most current form which was pretty cool. I just took a snapshot, shut it off, turned on the repaired physical and live restored the data drive. It is noteworthy to mention that both the volumes on the physical server were not harmed during failure. It was a bad motherboard.

The physical DC had only been off for about 24 hours so replication with the other DC's should be fine. It is so far and I'll keep an eye on it.

The only thing that sucked was exporting a new base image after it was all over. I assumed the reason was the exported VM is one volume and the physical machine is two. Oddly though two days later it is again taking a new base image. I'm not sure what that is about.
Cancel
Up 0 Down

Cancel
0 Tudor.Popescu over 6 years ago

DC replication is dictated by the sequence number (USN) which is increased after each replication. If the domain controller that was restored has a lower USN than the others, the replication may not occur as the other domain controllers believe that they are OK. This behavior is called USN rollback.
If my memory serves me well (as I did not do system administration the last 5 years or so), repadmin is the utility allowing you to figure out which USN belongs to what DC. From there you can make an informed decision as of how to deal with the situation. I remember that I have used the unsupported method of increasing the USN on the desired DC and it worked in a few cases (don't remember how I have done it, though). A relatively simple method (forcing authoritative/non-authoritative sync is described here: support.microsoft.com/.../how-to-force-an-authoritative-and-non-authoritative-synchronization-fo ). If it does not work there may be a lot of work to do...
Cancel
Up 0 Down

Cancel
0 Corrigun over 6 years ago

If it does fail to replicate you can just demote it, clean the metadata from another DC, let the change propagate out and then promote it back. Hands on is about 10 min minus waiting for the next replication for the changes to get out.
Cancel
Up 0 Down

Cancel