This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

VM's randomly experience connectivity errors.

Every so often I will get errors with backup transfers that end up being connectivity errors in Rapid Recovery. For example, I have a VM that performed backups no issues every hour today except at 2:00 when I received an error that said it couldn't complete a backup due to "The Virtual Machine 'machine name' paired to another core." The stack trace reveals this:

Server side:

System.Security.Authentication.AuthenticationException: The virtual machine 'OTTEIQDATACOL' paired to another Core at Replay.Core.Implementation.VSphere.EsxVirtualMachineClient.GetVirtualMachine(Boolean ignorePairing) at Replay.Core.Implementation.VSphere.EsxVirtualMachineAgentClient.GetCurrentMetadata(MetadataCredentials metadataCredentials) at Replay.Core.Implementation.Agents.AgentsMetadataHelper.GetAgentMetadataInternalClient(AgentDescriptor agentDescriptor, IAgentClient agentClient) at Replay.Core.Implementation.Agents.ProtectedAgent.b__9() at Replay.Core.Implementation.Agents.ProtectedAgent.AgentClientSend[TResult](Func`1 func) at Replay.Core.Implementation.VSphere.EsxVirtualMachineAgent.GetMetadata() at Replay.Core.Implementation.Metadata.Cache.MetadataCacheService.UpdateAgentMetadataCacheEntry(IAgent agent, Boolean isForced, Boolean tryAgentServiceHostRestart)


UI side:

at Replay.Core.Implementation.VSphere.EsxVirtualMachineClient.GetVirtualMachine(Boolean ignorePairing)
at Replay.Core.Implementation.VSphere.EsxVirtualMachineAgentClient.GetCurrentMetadata(MetadataCredentials metadataCredentials)
at Replay.Core.Implementation.Agents.AgentsMetadataHelper.GetAgentMetadataInternalClient(AgentDescriptor agentDescriptor, IAgentClient agentClient)
at Replay.Core.Implementation.Agents.ProtectedAgent.b__9()
at Replay.Core.Implementation.Agents.ProtectedAgent.AgentClientSend[TResult](Func`1 func)
at Replay.Core.Implementation.VSphere.EsxVirtualMachineAgent.GetMetadata()
at Replay.Core.Implementation.Metadata.Cache.MetadataCacheService.UpdateAgentMetadataCacheEntry(IAgent agent, Boolean isForced, Boolean tryAgentServiceHostRestart)


We aren't performing any maintenance or doing anything with our VMware / Rapid Recovery infrastructure. Upon closer inspection of the VM in RR console, at the top it says "Some actions and metadata are unavailable because machine is unreachable."
I can connect to the VM using RDP just fine. Additionally in the RR console it says the disks are missing (which is also untrue).

What's causing these errors and how can I prevent them?

  • Do you have a virtual center in your environment? If so is it running on a VM, and are you using RR to backup that VM agent-lessly?

    Is your Rapid Recovery core running on a VM? Are you backing up the core via agent-less protection?

    Did you look inside vSphere to see if there was a corresponding error at that point in time?
  • Yes we are running vCentre and it is a VM (virtual appliance to be exact). It is being backed up agentlessly. Backup time is offset by 30 minutes from other VMs.

    RR core is on a physical machine. It is not being backed up.

    There are no errors in vsphere corresponding with the results recovery errors.
  • Gotcha, thank you. The reason I ask is because that symptom is consistent with behaviors/abnormalities that appear when you do backup your VC or the VM hosting your backup solution agent-lessly.

    May I direct you to this KB article that we have on the topic of backing up a VC:

    support.quest.com/.../229098

    This is common discussion point, especially when your VC is the VC VA, since there is not an agent available of it. Luckily the VC is mostly static and does not incur much change, and if you're not running distributed switches or vvols/vsan, most of the data (with the exception of the cluster and sso) is stored upon the hosts and the VC is just your single pane of glass, thus frequent backups of it agent-lessly are not recommended. Even if you have the backups offset, all it takes it for 1 day that the other backups run long, or the backup of the VC to run long, and you'll find yourself in a situation where the VC has to either snapshot itself, or close a snapshot on itself, while it is closing or opening snapshots for other VMs, which is where problems start to arise.

    The KB states pretty much the same scenario in a little more detail. However that is why I asked about the VC/backup server, as this behavior tends to follow that type of configuration.
  • Thanks for the information and KB link. I understand that not much changes and no we are not currently using vvols/vsan or distributed switches, however things may change in the future.
    I do like the suggestion at the end of the KB to create a 2 hour window at the end of the day for the VCSA backup only. After reading I agree and don't see the point of hourly snapshots for the VCSA either. I'll make the configuration changes and monitor for a bit to see if this clears up any issues we have been seeing.
  • Yup. Not 'ideal' I'd admit, but they are realistic challenges that one faces when moving to an agent-less solution. Not that you shouldn't, it is an excellent DP method from a management standpoint, it just presents it's own set of things to consider when configuring.

  • So I had to reject the supposed answer above as I just started getting this error again.

    Quest Core on RRCore has reported the Error event "Transfer has failed":

    Date/Time: 07/24/2017 15:03:54 -04:00

    The transfer of the backup of '\\Hard disk 1\Volume 1; \\Hard disk 1\Volume 2' on 'VMname' failed

    The virtual machine 'VMname' paired to another Core

    System.Security.Authentication.AuthenticationException: The virtual machine 'VMname' paired to another Core
    at Replay.Core.Implementation.VSphere.EsxVirtualMachineClient.GetVirtualMachine(Boolean ignorePairing)
    at Replay.Core.Implementation.VSphere.EsxVirtualMachineAgentClient.GetCurrentMetadata(MetadataCredentials metadataCredentials)
    at Replay.Core.Implementation.VSphere.EsxVirtualMachineAgentClient.GetCurrentSummaryMetadata(MetadataCredentials metadataCredentials)
    at Replay.Core.Implementation.Agents.AgentsMetadataHelper.GetAgentSummaryMetadataInternalClient(AgentDescriptor agentDescriptor, IAgentClient agentClient)
    at Replay.Core.Implementation.Agents.ProtectedAgent.<GetSummaryMetadata>b__c()
    at Replay.Core.Implementation.Agents.ProtectedAgent.AgentClientSend[TResult](Func`1 func)
    at Replay.Core.Implementation.Transfer.Validation.Implementation.ProtectedAgentTransferValidator.Validate()
    at Replay.Core.Contracts.Validation.ValidatorBase.AggregateValidator.Validate()
    at Replay.Core.Implementation.Transfer.Queuing.Implementation.TransferQueueService.StartTransfer(TransferQueueEntry entry)
    ---

    About this event: The transfer of a new recovery point from the protected machine has failed
  • I see your reply, I'd honestly suggest that you contact support and let them look at the config and pull a log to validate what the core service is seeing within its own registry when compared to the IDs that VMware is passing out. That might be your best course of action rather than continue to chase this down via the forum. Support can be reached at 1.800.306.9329.