Started getting system.timeoutexception: 'b__0' on one machine, can't tell why.

One of the VM's being backed up started throwing this error every few backups since last night. Out of the past 8 backups, 5 have thrown the error. No other VM's are having this issue. These are backed up agentlessly (vCenter) direct from SAN.

Server side:

System.TimeoutException: 'b__0' for 'LS2RESDATA' has timed out at Replay.Common.Implementation.Utilities.SingletonTask.Execute(Action action, CancellationToken cancellationToken) at Replay.Core.Implementation.VSphere.EsxVirtualMachineAgentClient.PrepareForBackupOrRestoreInternal(TimeSpan takeSnapshotTimeout, TimeSpan cleanSnapshotTimeout, ShadowCopyType shadowCopyType, CancellationToken cancellationToken, Boolean isBackup) at Replay.Core.Implementation.VSphere.EsxVirtualMachineAgentClient.PrepareForBackup(TimeSpan takeSnapshotTimeout, TimeSpan cleanSnapshotTimeout, ShadowCopyType shadowCopyType, CancellationToken cancellationToken) at Replay.Core.Implementation.Transfer.TransferJobHandler.EsxTransferHandler.PrepareForBackup() at Replay.Core.Implementation.Transfer.TransferJobHandler.TransferHandler.TransferTaskInternal() at Replay.Core.Implementation.Transfer.TransferJobHandler.TransferHandler.TransferAgentTask() at Replay.Core.Implementation.Transfer.TransferJobHandler.TransferHandler.TransferTask() at System.Threading.Tasks.Task.Execute()

Parents
  • You mention this is agent-less, however the error mentions 'shadow copy' so this is more than likely a problem with the quiesced snapshot when it tried to do a local OS snapshot. As a test, turn off the quiesced snapshot for that VM and see if the problems persist. Keep in mind, RR defaults to having quiescing turned on, so you get 1 VMware snap, then a quest OS snap, and then another VMware snap. Also, it might be a good a idea before you test to make sure the VM is free from all other snapshots: 

    support.quest.com/.../rapid-recovery-defaults-to-guest-quiescing-when-taking-a-vmware-snapshot

  • The first thing I checked was to see if there were any lingering snapshots and there aren't. I created a VMware snapshot and deleted it successfully. I will try disabling quiesced snapshot for this VM but why would only this one machine out of every VM being backed up be having this issue? Also, wouldn't having quiesced enabled help ensure the data being backed up is consistent when you can't confirm that system use is minimal during the backup?

  • In a nutshell, especially if it is quiesced, it is specific to that one VM as a quiesced reaches in and takes a quest OS snapshot too - so it is indeed specific. The other thing you can do, is if you look at the time the job failed in RR, and then take that time and look at the logs on the guest OS (the event logs) under system and application, you probably have a corresponding error too. 

    As far as the 'consistent' part - this is the fun gray area of quiescing. Every other backup vendor (except ComVault) defaults to a non-quiesced snapshot, and even then you only turn it on when you want to truncate Exchange or SQL logs. That is the example for a non-quiesced snapshot is perfectly fine/adequate/safe. Simply there is no reason NOT to (other than moving part, system resources, and affecting the guest OS) take a quiesced snap, so when agent-less what put into RR they defaulted to it. If you are using RR to truncate SQL/Exchange logs, you MUST have it. If you aren't, it's a choice. A non-quiesced snap is perfectly safe, it's just the difference between if you do a backup at 8:43 your do you want to backup the data that has already been committed to the hard disk, or all the data on the disk and in RAM. A quiesced snapshot will 'flush out' RAM before the backup is taken (that is the guest level snap). Both are perfectly safe, that is why 1 is called crash consistent and the other application specific - assuming if there was a power outage and the hardware/OS was still recoverable, you'd have everything that was already written to disk, but not what was in RAM (non-quiesced). In theory everything in RAM will be committed to disk in seconds/minutes and will be included in the next backup, but there we are. 

    You also said it works some of the time, so there is probably a competing task, or VSS conflict at the time it is failing. Which would appear in the event viewer under app or system yes. 

  • Ah so since these are primarily file servers or non-database services, it's fine to disable quiesce. I think I follow now.

    I checked logs as suggested and yeah there are a ton of errors that coincide with the RR errors. Also coincidentally, these all seemed to have started after I updated VM Tools so I'm wondering if there are some compatibility issues happening?

    Yes it also still works some of the time but I will definitely run without quiesce to see if that helps. Also I notice there is now a synthetic snapshot option but in reading the documentation, I'm still not sure how that would help. It seems to take longer and possibly use up more space?

    I will update the thread at the end of the day after a few more backup runs go.

Reply
  • Ah so since these are primarily file servers or non-database services, it's fine to disable quiesce. I think I follow now.

    I checked logs as suggested and yeah there are a ton of errors that coincide with the RR errors. Also coincidentally, these all seemed to have started after I updated VM Tools so I'm wondering if there are some compatibility issues happening?

    Yes it also still works some of the time but I will definitely run without quiesce to see if that helps. Also I notice there is now a synthetic snapshot option but in reading the documentation, I'm still not sure how that would help. It seems to take longer and possibly use up more space?

    I will update the thread at the end of the day after a few more backup runs go.

Children
  • Yup yup. The VMware tools are indeed what creates the highway between the hypervisor and the guest OS. If you have to remove/re-install the tools (or just non--quiesce the thing perhaps) there are ways. If you are getting the errors in the event viewer though yes, hands down that's going to lead you to your root cause (tools, VSS, competing jobs, something). 

    You mention synthetics, that is to help with redundant base images, I'll try to see if there's a Quest document about that. That's to help with the multiple base images for protected nodes. Other vendors call them synthetic fulls, here in RR it'd be a synthetic base. 

  • Yeah I'm seeing a bunch of these ever since upgrading tools and this is when the RR errors started. Will definitely be digging into this on the VMware side (or perhaps it's Microsoft side?)

    EVID 8197 - Source SRMSVC

    File Server Resource Manager Service error: Unexpected error.

    Error-specific details:
    Error: FlushFileBuffers(\\?\Volume{e7021c7a-4862-11ec-80e6-0050569237b4}\System Volume Information\SRM\FciNrt.usn), 0x80070001, Incorrect function.

    While I was typing this a backup was taken and it worked flawlessly. If the rest of the days backups work this good I think I've at least patched the problem, thanks!

    Regarding synthetics, I did read the Quest doc on it but I didn't really comprehend what it was trying to tell me. I don't think I have an issue with multiple base images as we take incrementals

  • Yup - definitely OS related. If it works some of the time, and not others could be a timing issue too, multiple VSS related tasks running at the same time always throws VSS into a tissy. 

    The Synthetics are to help you AFTER an unexpected base image is taken. RR is still an incremental for life model, however there are still those 'triggers' that get you a new base/full, which the synthetic technology is there to help combat. 

  • If it's meant to combat an unwanted scenario, what is the rational behind not enabling synthetic by default then? Or is it one of those things that shouldn't happen but if you notice happening regularly to a particular backed up system then you can turn it on to help remedy it and therefore isn't really required to be enabled across the board?

  • Could be a few reasons. It does take more time, it is a 'new' feature (within the last year) so rather than scoot it in after a decade of it not being there, who knows. It's only purpose is to try to alleviate the unexpected base image snafu that interrupts the incremental for life design. However for it to work you do have to have it enabled before the new base is triggered. Dealers choice, however at the moment it does not look to be a default setting.