Job failing on occasion with same error

Hello All,

I'm fairly new to Rapid Recovery, whist I have completed the free training provided on this site, i'm still new to real world scenarios and troubleshooting.  I have a server which is being protected which on occasion fails with the error Agent dropped the network connection during data transfer.  Since then error was reported at 08:38 yesterday, other transfers have run and have been successful.  However I have noticed transfer failing in the past which produce the same error.

I have checked the KB article and have run a Chkdsk but it reports no bad sectors on the drive.  I have also checked the system event log of the server for any VSS errors but there aren't any.  I checked the VSS writers and no errors are found there also.  The server is a virtual machine but is not part of an agentless installation.

I have not yet ran a disk defrag because I would rather this be a last resort.

Although it's not a serious issue, I would like to understand the error and cause better to try and prevent it in future.  Has anyone come across this before?  If there is any other information I can provide to help let me know.

Thank you.

Parents
  • Backups (not just RR) are fairly sensitive to networks drops so its possible the issue is exactly what is reported. A network drop

    But with RR to troubleshoot you have to go over the logs. A lot of times something will fail and RR will just sit there till some random job timeout hits. Then report a network error/ timeout even though that had nothing to do with the initial failure. 

    The logging in RR is not great in my opinion but there is no use trying to troubleshoot errors returned by jobs they are often to vague or misleading. Crack open the logs on BOTH the client and the Core and see what happened.

  • Hi Emte,

    Thank you for your advise.  I have opened up the log for the agent and have found this interesting debug:

    DEBUG 2020-04-20T09:04:56 [24] - Replay.Agent.Implementation.Transfer.NetworkConfigurationReader (RequestUri=vm-hq-sophos:8006/.../27a9d461-61d7-48f0-9bd9-380ba2eac9ee ClientAddress=10.0.165.144:59798 Service=Replay.Agent.Management.Transfer.TransferManagement Method=DeleteSnapshot)

    File with network configuration 'C:\NetworkConfiguration.htm' has been deleted.

    I wonder if "File with network configuration 'C:\NetworkConfiguration.htm' has been deleted." could be the cause.  I'm only guessing as I am very new to this still and have a lot to learn.

    This debug even took place 2 minutes before the the job failed and provided the error "Agent dropped the network connection during data transfer"

  • It starts with Debug, that is typically just an informational message. That does not appear to be your problem.

    This is not 100% effective but look for Error or Exception and match it up to the time the job actually failed

  • I see, so I was reading too much into that.  I have found some errors and exceptions.

    ERROR 2020-04-20T09:03:53 [18] - Replay.Agent.Implementation.Transfer.TransferDataConnection ()
    Caught exception handling requests from IP:60016; no further requests will be handled
    System.TimeoutException depth 0: The operation has timed out. (0x80131505)

    This was the last error seen before the job officially failed at 09:06:23am.  I'm wondering if it really was just a drop in network connection (as you previously mentioned) at the time, maybe a switch reboot or something that someone had scheduled in.

Reply
  • I see, so I was reading too much into that.  I have found some errors and exceptions.

    ERROR 2020-04-20T09:03:53 [18] - Replay.Agent.Implementation.Transfer.TransferDataConnection ()
    Caught exception handling requests from IP:60016; no further requests will be handled
    System.TimeoutException depth 0: The operation has timed out. (0x80131505)

    This was the last error seen before the job officially failed at 09:06:23am.  I'm wondering if it really was just a drop in network connection (as you previously mentioned) at the time, maybe a switch reboot or something that someone had scheduled in.

Children
  • It would be hard to prove but I would doubt its a network. Try and go back further in the logs and see what else was happening. A lot of times some operation (like scanning a volume for change/ reading the change log) will take to long and hit a timeout. RR has a ton of timeouts and the logging of each is very poor

    Any chance you could make the log available to me? You can contact me via PM on the forum I think

  • Another idea. Open up a support case and have the tech go over the logs with you. This maybe difficult as some of the RR support people don't read the logs like you or I. They have tools that parse the logs and give them a quick overview but they wont share this tool with their customers. So when you are trying to troubleshoot with them you are both looking at completely different views. You cant see theirs and they don't know how to read logs using the only view you have. If you get someone that wont go over the process with you and demands you just follow these steps to fix it see if you can transfer the case to a more senior tech

  • Thanks Emte, I'll give support a try.  Hopefully they will go through it with me and explain what it is they find to help me better understand the fault for future reference.  Thank you for you help and advice.