This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Replication WAN Issue

We recently updated our core at our DR facility to 6.2.1.100.  Our main facility is at 6.3.1.100.  We replicate roughly 100 servers at night.  Since we upgraded to 6.2.1.100 we noticed that our replication is taking an extra hour to hour and a half to complete.  Network graphs show that we are not using 100% of the WAN connection when we replicate at night.  The connection will go up and then it looks like before it starts the next server replication the connection stops and waits and then starts the next server.  If we replicate one server with a large amount of data the software uses the entire WAN connection for that one server.  We have never had an issue with this prior to upgrading DR.  Our nightly replication always used 100% of the WAN connection until all servers we replicated to DR.  We didn't make any configuration changes to the network or to the backup settings when we upgraded to this new version but we now have encountered this issue.  Can anyone give me any thoughts on this?

  • No one saying the items you mention should not be checked, they are great suggestions. But I think the chances of the scheduler hanging or retention just happening to hit 1 year at the same time as the upgrade maybe low (but not 0) And if the chances of this are fairly high then maybe some logic should be put in to warn/ restart the scheduler if it is not running vs letting it sit in a failed state for months or more

    But even if the scheduler did hang or retention did just hit a milestone and they were causing slow downs due to a large amount of work. That would only bolster my point. If the Core recorded any performance information for its jobs, it would be incredibly easy to see this or any other issue. Sure you can look in the GUI for job run times and maybe a rate (but support has always told me the rate numbers are useless) but even if the rate is accurate, not sure it helps you. 

    And if those 2 specific scenarios are not the issue, then we move onto other guesses. But if the Core reported anything about replication, we could narrow in on the issue. For example, lets assume replication has 4 parts

    1) Read from Source
    2) Package data for transmission (compression?)
    3) Transfer
    4) Write data to Target

    So a log entry may look like "Read = 10GB/ 2mbps, Package = 10GB/ 20mbps, Transfer = 8GB/ 60mbps, Write= 8GB/15mbps" (write would probably be in a different log on the target unless he reported back)

    Easy to see where the jobs bottleneck is, during read so probably a disk issue. The bottleneck in the transfer section, then its probably network related. The Core knows how much data he handles, he knows how much time he spends during each operation, he just does not track it or report it

    Now, since none of the above helps tim.sweet with his current issue, look in the Core logs for a message like:

    - performance: on volume 1, during the last 600s, 3 IO operations exceeded 30000ms threshold, their total duration is 361s, the longest operation took 138s.

    This may help you see if you have any disk IO issues

    I am also messing around with a powershell scrip Tudor sent me to see if it can help narrow down jobs as an issue

  • I know it's been a little while but I wanted to update this string with new info.  We updated our production site to 6.2.1.100 so now it's the same as our DR location.  No change in replication.  It's still not using all the WAN connection which is 100 MB.  Matter of fact it's using less of the connection now.  We have been on this new version for 2 nights now and replication is worse.  Last night we had 12 servers of the 100 we replicate that never made it over due to errors.  This is a software issue.  We have a ticket opened with Quest and we have proven to them that this all started when we upgraded our DR site.  We have network graphs and replication times before and after the upgrade.  This all started with the upgrade and no network changes.  I can replicate single servers during the day without issue but as soon as we try to replicate 2 in the early morning hours as part of 100 servers that need to be replicated the software struggles to handle it.  What it seems to be doing is replicating 2 servers then it stop/drops and waits for a timeout or something and it replicates 2 more then drops again.  Prior to the upgrade it would peg the wan connection at like 90%, it would stay that way until it was done, no dropping and waiting.  Our replication is take an hour and half to 2 hours longer now.