This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Replication WAN Issue

We recently updated our core at our DR facility to 6.2.1.100.  Our main facility is at 6.3.1.100.  We replicate roughly 100 servers at night.  Since we upgraded to 6.2.1.100 we noticed that our replication is taking an extra hour to hour and a half to complete.  Network graphs show that we are not using 100% of the WAN connection when we replicate at night.  The connection will go up and then it looks like before it starts the next server replication the connection stops and waits and then starts the next server.  If we replicate one server with a large amount of data the software uses the entire WAN connection for that one server.  We have never had an issue with this prior to upgrading DR.  Our nightly replication always used 100% of the WAN connection until all servers we replicated to DR.  We didn't make any configuration changes to the network or to the backup settings when we upgraded to this new version but we now have encountered this issue.  Can anyone give me any thoughts on this?

Parents
  • Hello tim.sweet, 

    Thank you for the information you provided. I am wondering if you are having the following issue:

    support.quest.com/.../slow-replication-of-base-images

    I would be curious to see if you apply this on the 6.2.1.100 Core if this improves the replication time.

    We also would recommend that both Cores be the same version as mentioned.

    Please let me know if you have any further questions.

  • Francis, this isn't our issue.  We don't replicate base backups every night.  Typically we replicate incrementals and since we upgraded DR it takes longer and replication will not use all the bandwidth of the WAN link.  Do you think this patch will still help us in anyway?  Let me know your thoughts, really need some assistance with this.  

  • Hi Tim, I think this patch will still improve your replication times. I would say install it and test it out and let us know if you notice improvements or not.

  • Could you expand a bit? Why would installing the patch on one core cause replication to slow down? And why would you then expect installing the patch on another core to speed it up? If there is a known issue with this patch between cores, it should be noted (and I cant find any, this appears to be a supported configuration)

    I am not saying it wont help, and in fact it looks like Tim has zero options. Just that the whole process seems to be a try it and see and this is not ideal for enterprise level production backups.

    RR needs some way to monitor and measure performance counters for all of its jobs. 

  • The patch increases the write buffer on the target core. By default our write buffer is relatively small. So if your source core is capable of overwhelming the write buffer on the target core, you end up with slower than expected replication. So this patch doesn't need to be installed on the source and if it is, it won't have any affect on it's performance. It needs to be installed on the target and it will only improve replication speeds if the bottleneck is the write buffer on the target core. The write buffer is usually the bottleneck only during base image replications, however I can theorize a couple situations where it would be beneficial with incrementals (say large incremental replication or highly unique incremental data).

    - What does the disk queue look like on the source core? What does it look like on the target core? Do you ever see it go above 1? You can check using resource monitor. Are there any background jobs running at the same time as replication? Jobs like deleting index RPFS files or other maintenance tasks?  When are your nightly jobs set to run on your two cores? Are they at the same time as you are allowing replication? Do you have separate retention policies between the two cores? How many concurrent replication jobs are allowed? What is your bandwidth between the two sites? What is the latency? Have you seen an increase in latency on your network line? Is that something you monitor at all? Latency can really slow down RR replication especially when you are replicating incremental data that is pretty well deduped. I know I just asked a lot of questions, but these are all things I would look at in investigating this kind of issue. 

    From a software perspective, we recommend to keep the source and target core on the same build. If there are changes in repository structure between builds we can end up with overhead as replication is sending data between the two cores and having to convert it on the target core to match the new repo structure. I'm not saying that is the case here, but we've seen it in the past and there were some big code changes made between 6.1.3 and 6.2.x. I've also seen defects related to replication speeds in the past between builds. Anytime a new build is released that makes modifications to the replication code (for instance we added better resiliency against network drops) then there is a chance that could introduce performance issues between builds. In your situation, the performance hit isn't preventing you from completing your replications each night on a large number of servers, so that is a good thing. It's definitely frustrating to see worse performance, but it's not crippling.

  • Tim, as always I appreciate your replies and the details you give. Like the fact that write buffers were increased in this patch.

    Write Buffer - It seems like would only help and it is available on the correct core. So this should not have caused a slow down. Is there any log/ gui indication that a core is "overwhelming the write buffer"?

    General Perfmon/ scheduling stuff - While these maybe impacting performance, it would impact performance always, not just after 1 core was upgraded. And the previous recommendation to match the patch levels would not affect this. Unless there was some schedule change from the patch that would cause another job to be running at a different/ new time that conflicts with replication

    Match Version - We follow this general rule but I was just thinking of reviewing this process as Tim.Sweet pointed out I was not reading the docs correctly. I swear it used to say to upgrade the Target first then Source but it appears to just say Target now. We have been bitten before with patch issues and it makes it worse when both the source and target have been upgraded, so I was interested to see its supported to upgrade just the target core.

    I know they cant, but the cores should be able to give some insight into their performance and job statistics in order to narrow down what is happening. The Cores performance appears to have slowed down yet there is not a single recommendation that involves checking anything in the GUI or logs that would help narrow in on the root cause. It is so frustrating when these performance issue happen and there is almost no data available from the Core. 

  • Is there anywhere in the logs that you can see write buffer being overwhelmed? No. Why? Because it wasn't ever supposed to be manipulated and the write buffer size being too small was a coding mistake.

    General perfmon/scheduling - I've seen quite a few instances where a core has been online for a long time (think a few months or more) without a restart. Something hangs inside the scheduler and no deletes happen for a long time. Then when there is an upgrade and all of a sudden the core has this HUGE backlog of deletes to run through. Sure, the customer didn't change anything in the schedule, but that doesn't mean that the core doesn't have a lot of work to do now. Lots of deletes can definitely impact performance. So that's why I ask. Another reason to ask about performance counters like disk queue is it can indicate issues with your hardware. If you know that your system always runs with a disk queue of 1 and you get specific performance, then great. But if you haven't been monitoring that kind of performance and you look and you have a high disk queue it could indicate other problems. Like failed drives. Or perc controller firmware issues. Or maybe you have a controller with a dead battery so you have no more write caching on the controller. There are lots of things that just happen that could coincide with an upgrade that could make it seem like a software problem when it's not. So I like to cover all my bases.

    Another reason to ask about retention policy is that it makes a BIG difference when it comes to rollup and how long rollup takes. For instance, let's say you set up a core 1 year ago and you have a retention policy that says you are going to keep 4 weeks of daily backups and then 11 monthly backups. When you reach the one year mark, now rollup has to process a base image and roll together a base and 1 monthly incremental. That change is likely to be large and require much more time than rolling together two daily backups. So depending on where Tim is in his retention policy may make a difference. He could have just reached the end of his policy around the same time that he upgraded. Again, that's not a problem with the software performance and it's not a problem caused by changing his configuration (since he didn't), it's a problem with having to process a LOT more data and it may just happen to coincide with other factors. This is one we actually see a lot. The core runs incredibly fast when you are first starting out. That makes perfect sense since it's not processing any rollup data, it's not having to delete anything, and all it's doing is writing data. But once you reach a point where maintenance jobs have to run, now we see performance issues because reads are MUCH slower than writes since you can't use caching.

    Match version - yes you can upgrade just the target core and not the source. Replication is supposed to work as long as target core build >= source core build. However, they need to be relatively close in versions. So you should be able to replicate 6.0.2 to 6.2.1 and 6.1.3 to 6.2.1. But we don't expect 5.4.3 to be able to replicate to 6.2.1. There are limits to what we test and are willing to support.

    We need more info to diagnose what is going on. That's why I asked all those questions. A lot of that information does come from the GUI. Are maintenance jobs running longer than they used to? Are deletes running longer than they used to? What's the retention policy and as it been reached? I could go on, but I think I've typed enough for now.

  • No one saying the items you mention should not be checked, they are great suggestions. But I think the chances of the scheduler hanging or retention just happening to hit 1 year at the same time as the upgrade maybe low (but not 0) And if the chances of this are fairly high then maybe some logic should be put in to warn/ restart the scheduler if it is not running vs letting it sit in a failed state for months or more

    But even if the scheduler did hang or retention did just hit a milestone and they were causing slow downs due to a large amount of work. That would only bolster my point. If the Core recorded any performance information for its jobs, it would be incredibly easy to see this or any other issue. Sure you can look in the GUI for job run times and maybe a rate (but support has always told me the rate numbers are useless) but even if the rate is accurate, not sure it helps you. 

    And if those 2 specific scenarios are not the issue, then we move onto other guesses. But if the Core reported anything about replication, we could narrow in on the issue. For example, lets assume replication has 4 parts

    1) Read from Source
    2) Package data for transmission (compression?)
    3) Transfer
    4) Write data to Target

    So a log entry may look like "Read = 10GB/ 2mbps, Package = 10GB/ 20mbps, Transfer = 8GB/ 60mbps, Write= 8GB/15mbps" (write would probably be in a different log on the target unless he reported back)

    Easy to see where the jobs bottleneck is, during read so probably a disk issue. The bottleneck in the transfer section, then its probably network related. The Core knows how much data he handles, he knows how much time he spends during each operation, he just does not track it or report it

    Now, since none of the above helps tim.sweet with his current issue, look in the Core logs for a message like:

    - performance: on volume 1, during the last 600s, 3 IO operations exceeded 30000ms threshold, their total duration is 361s, the longest operation took 138s.

    This may help you see if you have any disk IO issues

    I am also messing around with a powershell scrip Tudor sent me to see if it can help narrow down jobs as an issue

  • I know it's been a little while but I wanted to update this string with new info.  We updated our production site to 6.2.1.100 so now it's the same as our DR location.  No change in replication.  It's still not using all the WAN connection which is 100 MB.  Matter of fact it's using less of the connection now.  We have been on this new version for 2 nights now and replication is worse.  Last night we had 12 servers of the 100 we replicate that never made it over due to errors.  This is a software issue.  We have a ticket opened with Quest and we have proven to them that this all started when we upgraded our DR site.  We have network graphs and replication times before and after the upgrade.  This all started with the upgrade and no network changes.  I can replicate single servers during the day without issue but as soon as we try to replicate 2 in the early morning hours as part of 100 servers that need to be replicated the software struggles to handle it.  What it seems to be doing is replicating 2 servers then it stop/drops and waits for a timeout or something and it replicates 2 more then drops again.  Prior to the upgrade it would peg the wan connection at like 90%, it would stay that way until it was done, no dropping and waiting.  Our replication is take an hour and half to 2 hours longer now.  

Reply
  • I know it's been a little while but I wanted to update this string with new info.  We updated our production site to 6.2.1.100 so now it's the same as our DR location.  No change in replication.  It's still not using all the WAN connection which is 100 MB.  Matter of fact it's using less of the connection now.  We have been on this new version for 2 nights now and replication is worse.  Last night we had 12 servers of the 100 we replicate that never made it over due to errors.  This is a software issue.  We have a ticket opened with Quest and we have proven to them that this all started when we upgraded our DR site.  We have network graphs and replication times before and after the upgrade.  This all started with the upgrade and no network changes.  I can replicate single servers during the day without issue but as soon as we try to replicate 2 in the early morning hours as part of 100 servers that need to be replicated the software struggles to handle it.  What it seems to be doing is replicating 2 servers then it stop/drops and waits for a timeout or something and it replicates 2 more then drops again.  Prior to the upgrade it would peg the wan connection at like 90%, it would stay that way until it was done, no dropping and waiting.  Our replication is take an hour and half to 2 hours longer now.  

Children
No Data