This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Replication WAN Issue

We recently updated our core at our DR facility to 6.2.1.100.  Our main facility is at 6.3.1.100.  We replicate roughly 100 servers at night.  Since we upgraded to 6.2.1.100 we noticed that our replication is taking an extra hour to hour and a half to complete.  Network graphs show that we are not using 100% of the WAN connection when we replicate at night.  The connection will go up and then it looks like before it starts the next server replication the connection stops and waits and then starts the next server.  If we replicate one server with a large amount of data the software uses the entire WAN connection for that one server.  We have never had an issue with this prior to upgrading DR.  Our nightly replication always used 100% of the WAN connection until all servers we replicated to DR.  We didn't make any configuration changes to the network or to the backup settings when we upgraded to this new version but we now have encountered this issue.  Can anyone give me any thoughts on this?

  • I believe Cores need to be on the same version for replication. I am surprised it is working at all

    support.quest.com/.../how-to-upgrade-to-the-most-recent-version-of-rapid-recovery

    1. Target Cores: If upgrading a multi-Core environment, upgrade Target Cores first.
      NOTE: Replication cannot be resumed unless the Target Core is of an equal or greater build than the Source Core.

     

  • Thanks for your response but if I am reading this correctly we need to update the target core first, as long as we do that then everything should work properly.  That's what we did we upgraded the target core first and we picked up this issue.  We are reluctant to upgrade our source core in fear that this will cause additional slowness to the process.

  • You are probably right. No idea if the issue with performance is related directly to the new build or the version difference between cores. Of course there is no way to find out either since RR does not log any performance detail of use

  • Hello tim.sweet, 

    Thank you for the information you provided. I am wondering if you are having the following issue:

    support.quest.com/.../slow-replication-of-base-images

    I would be curious to see if you apply this on the 6.2.1.100 Core if this improves the replication time.

    We also would recommend that both Cores be the same version as mentioned.

    Please let me know if you have any further questions.

  • Francis, this isn't our issue.  We don't replicate base backups every night.  Typically we replicate incrementals and since we upgraded DR it takes longer and replication will not use all the bandwidth of the WAN link.  Do you think this patch will still help us in anyway?  Let me know your thoughts, really need some assistance with this.  

  • Hi Tim, I think this patch will still improve your replication times. I would say install it and test it out and let us know if you notice improvements or not.

  • Could you expand a bit? Why would installing the patch on one core cause replication to slow down? And why would you then expect installing the patch on another core to speed it up? If there is a known issue with this patch between cores, it should be noted (and I cant find any, this appears to be a supported configuration)

    I am not saying it wont help, and in fact it looks like Tim has zero options. Just that the whole process seems to be a try it and see and this is not ideal for enterprise level production backups.

    RR needs some way to monitor and measure performance counters for all of its jobs. 

  • The patch increases the write buffer on the target core. By default our write buffer is relatively small. So if your source core is capable of overwhelming the write buffer on the target core, you end up with slower than expected replication. So this patch doesn't need to be installed on the source and if it is, it won't have any affect on it's performance. It needs to be installed on the target and it will only improve replication speeds if the bottleneck is the write buffer on the target core. The write buffer is usually the bottleneck only during base image replications, however I can theorize a couple situations where it would be beneficial with incrementals (say large incremental replication or highly unique incremental data).

    - What does the disk queue look like on the source core? What does it look like on the target core? Do you ever see it go above 1? You can check using resource monitor. Are there any background jobs running at the same time as replication? Jobs like deleting index RPFS files or other maintenance tasks?  When are your nightly jobs set to run on your two cores? Are they at the same time as you are allowing replication? Do you have separate retention policies between the two cores? How many concurrent replication jobs are allowed? What is your bandwidth between the two sites? What is the latency? Have you seen an increase in latency on your network line? Is that something you monitor at all? Latency can really slow down RR replication especially when you are replicating incremental data that is pretty well deduped. I know I just asked a lot of questions, but these are all things I would look at in investigating this kind of issue. 

    From a software perspective, we recommend to keep the source and target core on the same build. If there are changes in repository structure between builds we can end up with overhead as replication is sending data between the two cores and having to convert it on the target core to match the new repo structure. I'm not saying that is the case here, but we've seen it in the past and there were some big code changes made between 6.1.3 and 6.2.x. I've also seen defects related to replication speeds in the past between builds. Anytime a new build is released that makes modifications to the replication code (for instance we added better resiliency against network drops) then there is a chance that could introduce performance issues between builds. In your situation, the performance hit isn't preventing you from completing your replications each night on a large number of servers, so that is a good thing. It's definitely frustrating to see worse performance, but it's not crippling.

  • Tim, as always I appreciate your replies and the details you give. Like the fact that write buffers were increased in this patch.

    Write Buffer - It seems like would only help and it is available on the correct core. So this should not have caused a slow down. Is there any log/ gui indication that a core is "overwhelming the write buffer"?

    General Perfmon/ scheduling stuff - While these maybe impacting performance, it would impact performance always, not just after 1 core was upgraded. And the previous recommendation to match the patch levels would not affect this. Unless there was some schedule change from the patch that would cause another job to be running at a different/ new time that conflicts with replication

    Match Version - We follow this general rule but I was just thinking of reviewing this process as Tim.Sweet pointed out I was not reading the docs correctly. I swear it used to say to upgrade the Target first then Source but it appears to just say Target now. We have been bitten before with patch issues and it makes it worse when both the source and target have been upgraded, so I was interested to see its supported to upgrade just the target core.

    I know they cant, but the cores should be able to give some insight into their performance and job statistics in order to narrow down what is happening. The Cores performance appears to have slowed down yet there is not a single recommendation that involves checking anything in the GUI or logs that would help narrow in on the root cause. It is so frustrating when these performance issue happen and there is almost no data available from the Core. 

  • Is there anywhere in the logs that you can see write buffer being overwhelmed? No. Why? Because it wasn't ever supposed to be manipulated and the write buffer size being too small was a coding mistake.

    General perfmon/scheduling - I've seen quite a few instances where a core has been online for a long time (think a few months or more) without a restart. Something hangs inside the scheduler and no deletes happen for a long time. Then when there is an upgrade and all of a sudden the core has this HUGE backlog of deletes to run through. Sure, the customer didn't change anything in the schedule, but that doesn't mean that the core doesn't have a lot of work to do now. Lots of deletes can definitely impact performance. So that's why I ask. Another reason to ask about performance counters like disk queue is it can indicate issues with your hardware. If you know that your system always runs with a disk queue of 1 and you get specific performance, then great. But if you haven't been monitoring that kind of performance and you look and you have a high disk queue it could indicate other problems. Like failed drives. Or perc controller firmware issues. Or maybe you have a controller with a dead battery so you have no more write caching on the controller. There are lots of things that just happen that could coincide with an upgrade that could make it seem like a software problem when it's not. So I like to cover all my bases.

    Another reason to ask about retention policy is that it makes a BIG difference when it comes to rollup and how long rollup takes. For instance, let's say you set up a core 1 year ago and you have a retention policy that says you are going to keep 4 weeks of daily backups and then 11 monthly backups. When you reach the one year mark, now rollup has to process a base image and roll together a base and 1 monthly incremental. That change is likely to be large and require much more time than rolling together two daily backups. So depending on where Tim is in his retention policy may make a difference. He could have just reached the end of his policy around the same time that he upgraded. Again, that's not a problem with the software performance and it's not a problem caused by changing his configuration (since he didn't), it's a problem with having to process a LOT more data and it may just happen to coincide with other factors. This is one we actually see a lot. The core runs incredibly fast when you are first starting out. That makes perfect sense since it's not processing any rollup data, it's not having to delete anything, and all it's doing is writing data. But once you reach a point where maintenance jobs have to run, now we see performance issues because reads are MUCH slower than writes since you can't use caching.

    Match version - yes you can upgrade just the target core and not the source. Replication is supposed to work as long as target core build >= source core build. However, they need to be relatively close in versions. So you should be able to replicate 6.0.2 to 6.2.1 and 6.1.3 to 6.2.1. But we don't expect 5.4.3 to be able to replicate to 6.2.1. There are limits to what we test and are willing to support.

    We need more info to diagnose what is going on. That's why I asked all those questions. A lot of that information does come from the GUI. Are maintenance jobs running longer than they used to? Are deletes running longer than they used to? What's the retention policy and as it been reached? I could go on, but I think I've typed enough for now.