Rapid Recovery

Replication bandwidth

We've just moved our replica core offsite, but TYPICALLY! our source core has decided to take a base of a very large volume on one of our FS, grrrrrrrrrrrrrrrrrrrrr.  not sure why it did but it has, anyway it now wants to copy 2.2Tb up to the offsite replica, nice!  Lucky we have a 1gbs line to our offsite location but we're only getting the replica job running at about 10MB/s.  I've got the replication set as follows, so it shouldn't be throttled, and there isn't anything else restricting traffic.  Any ideas as to how to increase the wire speed, is there a setting somewhere else?

 

  • I feel for you. I cant offer much to help your current situation. We used to see this exact issue all the time and never really get any solid advise, just vague recommendations that its a core resource issue. I believe that much of it is built into the product, these operations (like archives) are just painfully slow. Of course this is a very resource intensive process, so check your memory and disk IO counters, etc.

    But I would setup your dedup cache to be able to include your entire repo. Now that your replication core is offsite, this will be critical and will help you avoid this issue moving forward

    support.quest.com/.../134726
  • its defo not a core resource issue, its a brand new DL4300 and the only job running most of the time is the replication job off to the replica core. I expect it'll get there in the end, it should do about 1TB a day at 10MB, if of course it runs without any time outs (which it never seems to do!) Just REALLY annoying that we've had the replica sat onsite for the last two months then the day before it ships offsite it does a 2.2TB base! The FS is about 12TB in total I could, if pushed, do a seed using a couple of 8TB disks and ship them offsite but thats a pain in itself. Oh the fun when we migrate to a new FS next year!
  • There isn't a limitation built into the software that would throttle the replication speed outside the settings you provided. Replication will push out as fast as it can. You would notice this on a LAN based replication set up that is truly over a 1Gb link.

    My recommendation is to run a few speed tests utilizing that link between the two Cores.
    When I troubleshoot replication performance, I normally begin with a few basic Internet speed tests to get an idea if I'm receiving the throughput the ISP said should be the case. After that I'll do an iPerf test (support.quest.com/.../131614). While the iPerf test going on, open Resource Monitor and check the disk IO utilization on both Cores and verify that Timing, Disk Queue, and Response Times are not displaying signs of concerning performance.
  • Hi dtghelp :
    There are a few things that you may try to increase the replication speed, albeit they may not be applicable to the job you are currently running. However, there are a few items to consider first.
    Basically replication has a few separate steps, each of which affects the overall performance.

    First step is the rate of retrieving data on the source core. This is conditioned mainly by the repository performance which in turn is determined by the hardware capabilities, drivers/firmware and overall load on the core. Assuming that the hardware is performant, that the drivers and firmware (storage AND Hard Drives) are current, you need to monitor the load on the core. Beside the regular backup you have a host of other jobs that consume Storage IOPS. In my experience, mountability, attachability, RP checks and rollups (with sometimes huge amounts of deferred deletes) and other replications may slow down considerably the data retrieval for replication. If possible limiting these operations during large replication jobs may increase performance a few times over. Please note that, by default, the repository takes 64 concurrent operations and most jobs are composed of multiple operations. For instance, backups feature 8 streams each by default and replication a total of 8 streams.

    The second step is the data ingestion rate which is influenced by all of the above (except backup jobs) plus the inner works of the dedupe cache on the target core. Please note the "read-match-write" process which is intended to send over the wire only the blocks that cannot be found in the cache.

    The third step is the actual WAN pipe performance. 1 Gbs is obviously a very good speed (125MB/s max theoretical speed) and the replication process was not optimized for it so it is unlikely to reach its full potential. However, if replicating locally worked better than it works now when doing it remotely, you may need to troubleshoot the connection. For instance, depending of provider and your Service Agreement, the upload speed may be considerably slower than the download one. Since replication depends on uploading data on the target core, it may make sense to check if it is the case. Another possibility may be that the WAN speed is limited on a per client basis at your organization level. For instance, if one considers various performance factors AND the way some providers calculate the WAN speed they offer, 10MB/s may correspond to an 100Mbs. This is a rather common issue that needs to be addressed with the SysAdmin guys in your organization.

    All these being said and assuming that none of the performance degrading factors apply, you can increase the number of parallel streams allocated to replication. I would start with 16 and go up or down based on what the pipe can take (if you get errors, reduce the number of streams, otherwise, you may attempt increasing them until you begin witnessing errors).

    Hope that this helps.