Rapid Recovery

Why are exports so slow?

So this is something we have had issues with ever since getting our DL4300 appliance 2 years ago back when it was AppAssure, we are now running the latest version of RapidRecovery. Our exports typically cannot get any faster than around 10MB/s (MegaBytes), and over the years we have had numerous reasons from Dell/Quest as to why this is ranging from:

  • Slow network connectivity
  • Bad firmware on both DL4300 and our R730XD ESXI hosts
  • Issues with the ESXI kernal which are outside of Dell/Quest control
  • Underspec DL4300 Appliance
  • Datastore fragmentation
  • Firewalls & IPS
  • Encryption, De-Dup and/or Compression.
  • The list goes on....

So to give some info on our setup:

  • Appliance: DL4300 running RapidRecovery 6.2.0.17839
  • ESXI Hosts: R730XD - 256GB RAM - Dual Xeon E5-2630 CPU's - 24x 15k RPM 600GB Dell SAS Hard Drives - H730 Mini RAID Controller - ESXI 6.0
  • Network: 10Gb/s fibre between all servers - No Firewalls or IPS, all direct Layer 2
  • Firmware: All firmware on all servers are up to date with the Dell FTP site

We have addressed every reason (*almost) that has been given so far, and we are now completely lost for an answer other than the product is essentially useless for restoring backup data. Today I took one of our standby R730XD hosts (only 64GB RAM, all other spec the same) and did a bare metal install of Server 2016 Datacentre with the HyperV role and started an export so we can rule out ESXI as a cause, to no surprise we see the exact same behaviour with slow transfer speeds of sub-10MB/s. One thing I have noticed, monitoring the network traffic shows small regular "bursts" of a few megabytes and then nothing, and then the same again throughout the entire export. It's as if RapidRecovery is having to wait for the data to be read from disk and then sending what it has. *The one remaining step we have yet to try is disabling compression and de-dup, simply because support have warned us "it may cause some issues, but we can't say specifically". We do have the room in our repository to comfortably disable these features, and would be willing to do so if this has shown results for others.

I would be very interested to hear of any other suggestions, experiences or general feedback from other community members who are (or are not!) experiencing this issue and what steps you have tried to resolve it, successfully or not. Happy to provide more detailed information if needed.

Parents
No Data
Reply
  • Exports are slow, just like many other processes in RR (like archives) I assume its due to the architecture of the repo but there is no way to confirm that

    But 10 MB/s is to slow. Some thoughts

    First the easy ones

    1) You mention you have a DELL appliance, if you have this controller

    https://support.quest.com/rapid-recovery/kb/180680/dell-perc-controller-cache-settings-for-improved-disk-performance

    2) This one has helped in some of our Core's and probably your best bet

    https://support.quest.com/rapid-recovery/kb/119686/high-paged-pool-ram-utilization-on-systems-running-appassure-or-rapid-recover

    - RR has no logging to help, it is embarrassing. So they wont/ cant help you troubleshoot their product.Instead they will point you to wireshark and perfmon, then if those come back clean, they will blame random but impossible to check functions. I don't want to give the impression that you should not be running perfom counters, you should. Its just that the perfmon counters alone tell you half of the story. 

    In the test below where I exported data from the raid to SSD, none of the disk counters for the raid were bad, so I am not sure why it was so slow. But I did not spend a ton of time on this. Even if a single running job crushed disk IO on my Core, there is not much I can do about repo architecture.

    https://blogs.technet.microsoft.com/askcore/2012/02/07/measuring-disk-latency-with-windows-performance-monitor-perfmon/

    https://blogs.technet.microsoft.com/askcore/2012/03/16/windows-performance-monitor-disk-counters-explained/

    - One thing we have done to try and get baselines, is add a SSD card directly to the Core. Then create a new Repo on the SSD and archive some data, then import it to the new repo. Use this as your test bed because unless your bottleneck is fairly obvious (massive page faults in memory) the issue will probably be with the disk IO. It also removes many other factors to start, like the network. If you want to test the network, throw the SSD card in another server after getting a baseline from the Core

    The problem is, even if the SSD card shows massive numbers (it probably wont) You cant scale it. You cant create a 60TB SSD raid. You cant tune the repo or RR to handle disk IO better. You are probably testing this on a production Core, so what if a RPFS delete, rollup or dedup cache dump ran during 1 test but not the other. But it lets you test certain functions to see how fast they can run on SSD.

    Here are some numbers from the last test I ran, note this is a test core and not in production. The export was local to the Core (2012 R2 running the HyperV role)

    Export from RAID disk Repo to SSD:

    Total Work 32.62
    Rate 36.95 MB/s
    Elapsed Time 15:26


    Export from SSD Repo back to itself (SSD)

    Total Work 32.61
    Rate 102.05 MB/s
    Elapsed Time 5:03

    Both exports were of the exact same RP and was a newly created base. I am not sure how the fact this was a base would effect performance, I assume an export of an INC would take even longer as it would have to build the data before export, but just a guess

    You cant compare straight file copies to what RR does with an export but I like to run a copy of the exact same data just to see. I use Robocopy since it gives statistics.

    Robocopy from RAID to SSD of the exact same export data = 5 minutes and 04 seconds

    Robocopy from SSD to itself (SSD) of the exact same export data = 2 minutes and 04 seconds

    So now you spent all this time and effort troubleshooting performance of an application, what can you do? Honestly, I have no idea. This is a small test core that was not running a single other job and using a small set of raid disks. Could I add 6 more spindles, maybe. Would it help, no idea.  How could I apply this to a Core in production that maybe having performance issues, no idea. 

    Why did I waste my time doing all these tests and writing this up? No idea, but Happy 4th!

    If anyone has any feedback or see's a flaw in my approach, I would appreciate setting me straight 

Children