So this is something we have had issues with ever since getting our DL4300 appliance 2 years ago back when it was AppAssure, we are now running the latest version of RapidRecovery. Our exports typically cannot get any faster than around 10MB/s (MegaBytes), and over the years we have had numerous reasons from Dell/Quest as to why this is ranging from:
So to give some info on our setup:
We have addressed every reason (*almost) that has been given so far, and we are now completely lost for an answer other than the product is essentially useless for restoring backup data. Today I took one of our standby R730XD hosts (only 64GB RAM, all other spec the same) and did a bare metal install of Server 2016 Datacentre with the HyperV role and started an export so we can rule out ESXI as a cause, to no surprise we see the exact same behaviour with slow transfer speeds of sub-10MB/s. One thing I have noticed, monitoring the network traffic shows small regular "bursts" of a few megabytes and then nothing, and then the same again throughout the entire export. It's as if RapidRecovery is having to wait for the data to be read from disk and then sending what it has. *The one remaining step we have yet to try is disabling compression and de-dup, simply because support have warned us "it may cause some issues, but we can't say specifically". We do have the room in our repository to comfortably disable these features, and would be willing to do so if this has shown results for others.
I would be very interested to hear of any other suggestions, experiences or general feedback from other community members who are (or are not!) experiencing this issue and what steps you have tried to resolve it, successfully or not. Happy to provide more detailed information if needed.
Exports are slow, just like many other processes in RR (like archives) I assume its due to the architecture of the repo but there is no way to confirm that
But 10 MB/s is to slow. Some thoughts
First the easy ones
1) You mention you have a DELL appliance, if you have this controller
2) This one has helped in some of our Core's and probably your best bet
- RR has no logging to help, it is embarrassing. So they wont/ cant help you troubleshoot their product.Instead they will point you to wireshark and perfmon, then if those come back clean, they will blame random but impossible to check functions. I don't want to give the impression that you should not be running perfom counters, you should. Its just that the perfmon counters alone tell you half of the story.
In the test below where I exported data from the raid to SSD, none of the disk counters for the raid were bad, so I am not sure why it was so slow. But I did not spend a ton of time on this. Even if a single running job crushed disk IO on my Core, there is not much I can do about repo architecture.
- One thing we have done to try and get baselines, is add a SSD card directly to the Core. Then create a new Repo on the SSD and archive some data, then import it to the new repo. Use this as your test bed because unless your bottleneck is fairly obvious (massive page faults in memory) the issue will probably be with the disk IO. It also removes many other factors to start, like the network. If you want to test the network, throw the SSD card in another server after getting a baseline from the Core
The problem is, even if the SSD card shows massive numbers (it probably wont) You cant scale it. You cant create a 60TB SSD raid. You cant tune the repo or RR to handle disk IO better. You are probably testing this on a production Core, so what if a RPFS delete, rollup or dedup cache dump ran during 1 test but not the other. But it lets you test certain functions to see how fast they can run on SSD.
Here are some numbers from the last test I ran, note this is a test core and not in production. The export was local to the Core (2012 R2 running the HyperV role)
Export from RAID disk Repo to SSD:
Total Work 32.62Rate 36.95 MB/sElapsed Time 15:26
Export from SSD Repo back to itself (SSD)
Total Work 32.61Rate 102.05 MB/sElapsed Time 5:03
Both exports were of the exact same RP and was a newly created base. I am not sure how the fact this was a base would effect performance, I assume an export of an INC would take even longer as it would have to build the data before export, but just a guess
You cant compare straight file copies to what RR does with an export but I like to run a copy of the exact same data just to see. I use Robocopy since it gives statistics.
Robocopy from RAID to SSD of the exact same export data = 5 minutes and 04 seconds
Robocopy from SSD to itself (SSD) of the exact same export data = 2 minutes and 04 seconds
So now you spent all this time and effort troubleshooting performance of an application, what can you do? Honestly, I have no idea. This is a small test core that was not running a single other job and using a small set of raid disks. Could I add 6 more spindles, maybe. Would it help, no idea. How could I apply this to a Core in production that maybe having performance issues, no idea.
Why did I waste my time doing all these tests and writing this up? No idea, but Happy 4th!
If anyone has any feedback or see's a flaw in my approach, I would appreciate setting me straight
Great reply thanks very much for taking the time out of your day for it
We have exhausted dozens of hours troubleshooting our core, unfortunately our single core is in production so we don't have anything to test freely against. We have another ticket for this with support, who now agree that there is an issue so it will be interesting to see what they find/say this time.
It's such a shame this issue has persisted for so long, it seemed to get worse when it changed from AppAssure to RapidRecovery; I believe they made very significant changes to the Repo architecture, this was blamed in the past for the slow exports due to fragmentation in the new Repo. We was told a defrag tool would be made available last year, but have yet to see any sign of it. We really are at a point now of considering using the appliance as a door wedge and moving to another product. Waiting an hour for a 6GB incremental backup to push to a virtual standby is just ridiculous at best, not to mention the thought of having to recover a multi-terabyte server.