This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why are exports so slow?

So this is something we have had issues with ever since getting our DL4300 appliance 2 years ago back when it was AppAssure, we are now running the latest version of RapidRecovery. Our exports typically cannot get any faster than around 10MB/s (MegaBytes), and over the years we have had numerous reasons from Dell/Quest as to why this is ranging from:

Slow network connectivity
Bad firmware on both DL4300 and our R730XD ESXI hosts
Issues with the ESXI kernal which are outside of Dell/Quest control
Underspec DL4300 Appliance
Datastore fragmentation
Firewalls & IPS
Encryption, De-Dup and/or Compression.
The list goes on....

So to give some info on our setup:

Appliance: DL4300 running RapidRecovery 6.2.0.17839
ESXI Hosts: R730XD - 256GB RAM - Dual Xeon E5-2630 CPU's - 24x 15k RPM 600GB Dell SAS Hard Drives - H730 Mini RAID Controller - ESXI 6.0
Network: 10Gb/s fibre between all servers - No Firewalls or IPS, all direct Layer 2
Firmware: All firmware on all servers are up to date with the Dell FTP site

We have addressed every reason (*almost) that has been given so far, and we are now completely lost for an answer other than the product is essentially useless for restoring backup data. Today I took one of our standby R730XD hosts (only 64GB RAM, all other spec the same) and did a bare metal install of Server 2016 Datacentre with the HyperV role and started an export so we can rule out ESXI as a cause, to no surprise we see the exact same behaviour with slow transfer speeds of sub-10MB/s. One thing I have noticed, monitoring the network traffic shows small regular "bursts" of a few megabytes and then nothing, and then the same again throughout the entire export. It's as if RapidRecovery is having to wait for the data to be read from disk and then sending what it has. *The one remaining step we have yet to try is disabling compression and de-dup, simply because support have warned us "it may cause some issues, but we can't say specifically". We do have the room in our repository to comfortably disable these features, and would be willing to do so if this has shown results for others.

I would be very interested to hear of any other suggestions, experiences or general feedback from other community members who are (or are not!) experiencing this issue and what steps you have tried to resolve it, successfully or not. Happy to provide more detailed information if needed.

Top Replies

Parents

0 Emte over 7 years ago

Exports are slow, just like many other processes in RR (like archives) I assume its due to the architecture of the repo but there is no way to confirm that

But 10 MB/s is to slow. Some thoughts

First the easy ones

1) You mention you have a DELL appliance, if you have this controller

https://support.quest.com/rapid-recovery/kb/180680/dell-perc-controller-cache-settings-for-improved-disk-performance

2) This one has helped in some of our Core's and probably your best bet

https://support.quest.com/rapid-recovery/kb/119686/high-paged-pool-ram-utilization-on-systems-running-appassure-or-rapid-recover

- RR has no logging to help, it is embarrassing. So they wont/ cant help you troubleshoot their product.Instead they will point you to wireshark and perfmon, then if those come back clean, they will blame random but impossible to check functions. I don't want to give the impression that you should not be running perfom counters, you should. Its just that the perfmon counters alone tell you half of the story.

In the test below where I exported data from the raid to SSD, none of the disk counters for the raid were bad, so I am not sure why it was so slow. But I did not spend a ton of time on this. Even if a single running job crushed disk IO on my Core, there is not much I can do about repo architecture.

* https://blogs.technet.microsoft.com/askcore/2012/02/07/measuring-disk-latency-with-windows-performance-monitor-perfmon/

* https://blogs.technet.microsoft.com/askcore/2012/03/16/windows-performance-monitor-disk-counters-explained/

- One thing we have done to try and get baselines, is add a SSD card directly to the Core. Then create a new Repo on the SSD and archive some data, then import it to the new repo. Use this as your test bed because unless your bottleneck is fairly obvious (massive page faults in memory) the issue will probably be with the disk IO. It also removes many other factors to start, like the network. If you want to test the network, throw the SSD card in another server after getting a baseline from the Core

The problem is, even if the SSD card shows massive numbers (it probably wont) You cant scale it. You cant create a 60TB SSD raid. You cant tune the repo or RR to handle disk IO better. You are probably testing this on a production Core, so what if a RPFS delete, rollup or dedup cache dump ran during 1 test but not the other. But it lets you test certain functions to see how fast they can run on SSD.

Here are some numbers from the last test I ran, note this is a test core and not in production. The export was local to the Core (2012 R2 running the HyperV role)

Export from RAID disk Repo to SSD:

Total Work 32.62
Rate 36.95 MB/s
Elapsed Time 15:26

Export from SSD Repo back to itself (SSD)

Total Work 32.61
Rate 102.05 MB/s
Elapsed Time 5:03

Both exports were of the exact same RP and was a newly created base. I am not sure how the fact this was a base would effect performance, I assume an export of an INC would take even longer as it would have to build the data before export, but just a guess

You cant compare straight file copies to what RR does with an export but I like to run a copy of the exact same data just to see. I use Robocopy since it gives statistics.

Robocopy from RAID to SSD of the exact same export data = 5 minutes and 04 seconds

Robocopy from SSD to itself (SSD) of the exact same export data = 2 minutes and 04 seconds

So now you spent all this time and effort troubleshooting performance of an application, what can you do? Honestly, I have no idea. This is a small test core that was not running a single other job and using a small set of raid disks. Could I add 6 more spindles, maybe. Would it help, no idea. How could I apply this to a Core in production that maybe having performance issues, no idea.

Why did I waste my time doing all these tests and writing this up? No idea, but Happy 4th!

If anyone has any feedback or see's a flaw in my approach, I would appreciate setting me straight
Cancel
Up +1 Down

Cancel

Reply

0 Emte over 7 years ago

Exports are slow, just like many other processes in RR (like archives) I assume its due to the architecture of the repo but there is no way to confirm that

But 10 MB/s is to slow. Some thoughts

First the easy ones

1) You mention you have a DELL appliance, if you have this controller

https://support.quest.com/rapid-recovery/kb/180680/dell-perc-controller-cache-settings-for-improved-disk-performance

2) This one has helped in some of our Core's and probably your best bet

https://support.quest.com/rapid-recovery/kb/119686/high-paged-pool-ram-utilization-on-systems-running-appassure-or-rapid-recover

- RR has no logging to help, it is embarrassing. So they wont/ cant help you troubleshoot their product.Instead they will point you to wireshark and perfmon, then if those come back clean, they will blame random but impossible to check functions. I don't want to give the impression that you should not be running perfom counters, you should. Its just that the perfmon counters alone tell you half of the story.

In the test below where I exported data from the raid to SSD, none of the disk counters for the raid were bad, so I am not sure why it was so slow. But I did not spend a ton of time on this. Even if a single running job crushed disk IO on my Core, there is not much I can do about repo architecture.

* https://blogs.technet.microsoft.com/askcore/2012/02/07/measuring-disk-latency-with-windows-performance-monitor-perfmon/

* https://blogs.technet.microsoft.com/askcore/2012/03/16/windows-performance-monitor-disk-counters-explained/

- One thing we have done to try and get baselines, is add a SSD card directly to the Core. Then create a new Repo on the SSD and archive some data, then import it to the new repo. Use this as your test bed because unless your bottleneck is fairly obvious (massive page faults in memory) the issue will probably be with the disk IO. It also removes many other factors to start, like the network. If you want to test the network, throw the SSD card in another server after getting a baseline from the Core

The problem is, even if the SSD card shows massive numbers (it probably wont) You cant scale it. You cant create a 60TB SSD raid. You cant tune the repo or RR to handle disk IO better. You are probably testing this on a production Core, so what if a RPFS delete, rollup or dedup cache dump ran during 1 test but not the other. But it lets you test certain functions to see how fast they can run on SSD.

Here are some numbers from the last test I ran, note this is a test core and not in production. The export was local to the Core (2012 R2 running the HyperV role)

Export from RAID disk Repo to SSD:

Total Work 32.62
Rate 36.95 MB/s
Elapsed Time 15:26

Export from SSD Repo back to itself (SSD)

Total Work 32.61
Rate 102.05 MB/s
Elapsed Time 5:03

Both exports were of the exact same RP and was a newly created base. I am not sure how the fact this was a base would effect performance, I assume an export of an INC would take even longer as it would have to build the data before export, but just a guess

You cant compare straight file copies to what RR does with an export but I like to run a copy of the exact same data just to see. I use Robocopy since it gives statistics.

Robocopy from RAID to SSD of the exact same export data = 5 minutes and 04 seconds

Robocopy from SSD to itself (SSD) of the exact same export data = 2 minutes and 04 seconds

So now you spent all this time and effort troubleshooting performance of an application, what can you do? Honestly, I have no idea. This is a small test core that was not running a single other job and using a small set of raid disks. Could I add 6 more spindles, maybe. Would it help, no idea. How could I apply this to a Core in production that maybe having performance issues, no idea.

Why did I waste my time doing all these tests and writing this up? No idea, but Happy 4th!

If anyone has any feedback or see's a flaw in my approach, I would appreciate setting me straight
Cancel
Up +1 Down

Cancel

Children

0 network.systex over 7 years ago in reply to Emte

Hey,

Great reply thanks very much for taking the time out of your day for it

We have exhausted dozens of hours troubleshooting our core, unfortunately our single core is in production so we don't have anything to test freely against. We have another ticket for this with support, who now agree that there is an issue so it will be interesting to see what they find/say this time.

It's such a shame this issue has persisted for so long, it seemed to get worse when it changed from AppAssure to RapidRecovery; I believe they made very significant changes to the Repo architecture, this was blamed in the past for the slow exports due to fragmentation in the new Repo. We was told a defrag tool would be made available last year, but have yet to see any sign of it. We really are at a point now of considering using the appliance as a door wedge and moving to another product. Waiting an hour for a 6GB incremental backup to push to a virtual standby is just ridiculous at best, not to mention the thought of having to recover a multi-terabyte server.
Cancel
Up 0 Down

Cancel
0 Emte over 7 years ago in reply to network.systex

This product is amazing but the logging (or lack of) and specifically the performance logging causes so many issues to linger just like this. And the product starts to be seen as IT time sink vs the amazing tool it could be. I don't blame support on this as they don't have any information available to them to help troubleshoot many issues so they have to resort to shotgun fixes and hope. I talked with Dev about this issue recently and there was supposed to be an enhancement request made but the case I have open that covers this, well the tech simply wont respond. I think it has been 30 days now without a response. That seems to be his M.O. yet he keeps taking my cases and nothing is done even when I contact his manager. But that is a different topic

Once I get the link to the CR, I will post it here so you can add your support to it (if you like)
Cancel
Up +1 Down

Cancel
0 network.systex over 7 years ago in reply to Emte

That would be very helpful thank you.

From a backup perspective it is remarkable, very few issues and doesn't keep me awake at night wondering if transfer jobs are completing. It's just getting the data back out of the appliance that is the real fall down of it, how Quest can expect customers to live with days long recoveries is beyond me.

The part that really baffles me, is that I can mount an RP in minutes (supposedly, never tested it) to a local HyperV VM on the appliance. How is it able to so readily access that information to load a full VM, but can't access the information to export it in such a speedy manner.
Cancel
Up 0 Down

Cancel
0 Emte over 7 years ago in reply to network.systex

I think what you are talking about is mount and live recovery. They work similar but are very different. Look up live recovery if you want to get more information.

In essence what they do is restore only the meta-data about the files first. Think of it like running the windows dir cmd, it only pulls metadata about the files, not the data itself and is blazing fast. If you try to open a specific file, that file is then pushed to the front of the restore queue, so it looks like all of the data is there when in fact it is not. 10Tb can appear to be restored in seconds

It is an amazing feature but one that could not work with other operations like exports
Cancel
Up +1 Down

Cancel
0 Jose.Castrellon over 7 years ago in reply to Emte

while I acknowledge the case was not handled properly (follow ups) It is not normal for this to happen. We have procedures in place that allow us to provide the service even when original support representatives are out.
Cancel
Up 0 Down

Cancel
0 Emte over 6 years ago in reply to Emte

Here is the CR that Dev created questdataprotection.ideas.aha.io/.../RR-I-325
Cancel
Up 0 Down

Cancel
0 network.systex over 6 years ago in reply to Emte

Those features would be amazing to have right now. I always get the feeling that many of the support guys for Quest simply have to guess at what may be wrong, and quite rightly must be extremely frustrating when every customer is different with no way to run a like for like test between them.
Cancel
Up 0 Down

Cancel
0 Emte over 6 years ago in reply to network.systex

Add your vote (and comments) to it please

I get frustrated with support a lot, many simply use a random trial and error method vs any logic. I think it is partly due to training and partly due to the products logging deficiency. I used to live where the new support center is located and I still have nightmares about my initial training to do this job, so I understand some of the hurdles. But it is also because the engineers lack information to make logical decisions, so they have to guess, this drags out ticket times and decreases customer satisfaction.

I always say that its tough to get support related features added to the product since support related features don't sell the product. But we all know who gets products replaced with competitors, the application administrators. If the old adage that it costs 10 times more to gain a new customer than keep an existing one is true, then product administrators should be a primary focus

It is not part of the CR above but in my initial write up, I mentioned there should be an initiative with support where if the product encounters an issue but the cause is not logged (but could be) the engineer has to create an enhancement request.
Cancel
Up +1 Down

Cancel
0 network.systex over 6 years ago in reply to Emte

Completely agree, we are making serious considerations at this time weather to stick with the product or move to another, largely depending on the resolution of the ticket we have for this issue. If they can't get exports working in a timely manner then simply put the product does not meet our requirements as a backup solution, which is a shame as the likely cause will be the reasons you stated above.
Cancel
Up 0 Down

Cancel