This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Virtual Standby to ESXI are slow

Appliance: DL4300

Software Version: RapidRecovery 6.1.2.115

ESXI Version: 6.0

ESXI Server Spec: Dell R730x, 24 15k SAS Drives in RAID 6 - Adaptive Read Ahead - Write Back, 64GB RAM, Dual Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

Problem: Exports from RR to ESXI run below 10MB/s on a 10Gb/s network. Many of our standby's are in the TB per day range, meaning most take approx 20-30 hours to complete. During this time backups of that agent do not take place, leaving the server vulnerable.

 

Troubleshooting taken with Support:

 - All software and hardware updates completed

 - Direct 10Gb/s fibre between Core and ESXI

 - Pausing all other tasks on the Core and restarting the server

 - Change RAID policy to Adaptive Read Ahead and Write Back

 

Performing an export of an agentless backup last night seemed to run in the 100MB/s+ range, however after checking the logs this morning it dropped back down to 18MB/s after an hour.

 

Any help is appreciated, and even a comment to say you are experiencing the same and info about your setup will assist.

 

James

  • Hello James,

    The situation that you are describing does tend to come up fairly often in data protection, as that is one of the few times that a 3rd part entity (in this case the RR core) has to write into VMware with a large amount of data for an extended period of time and/or often. My reply will use the words 'generally' and 'usually' as many assumptions are made since in the end we are referring to a Windows OS (the core) writing into VMware. By all means though, happy to have a discussion.

    One thing that I will throw out which will be a reoccurring theme, writing from the outside into VMware (in this case our Windows core writing out to ESXi and then to a datastore) is not nearly as fast as anyone would like it to be. It is generally a fraction of the speed you would get if you were to take the same hardware and do a file copy. Regardless of product that I have used, and/or supported over the years, this sentiment comes up time and time again, writing into VMware is not as fast as one would think. Especially when you factor in variables (like with Rapid Recovery) where you have to un-compress, and re-hydrate your data before it can be written. That takes a slower process and makes it even slower.

    Having said that, that specifically is why they provide the ability/features of HotAdd and Lan-Free SAN, to try to 'boost' performance a bit for their customers. Looking at your post it suggests that you might have tried to setup Lan-Free SAN, or did, is that correct? Normally this action doesn't single handily improve single job performance, it tends to assist with 'towing capacity' or the ability to run multiple jobs without incurring as much of a drop in performance. Lan-Free SAN is not intended to make 1 task 'faster' but the ability to run more jobs simultaneously. The adverse technology, HotAdd, on the other hand is knowing for having a higher boost performance. HotAdd is utilized if your Core is running on a VMware VM, and has access to the same datastores as the VMware VMs that it is backing up. Both technologies however are known/notorious for slowing down over time, HotAdd tends to be more evident, however both do.

    What we commonly see, and what we relay to customers, is that on average we see 10-20 MB/s when writing into VMware. Some customers are lower, some customers are higher, however the vast majority of customers fall into this range. When I mention these rates I'm referring to the average speed throughout, burst speeds do come and go however this is where the vast majority sit. What I can tell you is that even with a 1 Gb NIC, a transport into VMware is more than likely not going to max out that single NIC, let alone a 10 Gb. For an example, with LAN-Free SAN and 10k disks in VMware and in my repo, right now I am humming along at ~ 17 MB/s. Now when I watch it, it goes up into 20 and then back below, usually in the teens. I kicked off a HotAdd restore, again with 10k disks, and it started in the 30s and within minutes here it is back to low 20s. This is absolutely the reason for virtual standbys being configured, is so that you can prevent this latency in the restore as long as you have the disk space available. For the speed however, when asked those are the metrics that I use for customers as they are realistic and tend to be the average of what we see. Some higher, some lower, but the ones that are higher tend to be burst speeds, or environment where vsan (or other form of flash media) is involved (for the repo or the datastores).

    You mentioned that your virtual standbys are TBs in size, is that the initial? Or do you really have TBs worth of change on a daily basis? That would be a higher change rate than what we normal see on a day to day basis.

    Also, you mentioned that while the virtual standby is going you can't perform a backup of the machine, that should not be the case. You should be able to do backup of a protected machine while doing an export of it. I can't recall off-hand if that use to be in a problem in AppAssure or not, however in Rapid Recovery an export of a protected machine does not prevent a transfer for that protected machine.
  • Hi Phuffers,

    Many thanks for your response, there is a lot to take away from it.

    We finally got an official response from Quest yesterday afternoon after three weeks of troubleshooting, the issue is caused by fragmentation of the file system used in the repository. Unfortunately there is currently no way to resolve this other than to delete the repository and start again, which is obviously not a viable option. If space is available the archive feature can be used to export the recovery points, and then import them again after the rebuild. However, given how slow exports are, archiving 17+ TB of data from the repository would likely take weeks.

    It's surprising that this issue has not been made public by Quest despite knowing about it as we can't be the only people with large daily exports that cannot complete in a satisfactory amount of time. I understand from the support request that a tool to de-fragment the repository will be available in the next release some time next year, however this has left us having to invest further capital into another solution for our DR, be that an additional appliance to hand solely virtual standby or another solution (VEEAM).

    It's a shame really as aside from this issue AppAssure, and now RapidRecovery, has been an amazing appliance that workswww.quest.com/.../ very well.
  • No problem. However your reply does beg a few question that I am curious about now since there does seem to be a gap between the information that I have, and how I use the product and what you have been told. If at all possible can I get the support request # that you had opened so that I can look through the case notes and follow up on our end?

    Thank you for your input James.
  • I believe we are in the same boat. Quest state that the known issue related to exports being slow on a fragmented repository has been fixed in version 6.1.2. If it has I'm glad I never tried earlier versions. This is in fact in the known issues of the release notes of 6.1.2, apparently it should be in the fixed issues list.

    What I do know is when we were on 5.4.3 we were able to export/update 50 machines by 6AM every day and now we generally update about 10 over the entire 24 hours and are just getting further and further behind.

    Like you, likely we will be going elsewhere as we can't survive like this for long.
  • Hi Phuffers,

    SR #4077427.

    James
  • Hi Freddbloggs,

    We are currently running Core Version: 6.1.2.115, Quest have stated this is a known issue in this version in our support request. However as you say this is in the know fixed issues list in 6.1.2: support.quest.com/.../3

    I wonder if this issue is only resolved for new repositories running 6.1.2, and not for already fragmented repositories that were upgraded?
  • Thank you James. I'll take a look at the case, appreciate it.
  • Interesting, Quest Dev have told us that it has been fixed in that build. I'd hate to see what it's like when it is broken.

    It has been moved in to the fixed issues list, was in the known issues list last week
  • Hey Fredbloggs,

    Quest support have come back with a potential fix for this issue which initially looks to be working, I will report back tomorrow on our findings and with the fix. On a 1.77TB export which would previously run at 3-8MB/s, it is now running at 81-130MB/s. I have seen similar behaviour in testing of agentless backup exports however which start high and then drop, so not holding my breath right now.

    James
  • Let me give some more details related to D-34758. This defect was created related to a specific customer environment in which the Quest Dev team was able to identify that the data in the repository was highly fragmented and during the export job many small requests were being sent to the repository. These small requests generated a bottleneck on the core slowing down the export speed. To optimize this, the Quest Dev team improved the logic and grouped those small requests together so that they did not create a bottleneck in the same way. Hence the issue was resolved in this one specific customer environment and the defect was closed after further testing to ensure that the fix did not have any negative consequences on performance in other environments.

    The issue here is that the defect title is very generic and makes it seem like it covers every situation with a fragmented repository. It does not.

    To isolate your specific problem, what we need to do is:
    1) Confirm that the repository is actually highly fragmented. This is not easy to do and requires the Quest Dev team to review. I'll follow up with the team on my side and confirm we have done this properly.
    2) Confirm the bottleneck that is causing the export slowness. If it's the same as D-34758 then the fix we implemented was not as complete as it needed to be. If it is not hte same as D-34758 then the Quest Dev team will need to work through other options for optimizing the job. As with any complex issue this requires patience and time as we find the root cause of the issue and then figure out how to work around that.