This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Rapid Recover 6 is slow - Update, it's not as slow anymore, hopefully still improving

Just posting this as an update.

 

I've been quite vocal on here recently about how poor RR6 has been, and virtually unusable.  We had tried multiple things, writecache, firmware & driver updates, moving metadata locations around between storage systems (have found definite improvement having metadata and data on different disks).

Last weekend, having decided that nothing was really updating anyway we decided to pause replications for an extended window and exclude them for ~10-8 hours/ day.  Also left virtual standby's disabled  This has finally allowed the system to tidy the repository up, deleting index RPFS has now gone to the x00's MB/s whereas during replication jobs it was closer to single digits.

It is still deleting index RPFS recovery points, however whilst replicating it is now managing this at a reasonable speed (still have the issue with updating agent metadata where I apparently have to upgrade the source cores, does appear to have improved it for one of the cores we replicate from).

I have even managed to have exports running at a much more reasonable 30MB/s (was kB/s) whilst replication is running and not impacting things as much as before.  almost back to pre-upgrade level

 

Basically, it appears to my untrained eye that it was so bad because it was heavily fragmented and this reared its head after the repository update, it has now gained some spots to place new data and is running better.  it is still going through this process, hopefully i can start some rollups again soon.

 

If you have a similar issue after upgrading, check to see if any repository Delete Index RPFS jobs are running and if they are just stop everything else and give it some time to finish these or at least get to the stage whereby it has deleted a lot of them.

going forwards, try and have some windows where no replication is happening, i want to reduce my exclusions but waiting until everything has caught back up, suspect i'm a few weeks away from that at this stage.

 

anyway, just thought i'd update this, even though disk latencies were generally low (<5ms) and disk activity was <5MB/s it must have been creating such small writes all over the place that they didn't register on those tools but were impacting performance to a large degree.

I still feel that there are potential improvements to the underlying engine that have raised their head here and that certain 'don't try this at home' workarounds would improve the amount of disk reads/writes required in to a more sequential nature but i'm feeling that i may get through this.

 

But just to add, i have had to find and do all this myself, support have been pointless, today was the first time they even looked at the system (when it was better) and stated it appears to be ok (which to be honest it is much more respectable).

  • Purely to support your findings I purposely purged roughly 1 TB worth of RPs from my repo, and I was able to reproduce your findings. While the core was chugging though the DDs the performance of the Core as a whole (which I agree is known) was poor, however the replication and the exports where without a doubt the most significantly impacted. Perhaps the backups masked this a bit as they came in 3,4 at a time and finished within hour, but the exports and replications that run at a clip of 1 at a time where most definitely impacted. After pausing the jobs and letting DDs catch up (in my case about 3 hours) everything was back to humming right along and finishing withing the hour. Again, just providing a similar example to what you described.
  • so, as a query, is there a way to schedule DD's so that they only happen during a specified window?
    I think i'd find it quicker to simply have them run for a few hours/day as opposed to conflicting with other jobs such as replication/exports. 4 hours of 200MB/s would do far more than 24 hours of 2MB/s and both things happen to run quicker
  • actually, have just seen that you can configure deferred delete as a nightly job and a maximum time to run.
    may leave as-is for this weekend but perhaps that would sort this out for future reference and ensure it remains optimal
  • That is correct, you beat me to it. The DDs you can schedule as a nightly job to try an alleviate some of this job contention, that was put in a build or two for exactly what you have experienced. If you were able to schedule your environment in such a way that the DDs had a clear path to hammer through everything on a nightly basis then in theory you wouldn't have to see the performance hit one the jobs conflict with one another. The trick is to find that happy medium where the DDs finish on a daily basis and you don't fall behind, while keeping your backups/replications/exports up to date as well.
  • Not sure how it applies to your situation or how encompassing this fix is, but 6.1.2 release notes say:

    Export rate was slow for recovery points from repositories that have high fragmentation.
    34758
    Virtual export, Repository

    Since we have talked about this issue a fair bit, it would be great if someone from quest could give us a bit more detail on what was fixed, what should work as expected and what is still an issue when it comes to the fragmentation/ performance issue