This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Rapid Recover 6 is slow - Update, it's not as slow anymore, hopefully still improving

Just posting this as an update.

 

I've been quite vocal on here recently about how poor RR6 has been, and virtually unusable.  We had tried multiple things, writecache, firmware & driver updates, moving metadata locations around between storage systems (have found definite improvement having metadata and data on different disks).

Last weekend, having decided that nothing was really updating anyway we decided to pause replications for an extended window and exclude them for ~10-8 hours/ day.  Also left virtual standby's disabled  This has finally allowed the system to tidy the repository up, deleting index RPFS has now gone to the x00's MB/s whereas during replication jobs it was closer to single digits.

It is still deleting index RPFS recovery points, however whilst replicating it is now managing this at a reasonable speed (still have the issue with updating agent metadata where I apparently have to upgrade the source cores, does appear to have improved it for one of the cores we replicate from).

I have even managed to have exports running at a much more reasonable 30MB/s (was kB/s) whilst replication is running and not impacting things as much as before.  almost back to pre-upgrade level

 

Basically, it appears to my untrained eye that it was so bad because it was heavily fragmented and this reared its head after the repository update, it has now gained some spots to place new data and is running better.  it is still going through this process, hopefully i can start some rollups again soon.

 

If you have a similar issue after upgrading, check to see if any repository Delete Index RPFS jobs are running and if they are just stop everything else and give it some time to finish these or at least get to the stage whereby it has deleted a lot of them.

going forwards, try and have some windows where no replication is happening, i want to reduce my exclusions but waiting until everything has caught back up, suspect i'm a few weeks away from that at this stage.

 

anyway, just thought i'd update this, even though disk latencies were generally low (<5ms) and disk activity was <5MB/s it must have been creating such small writes all over the place that they didn't register on those tools but were impacting performance to a large degree.

I still feel that there are potential improvements to the underlying engine that have raised their head here and that certain 'don't try this at home' workarounds would improve the amount of disk reads/writes required in to a more sequential nature but i'm feeling that i may get through this.

 

But just to add, i have had to find and do all this myself, support have been pointless, today was the first time they even looked at the system (when it was better) and stated it appears to be ok (which to be honest it is much more respectable).

Parents
  • Purely to support your findings I purposely purged roughly 1 TB worth of RPs from my repo, and I was able to reproduce your findings. While the core was chugging though the DDs the performance of the Core as a whole (which I agree is known) was poor, however the replication and the exports where without a doubt the most significantly impacted. Perhaps the backups masked this a bit as they came in 3,4 at a time and finished within hour, but the exports and replications that run at a clip of 1 at a time where most definitely impacted. After pausing the jobs and letting DDs catch up (in my case about 3 hours) everything was back to humming right along and finishing withing the hour. Again, just providing a similar example to what you described.
Reply
  • Purely to support your findings I purposely purged roughly 1 TB worth of RPs from my repo, and I was able to reproduce your findings. While the core was chugging though the DDs the performance of the Core as a whole (which I agree is known) was poor, however the replication and the exports where without a doubt the most significantly impacted. Perhaps the backups masked this a bit as they came in 3,4 at a time and finished within hour, but the exports and replications that run at a clip of 1 at a time where most definitely impacted. After pausing the jobs and letting DDs catch up (in my case about 3 hours) everything was back to humming right along and finishing withing the hour. Again, just providing a similar example to what you described.
Children
No Data