Question - What denotes how quickly an archive is taken?

Does anyone know what denotes how quickly an archive is taken? Through the week, we run 5 archives that will be backed up to a SAN and then copy over to a tape using Netvault. Recently, the archives have slowed right down to a point where some of the archives, generally around 8TB in size, can take 6 days to complete which means that the 5 archives a week can't be done. The archive speed seems to be sporadic, between 5MB/s up to 30MB/s and we've even seen some at 100MB/s but will generally chug along at around 8-10MB/s.

It's a DL4300 box with 2x10Core cpus and 128GB of ram, the SAN is on a 10GB link that the DL4300 never comes close to ever saturating. We've tried reboots to clear down some memory usage, we've tried clearing out the antivirus to see if the antivirus was upsetting anything but it still seems to speed up and slow down and chugs along. The slow down seems to have happened around the update from 6.2.1 to 6.3 and we also have repeated Schannel errors in the System Event viewer - "A fatal alert was received from the remote endpoint. The TLS protocol defined fatal alert code is 46." that also correspond with the 6.3 update, haven't found any other people with this issue on a Rapid Recovery machine and all the google links i've looked at in regards to this error point to Exchange servers and certificates, we don't run Exchange and the DL4300 server is off domain and isn't changed at all, apart from for Windows and driver updates, which are all up to date, the only other thing it does is it runs the HyperV server for the exports but only a basic Netvault server is run on that..

Top Replies

Parents

0 Tim Seymour over 6 years ago

What does the disk queue on the repository disks look like during the archive? It's rare to see a DL max out it's CPU or RAM during an archive. What maxes out is the disks. Rapid Recovery writes everything in an 8 KB fixed block size (that's what the dedupe engine uses) and so when you are doing an archive it's doing reads with an 8 KB IO block size. That generally means slower performance since that's a small IO size. Speed fluctuations then become a function of whether or not the reads are sequential or random. The larger the chunk of data that can be read sequentially for the archive, the faster it will be. Remember also that since the data is deduped each backup is not stored sequentially. More than likely some portion of it is stored randomly since there were blocks that were deduped with other blocks. So even if you were to archive just 1 agent, it's still going to have to do a lot of random reads to get the data for the archive.

So, I recommend checking the disk queue to see what it's doing. Any disk queue greater than 1 on the repository disk means that it's maxing out the array (since Windows sees the RAID virtual disk created by the hardware as 1 disk).

If there are other tasks running at the same time, especially background tasks like deleting index RPFS files, then you get increased IO competition on the disk and that can slow things down too.
Cancel
Up 0 Down

Reply

Verify Answer

Reject Answer

Cancel

Reply

0 Tim Seymour over 6 years ago

What does the disk queue on the repository disks look like during the archive? It's rare to see a DL max out it's CPU or RAM during an archive. What maxes out is the disks. Rapid Recovery writes everything in an 8 KB fixed block size (that's what the dedupe engine uses) and so when you are doing an archive it's doing reads with an 8 KB IO block size. That generally means slower performance since that's a small IO size. Speed fluctuations then become a function of whether or not the reads are sequential or random. The larger the chunk of data that can be read sequentially for the archive, the faster it will be. Remember also that since the data is deduped each backup is not stored sequentially. More than likely some portion of it is stored randomly since there were blocks that were deduped with other blocks. So even if you were to archive just 1 agent, it's still going to have to do a lot of random reads to get the data for the archive.

So, I recommend checking the disk queue to see what it's doing. Any disk queue greater than 1 on the repository disk means that it's maxing out the array (since Windows sees the RAID virtual disk created by the hardware as 1 disk).

If there are other tasks running at the same time, especially background tasks like deleting index RPFS files, then you get increased IO competition on the disk and that can slow things down too.
Cancel
Up 0 Down

Reply

Verify Answer

Reject Answer

Cancel

Children

0 simon peart over 6 years ago in reply to Tim Seymour

The Avg. Disk Queue Length for the repository drives average around 1.7, max of 8.633. This is with an export running and an Archive running, nothing else at all. We have 6 virtual standbys but the only one that is exported hourly is the file server. Exports tend to run at around 10MB/s and the archive is running at around 22MB/s, which is pretty quick.
Cancel
Up 0 Down

Reply

Verify Answer

Cancel
+1 Tim Seymour over 6 years ago in reply to simon peart

So the disks are maxed out. So speed fluctuations are then based on the location of the data being read, whether it's a random or sequential read, and size of the data that can be read from that location. When the bottleneck is the disks, there isn't anything we can do to speed them up.
Cancel
Up +1 Down

Reply

Reject Answer

Cancel
0 simon peart over 6 years ago in reply to Tim Seymour

Thank you
Cancel
Up 0 Down

Reply

Verify Answer

Cancel
0 simon peart over 6 years ago in reply to Tim Seymour

Over the past couple of weeks, we've made a concerted effort to clear out as much as possible and managed to clear off 4TB of disk space from the repository, this seems to have speeded up the archives but they still seem awfully slow compared to what they were pre-6.3. Going through your description of how it works where there is a huge amount of fragmented data in the repository that has to be hunted down for the archive, this essentially sounds like a fragmented hard drive, is it possible to defragment the repository so server data is stored together in chunks rather fragmented across the repository?
Cancel
Up 0 Down

Reply

Verify Answer

Cancel
0 Emte over 6 years ago in reply to simon peart

It does not look like any repo job would perform a defrag. The repository structure in RR is so bad, I wish they would re-write the entire thing. And what ever you do, don't kick off a repository optimization job unless you either have a very small repo or want the Core to be unavailable for a week or so
Cancel
Up 0 Down

Reply

Verify Answer

Cancel
0 Tim Seymour over 6 years ago in reply to simon peart

There is no defrag tool for the repository. I wish there was, but there isn't. The only way to defrag a repository currently is to archive all the recovery points, delete the repository, create a new one, and import the archive.
Cancel
Up 0 Down

Reply

Verify Answer

Cancel
0 simon peart over 5 years ago in reply to Emte

The annoying thing is, everyone of our issues that we're having in terms of high memory usage, core tends to use a minimum of 70% of 128GB of ram and sometimes up to 96% meaning a full reboot has to be done before we can bring a virtual standby online as there is no memory for the virtual standby, and the excruciating slow archives that takes days and days, instead of a single day or 2, to complete all seems to have coincidentally happened when we upgraded to Core and Agents 6.3 and it seems impossible to go back to 6.2.1 where everything just ran fine. Wish i'd never upgraded now
Cancel
Up 0 Down

Reply

Verify Answer

Cancel
0 Jose.A.Gonzalez over 5 years ago in reply to simon peart

From what I can tell on this thread, it is very possible that your server has been close to a performance threshold, and that the demands on it have increased (whether perceived or not) to a point where it "shows" with general task contention conditions, specially evident with "heavy" tasks. In most cases, adjustments and updates to the server's own infrastructure and OS are beneficial in these cases. I strongly recommend that you open a support case so we can identify areas that need attention, mitigate performance affecting factors, and help you establish steady pacing for all the tasks that you want your server to perform. In other words, a tune up is in order.
Cancel
Up 0 Down

Reply

Verify Answer

Cancel