Question - What denotes how quickly an archive is taken?

Does anyone know what denotes how quickly an archive is taken?  Through the week, we run 5 archives that will be backed up to a SAN and then copy over to a tape using Netvault.  Recently, the archives have slowed right down to a point where some of the archives, generally around 8TB in size, can take 6 days to complete which means that the 5 archives a week can't be done.  The archive speed seems to be sporadic, between 5MB/s up to 30MB/s and we've even seen some at 100MB/s but will generally chug along at around 8-10MB/s.

It's a DL4300 box with 2x10Core cpus and 128GB of ram, the SAN is on a 10GB link that  the DL4300 never comes close to ever saturating.  We've tried reboots to clear down some memory usage, we've tried clearing out the antivirus to see if the antivirus was upsetting anything but it still seems to speed up and slow down and chugs along.  The slow down seems to have happened around the update from 6.2.1 to 6.3 and we also have repeated Schannel errors in the System Event viewer - "A fatal alert was received from the remote endpoint. The TLS protocol defined fatal alert code is 46." that also correspond with the 6.3 update, haven't found any other people with this issue on a Rapid Recovery machine and all the google links i've looked at in regards to this error point to Exchange servers and certificates, we don't run Exchange and the DL4300 server is off domain and isn't changed at all, apart from for Windows and driver updates, which are all up to date, the only other thing it does is it runs the HyperV server for the exports but only a basic Netvault server is run on that.. 

Parents
  • What does the disk queue on the repository disks look like during the archive? It's rare to see a DL max out it's CPU or RAM during an archive. What maxes out is the disks. Rapid Recovery writes everything in an 8 KB fixed block size (that's what the dedupe engine uses) and so when you are doing an archive it's doing reads with an 8 KB IO block size. That generally means slower performance since that's a small IO size. Speed fluctuations then become a function of whether or not the reads are sequential or random. The larger the chunk of data that can be read sequentially for the archive, the faster it will be. Remember also that since the data is deduped each backup is not stored sequentially. More than likely some portion of it is stored randomly since there were blocks that were deduped with other blocks. So even if you were to archive just 1 agent, it's still going to have to do a lot of random reads to get the data for the archive.

    So, I recommend checking the disk queue to see what it's doing. Any disk queue greater than 1 on the repository disk means that it's maxing out the array (since Windows sees the RAID virtual disk created by the hardware as 1 disk).

    If there are other tasks running at the same time, especially background tasks like deleting index RPFS files, then you get increased IO competition on the disk and that can slow things down too.

  • The Avg. Disk Queue Length for the repository drives average around 1.7, max of 8.633. This is with an export running and an Archive running, nothing else at all.  We have 6 virtual standbys but the only one that is exported hourly is the file server.  Exports tend to run at around 10MB/s and the archive is running at around 22MB/s, which is pretty quick.

  • So the disks are maxed out. So speed fluctuations are then based on the location of the data being read, whether it's a random or sequential read, and size of the data that can be read from that location. When the bottleneck is the disks, there isn't anything we can do to speed them up.

Reply Children
No Data