Failed Integrity Check : Several Questions

Question

So I got into this position at my new job a few months ago and as time has gone on we are starting to get more and more machines with the error "Integrity checks failed" in the qore portal. After looking through documentation and opening support tickets about the issues I've almost had zero direction to go on what to do or even what causes it. I let my guys know every day which machines are having errors and now the list is just a handful of machines with failed integrity checks and they never seem to fall off the list. So, on to the questions... What is actually happening when an integrity check fails? What are some steps to troubleshoot or get them back to being successful? Is this something that I may be causing or maybe the client's engineers could be accidently causing? Short of wiping out their entire backups what can I do fix them?

phuff · Answer

I do believe it is doing a quick mount of the last RP to validate that it mounts and then there's a piece of logic that is supposed to check to see if the data is accessible. Since this is done as a nightly job it typically is touchy and requires a higher amount of RAM/resources, while it is running. So although it isn't common to fail, it isn't uncommon for it to give false failures alerts as the job might fail, but the data is fine. So if I found that server 'x' is failing the integrity check, I would mount server 'x' and validate the files are there and accessible, do a test FLR, perhaps export it to a VM to validate it is fine. If you have the ability to throw more CPU/RAM power to the core, that might help as well, also do things like make sure only 1 rollup is going at at time, or 1 of the checks. Not to mention the repo doesn't have an unlimited i/o capacity too (or the OS disk for that matter). For an example, say that you have 2 backup jobs going (or 3) and 1 replication, a export or 2, and then nightly jobs kick in and it's checking and rolling up too, that's a lot of 'things' at once. 
 Assuming that the manual data check works, you again can reduce the amount of concurrent jobs/add additional resources, or even perhaps skip the check for that one server if it becomes a chronic problem. You could also take a new base image (if you have the space/options) as a way to clear it up. Keep in mind it's only meant to work for MS OSs too. 
 Always check though, that is what I use the checks for - 'hey there might be a problem' give it a check and validate. However if you do have some that fail over and over yes, might be job contention, might be RP related, might be resource related, or you very well might have a problem. Initial thoughts on the topic though, without knowing any more details. Cheers to you Kyle.