KB301910 PS Warning & Results

Hi All,

I am running the PS script to check my recovery points and saw the following (in yellow) warning-

WARN  2019-08-14T08:33:27 Performance.Replay.Core.Client.LocalMountMana Execute Action DELETE https://MY SERVER HERE:8006/apprecovery/api/core/mounts
e9-80f2-40a8f01fc3b3%257d~|~ took 00:01:02.8944452, which is too long

Also they are not very clear about results and log checking for the script. I think I found and fixed this issue on my own with help from here just after I upgraded but I'm checking anyway but they could be a little clearer about reading results.

Thanks.

Parents
  • The script contains usage information at the top. From that usage information I pulled this:

    Checks recovery points of the specified agents <agentDisplayNames> for errors that could be caused by an issue in the KB301910. The script tries to identify 3 recovery points of interest:
    - newest recovery point backed up from an RR Agent version 6.1.3 or lower
    - oldest recovery point backed up from an RR Agent version 6.2.0 or later
    - newest recovery point of an agent on the Core

    After any of those recovery points were identified, script mounts those recovery points, runs Windows CHKDSK on them, and stores run results into specified location.
    Check results include a file called DisksHealthReport.txt which shows at a glance if CHKDSK returned a failure error code for any of the recovery points checked.
    Note that script is going to write logs and going to put them into a directory named "log" next to this script.

    Did you get a "DisksHealthReport.txt" file to review after the check completed?

  • So I have a few of those reports for each volume and server checked. I guess I'm looking for " 0 bad file" stuff?

    Why is it mounting data volumes if it's looking for corrupt OS volumes?.

    I notice it is mounting these "temp points" in the directory the PS lives in which is suddenly making space an issue since it's in my OS volume. Also, it appears to be deleting mount point after as opposed to dismounting them?

    I just saw this in red-

            14.08.2019 10:11:58     exception       System.Management.Automation.RuntimeException: You cannot call a method on a null-valued expression.
    at System.Management.Automation.ExceptionHandlingOps.CheckActionPreference(FunctionContext funcContext, Exception exc
    eption) at lambda_method(Closure , Object[] , StrongBox`1[] , InterpretedFrame )

    Thanks

  • So if you have a data volume with lots of change on it resulting in a log file greater than 1 MB when you performed the agent upgrade, then some of the change on that volume was lost. This means the next recovery point contains incomplete data. That means part of the file system within the recovery point is corrupted. If you then mount that recovery point and run chkdsk on it, it will show up as corrupted.

    I don't run this script (we try to avoid running unsupported scripts on our cores) but this caught my attention, so excuse me if I way off. Could you give me more information about this? For example:

    - Does this only happen during client RR upgrades?

    - Why cant the upgrade handle change logs greater than 1MB (RR limitation or something else?)

    - Assuming this happens, how does RR fix it? Is the first backup "corrupt" but backups after are fine?

    - What if the backup that is corrupt is a base?

    - Is this check only done with an unsupported script or does the recovery point check from the nightly job find this?

  • So it just finished and found "errors" on every volume in every server it checked? That doesn't seem right since two out of the three export as standby VM's that boot.

    =======================
    RP Check Report 8/14/2019 10:46:22 AM
    =======================

    'SERVER2' (b6963881-3716-402b-ab34-f7e8efa4d8e1); RP version: 6.0.2.144; created: 05/19/2019 04:00:00

        D:\ - ERROR

        C:\ - ERROR



    'SERVER2' (b6963881-3716-402b-ab34-f7e8efa4d8e1); RP version: 6.3.0.5309; created: 06/23/2019 04:00:01

        C:\ - ERROR

        D:\ - ERROR



    'SERVER2' (b6963881-3716-402b-ab34-f7e8efa4d8e1); RP version: 6.3.0.5309; created: 08/14/2019 04:00:01

        D:\ - ERROR

        C:\ - ERROR



    'SERVER1' (dfd4b6b4-77c2-4372-aab7-8ee35a82cde8); RP version: 6.0.2.144; created: 06/03/2019 10:00:00

        C:\ - ERROR

        D:\ - ERROR



    'SERVER1' (dfd4b6b4-77c2-4372-aab7-8ee35a82cde8); RP version: 6.3.0.5309; created: 07/08/2019 10:00:00

        C:\ - ERROR

        D:\ - ERROR



    'SERVER1' (dfd4b6b4-77c2-4372-aab7-8ee35a82cde8); RP version: 6.3.0.5309; created: 08/14/2019 10:00:00

        C:\ - ERROR

        D:\ - ERROR



    'SERVER3' (59271250-2fc2-45ff-b045-0dcb15a90ba9); RP version: 6.0.2.144; created: 06/03/2019 10:00:04

        (Volume '\\?\Volume{27e1bfc2-883c-11e3-8980-806e6f6e6963}\') - EXECUTION ERROR

        D:\ - ERROR

        C:\ - ERROR



    'SERVER3' (59271250-2fc2-45ff-b045-0dcb15a90ba9); RP version: 6.3.0.5309; created: 07/08/2019 10:00:17



    'SERVER3' (59271250-2fc2-45ff-b045-0dcb15a90ba9); RP version: 6.3.0.5309; created: 08/14/2019 10:00:23


    I think one failed to run and two found errors on everything? This is kind of a confusing mess.

  • This is a very specific issue related to the conversion of our change log from a log to a bitmap. It is detailed in this KB article - https://support.quest.com/kb/301910/incomplete-recovery-points-are-taken-after-windows-agent-is-upgraded-from-6-1-x-to-6-2-0-or-newer - and in the critical notification that was sent to all customers today.

    - It only occurs when you upgrade from agent version 6.1.3 or older to agent version 6.2.0 or newer. If your agents are still 6.1.3 or older then you haven't experienced this issue. If your agents were first installed using version 6.2.0 or newer, then it's not possible to experience this issue.

    - It is caused by a defect in the change log conversion process. So that's an RR code problem.

    - If this occurs, there is no way for RR to detect or fix it automatically. The nightly job check does not do a full chkdsk which is necessary to detect this problem. Hence our critical notification and our providing a script to detect it.

    - If this occurs, all incremental backups after the conversion are corrupt. The only way to fix this is to take a base image and get a complete copy of the volume.

    - A base image cannot be corrupt because of this defect. A base image does not use a change log. A base image is a complete backup of all used blocks on the volume. So the change log state doesn't matter. Taking a base image is listed as the fix for this issue in the KB article.

    - Who said this script was unsupported? I believe we will fully support this script to help you identify this issue. We want to make sure we identify problems well before you need this data for a critical restore. If you need help, feel free to open a support case or keep asking questions here.

    As the KB says, the easiest way to prevent experiencing this defect is to take a backup of the machine you are going to upgrade, then upgrade the agent, then reboot immediately. Since most of our customers have regularly occurring backups and reboot immediately after upgrade, we have seen very few incidences of this issue.

  • I don't see any critical notification so I was not aware of this. So I assumed it was one of the unsupported scripts.

    But thank you for the information Tim

  • That doesn't seem right since two out of the three export as standby VM's that boot.

    The check disk is a deeper validation than booting. 

    There are lots of reasons why check disk might fail - the issue in KB301910 is just one of them. The issue may even be on the protected machine. You can run check disk on the protected machine to see if the issue is there.

  • If you did not receive an email this morning about the critical notification, I highly recommend you go into the Quest Support portal and make sure your notification settings are updated and that you have not opted out of email notifications. If you are having issues with finding that, please call our support line and our support admins should be able to help you with those settings. It might also we worth it to check your spam filter to make sure it didn't get caught there.

  • What is the output of the actual chkdsk log for each system? The script creates a subfolder and in that subfolder are all the different logs for each volumes chkdsk. Open up one of those for a machine that shows an error status and see what it says.

  • Corrigun I too just tested this same process in my lab and you are correct, I am posting errors as you did   on all agents but they are valid.  We are looking into this now.

  • Stage one and two 0 bad file records. Stage three also 0 bad but I see also this on every checkpoint run for every volume-

    Usn Journal verification completed.
    Errors detected in the Boot File.
    The Volume Bitmap is incorrect.
    Windows has checked the file system and found problems.
    Please run chkdsk /scan to find the problems and queue them for repair.

  • Yes same here, and we are looking into this now.

Reply Children
  • The PS script itself is a slew of weird warnings and errors as well. It didn't even run on one of my servers because I think it essentially ran the core out of memory and the points would not load.

    This whole KB is a disaster. I feel like I understand RR pretty well and I simply disbelieve the results since observation tells me standby VM's from points work but I can't imagine some of my peers doing this.

  • Yes I immediately tested mine which showed ERROR on all volumes , I tested export and it booted fine.  This seems to be generating false positives.  We have made the appropriate persons aware and hopefully they correct this ASAP.  I am truly sorry for the inconvenience but its actively being looked into now.

  • I've pulled down the script for now due to the amount of false positives we are seeing. We're evaluating options for reducing false positives.

    I'll post an update when we know more. 

  • Here's the current status:

    1. The false positives in the script are being caused by a separate defect in RR that causes the bitmap on certain recovery points to be seen as invalid by chkdsk. This defect does not cause the recovery point to be in any way damaged or not functional. So it truly is a false positive and makes the backup look invalid even when it is not. To manually work around this you can manually mount a recovery point as writeable and then run chkdsk /f on the mounted RP. When doing that on a healthy recovery point you should see the bitmap be corrected by chkdsk and no other errors be fixed or reported.

    2. We are working on a new version of the script that will bypass this specific scenario so that the number of false positives are significantly reduced. As soon as we have that ready I'll be updating the KB article and I'll post an update here letting everyone know. Thank you for your patience.

    To try and help calm everyone's nerves, let me give a little data around this issue. We've recently discovered this problem because of how rare it is. It exists in the agent upgrade process so agentless backups are not affected in any way. It only affects people upgrading from 6.1.3 and older to a build 6.2.0 or newer.

    We released 6.2.0 over a year and a half ago. In that time we've seen over 75,000 agents upgraded. Of those we've seen fewer than 10 customers experience this issue. The likelihood of you having it in your environment if you have been doing regular testing of your backups is very small.

    I apologize for any unnecessary concern we caused with this script. We clearly didn't get it vetted well enough and are trying to balance the desire to ensure you know about a possible issue with a functional way of diagnosing the defect. We're trying not to just sweep this under the rug. So thank you for your patience with us and your willingness to provide quick and frank feedback here. 

  • I actually tried to run chkdsk on a mounted point and could not figure out how to make it work. I eventually found the "tools" menu in Windows explorer but it refused to start.

    I for one would rather know early and hash it out with you guys then wait. No apologies needed from me.

    I definitely did have a problem like this post upgrade that you guys in the forum helped me fix. I found it when standby VMs failed

    We're good. Thanks and keep us posted.

  • Mount the recovery point as writeable to a folder. Doesn't matter what folder, I usually do something like C:\m to keep it simple. Then run chkdsk like this:

    chkdsk /f C:\m\C__

    Notice the two underscores. Each mount point follows that same format with underscores. By pointing chkdsk at that file path it should run.