KB301910 PS Warning & Results

Hi All,

I am running the PS script to check my recovery points and saw the following (in yellow) warning-

WARN  2019-08-14T08:33:27 Performance.Replay.Core.Client.LocalMountMana Execute Action DELETE https://MY SERVER HERE:8006/apprecovery/api/core/mounts
e9-80f2-40a8f01fc3b3%257d~|~ took 00:01:02.8944452, which is too long

Also they are not very clear about results and log checking for the script. I think I found and fixed this issue on my own with help from here just after I upgraded but I'm checking anyway but they could be a little clearer about reading results.

Thanks.

Parents
  • The script contains usage information at the top. From that usage information I pulled this:

    Checks recovery points of the specified agents <agentDisplayNames> for errors that could be caused by an issue in the KB301910. The script tries to identify 3 recovery points of interest:
    - newest recovery point backed up from an RR Agent version 6.1.3 or lower
    - oldest recovery point backed up from an RR Agent version 6.2.0 or later
    - newest recovery point of an agent on the Core

    After any of those recovery points were identified, script mounts those recovery points, runs Windows CHKDSK on them, and stores run results into specified location.
    Check results include a file called DisksHealthReport.txt which shows at a glance if CHKDSK returned a failure error code for any of the recovery points checked.
    Note that script is going to write logs and going to put them into a directory named "log" next to this script.

    Did you get a "DisksHealthReport.txt" file to review after the check completed?

  • So I have a few of those reports for each volume and server checked. I guess I'm looking for " 0 bad file" stuff?

    Why is it mounting data volumes if it's looking for corrupt OS volumes?.

    I notice it is mounting these "temp points" in the directory the PS lives in which is suddenly making space an issue since it's in my OS volume. Also, it appears to be deleting mount point after as opposed to dismounting them?

    I just saw this in red-

            14.08.2019 10:11:58     exception       System.Management.Automation.RuntimeException: You cannot call a method on a null-valued expression.
    at System.Management.Automation.ExceptionHandlingOps.CheckActionPreference(FunctionContext funcContext, Exception exc
    eption) at lambda_method(Closure , Object[] , StrongBox`1[] , InterpretedFrame )

    Thanks

Reply
  • So I have a few of those reports for each volume and server checked. I guess I'm looking for " 0 bad file" stuff?

    Why is it mounting data volumes if it's looking for corrupt OS volumes?.

    I notice it is mounting these "temp points" in the directory the PS lives in which is suddenly making space an issue since it's in my OS volume. Also, it appears to be deleting mount point after as opposed to dismounting them?

    I just saw this in red-

            14.08.2019 10:11:58     exception       System.Management.Automation.RuntimeException: You cannot call a method on a null-valued expression.
    at System.Management.Automation.ExceptionHandlingOps.CheckActionPreference(FunctionContext funcContext, Exception exc
    eption) at lambda_method(Closure , Object[] , StrongBox`1[] , InterpretedFrame )

    Thanks

Children
  • The issue is not just with OS volumes. The issue can occur on any volume on the system. It is most common on the OS volume, but not exclusive to it.

    The issue is that if the change log file is greater than 1 MB in size when the upgrade occurs, anything over 1 MB is dropped. So we basically drop some of the tracked changes in the conversion to the new format. So if you have a data volume with lots of change on it resulting in a log file greater than 1 MB when you performed the agent upgrade, then some of the change on that volume was lost. This means the next recovery point contains incomplete data. That means part of the file system within the recovery point is corrupted. If you then mount that recovery point and run chkdsk on it, it will show up as corrupted.

    When dealing with Rapid Recovery backups, there are two file systems. One is the proprietary format that RR uses to store the data and create mount points. The other is the actual NTFS (or any other file system) that exists within the mounted recovery point. It is possible for the recovery point file system to be complete and functional (meaning we can mount a recovery point) but the NTFS file system within the mount point to be damaged. This issue is an example of just that. Hence running a chkdsk of the file system within a recovery point is a very simple and complete method for identifying issues with the data. If chkdsk passes on the mounted recovery point, then the file system checks out as complete and the backup is complete.

    The temp mounts are being mounted in the directory where the script lives. That's the default behavior. If you look at the available parameters within the header of the script, you can see there is a -mountPointsBasePath option where you can point the mounts to a different location. So if you have an OS drive on your core server with very little space, you could point the mounts to a secondary location that has more space.

    Deleting mount points means exactly the same thing as dismounting them. There is no difference in the function. It's semantics.

  • So if you have a data volume with lots of change on it resulting in a log file greater than 1 MB when you performed the agent upgrade, then some of the change on that volume was lost. This means the next recovery point contains incomplete data. That means part of the file system within the recovery point is corrupted. If you then mount that recovery point and run chkdsk on it, it will show up as corrupted.

    I don't run this script (we try to avoid running unsupported scripts on our cores) but this caught my attention, so excuse me if I way off. Could you give me more information about this? For example:

    - Does this only happen during client RR upgrades?

    - Why cant the upgrade handle change logs greater than 1MB (RR limitation or something else?)

    - Assuming this happens, how does RR fix it? Is the first backup "corrupt" but backups after are fine?

    - What if the backup that is corrupt is a base?

    - Is this check only done with an unsupported script or does the recovery point check from the nightly job find this?

  • So it just finished and found "errors" on every volume in every server it checked? That doesn't seem right since two out of the three export as standby VM's that boot.

    =======================
    RP Check Report 8/14/2019 10:46:22 AM
    =======================

    'SERVER2' (b6963881-3716-402b-ab34-f7e8efa4d8e1); RP version: 6.0.2.144; created: 05/19/2019 04:00:00

        D:\ - ERROR

        C:\ - ERROR



    'SERVER2' (b6963881-3716-402b-ab34-f7e8efa4d8e1); RP version: 6.3.0.5309; created: 06/23/2019 04:00:01

        C:\ - ERROR

        D:\ - ERROR



    'SERVER2' (b6963881-3716-402b-ab34-f7e8efa4d8e1); RP version: 6.3.0.5309; created: 08/14/2019 04:00:01

        D:\ - ERROR

        C:\ - ERROR



    'SERVER1' (dfd4b6b4-77c2-4372-aab7-8ee35a82cde8); RP version: 6.0.2.144; created: 06/03/2019 10:00:00

        C:\ - ERROR

        D:\ - ERROR



    'SERVER1' (dfd4b6b4-77c2-4372-aab7-8ee35a82cde8); RP version: 6.3.0.5309; created: 07/08/2019 10:00:00

        C:\ - ERROR

        D:\ - ERROR



    'SERVER1' (dfd4b6b4-77c2-4372-aab7-8ee35a82cde8); RP version: 6.3.0.5309; created: 08/14/2019 10:00:00

        C:\ - ERROR

        D:\ - ERROR



    'SERVER3' (59271250-2fc2-45ff-b045-0dcb15a90ba9); RP version: 6.0.2.144; created: 06/03/2019 10:00:04

        (Volume '\\?\Volume{27e1bfc2-883c-11e3-8980-806e6f6e6963}\') - EXECUTION ERROR

        D:\ - ERROR

        C:\ - ERROR



    'SERVER3' (59271250-2fc2-45ff-b045-0dcb15a90ba9); RP version: 6.3.0.5309; created: 07/08/2019 10:00:17



    'SERVER3' (59271250-2fc2-45ff-b045-0dcb15a90ba9); RP version: 6.3.0.5309; created: 08/14/2019 10:00:23


    I think one failed to run and two found errors on everything? This is kind of a confusing mess.

  • This is a very specific issue related to the conversion of our change log from a log to a bitmap. It is detailed in this KB article - https://support.quest.com/kb/301910/incomplete-recovery-points-are-taken-after-windows-agent-is-upgraded-from-6-1-x-to-6-2-0-or-newer - and in the critical notification that was sent to all customers today.

    - It only occurs when you upgrade from agent version 6.1.3 or older to agent version 6.2.0 or newer. If your agents are still 6.1.3 or older then you haven't experienced this issue. If your agents were first installed using version 6.2.0 or newer, then it's not possible to experience this issue.

    - It is caused by a defect in the change log conversion process. So that's an RR code problem.

    - If this occurs, there is no way for RR to detect or fix it automatically. The nightly job check does not do a full chkdsk which is necessary to detect this problem. Hence our critical notification and our providing a script to detect it.

    - If this occurs, all incremental backups after the conversion are corrupt. The only way to fix this is to take a base image and get a complete copy of the volume.

    - A base image cannot be corrupt because of this defect. A base image does not use a change log. A base image is a complete backup of all used blocks on the volume. So the change log state doesn't matter. Taking a base image is listed as the fix for this issue in the KB article.

    - Who said this script was unsupported? I believe we will fully support this script to help you identify this issue. We want to make sure we identify problems well before you need this data for a critical restore. If you need help, feel free to open a support case or keep asking questions here.

    As the KB says, the easiest way to prevent experiencing this defect is to take a backup of the machine you are going to upgrade, then upgrade the agent, then reboot immediately. Since most of our customers have regularly occurring backups and reboot immediately after upgrade, we have seen very few incidences of this issue.

  • I don't see any critical notification so I was not aware of this. So I assumed it was one of the unsupported scripts.

    But thank you for the information Tim

  • That doesn't seem right since two out of the three export as standby VM's that boot.

    The check disk is a deeper validation than booting. 

    There are lots of reasons why check disk might fail - the issue in KB301910 is just one of them. The issue may even be on the protected machine. You can run check disk on the protected machine to see if the issue is there.

  • If you did not receive an email this morning about the critical notification, I highly recommend you go into the Quest Support portal and make sure your notification settings are updated and that you have not opted out of email notifications. If you are having issues with finding that, please call our support line and our support admins should be able to help you with those settings. It might also we worth it to check your spam filter to make sure it didn't get caught there.

  • What is the output of the actual chkdsk log for each system? The script creates a subfolder and in that subfolder are all the different logs for each volumes chkdsk. Open up one of those for a machine that shows an error status and see what it says.

  • Corrigun I too just tested this same process in my lab and you are correct, I am posting errors as you did   on all agents but they are valid.  We are looking into this now.

  • Stage one and two 0 bad file records. Stage three also 0 bad but I see also this on every checkpoint run for every volume-

    Usn Journal verification completed.
    Errors detected in the Boot File.
    The Volume Bitmap is incorrect.
    Windows has checked the file system and found problems.
    Please run chkdsk /scan to find the problems and queue them for repair.