This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Core Memory

Looking to start a discussion on Core memory usage and memory trouble-shooting in general

1) The first problem we are seeing is that the Core will often be at 0 free memory with a HUGE amount in standby. Yes I know low free memory is "normal" in current server OS's and that standby memory is available to be used by another application if it needs it. But that is certainly not my experience either and def not when RR is the one with all the memory in standby

I have a Core that has 125GB of memory, 25GB is in use from every process and 100GB is in standby. 0 Free

I see the TN below about write caching. But a few issues

https://support.quest.com/rapid-recovery/kb/119686/high-paged-pool-ram-utilization-on-systems-running-appassure-or-rapid-recovery

a) The Core is 2012 R2 so should not be having this issue.

b) The technote gives no indications of how to confirm if you are having this issue.

c) Without a way to confirm if I am having this issue, the technote may not even help

https://www.quest.com/community/products/rapid-recovery/f/forum/21016/core-memory-usage-in-hyper-v-vms-making-vms-unresponsive#

 

2) Rammap file summary will often show a HUGE amount (+100GB) of memory in standby

Is this normal, does it show a problem with write cache (or anything else)

Why does this memory only show up in RAMMAP file sumary and point to the dfs.records file of our repo.

What does it not show up as standby memory allocated to core.service.exe (or any process) in task manager/ resource monitor. This is not how process are supposed to act

Parents
  • Awesome. Thanks again. On a related note, how do you track performance on a busy Core (or any core) This is one thing I always struggle with, checking the rate of a single job seems time consuming and unreliable.

    Is there any numbers you can get from the logs (or anywhere) that can give me more insight?
  • i just wanted to let you all know that I had the same experience Emte did. Multiple cores with 256GB of ram and RR was eating all of it. Except for most of my repositories I was not able to change write cache policy from 2 to 3 because the rep would no longer mount. Eventually I was able to create new reps with write caching set to off, aka 3 and it made a huge difference. My memory usage went from 95%+ to 60% max and my I/O speeds doubled, almost tripled.

    My experience does not match what you guys are saying about 2012. Maybe it is better at working with RR, but it still sucks. I would maybe go as far as recommending this change for anyone having this issue with 2012 as well, at the very least stop saying it's 2008 only because that's not true! I'm only adamant about this because I would of made this change a long while ago but I overlooked it because of the 2008 statement. 

    Anyways it doesn't seem to be necessarily related to the protected machine number on the core, but could still be a factor. 3 2012 cores had this issue, 1 with 100+ machines, 1 with 50ish, and 1 with about 15. Although the core with 15 was always able to achieve better performance than the others, but still constantly has 0 free space. They are all 60+ TB cores so I think that could be a factor but I don't know. Also limiting the number of jobs that can run seems to help, I have mine set to 5 right now, which is lower then I would like but it keeps my free memory fluctuating from 0 - 25MB, which I believe to be ideal.

    I've changed 2 of the 3 cores to 3 with success but I'm stuck on the last one. It's the smallest machine wise but the largest size wise, at 130TB I do not have the space to recreate this and retain our retention policy. The BytesperSector in registry shows as 512 for every storage location but when I change it to 3 the rep fails to mount. I would like to turn of windows write caching for this core as well and let RR do it. Any idea's on how I can deal with this? 

  • Here are a few quick thoughts:

    1. What is the bytes per sector of the disk your repository is stored on? An easy way to get that is to run "fsutil fsinfo ntfsinfo d:" at an admin command prompt. d: is the drive letter you have assigned to the disk the repository is on. If you have more than one extent, check all the different disks. Is the Bytes Per Sector also 512?

    2. I'm trying to remember off the top of my head, but I believe that unless the bytes per sector of the repository and the bytes per sector of the disk match, we can't disable write caching.

    3. Matching the Bytes Per Sector of the repository and the disk greatly improves performance in the testing I have seen done. So in the rebuild of your cores, it may not have been write caching that improved performance nearly as much as matching those values. Did you run the two cores you rebuilt with write caching enabled for any period of time?

    I look forward to hearing back from you and I appreciate your feedback. We're always looking for feedback like yours to help us identify more ways to improve performance.

  • Hi Tim,

    I now believe you were correct. Matching up block sizes definitely makes a large difference in transfer/backup speeds. 

    I am still having issues with this particular core on Server 2012. Backup speeds are slower then I would expect and slower then what I'm seeing on my other cores, 2 of which are using the same storage with similar and reliable speeds.

    The issue I have now.
    1. Windows reporting a large amount of memory "in use". 222499 MB of 262144 MB approx as I look at it right now. 39859 MB in standby and 0 MB free.

    2. For reference, the healthy core is 127461 MB of 262144 MB in use, 2955 MB in standby, and 131094 free.

    I've drastically reduced the number of machines on this problem core to no avail. Healthy core now has 95 protected machines vs 48 on problem core. I think the # of machines has little impact on memory usage, but could be wrong.

    I've changed the dedupe cache(unfortunately) from 128 GB to 64 GB with little to no effect. Only thing left I can think of is the size of the machines and jobs that are running on the bad core. I have a 30TB fileserver trying to base, and 6 6 TB exchange servers that come along with huge 7+ TB rollups for each machine.(managing rollups for servers of this size is a whole nother discussion I need to have with you guys lol).

    I'm down to 3 Maximum concurrent transfers and 1 rollup. I can get decent speeds with this setup, but still HUGE in use memory which I feel is slowing the transfer speeds. Plus, no other machines can backup because the 7 most important servers are always backing and rolling up.

    We've considered adding ram to this server but its expensive as ***, so today I decided to just build a new 2016 VM with 128GB or so of ram and throw the 6 exchange servers on it alone. 

    Just wanted to share my experiences with you guys again, thanks for all the feedback and information on this post. If there's anything that could help, or I am missing that would be great too. Cya!

  • Thanks for the response and continuing the conversation. It's always helpful when someone shares their experiences.

    Since a copy of the dedupe cache lives in active RAM, changing the dedupe cache from 128 to 64 GB should have immediately freed up 64 GB of RAM. Obviously that RAM could then be consumed by other functions and disappear over time, but it should have been noticeable.

    Number of agents is important if you are doing lots of post backup processing jobs. For instance if you are running mountability and checksum checks on Exchange and attachability checks on SQL, you are going to use more resources than if you are just backing up machines with no checks. You are definitely right that larg agents (multiple TBs) are far more impacting than lots of smaller agents. A base image of a 30 TB server is going to take a LONG time. There's no way around that. You're trying to move 30 TB of data. 6 x 6 TB Exchange servers are also going to be significantly impacting. Are you doing mountability jobs on those servers. I'd bet that this is also part of that high memory usage.

    I'm curious to see how things change now that you've moved the 6 Exchange servers to a single core. That would have been my recommendation also. Pull the heavier load machines out of the current core and put them on their own core so that it can focus solely on them and not interfere with other machines.

Reply
  • Thanks for the response and continuing the conversation. It's always helpful when someone shares their experiences.

    Since a copy of the dedupe cache lives in active RAM, changing the dedupe cache from 128 to 64 GB should have immediately freed up 64 GB of RAM. Obviously that RAM could then be consumed by other functions and disappear over time, but it should have been noticeable.

    Number of agents is important if you are doing lots of post backup processing jobs. For instance if you are running mountability and checksum checks on Exchange and attachability checks on SQL, you are going to use more resources than if you are just backing up machines with no checks. You are definitely right that larg agents (multiple TBs) are far more impacting than lots of smaller agents. A base image of a 30 TB server is going to take a LONG time. There's no way around that. You're trying to move 30 TB of data. 6 x 6 TB Exchange servers are also going to be significantly impacting. Are you doing mountability jobs on those servers. I'd bet that this is also part of that high memory usage.

    I'm curious to see how things change now that you've moved the 6 Exchange servers to a single core. That would have been my recommendation also. Pull the heavier load machines out of the current core and put them on their own core so that it can focus solely on them and not interfere with other machines.

Children
No Data