[lustre-discuss] 2.16.1 ptlrpcd infinite loop when machine runs out of RAM

Lewis Hyatt lhyatt at gmail.com
Wed Feb 5 14:21:40 PST 2025


On Wed, Feb 5, 2025 at 10:21 AM Laura Hild <lsh at jlab.org> wrote:
>
> I wanna say 2.15 added those messages (the obd_memory ones, not the spinning ptlrpcd) to every OoM. I remember seeing them when we first had 2.15 clients and looking them up.  I take it you're not getting a corresponding OoM for each, though?

Thanks, yes what we see is one single OoM instance, which is resolved
by oom-killer, and triggers ptlrpcd to then loop forever, spinning a
CPU and also spamming the log messages. I guess the oom callback it
runs, is just being called over and over?

> It is typical for a host to struggle if OoM conditions are happening regularly.  Is there workload manager where you could contain individual jobs' memory usage, and limit the total to something with a bigger margin for the system?

Right, certainly we are not expecting it to happen often and/or we can
arrange to make sure it does not happen, however the fact that one OOM
instance causes the server to become unusable and causes process to
hang indefinitely is still an issue that would be great to resolve.

-Lewis


More information about the lustre-discuss mailing list