[lustre-discuss] Kernel panic when reading Lustre osc stats on GPU nodes

Wed May 7 16:10:57 PDT 2025

Hello

I think this is a NVIDIA bug (GSP task)

Better Contact the NVIDIA support or community

Tahari.Abdeslam

Le mer. 7 mai 2025, 18:20, Oleg Drokin via lustre-discuss <
lustre-discuss at lists.lustre.org> a écrit :

> Hello!
>
> "An uncorrectable ECC error detected" does sound like there's some
> hardware problem, while it is strange you only get this on GPU nodes
> (Extra power load leading to higher chances of memory corruption + more
> frequent kernel memory scannong increasing the chance to hit such
> curruption?) I'd expect you'd be seeing other crashes on such GPU nodes
> .
>
> Can you just generate some other cpu load (that involves system calls)
> on those nodes perhaps and see if suddenly crashes go up as well, just
> in some other area?
>
> On Wed, 2025-05-07 at 17:23 +0200, Anna Fuchs via lustre-discuss wrote:
> >
> > Dear all,
> >
> > We're facing an issue that is hopefully not directly related to
> > Lustre itself (we're not using community Lustre), but maybe someone
> > here has seen something similar or knows someone who has.
> >
> > On our GPU partition with A100-SXM4-80GB GPUs (VBIOS version:
> > 92.00.36.00.02), we’re trying to read IOPS statistics (osc_stats) via
> > the files under /sys/kernel/debug/lustre/osc/ (we’re running 160
> > OSTs, Lustre version 2.14.0_ddn184). Our goal is to sample the data
> > at 5-second intervals, then aggregate and postprocess it into
> > readable metrics.
> >
> > We have a collectd daemon running, which had been stable for a long
> > time. After integrating the IOPS metric, however, we occasionally hit
> > a kernel panic (see crash dump excerpts below). The issue appears to
> > originate somewhere in the GPU firmware stack, but we're unsure why
> > this happens and how it's related to reading Lustre metrics.
> >
> > The problem occurs often, but is hard to reproduce and happens at
> > random. We’re hesitant to run the scripts frequently since a crash
> > could interrupt critical GPU workloads. That said, limited test runs
> > over several hours often work fine, especially after a fresh reboot.
> > The CPU-only nodes run the same scripts without issues all the time.
> >
> >
> > Could this be a sign that /sys/kernel/debug is being overwhelmed
> > somehow? Although that shouldn’t normally cause a kernel panic.
> >
> > We’d appreciate any insights, experiences, or pointers, even indirect
> > ones.
> >
> > Thanks in advance!
> >
> > Anna
> >
> >
> >
> >
> > 2024-12-17 17:11:28 [2453606.802826] NVRM: Xid (PCI:0000:03:00): 120,
> > pid='<unknown>', name=<unknown>, GSP task timeout @ pc:0x4bd36c4,
> > task:
> > 1
> > 2024-12-17 17:11:28 [2453606.802835] NVRM:     Reported by libos
> > task:0 v2.0 [0] @ ts:1734451888
> > 2024-12-17 17:11:28 [2453606.802837] NVRM:     RISC-V CSR State:
> > 2024-12-17 17:11:28 [2453606.802840] NVRM:
> > mstatus:0x000000001e000000  mscratch:0x0000000000000000
> > mie:0x0000000000000880  mip:0x
> > 0000000000000000
> > 2024-12-17 17:11:28 [2453606.802842] NVRM:
> > mepc:0x0000000004bd36c4  mbadaddr:0x00000100badca700
> > mcause:0x8000000000000007
> > 2024-12-17 17:11:28 [2453606.802844] NVRM:     RISC-V GPR State:
> > [...]
> > 2024-12-17 17:11:29 [2453606.803121] NVRM: Xid (PCI:0000:03:00): 140,
> > pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected
> > (p
> > ossible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0,
> > PCIE:0
> > [...]
> > 2024-12-17 17:30:03 [2454721.362906] Kernel panic - not syncing:
> > Fatal exception
> > 2024-12-17 17:30:03 [2454721.611822] Kernel Offset: 0x5200000 from
> > 0xffffffff81000000 (relocation range: 0xffffffff80000000-
> > 0xffffffffbfffffff)
> > 2024-12-17 17:30:03 [2454721.770927] ---[ end Kernel panic - not
> > syncing: Fatal exception ]---
> >
> >
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250508/bae92e91/attachment.htm>