[lustre-discuss] Kernel panic when reading Lustre osc stats on GPU nodes

Anna Fuchs anna.fuchs at uni-hamburg.de
Wed May 7 08:23:16 PDT 2025


Dear all,

We're facing an issue that is hopefully not directly related to Lustre 
itself (we're not using community Lustre), but maybe someone here has 
seen something similar or knows someone who has.

On our GPU partition with A100-SXM4-80GB GPUs (VBIOS version: 
|92.00.36.00.02|), we’re trying to read IOPS statistics (osc_stats) via 
the files under |/sys/kernel/debug/lustre/osc/| (we’re running 160 OSTs, 
Lustre version |2.14.0_ddn184|). Our goal is to sample the data at 
5-second intervals, then aggregate and postprocess it into readable metrics.

We have a collectd daemon running, which had been stable for a long 
time. After integrating the IOPS metric, however, we occasionally hit a 
kernel panic (see crash dump excerpts below). The issue appears to 
originate somewhere in the GPU firmware stack, but we're unsure why this 
happens and how it's related to reading Lustre metrics.

The problem occurs often, but is hard to reproduce and happens at 
random. We’re hesitant to run the scripts frequently since a crash could 
interrupt critical GPU workloads. That said, limited test runs over 
several hours often work fine, especially after a fresh reboot. The 
CPU-only nodes run the same scripts without issues all the time.

Could this be a sign that |/sys/kernel/debug| is being overwhelmed 
somehow? Although that shouldn’t normally cause a kernel panic.

We’d appreciate any insights, experiences, or pointers, even indirect ones.

Thanks in advance!

Anna


|2024-12-17 17:11:28 [2453606.802826] NVRM: Xid (PCI:0000:03:00): 120, 
pid='<unknown>', name=<unknown>, GSP task timeout @ pc:0x4bd36c4, task: 
1 2024-12-17 17:11:28 [2453606.802835] NVRM: Reported by libos task:0 
v2.0 [0] @ ts:1734451888 2024-12-17 17:11:28 [2453606.802837] NVRM: 
RISC-V CSR State: 2024-12-17 17:11:28 [2453606.802840] NVRM: 
mstatus:0x000000001e000000 mscratch:0x0000000000000000 
mie:0x0000000000000880 mip:0x 0000000000000000 2024-12-17 17:11:28 
[2453606.802842] NVRM: mepc:0x0000000004bd36c4 
mbadaddr:0x00000100badca700 mcause:0x8000000000000007 2024-12-17 
17:11:28 [2453606.802844] NVRM: RISC-V GPR State: [...] 2024-12-17 
17:11:29 [2453606.803121] NVRM: Xid (PCI:0000:03:00): 140, 
pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (p 
ossible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, 
PCIE:0 [...] 2024-12-17 17:30:03 [2454721.362906] Kernel panic - not 
syncing: Fatal exception 2024-12-17 17:30:03 [2454721.611822] Kernel 
Offset: 0x5200000 from 0xffffffff81000000 (relocation range: 
0xffffffff80000000-0xffffffffbfffffff) 2024-12-17 17:30:03 
[2454721.770927] ---[ end Kernel panic - not syncing: Fatal exception 
]--- -- Anna Fuchs Universität Hamburg / Deutsches Klimarechenzentrum 
GmbH (DKRZ) |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250507/ede0e77a/attachment.htm>


More information about the lustre-discuss mailing list