[lustre-discuss] Kernel panic when reading Lustre osc stats on GPU nodes

Wed May 7 09:15:52 PDT 2025

Hello!

"An uncorrectable ECC error detected" does sound like there's some
hardware problem, while it is strange you only get this on GPU nodes
(Extra power load leading to higher chances of memory corruption + more
frequent kernel memory scannong increasing the chance to hit such
curruption?) I'd expect you'd be seeing other crashes on such GPU nodes
.

Can you just generate some other cpu load (that involves system calls)
on those nodes perhaps and see if suddenly crashes go up as well, just
in some other area?

On Wed, 2025-05-07 at 17:23 +0200, Anna Fuchs via lustre-discuss wrote:
>  
> Dear all,
>  
> We're facing an issue that is hopefully not directly related to
> Lustre itself (we're not using community Lustre), but maybe someone
> here has seen something similar or knows someone who has.
>  
> On our GPU partition with A100-SXM4-80GB GPUs (VBIOS version:
> 92.00.36.00.02), we’re trying to read IOPS statistics (osc_stats) via
> the files under /sys/kernel/debug/lustre/osc/ (we’re running 160
> OSTs, Lustre version 2.14.0_ddn184). Our goal is to sample the data
> at 5-second intervals, then aggregate and postprocess it into
> readable metrics.
>  
> We have a collectd daemon running, which had been stable for a long
> time. After integrating the IOPS metric, however, we occasionally hit
> a kernel panic (see crash dump excerpts below). The issue appears to
> originate somewhere in the GPU firmware stack, but we're unsure why
> this happens and how it's related to reading Lustre metrics.
>  
> The problem occurs often, but is hard to reproduce and happens at
> random. We’re hesitant to run the scripts frequently since a crash
> could interrupt critical GPU workloads. That said, limited test runs
> over several hours often work fine, especially after a fresh reboot.
> The CPU-only nodes run the same scripts without issues all the time.
>  
>  
> Could this be a sign that /sys/kernel/debug is being overwhelmed
> somehow? Although that shouldn’t normally cause a kernel panic.
>  
> We’d appreciate any insights, experiences, or pointers, even indirect
> ones.
>  
> Thanks in advance!
>  
> Anna
>  
> 
>  
>  
> 2024-12-17 17:11:28 [2453606.802826] NVRM: Xid (PCI:0000:03:00): 120,
> pid='<unknown>', name=<unknown>, GSP task timeout @ pc:0x4bd36c4,
> task:
> 1
> 2024-12-17 17:11:28 [2453606.802835] NVRM:     Reported by libos
> task:0 v2.0 [0] @ ts:1734451888
> 2024-12-17 17:11:28 [2453606.802837] NVRM:     RISC-V CSR State:
> 2024-12-17 17:11:28 [2453606.802840] NVRM:        
> mstatus:0x000000001e000000  mscratch:0x0000000000000000    
> mie:0x0000000000000880  mip:0x
> 0000000000000000
> 2024-12-17 17:11:28 [2453606.802842] NVRM:           
> mepc:0x0000000004bd36c4  mbadaddr:0x00000100badca700 
> mcause:0x8000000000000007
> 2024-12-17 17:11:28 [2453606.802844] NVRM:     RISC-V GPR State:
> [...]
> 2024-12-17 17:11:29 [2453606.803121] NVRM: Xid (PCI:0000:03:00): 140,
> pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected
> (p
> ossible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0,
> PCIE:0
> [...]
> 2024-12-17 17:30:03 [2454721.362906] Kernel panic - not syncing:
> Fatal exception
> 2024-12-17 17:30:03 [2454721.611822] Kernel Offset: 0x5200000 from
> 0xffffffff81000000 (relocation range: 0xffffffff80000000-
> 0xffffffffbfffffff)
> 2024-12-17 17:30:03 [2454721.770927] ---[ end Kernel panic - not
> syncing: Fatal exception ]---
> 
> 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org