<div dir="auto"><div>Hello </div><div dir="auto"><br></div><div dir="auto">I think this is a NVIDIA bug (GSP task)</div><div dir="auto"><br></div><div dir="auto">Better Contact the NVIDIA support or community </div><div><br></div><div data-smartmail="gmail_signature">Tahari.Abdeslam</div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">Le mer. 7 mai 2025, 18:20, Oleg Drokin via lustre-discuss <<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello!<br>

<br>

"An uncorrectable ECC error detected" does sound like there's some<br>

hardware problem, while it is strange you only get this on GPU nodes<br>

(Extra power load leading to higher chances of memory corruption + more<br>

frequent kernel memory scannong increasing the chance to hit such<br>

curruption?) I'd expect you'd be seeing other crashes on such GPU nodes<br>

.<br>

<br>

Can you just generate some other cpu load (that involves system calls)<br>

on those nodes perhaps and see if suddenly crashes go up as well, just<br>

in some other area?<br>

<br>

On Wed, 2025-05-07 at 17:23 +0200, Anna Fuchs via lustre-discuss wrote:<br>

>  <br>

> Dear all,<br>

>  <br>

> We're facing an issue that is hopefully not directly related to<br>

> Lustre itself (we're not using community Lustre), but maybe someone<br>

> here has seen something similar or knows someone who has.<br>

>  <br>

> On our GPU partition with A100-SXM4-80GB GPUs (VBIOS version:<br>

> 92.00.36.00.02), we’re trying to read IOPS statistics (osc_stats) via<br>

> the files under /sys/kernel/debug/lustre/osc/ (we’re running 160<br>

> OSTs, Lustre version 2.14.0_ddn184). Our goal is to sample the data<br>

> at 5-second intervals, then aggregate and postprocess it into<br>

> readable metrics.<br>

>  <br>

> We have a collectd daemon running, which had been stable for a long<br>

> time. After integrating the IOPS metric, however, we occasionally hit<br>

> a kernel panic (see crash dump excerpts below). The issue appears to<br>

> originate somewhere in the GPU firmware stack, but we're unsure why<br>

> this happens and how it's related to reading Lustre metrics.<br>

>  <br>

> The problem occurs often, but is hard to reproduce and happens at<br>

> random. We’re hesitant to run the scripts frequently since a crash<br>

> could interrupt critical GPU workloads. That said, limited test runs<br>

> over several hours often work fine, especially after a fresh reboot.<br>

> The CPU-only nodes run the same scripts without issues all the time.<br>

>  <br>

>  <br>

> Could this be a sign that /sys/kernel/debug is being overwhelmed<br>

> somehow? Although that shouldn’t normally cause a kernel panic.<br>

>  <br>

> We’d appreciate any insights, experiences, or pointers, even indirect<br>

> ones.<br>

>  <br>

> Thanks in advance!<br>

>  <br>

> Anna<br>

>  <br>

> <br>

>  <br>

>  <br>

> 2024-12-17 17:11:28 [2453606.802826] NVRM: Xid (PCI:0000:03:00): 120,<br>

> pid='<unknown>', name=<unknown>, GSP task timeout @ pc:0x4bd36c4,<br>

> task:<br>

> 1<br>

> 2024-12-17 17:11:28 [2453606.802835] NVRM:     Reported by libos<br>

> task:0 v2.0 [0] @ ts:1734451888<br>

> 2024-12-17 17:11:28 [2453606.802837] NVRM:     RISC-V CSR State:<br>

> 2024-12-17 17:11:28 [2453606.802840] NVRM:        <br>

> mstatus:0x000000001e000000  mscratch:0x0000000000000000    <br>

> mie:0x0000000000000880  mip:0x<br>

> 0000000000000000<br>

> 2024-12-17 17:11:28 [2453606.802842] NVRM:           <br>

> mepc:0x0000000004bd36c4  mbadaddr:0x00000100badca700 <br>

> mcause:0x8000000000000007<br>

> 2024-12-17 17:11:28 [2453606.802844] NVRM:     RISC-V GPR State:<br>

> [...]<br>

> 2024-12-17 17:11:29 [2453606.803121] NVRM: Xid (PCI:0000:03:00): 140,<br>

> pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected<br>

> (p<br>

> ossible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0,<br>

> PCIE:0<br>

> [...]<br>

> 2024-12-17 17:30:03 [2454721.362906] Kernel panic - not syncing:<br>

> Fatal exception<br>

> 2024-12-17 17:30:03 [2454721.611822] Kernel Offset: 0x5200000 from<br>

> 0xffffffff81000000 (relocation range: 0xffffffff80000000-<br>

> 0xffffffffbfffffff)<br>

> 2024-12-17 17:30:03 [2454721.770927] ---[ end Kernel panic - not<br>

> syncing: Fatal exception ]---<br>

> <br>

> <br>

> <br>

> _______________________________________________<br>

> lustre-discuss mailing list<br>

> <a href="mailto:lustre-discuss@lists.lustre.org" target="_blank" rel="noreferrer">lustre-discuss@lists.lustre.org</a><br>

> <a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>

<br>

_______________________________________________<br>

lustre-discuss mailing list<br>

<a href="mailto:lustre-discuss@lists.lustre.org" target="_blank" rel="noreferrer">lustre-discuss@lists.lustre.org</a><br>

<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>

</blockquote></div>