<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p data-start="141" data-end="150" class="">Dear all,</p>
<p data-start="152" data-end="343" class="">We're facing an issue
that is hopefully not directly related to Lustre itself (we're not
using community Lustre), but maybe someone here has seen something
similar or knows someone who has.</p>
<p data-start="345" data-end="695" class="">On our GPU partition
with <span data-start="371" data-end="394">A100-SXM4-80GB GPUs</span>
(VBIOS version: <code data-start="411" data-end="427">92.00.36.00.02</code>),
we’re trying to read <span data-start="451" data-end="470">IOPS
statistics</span> (osc_stats) via the files under <code
data-start="491" data-end="522">/sys/kernel/debug/lustre/osc/</code>
(we’re running 160 OSTs, Lustre version <code data-start="563"
data-end="578">2.14.0_ddn184</code>). Our goal is to sample the
data at <span data-start="615" data-end="637">5-second intervals</span>,
then aggregate and postprocess it into readable metrics.</p>
<p data-start="697" data-end="1038" class="">We have a <span
data-start="707" data-end="726">collectd daemon</span> running,
which had been stable for a long time. After integrating the IOPS
metric, however, we occasionally hit a <span data-start="844"
data-end="860">kernel panic</span> (see crash dump excerpts
below). The issue appears to originate somewhere in the <span
data-start="942" data-end="955">GPU firmware stack</span>, but
we're unsure why this happens and how it's related to reading
Lustre metrics.</p>
<p data-start="1040" data-end="1374" class="">The problem occurs
often, but is hard to reproduce and happens at random. We’re
hesitant to run the scripts frequently since a crash could
interrupt critical GPU workloads. That said, <span
data-start="1206" data-end="1227">limited test runs</span> over
several hours often work fine, especially <span data-start="1275"
data-end="1299">after a fresh reboot</span>. The CPU-only nodes
run the same scripts without issues all the time.<br>
</p>
<p data-start="1376" data-end="1506" class="">Could this be a sign
that <code data-start="1402" data-end="1421">/sys/kernel/debug</code>
is being overwhelmed somehow? Although that shouldn’t normally
cause a kernel panic.</p>
<p data-start="1508" data-end="1586" class="">We’d appreciate <span
data-start="1524" data-end="1566">any insights, experiences, or
pointers</span>, even indirect ones.</p>
<p data-start="1588" data-end="1606" class="">Thanks in advance!</p>
<p data-start="1588" data-end="1606" class="">Anna</p>
<p data-start="1588" data-end="1606" class=""><br>
</p>
<pre role="region"><code class="code-colors hljs language-dns"><span
class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
class="hljs-number">2453606</span>.<span class="hljs-number">802826</span>] NVRM: Xid (PCI:<span
class="hljs-number">0000:03:00</span>): <span class="hljs-number">120</span>, pid='<unknown>', name=<unknown>, GSP task timeout @ pc:<span
class="hljs-number">0</span>x4bd36c4, task:
<span class="hljs-number">1</span>
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
class="hljs-number">2453606</span>.<span class="hljs-number">802835</span>] NVRM: Reported by libos task:<span
class="hljs-number">0</span> v2.<span class="hljs-number">0</span> [<span
class="hljs-number">0</span>] @ ts:<span class="hljs-number">1734451888</span>
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
class="hljs-number">2453606</span>.<span class="hljs-number">802837</span>] NVRM: RISC-V CSR State:
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
class="hljs-number">2453606</span>.<span class="hljs-number">802840</span>] NVRM: mstatus:<span
class="hljs-number">0</span>x0000000<span class="hljs-number">01e000000</span> mscratch:<span
class="hljs-number">0</span>x00000<span class="hljs-number">00000000000</span> mie:<span
class="hljs-number">0</span>x00000<span class="hljs-number">00000000880</span> mip:<span
class="hljs-number">0</span>x
<span class="hljs-number">0000000000000000</span>
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
class="hljs-number">2453606</span>.<span class="hljs-number">802842</span>] NVRM: mepc:<span
class="hljs-number">0</span>x0000000004bd36c4 mbadaddr:<span
class="hljs-number">0</span>x00000100badca700 mcause:<span
class="hljs-number">0</span>x80000<span class="hljs-number">00000000007</span>
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
class="hljs-number">2453606</span>.<span class="hljs-number">802844</span>] NVRM: RISC-V GPR State:
[...]
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
class="hljs-number">11</span>:<span class="hljs-number">29</span> [<span
class="hljs-number">2453606</span>.<span class="hljs-number">803121</span>] NVRM: Xid (PCI:<span
class="hljs-number">0000:03:00</span>): <span class="hljs-number">140</span>, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (p
ossible firmware handling failure) DRAM:-<span class="hljs-number">1840691462</span>, LTC:<span
class="hljs-number">0</span>, MMU:<span class="hljs-number">0</span>, PCIE:<span
class="hljs-number">0</span>
[...]
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
class="hljs-number">30</span>:<span class="hljs-number">03</span> [<span
class="hljs-number">2454721</span>.<span class="hljs-number">362906</span>] Kernel panic - not syncing: Fatal exception
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
class="hljs-number">30</span>:<span class="hljs-number">03</span> [<span
class="hljs-number">2454721</span>.<span class="hljs-number">611822</span>] Kernel Offset: <span
class="hljs-number">0x5200000</span> from <span class="hljs-number">0</span>xffffffff<span
class="hljs-number">81000000</span> (relocation range: <span
class="hljs-number">0</span>xffffffff<span class="hljs-number">80000000</span>-<span
class="hljs-number">0</span>xffffffffbfffffff)
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
class="hljs-number">30</span>:<span class="hljs-number">03</span> [<span
class="hljs-number">2454721</span>.<span class="hljs-number">770927</span>] ---[ end Kernel panic - not syncing: Fatal exception ]---
--
Anna Fuchs
Universität Hamburg /
Deutsches Klimarechenzentrum GmbH (DKRZ)
</code></pre>
</body>
</html>