<!DOCTYPE html>
<html>
  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p data-start="141" data-end="150" class="">Dear all,</p>
    <p data-start="152" data-end="343" class="">We're facing an issue
      that is hopefully not directly related to Lustre itself (we're not
      using community Lustre), but maybe someone here has seen something
      similar or knows someone who has.</p>
    <p data-start="345" data-end="695" class="">On our GPU partition
      with <span data-start="371" data-end="394">A100-SXM4-80GB GPUs</span>
      (VBIOS version: <code data-start="411" data-end="427">92.00.36.00.02</code>),
      we’re trying to read <span data-start="451" data-end="470">IOPS
        statistics</span> (osc_stats) via the files under <code
        data-start="491" data-end="522">/sys/kernel/debug/lustre/osc/</code>
      (we’re running 160 OSTs, Lustre version <code data-start="563"
        data-end="578">2.14.0_ddn184</code>). Our goal is to sample the
      data at <span data-start="615" data-end="637">5-second intervals</span>,
      then aggregate and postprocess it into readable metrics.</p>
    <p data-start="697" data-end="1038" class="">We have a <span
        data-start="707" data-end="726">collectd daemon</span> running,
      which had been stable for a long time. After integrating the IOPS
      metric, however, we occasionally hit a <span data-start="844"
        data-end="860">kernel panic</span> (see crash dump excerpts
      below). The issue appears to originate somewhere in the <span
        data-start="942" data-end="955">GPU firmware stack</span>, but
      we're unsure why this happens and how it's related to reading
      Lustre metrics.</p>
    <p data-start="1040" data-end="1374" class="">The problem occurs
      often, but is hard to reproduce and happens at random. We’re
      hesitant to run the scripts frequently since a crash could
      interrupt critical GPU workloads. That said, <span
        data-start="1206" data-end="1227">limited test runs</span> over
      several hours often work fine, especially <span data-start="1275"
        data-end="1299">after a fresh reboot</span>. The CPU-only nodes
      run the same scripts without issues all the time.<br>
    </p>
    <p data-start="1376" data-end="1506" class="">Could this be a sign
      that <code data-start="1402" data-end="1421">/sys/kernel/debug</code>
      is being overwhelmed somehow? Although that shouldn’t normally
      cause a kernel panic.</p>
    <p data-start="1508" data-end="1586" class="">We’d appreciate <span
        data-start="1524" data-end="1566">any insights, experiences, or
        pointers</span>, even indirect ones.</p>
    <p data-start="1588" data-end="1606" class="">Thanks in advance!</p>
    <p data-start="1588" data-end="1606" class="">Anna</p>
    <p data-start="1588" data-end="1606" class=""><br>
    </p>
    <pre role="region"><code class="code-colors hljs language-dns"><span
    class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
    class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
    class="hljs-number">2453606</span>.<span class="hljs-number">802826</span>] NVRM: Xid (PCI:<span
    class="hljs-number">0000:03:00</span>): <span class="hljs-number">120</span>, pid='<unknown>', name=<unknown>, GSP task timeout @ pc:<span
    class="hljs-number">0</span>x4bd36c4, task:
<span class="hljs-number">1</span>
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
    class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
    class="hljs-number">2453606</span>.<span class="hljs-number">802835</span>] NVRM:     Reported by libos task:<span
    class="hljs-number">0</span> v2.<span class="hljs-number">0</span> [<span
    class="hljs-number">0</span>] @ ts:<span class="hljs-number">1734451888</span>
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
    class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
    class="hljs-number">2453606</span>.<span class="hljs-number">802837</span>] NVRM:     RISC-V CSR State:
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
    class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
    class="hljs-number">2453606</span>.<span class="hljs-number">802840</span>] NVRM:         mstatus:<span
    class="hljs-number">0</span>x0000000<span class="hljs-number">01e000000</span>  mscratch:<span
    class="hljs-number">0</span>x00000<span class="hljs-number">00000000000</span>     mie:<span
    class="hljs-number">0</span>x00000<span class="hljs-number">00000000880</span>  mip:<span
    class="hljs-number">0</span>x
<span class="hljs-number">0000000000000000</span>
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
    class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
    class="hljs-number">2453606</span>.<span class="hljs-number">802842</span>] NVRM:            mepc:<span
    class="hljs-number">0</span>x0000000004bd36c4  mbadaddr:<span
    class="hljs-number">0</span>x00000100badca700  mcause:<span
    class="hljs-number">0</span>x80000<span class="hljs-number">00000000007</span>
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
    class="hljs-number">11</span>:<span class="hljs-number">28</span> [<span
    class="hljs-number">2453606</span>.<span class="hljs-number">802844</span>] NVRM:     RISC-V GPR State:
[...]
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
    class="hljs-number">11</span>:<span class="hljs-number">29</span> [<span
    class="hljs-number">2453606</span>.<span class="hljs-number">803121</span>] NVRM: Xid (PCI:<span
    class="hljs-number">0000:03:00</span>): <span class="hljs-number">140</span>, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (p
ossible firmware handling failure) DRAM:-<span class="hljs-number">1840691462</span>, LTC:<span
    class="hljs-number">0</span>, MMU:<span class="hljs-number">0</span>, PCIE:<span
    class="hljs-number">0</span>
[...]
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
    class="hljs-number">30</span>:<span class="hljs-number">03</span> [<span
    class="hljs-number">2454721</span>.<span class="hljs-number">362906</span>] Kernel panic - not syncing: Fatal exception
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
    class="hljs-number">30</span>:<span class="hljs-number">03</span> [<span
    class="hljs-number">2454721</span>.<span class="hljs-number">611822</span>] Kernel Offset: <span
    class="hljs-number">0x5200000</span> from <span class="hljs-number">0</span>xffffffff<span
    class="hljs-number">81000000</span> (relocation range: <span
    class="hljs-number">0</span>xffffffff<span class="hljs-number">80000000</span>-<span
    class="hljs-number">0</span>xffffffffbfffffff)
<span class="hljs-number">2024-12-17</span> <span class="hljs-number">17</span>:<span
    class="hljs-number">30</span>:<span class="hljs-number">03</span> [<span
    class="hljs-number">2454721</span>.<span class="hljs-number">770927</span>] ---[ end Kernel panic - not syncing: Fatal exception ]---



-- 
Anna Fuchs
Universität Hamburg /
Deutsches Klimarechenzentrum GmbH (DKRZ)


</code></pre>
  </body>
</html>