<!DOCTYPE html>

<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body>

    Hi,<br>

    <br>

    Last week we upgraded to Lustre 2.15.5 from 2.12.9. It went almost

    without any issues. However, clients using TCP logs this message,

    when mounting one of the two filesystems:<br>

    <br>

    Issue #1:<br>

    -------------<br>

    <br>

    Aug  1 09:39:41 fend08 kernel: Lustre: Lustre: Build Version: 2.15.5<br>

    Aug  1 09:39:41 fend08 kernel: LustreError:

    31623:0:(mgc_request.c:1566:mgc_apply_recover_logs()) mgc: cannot

    find UUID by nid '10.21.10.122@o2ib': rc = -2<br>

    Aug  1 09:39:41 fend08 kernel: Lustre:

    31623:0:(mgc_request.c:1784:mgc_process_recover_nodemap_log())

    MGC172.20.10.101@tcp1: error processing recovery log hpc-cliir: rc =

    -2<br>

    Aug  1 09:39:41 fend08 kernel: Lustre:

    31623:0:(mgc_request.c:2150:mgc_process_log())

    MGC172.20.10.101@tcp1: IR log hpc-cliir failed, not fatal: rc = -2<br>

    Aug  1 09:39:41 fend08 root[31712]: ksocklnd-config: skip setting up

    route for bond0: don't overwrite existing route<br>

    Aug  1 09:39:42 fend08 kernel: Lustre: Mounted hpc-client<br>

    <br>

    This is not happening when using Infiniband.<br>

    <br>

    How can we fix this?<br>

    <br>

    <br>

    Issue #2 (might or might not be related):<br>

    ---------------------------------------------------------<br>

    <br>

    The status of target connections after mounting is:<br>

    <br>

    # lfs check all<br>

    hpc-OST0003-osc-ffff90532327f000 active.<br>

    hpc-OST0004-osc-ffff90532327f000 active.<br>

    hpc-OST0005-osc-ffff90532327f000 active.<br>

    hpc-OST0006-osc-ffff90532327f000 active.<br>

    lfs check: error: check 'hpc-OST0007-osc-ffff90532327f000': Resource

    temporarily unavailable (11)<br>

    lfs check: error: check 'hpc-OST0008-osc-ffff90532327f000': Resource

    temporarily unavailable (11)<br>

    hpc-OST0009-osc-ffff90532327f000 active.<br>

    hpc-OST000a-osc-ffff90532327f000 active.<br>

    hpc-OST000b-osc-ffff90532327f000 active.<br>

    hpc-OST000c-osc-ffff90532327f000 active.<br>

    hpc-OST000d-osc-ffff90532327f000 active.<br>

    hpc-OST000e-osc-ffff90532327f000 active.<br>

    hpc-MDT0000-mdc-ffff90532327f000 active.<br>

    MGC172.20.10.101@tcp1 active.<br>

    <br>

    OST000[7-e] are on host 172.20.10.122@tcp1 (10.21.10.122@o2ib).<br>

    <br>

    Due to this situation it hangs when hitting OST000[7-8].<br>

    <br>

    Unmounting and mounting it again clear the error on OST000[7-8] and

    make it usable (Issue #1 still showing). With a clean LNet start the

    issue comes back.<br>

    <br>

    Disabling 'discovery' in LNet makes this issue go away (Issue #1

    still showing).<br>

    <br>

    Reverting to Lustre 2.15.3 also makes it go away (Issue #1 still

    showing). Perhaps all the TCP issues in 2.15.4 was not fixed by

    LU-17664.<br>

    <br>

    <br>

    A few notes about our system:<br>

    ------------------------------------------<br>

    <br>

    - It's ZFS based.<br>

    - It was created back in 2015. MGS, and MDTs have survived since

    then (zfs send/receive), while new OSTs have been added over time an

    old ones have been taken out. <br>

    - There are 2 filesystems on an MDS pair. One MDT on each MDS.<br>

    - Dual network stack with Infiniband and TCP. For historical reasons

    we are using tcp1 and not the default tcp0. No routers.<br>

    <br>

    Cheers,<br>

    Hans Henrik<br>

  </body>

</html>