[lustre-discuss] Issue after 2.15.5 upgrade

Hans Henrik Happe happe at nbi.dk
Thu Aug 1 01:10:33 PDT 2024


Hi,

Last week we upgraded to Lustre 2.15.5 from 2.12.9. It went almost 
without any issues. However, clients using TCP logs this message, when 
mounting one of the two filesystems:

Issue #1:
-------------

Aug  1 09:39:41 fend08 kernel: Lustre: Lustre: Build Version: 2.15.5
Aug  1 09:39:41 fend08 kernel: LustreError: 
31623:0:(mgc_request.c:1566:mgc_apply_recover_logs()) mgc: cannot find 
UUID by nid '10.21.10.122 at o2ib': rc = -2
Aug  1 09:39:41 fend08 kernel: Lustre: 
31623:0:(mgc_request.c:1784:mgc_process_recover_nodemap_log()) 
MGC172.20.10.101 at tcp1: error processing recovery log hpc-cliir: rc = -2
Aug  1 09:39:41 fend08 kernel: Lustre: 
31623:0:(mgc_request.c:2150:mgc_process_log()) MGC172.20.10.101 at tcp1: IR 
log hpc-cliir failed, not fatal: rc = -2
Aug  1 09:39:41 fend08 root[31712]: ksocklnd-config: skip setting up 
route for bond0: don't overwrite existing route
Aug  1 09:39:42 fend08 kernel: Lustre: Mounted hpc-client

This is not happening when using Infiniband.

How can we fix this?


Issue #2 (might or might not be related):
---------------------------------------------------------

The status of target connections after mounting is:

# lfs check all
hpc-OST0003-osc-ffff90532327f000 active.
hpc-OST0004-osc-ffff90532327f000 active.
hpc-OST0005-osc-ffff90532327f000 active.
hpc-OST0006-osc-ffff90532327f000 active.
lfs check: error: check 'hpc-OST0007-osc-ffff90532327f000': Resource 
temporarily unavailable (11)
lfs check: error: check 'hpc-OST0008-osc-ffff90532327f000': Resource 
temporarily unavailable (11)
hpc-OST0009-osc-ffff90532327f000 active.
hpc-OST000a-osc-ffff90532327f000 active.
hpc-OST000b-osc-ffff90532327f000 active.
hpc-OST000c-osc-ffff90532327f000 active.
hpc-OST000d-osc-ffff90532327f000 active.
hpc-OST000e-osc-ffff90532327f000 active.
hpc-MDT0000-mdc-ffff90532327f000 active.
MGC172.20.10.101 at tcp1 active.

OST000[7-e] are on host 172.20.10.122 at tcp1 (10.21.10.122 at o2ib).

Due to this situation it hangs when hitting OST000[7-8].

Unmounting and mounting it again clear the error on OST000[7-8] and make 
it usable (Issue #1 still showing). With a clean LNet start the issue 
comes back.

Disabling 'discovery' in LNet makes this issue go away (Issue #1 still 
showing).

Reverting to Lustre 2.15.3 also makes it go away (Issue #1 still 
showing). Perhaps all the TCP issues in 2.15.4 was not fixed by LU-17664.


A few notes about our system:
------------------------------------------

- It's ZFS based.
- It was created back in 2015. MGS, and MDTs have survived since then 
(zfs send/receive), while new OSTs have been added over time an old ones 
have been taken out.
- There are 2 filesystems on an MDS pair. One MDT on each MDS.
- Dual network stack with Infiniband and TCP. For historical reasons we 
are using tcp1 and not the default tcp0. No routers.

Cheers,
Hans Henrik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240801/bcb24007/attachment.htm>


More information about the lustre-discuss mailing list