[lustre-discuss] Issue after 2.15.5 upgrade
Hans Henrik Happe
happe at nbi.dk
Thu Aug 1 01:10:33 PDT 2024
Hi,
Last week we upgraded to Lustre 2.15.5 from 2.12.9. It went almost
without any issues. However, clients using TCP logs this message, when
mounting one of the two filesystems:
Issue #1:
-------------
Aug 1 09:39:41 fend08 kernel: Lustre: Lustre: Build Version: 2.15.5
Aug 1 09:39:41 fend08 kernel: LustreError:
31623:0:(mgc_request.c:1566:mgc_apply_recover_logs()) mgc: cannot find
UUID by nid '10.21.10.122 at o2ib': rc = -2
Aug 1 09:39:41 fend08 kernel: Lustre:
31623:0:(mgc_request.c:1784:mgc_process_recover_nodemap_log())
MGC172.20.10.101 at tcp1: error processing recovery log hpc-cliir: rc = -2
Aug 1 09:39:41 fend08 kernel: Lustre:
31623:0:(mgc_request.c:2150:mgc_process_log()) MGC172.20.10.101 at tcp1: IR
log hpc-cliir failed, not fatal: rc = -2
Aug 1 09:39:41 fend08 root[31712]: ksocklnd-config: skip setting up
route for bond0: don't overwrite existing route
Aug 1 09:39:42 fend08 kernel: Lustre: Mounted hpc-client
This is not happening when using Infiniband.
How can we fix this?
Issue #2 (might or might not be related):
---------------------------------------------------------
The status of target connections after mounting is:
# lfs check all
hpc-OST0003-osc-ffff90532327f000 active.
hpc-OST0004-osc-ffff90532327f000 active.
hpc-OST0005-osc-ffff90532327f000 active.
hpc-OST0006-osc-ffff90532327f000 active.
lfs check: error: check 'hpc-OST0007-osc-ffff90532327f000': Resource
temporarily unavailable (11)
lfs check: error: check 'hpc-OST0008-osc-ffff90532327f000': Resource
temporarily unavailable (11)
hpc-OST0009-osc-ffff90532327f000 active.
hpc-OST000a-osc-ffff90532327f000 active.
hpc-OST000b-osc-ffff90532327f000 active.
hpc-OST000c-osc-ffff90532327f000 active.
hpc-OST000d-osc-ffff90532327f000 active.
hpc-OST000e-osc-ffff90532327f000 active.
hpc-MDT0000-mdc-ffff90532327f000 active.
MGC172.20.10.101 at tcp1 active.
OST000[7-e] are on host 172.20.10.122 at tcp1 (10.21.10.122 at o2ib).
Due to this situation it hangs when hitting OST000[7-8].
Unmounting and mounting it again clear the error on OST000[7-8] and make
it usable (Issue #1 still showing). With a clean LNet start the issue
comes back.
Disabling 'discovery' in LNet makes this issue go away (Issue #1 still
showing).
Reverting to Lustre 2.15.3 also makes it go away (Issue #1 still
showing). Perhaps all the TCP issues in 2.15.4 was not fixed by LU-17664.
A few notes about our system:
------------------------------------------
- It's ZFS based.
- It was created back in 2015. MGS, and MDTs have survived since then
(zfs send/receive), while new OSTs have been added over time an old ones
have been taken out.
- There are 2 filesystems on an MDS pair. One MDT on each MDS.
- Dual network stack with Infiniband and TCP. For historical reasons we
are using tcp1 and not the default tcp0. No routers.
Cheers,
Hans Henrik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240801/bcb24007/attachment.htm>
More information about the lustre-discuss
mailing list