<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
Hi,<br>
<br>
Last week we upgraded to Lustre 2.15.5 from 2.12.9. It went almost
without any issues. However, clients using TCP logs this message,
when mounting one of the two filesystems:<br>
<br>
Issue #1:<br>
-------------<br>
<br>
Aug 1 09:39:41 fend08 kernel: Lustre: Lustre: Build Version: 2.15.5<br>
Aug 1 09:39:41 fend08 kernel: LustreError:
31623:0:(mgc_request.c:1566:mgc_apply_recover_logs()) mgc: cannot
find UUID by nid '10.21.10.122@o2ib': rc = -2<br>
Aug 1 09:39:41 fend08 kernel: Lustre:
31623:0:(mgc_request.c:1784:mgc_process_recover_nodemap_log())
MGC172.20.10.101@tcp1: error processing recovery log hpc-cliir: rc =
-2<br>
Aug 1 09:39:41 fend08 kernel: Lustre:
31623:0:(mgc_request.c:2150:mgc_process_log())
MGC172.20.10.101@tcp1: IR log hpc-cliir failed, not fatal: rc = -2<br>
Aug 1 09:39:41 fend08 root[31712]: ksocklnd-config: skip setting up
route for bond0: don't overwrite existing route<br>
Aug 1 09:39:42 fend08 kernel: Lustre: Mounted hpc-client<br>
<br>
This is not happening when using Infiniband.<br>
<br>
How can we fix this?<br>
<br>
<br>
Issue #2 (might or might not be related):<br>
---------------------------------------------------------<br>
<br>
The status of target connections after mounting is:<br>
<br>
# lfs check all<br>
hpc-OST0003-osc-ffff90532327f000 active.<br>
hpc-OST0004-osc-ffff90532327f000 active.<br>
hpc-OST0005-osc-ffff90532327f000 active.<br>
hpc-OST0006-osc-ffff90532327f000 active.<br>
lfs check: error: check 'hpc-OST0007-osc-ffff90532327f000': Resource
temporarily unavailable (11)<br>
lfs check: error: check 'hpc-OST0008-osc-ffff90532327f000': Resource
temporarily unavailable (11)<br>
hpc-OST0009-osc-ffff90532327f000 active.<br>
hpc-OST000a-osc-ffff90532327f000 active.<br>
hpc-OST000b-osc-ffff90532327f000 active.<br>
hpc-OST000c-osc-ffff90532327f000 active.<br>
hpc-OST000d-osc-ffff90532327f000 active.<br>
hpc-OST000e-osc-ffff90532327f000 active.<br>
hpc-MDT0000-mdc-ffff90532327f000 active.<br>
MGC172.20.10.101@tcp1 active.<br>
<br>
OST000[7-e] are on host 172.20.10.122@tcp1 (10.21.10.122@o2ib).<br>
<br>
Due to this situation it hangs when hitting OST000[7-8].<br>
<br>
Unmounting and mounting it again clear the error on OST000[7-8] and
make it usable (Issue #1 still showing). With a clean LNet start the
issue comes back.<br>
<br>
Disabling 'discovery' in LNet makes this issue go away (Issue #1
still showing).<br>
<br>
Reverting to Lustre 2.15.3 also makes it go away (Issue #1 still
showing). Perhaps all the TCP issues in 2.15.4 was not fixed by
LU-17664.<br>
<br>
<br>
A few notes about our system:<br>
------------------------------------------<br>
<br>
- It's ZFS based.<br>
- It was created back in 2015. MGS, and MDTs have survived since
then (zfs send/receive), while new OSTs have been added over time an
old ones have been taken out. <br>
- There are 2 filesystems on an MDS pair. One MDT on each MDS.<br>
- Dual network stack with Infiniband and TCP. For historical reasons
we are using tcp1 and not the default tcp0. No routers.<br>
<br>
Cheers,<br>
Hans Henrik<br>
</body>
</html>