[lustre-discuss] ost_connect to node failed

Thu Nov 25 10:35:25 PST 2021

Hi Colin,

I’ve done some more digging and found that on the affected nodes the messages repeat at ~10 min intervals.
I can also see a lot of these errors in the MDS log:

Nov 25 10:56:02 mds01 kernel: LustreError: 10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) lustre01-OST000c-osc-MDT0000: cannot cleanup orphans: rc = -11
Nov 25 10:56:02 mds01 kernel: LustreError: 10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) Skipped 4 previous similar messages
Nov 25 11:08:39 mds01 kernel: LustreError: 10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) lustre01-OST000c-osc-MDT0000: cannot cleanup orphans: rc = -11
Nov 25 11:08:39 mds01 kernel: LustreError: 10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) Skipped 4 previous similar messages
Nov 25 11:21:16 mds01 kernel: LustreError: 10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) lustre01-OST000c-osc-MDT0000: cannot cleanup orphans: rc = -11

As you can see, these refer to another ost and the are repeated every ~14 mins.

On oss03 (serving ost000a – ost000e), no errors are logged after rebooting the clients, but I can see these messages:

Nov 25 19:08:02 oss03 kernel: Lustre: 19728:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply#012  req at ffff9c6ec5550850 x1713320906932288/t0(0) >lustre01-MDT0000-mdtlov_UUID at 192.168.1.200@o2ib:662/0<mailto:lustre01-MDT0000-mdtlov_UUID at 192.168.1.200@o2ib:662/0> lens 432/0 e 0 to 0 dl 1637863687 ref 2 fl New:/0/ffffffff rc 0/-1
Nov 25 19:08:02 oss03 kernel: Lustre: 19728:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 4 previous similar messages
Nov 25 19:11:23 oss03 kernel: Lustre: lustre01-OST000b: Export ffff9c42c996fc00 already connecting from 192.168.1.13 at o2ib<mailto:192.168.1.13 at o2ib>
Nov 25 19:11:23 oss03 kernel: Lustre: lustre01-OST000a: Export ffff9c4f43fb3c00 already connecting from 192.168.1.13 at o2ib<mailto:192.168.1.13 at o2ib>

Also checked the Infiniband network, no errors found.
Servers are running CentOS 7.9 with Lustre 2.12.6 / zfs 3.10.0
Clients are running CentOS 7.2 with Lustre 2.8.0

Looks like a problem on oss03 ?

Hilsen Hallstein

Fra: Colin Faber <cfaber at gmail.com>
Sendt: torsdag 25. november 2021 18:11
Til: Hallstein Løhre <Hallstein.Lohre at alphasystem.no>
Kopi: lustre-discuss at lists.lustre.org
Emne: Re: [lustre-discuss] ost_connect to node failed

-114 == operation in progress, what's the logging look like on both sides of the connection?

-cf

On Thu, Nov 25, 2021 at 5:18 AM Hallstein Løhre <Hallstein.Lohre at alphasystem.no<mailto:Hallstein.Lohre at alphasystem.no>> wrote:

Hi,

After some trouble with runaway processes yesterday, I had to reboot several Lustre clients. Now some of these shows the following entries in /var/log/messages:

Nov 25 11:09:51 nodexx kernel: LustreError: 11-0: lustre01-OST000a-osc-ffff887ee3207800: operation ost_connect to node 192.168.1.xxx at o2ib<mailto:192.168.1.xxx at o2ib> failed: rc = -114

The filesystem seems ok, but the stuck processes might have accessed file(s) on OST000a. No hardware problem seems to exist, the ost’s are all zfs volumes with status ok.
I suspended writing to ost000a, but after reboot of the clients and checking for hardware problems, I have reenabled writing.
Any explanation of rc = -114 ?

Best Regards

Hallstein Løhre

ALPHA SYSTEM AS

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20211125/2e1d1edb/attachment-0001.html>