[lustre-discuss] Lustre server still try to recover the lnet reply to the depreciated clients

Huang, Qiulan qhuang at bnl.gov
Tue Dec 12 12:53:02 PST 2023


Hello Andreas,


Thanks for your reply and tips.

We found this case was caused by removing Lustre modules(uninstall Lustre rpms) without unmount Lustre instance. It means there are no any notifications to Lustre servers, and servers tried to recovery the connection again and again.

The good thing is that LNetError stopped after I run the following command to remove the export.  I don't know is there any other better way to clean up the removed clients. Disconnect in LNET level?

[root at mds2 ~]# echo "10.67.178.25 at tcp" > /proc/fs/lustre/mdt/data-MDT0000/exports/clear

Thank you.

Regards,
Qiulan


________________________________
From: Andreas Dilger <adilger at whamcloud.com>
Sent: Friday, December 8, 2023 6:49 PM
To: Huang, Qiulan <qhuang at bnl.gov>
Cc: lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] Lustre server still try to recover the lnet reply to the depreciated clients

If you are evicting a client by NID, then use the "nid:" keyword:

    lctl set_param mdt.*.evict_client=nid:10.68.178.25 at tcp

Otherwise it is expecting the input to be in the form of a client UUID (to allow
evicting a single export from a client mounting the filesystem multiple times).

That said, the client *should* be evicted by the server automatically, so it isn't
clear why this isn't happening.  Possibly this is something at the LNet level
(which unfortunately I don't know much about)?

Cheers, Andreas

> On Dec 6, 2023, at 13:23, Huang, Qiulan via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
>
>
>
> Hello all,
>
>
> We removed some clients two weeks ago but we see the Lustre server is still trying to handle the lnet recovery reply to those clients (the error log is posted as below). And they are still listed in the exports dir.
>
>
> I tried to run  to evict the clients but failed with  the error "no exports found"
>
> lctl set_param mdt.*.evict_client=10.68.178.25 at tcp
>
>
> Do you know how to clean up the removed the depreciated clients? Any suggestions would be greatly appreciated.
>
>
>
> For example:
>
> [root at mds2 ~]# ll /proc/fs/lustre/mdt/data-MDT0000/exports/10.67.178.25 at tcp/
> total 0
> -r--r--r-- 1 root root 0 Dec  5 15:41 export
> -r--r--r-- 1 root root 0 Dec  5 15:41 fmd_count
> -r--r--r-- 1 root root 0 Dec  5 15:41 hash
> -rw-r--r-- 1 root root 0 Dec  5 15:41 ldlm_stats
> -r--r--r-- 1 root root 0 Dec  5 15:41 nodemap
> -r--r--r-- 1 root root 0 Dec  5 15:41 open_files
> -r--r--r-- 1 root root 0 Dec  5 15:41 reply_data
> -rw-r--r-- 1 root root 0 Aug 14 10:58 stats
> -r--r--r-- 1 root root 0 Dec  5 15:41 uuid
>
>
>
>
>
> /var/log/messages:Dec  6 12:50:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
> /var/log/messages:Dec  6 13:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25 at tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
> /var/log/messages:Dec  6 13:20:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25 at tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:20:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
> /var/log/messages:Dec  6 13:35:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25 at tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:35:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
> /var/log/messages:Dec  6 13:50:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25 at tcp) recovery failed with -110
> /var/log/messages:Dec  6 13:50:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
> /var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25 at tcp) recovery failed with -110
> /var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
> /var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25 at tcp) recovery failed with -110
> /var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
> /var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25 at tcp) recovery failed with -111
> /var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 3 previous similar messages
> /var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25 at tcp) recovery failed with -111
> /var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 8 previous similar messages
> /var/log/messages:Dec  6 15:02:14 mds2 kernel: LNetError: 3817248:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25 at tcp) recovery failed with -111
>
>
> Regards,
> Qiulan
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!P4SdNyxKAPE!CUXLBOOw5KZoyNO5v4zxJoWzkgbz9boeSUQlOzVppwOEbfbxfCnnuHjbvn_gZ1toVmKpWTNRHdF8eMm9hCw$

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231212/d790d7ad/attachment.htm>


More information about the lustre-discuss mailing list