[lustre-discuss] How to activate an OST on a client ?
Hans Henrik Happe
happe at nbi.dk
Thu Aug 29 07:15:20 PDT 2024
Hi,
We just had a similar issue on 2.15.5. Infiniband clients not
reconnecting after a target outage.
Deleting the LNet net and importing the config again solved it without
reboot and unmount:
# letctl net del --net 02ib
# lnetctl import < /etc/lnet.conf
Cheers,
Hans Henrik
On 28/08/2024 18.18, Lixin Liu via lustre-discuss wrote:
>
> We had the same problem after we upgraded Lustre servers from 2.12.8
> to 2.15.3.
>
> Clients were running 2.15.3 on CentOS 7. Random OST dropped out
> frequently on
>
> busy login nodes (almost daily), but less so on compute nodes. “lctl”
> command
>
> cannot active OSTs and reboot we the only way to clear the problem.
>
> In June, we upgraded all client OS to AlmaLinux 9.3 and Lustre version
> to 2.15.4 on
>
> both servers and clients (missed 2.15.5 release by about 2 weeks).
> After the upgrade,
>
> we no longer have this problem.
>
> In our case, I wonder this was OmniPath related. Servers on AlamLinux
> 8 was using
>
> in kernel driver, but CentOS 7 clients are using driver from
> Intel/Cornelis release.
>
> Alma 9 clients are now also using in kernel driver.
>
> Cheers,
>
> Lixin.
>
> *From: *lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on
> behalf of Cameron Harr via lustre-discuss
> <lustre-discuss at lists.lustre.org>
> *Reply-To: *Cameron Harr <harr1 at llnl.gov>
> *Date: *Wednesday, August 28, 2024 at 8:19 AM
> *To: *"lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
> *Subject: *Re: [lustre-discuss] How to activate an OST on a client ?
>
> There's also an "lctl --device <dev> activate" that I've used in the
> past though I don't know what conditions need to be for it to work.
>
> On 8/27/24 07:46, Andreas Dilger via lustre-discuss wrote:
>
> Hi Jan,
>
> There is "lctl --device XXXX recover" that will trigger a
> reconnect to the named OST device (per "lctl dl" output), but not
> sure if that will help.
>
> Cheers, Andreas
>
>
>
> On Aug 22, 2024, at 06:36, Haarst, Jan van via lustre-discuss
> <lustre-discuss at lists.lustre.org>
> <mailto:lustre-discuss at lists.lustre.org> wrote:
>
> Hi,
>
> Probably the wording of the subject doesn’t actually cover the
> issue, what we see is this :
>
> We have a client behind a router (linking tcp to Omnipath)
> that shows an inactive OST (all on 2.15.5).
>
> Other clients that go through the router do not have this issue.
>
> One client had the same issue, although it showed a different
> OST as inactive.
>
> After a reboot, all was well again on that machine.
>
> The clients can lctl ping the OSSs.
>
> So although we have a workaround (reboot the client), it would
> be nice to:
>
> 1. Fix the issue without a reboot
> 2. Fix the underlying issue.
>
> It might be unrelated, but we also see another routing issue
> every now and then:
>
> The router stops routing request toward a certain OSS, and
> this can be fixed by deleting the peer_nid of the OSS from the
> router.
>
> I am probably missing informative logs, but I’m more than
> happy to try to generate them, if somebody has a pointer to how.
>
> We are a bit stumped right now.
>
> With kind regards,
>
> --
>
> Jan van Haarst
>
> HPC Administrator
>
> For Anunna/HPC questions, please use https://support.wur.nl
> <https://urldefense.us/v3/__https:/support.wur.nl__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mMjyuAoAg$> (with
> HPC as service)
>
> Aanwezig: maandag, dinsdag, donderdag & vrijdag
>
> Facilitair Bedrijf, onderdeel van Wageningen University &
> Research
>
> Afdeling Informatie Technologie
>
> Postbus 59, 6700 AB, Wageningen
>
> Gebouw 116, Akkermaalsbos 12, 6700 WB, Wageningen
>
> http://www.wur.nl/nl/Disclaimer.htm
> <https://urldefense.us/v3/__http:/www.wur.nl/nl/Disclaimer.htm__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mP2LXgG1Q$>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> <https://urldefense.us/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$>
>
>
>
> _______________________________________________
>
> lustre-discuss mailing list
>
> lustre-discuss at lists.lustre.org
>
> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$ <https://urldefense.us/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240829/4462fba0/attachment.htm>
More information about the lustre-discuss
mailing list