[lustre-discuss] Lustre /home lockup - how to check

Mon Oct 11 15:18:36 PDT 2021

>
>
>    2. Tools to check a lustre (Sid Young)
>    4. Re: Tools to check a lustre (Dennis Nelson)
>
>
> My key issue is why /home locks solid when you try to use it but /lustre
is OK . The backend is ZFS used to manage the disks presented from the HP
D8000 JBOD
I'm at a loss after 6 months of 100% operation why this is suddenly
occurring. If I do repeated "dd" tasks on lustre it works fine, start one
on /home and it locks solid.

I have started a ZFS scrub on two of the zfs pools. at 47T each it will
take most of today to resolve, but that should rule out the actual storage
(which is showing "NORMAL/ONLINE" and no errors.

I'm seeing a lot of these in /var/log/messages
kernel: LustreError: 6578:0:(events.c:200:client_bulk_callback()) event
type 1, status -5, desc ffff89cdf3b9dc00
A google search returned this:
https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency

Could it be a network issue? - the nodes are running the
Centos7.9 drivers... the Mellanox one did not seam to make any difference
when I originally tried it 6 months ago.

Any help appreciated :)

Sid

>
> ---------- Forwarded message ----------
> From: Sid Young <sid.young at gmail.com>
> To: lustre-discuss <lustre-discuss at lists.lustre.org>
> Cc:
> Bcc:
> Date: Mon, 11 Oct 2021 16:07:56 +1000
> Subject: [lustre-discuss] Tools to check a lustre
>
> I'm having trouble diagnosing where the problem lies in  my Lustre
> installation, clients are 2.12.6 and I have a /home and /lustre
> filesystems using Lustre.
>
> /home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs
> as ACTIVE.
>
> The /lustre file system appears fine, I can *ls *into every directory.
>
> When people log into the login node, it appears to lockup. I have shut
> down everything and remounted the OSTs and MDTs etc in order with no
> errors reporting but I'm getting the lockup issue soon after a few people
> log in.
> The backend network is 100G Ethernet using ConnectX5 cards and the OS is
> Cento 7.9, everything was installed as RPMs and updates are disabled in
> yum.conf
>
> Two questions to start with:
> Is there a command line tool to check each OST individually?
> Apart from /var/log/messages, is there a lustre specific log I can monitor
> on the login node to see errors when I hit /home...
>
>
>
> Sid Young
>
>
>
>
>
>
>
> ---------- Forwarded message ----------
> From: Dennis Nelson <dnelson at ddn.com>
> To: Sid Young <sid.young at gmail.com>
>
> Date: Mon, 11 Oct 2021 12:20:25 +0000
> Subject: Re: [lustre-discuss] Tools to check a lustre
> Have you tried lfs check servers on the login node?
>

Yes - one of the first things I did and this is what it always reports:

]# lfs check servers
home-OST0000-osc-ffff89adb7e5e000 active.
home-OST0001-osc-ffff89adb7e5e000 active.
home-OST0002-osc-ffff89adb7e5e000 active.
home-OST0003-osc-ffff89adb7e5e000 active.
lustre-OST0000-osc-ffff89cdd14a2000 active.
lustre-OST0001-osc-ffff89cdd14a2000 active.
lustre-OST0002-osc-ffff89cdd14a2000 active.
lustre-OST0003-osc-ffff89cdd14a2000 active.
lustre-OST0004-osc-ffff89cdd14a2000 active.
lustre-OST0005-osc-ffff89cdd14a2000 active.
home-MDT0000-mdc-ffff89adb7e5e000 active.
lustre-MDT0000-mdc-ffff89cdd14a2000 active.
[root at tri-minihub-01 ~]#
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20211012/47982a9a/attachment.html>