[lustre-discuss] Lustre /home lockup - more info

Sid Young sid.young at gmail.com
Mon Oct 11 20:16:31 PDT 2021


I tried remounting the /home lustre file system to /mnt in read-only mode
and when I try to ls the directory it locks up but I can escape it, how
ever when I do a df command i get the completely wrong size (should be
around 192TB):

10.140.93.42 at o2ib:/home    6.0P  4.8P  1.3P  80% /mnt

zfs scrub is still working and all disks physically report as OK in the ILO
of the two OSS servers...

When the scrub finishes later today I will unmount and remount the 4 OSTs
and see if the remount changes the status... updates in about 8 hours.

Sid Young

On Tue, Oct 12, 2021 at 8:18 AM Sid Young <sid.young at gmail.com> wrote:

>
>>    2. Tools to check a lustre (Sid Young)
>>    4. Re: Tools to check a lustre (Dennis Nelson)
>>
>>
>> My key issue is why /home locks solid when you try to use it but /lustre
> is OK . The backend is ZFS used to manage the disks presented from the HP
> D8000 JBOD
> I'm at a loss after 6 months of 100% operation why this is suddenly
> occurring. If I do repeated "dd" tasks on lustre it works fine, start one
> on /home and it locks solid.
>
> I have started a ZFS scrub on two of the zfs pools. at 47T each it will
> take most of today to resolve, but that should rule out the actual storage
> (which is showing "NORMAL/ONLINE" and no errors.
>
> I'm seeing a lot of these in /var/log/messages
> kernel: LustreError: 6578:0:(events.c:200:client_bulk_callback()) event
> type 1, status -5, desc ffff89cdf3b9dc00
> A google search returned this:
> https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency
>
> Could it be a network issue? - the nodes are running the
> Centos7.9 drivers... the Mellanox one did not seam to make any difference
> when I originally tried it 6 months ago.
>
> Any help appreciated :)
>
> Sid
>
>
>>
>> ---------- Forwarded message ----------
>> From: Sid Young <sid.young at gmail.com>
>> To: lustre-discuss <lustre-discuss at lists.lustre.org>
>> Cc:
>> Bcc:
>> Date: Mon, 11 Oct 2021 16:07:56 +1000
>> Subject: [lustre-discuss] Tools to check a lustre
>>
>> I'm having trouble diagnosing where the problem lies in  my Lustre
>> installation, clients are 2.12.6 and I have a /home and /lustre
>> filesystems using Lustre.
>>
>> /home has 4 OSTs and /lustre is made up of 6 OSTs. lfs df shows all OSTs
>> as ACTIVE.
>>
>> The /lustre file system appears fine, I can *ls *into every directory.
>>
>> When people log into the login node, it appears to lockup. I have shut
>> down everything and remounted the OSTs and MDTs etc in order with no
>> errors reporting but I'm getting the lockup issue soon after a few people
>> log in.
>> The backend network is 100G Ethernet using ConnectX5 cards and the OS is
>> Cento 7.9, everything was installed as RPMs and updates are disabled in
>> yum.conf
>>
>> Two questions to start with:
>> Is there a command line tool to check each OST individually?
>> Apart from /var/log/messages, is there a lustre specific log I can
>> monitor on the login node to see errors when I hit /home...
>>
>>
>>
>> Sid Young
>>
>>
>>
>>
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Dennis Nelson <dnelson at ddn.com>
>> To: Sid Young <sid.young at gmail.com>
>>
>> Date: Mon, 11 Oct 2021 12:20:25 +0000
>> Subject: Re: [lustre-discuss] Tools to check a lustre
>> Have you tried lfs check servers on the login node?
>>
>
> Yes - one of the first things I did and this is what it always reports:
>
> ]# lfs check servers
> home-OST0000-osc-ffff89adb7e5e000 active.
> home-OST0001-osc-ffff89adb7e5e000 active.
> home-OST0002-osc-ffff89adb7e5e000 active.
> home-OST0003-osc-ffff89adb7e5e000 active.
> lustre-OST0000-osc-ffff89cdd14a2000 active.
> lustre-OST0001-osc-ffff89cdd14a2000 active.
> lustre-OST0002-osc-ffff89cdd14a2000 active.
> lustre-OST0003-osc-ffff89cdd14a2000 active.
> lustre-OST0004-osc-ffff89cdd14a2000 active.
> lustre-OST0005-osc-ffff89cdd14a2000 active.
> home-MDT0000-mdc-ffff89adb7e5e000 active.
> lustre-MDT0000-mdc-ffff89cdd14a2000 active.
> [root at tri-minihub-01 ~]#
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20211012/a0fd4598/attachment.html>


More information about the lustre-discuss mailing list