[lustre-discuss] OSS node crash/high CPU latency when deleting 100's of emty test files

Mon Mar 1 18:25:18 PST 2021

Hi Sid,

What version of lustre?

-cf

On Mon, Mar 1, 2021, 6:37 PM Sid Young via lustre-discuss <
lustre-discuss at lists.lustre.org> wrote:

> G'Day all,
>
> I've been doing some file create/delete testing on our new Lustre storage
> which results in the OSS nodes crashing and rebooting due to high latency
> issues.
>
> I can reproduce it by running "dd" commands on the /lustre file system in
> a for loop and then do a rm -f testfile-*.text at the end.
> This results in console errors on our DL385 OSS nodes (running Centos 7.9)
> which basically show a stack of:
>   mlx5_core and bnxt_en error messages.... mlx5 being the Mellanox Driver
> for the 100G ConnectX5 cards followed by a stack of:
> "NMI watchdog: BUG: soft lockup - CPU#"N stuck for XXs "
> where the CPU number is around 4 different ones and XX is typical
> 20-24seconds...then the boxes reboot!
>
> Before I log a support ticket to HPe, I'm going to try and disable the
> 100G cards and see if its repeatable via the 10G interfaces on
> the motherboards, but before I do that, does anyone use the mellanox
> ConnectX5 cards on their Lustre Storage nodes and ethernet only and if so,
> which driver are you using and on which OS...
>
> Thanks in advance!
>
> Sid Young
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210301/03d55d1f/attachment.html>