[lustre-discuss] OSS node crash/high CPU latency when deleting 100's of emty test files

Weiss, Karsten karsten.weiss at atos.net
Tue Mar 2 04:18:12 PST 2021


Hi Sid,

if you are using a CentOS 7.9 kernel newer than 3.10.0-1160.6.1.el7.x86_64 then check out LU-14341 as these kernel versions cause a timer related regression:

https://jira.whamcloud.com/browse/LU-14341

We learnt this the hard way during the last couple of days and downgraded to kernel-3.10.0-1160.2.1.el7.x86_64 (which is the officially supported kernel version of lustre 2.12.6). We use ZFS. YMMV.

--
Karsten Weiss


From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> On Behalf Of Sid Young via lustre-discuss
Sent: Tuesday, March 2, 2021 02:37
To: lustre-discuss <lustre-discuss at lists.lustre.org>
Subject: [lustre-discuss] OSS node crash/high CPU latency when deleting 100's of emty test files


Caution! External email. Do not open attachments or click links, unless this email comes from a known sender and you know the content is safe.
G'Day all,

I've been doing some file create/delete testing on our new Lustre storage which results in the OSS nodes crashing and rebooting due to high latency issues.

I can reproduce it by running "dd" commands on the /lustre file system in a for loop and then do a rm -f testfile-*.text at the end.
This results in console errors on our DL385 OSS nodes (running Centos 7.9) which basically show a stack of:
  mlx5_core and bnxt_en error messages.... mlx5 being the Mellanox Driver for the 100G ConnectX5 cards followed by a stack of:
"NMI watchdog: BUG: soft lockup - CPU#"N stuck for XXs "
where the CPU number is around 4 different ones and XX is typical 20-24seconds...then the boxes reboot!

Before I log a support ticket to HPe, I'm going to try and disable the 100G cards and see if its repeatable via the 10G interfaces on the motherboards, but before I do that, does anyone use the mellanox ConnectX5 cards on their Lustre Storage nodes and ethernet only and if so, which driver are you using and on which OS...

Thanks in advance!

Sid Young

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210302/e2d827ea/attachment-0001.html>


More information about the lustre-discuss mailing list