[lustre-discuss] OSS node crash/high CPU latency when deleting 100's of emty test files

Mon Mar 1 17:37:04 PST 2021

G'Day all,

I've been doing some file create/delete testing on our new Lustre storage
which results in the OSS nodes crashing and rebooting due to high latency
issues.

I can reproduce it by running "dd" commands on the /lustre file system in a
for loop and then do a rm -f testfile-*.text at the end.
This results in console errors on our DL385 OSS nodes (running Centos 7.9)
which basically show a stack of:
  mlx5_core and bnxt_en error messages.... mlx5 being the Mellanox Driver
for the 100G ConnectX5 cards followed by a stack of:
"NMI watchdog: BUG: soft lockup - CPU#"N stuck for XXs "
where the CPU number is around 4 different ones and XX is typical
20-24seconds...then the boxes reboot!

Before I log a support ticket to HPe, I'm going to try and disable the 100G
cards and see if its repeatable via the 10G interfaces on the motherboards,
but before I do that, does anyone use the mellanox ConnectX5 cards on their
Lustre Storage nodes and ethernet only and if so, which driver are you
using and on which OS...

Thanks in advance!

Sid Young
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210302/df5e5033/attachment.html>