<div dir="ltr">G'Day all,<div><br></div><div>I've been doing some file create/delete testing on our new Lustre storage which results in the OSS nodes crashing and rebooting due to high latency issues.</div><div><div><br></div><div>I can reproduce it by running "dd" commands on the /lustre file system in a for loop and then do a rm -f testfile-*.text at the end.</div><div>This results in console errors on our DL385 OSS nodes (running Centos 7.9) which basically show a stack of:</div></div><div>  mlx5_core and bnxt_en error messages.... mlx5 being the Mellanox Driver for the 100G ConnectX5 cards followed by a stack of:  <br></div><div>"NMI watchdog: BUG: soft lockup - CPU#"N stuck for XXs " </div><div>where the CPU number is around 4 different ones and XX is typical 20-24seconds...then the boxes reboot!</div><div><br></div><div>Before I log a support ticket to HPe, I'm going to try and disable the 100G cards and see if its repeatable via the 10G interfaces on the motherboards, but before I do that, does anyone use the mellanox ConnectX5 cards on their Lustre Storage nodes and ethernet only and if so, which driver are you using and on which OS...</div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div><br></div><div>Thanks in advance!</div><div><br></div><div>Sid Young</div><div><br></div></div></div></div></div></div></div></div></div></div></div></div>