[lustre-discuss] Frequent, silent OSS hangs on multi-homed system

Wed Aug 14 12:20:13 PDT 2019

Hi, I'm love some ideas to debug what has become a frequent annoyance for us.  At the high level, we're observing fairly frequent OSS hangs, with absolutely no console or logging activity.  Our BMC watchdogs then reboot the OSS and ~6 minutes later everything is back in line.  This has been an infrequent occurance on this system for a couple years, but has become much more frequent in recent months.

I'd love any suggestions for either lustre/lnet or overall kernel tricks to up the logging level if possible to see if we can get some more useful output. Right now we're blind.

More details below, and also what I'd characterize as uninformed speculation:

-) overall system is (2x)MDS, (12x)OSS, (2x) Monitoring nodes of identical servers, network cards, etc... 

-) only difference is JBOD types, the OSS'es are connected to Supermicro 90-bay SC946ED-R2KJBOD. All other server hardware is identical. 

-) only the OSSes hang in this manner. I'm looking back, some seem more prone than others, but it's not obviously only a few.

-) CentOS 7.6, lustre 2.10.8, ZFS 0.7.9

-) 2 active file systems, one is pure ZFS and the other ZFS/OSS with ldiskfs mdt

-) Mellanox ConnectX3 FDR IB & 40GbE

-) LSI 9300-8e HBA

-) Lustre servers are triple-homed, they live on (2x) IB and (1x) 40GbE networks

-) previously when we first moved to 2.10 we were bit hard and frequently by LU-10163 (which may or may not be relevant)

-) The hangs don't correlate to any discrete event best I can tell.  Importantly, we get no LBUGs or anything, which is different than the previous signature.

-) We have definitely stepped up the traffic on the ethernet network this year.  Whereas the primary I/O was previously just on the two IB networks, we are now taxing the ethernet as well with some regularity.

Any thoughts are most welcome, and thanks!

-Ben