[lustre-discuss] BUG: soft lockup - CPU#13 stuck for 23s! [ll_ost_io01_027:24071]

Thu Apr 2 08:22:43 PDT 2020

Hi All,
after several month of working, we recently have stability problems
with our lustre installation. Each of our seven OSD Server crashes
after some hours with kernel messages like

NMI watchdog: BUG: soft lockup - CPU#13 stuck for 23s! [ll_ost_io01_027:24071]

The time duration of the messages varies between 22s and 23s and the
number after the last colon between 32281, 30488 and 24071.

Our environment:
CentOS-7.7, (recent kernels 3.10.0-1062.12.1 or 3.10.0-1062.18.1)
lustre-2.12.4 on zfs-0.7.13
single rail Omnipath network (mixed mpi and lustre)
same behaviour with in kernel omnipath stack and Intel Stack (10.10.1.0.36)

At the time of these kernel ll_ost_io messages, the omnipath interface
of the failing osd is not longer able to ping (outgoing or ingoing).

What i have already done is reducing the ost_io.threads from 132
stepwise down to 40 (server has 32 cpu cores):
lctl set_param ost.OSS.ost_io.threads_max=40
Then i changed between kernel 3.10.0-1062.12.1 and 3.10.0-1062.18.1
and between kernel and  intel omnipath driver.

It is not clear for me if the failing lustre destroys the omnipath
of the server or the other way round. Intel Omnipath utlitities (opatop,
fmgui) does not show problems in the network (or we did not found them).

Other parameters for the OSDs are:
# cat /etc/modprobe.d/lustre.conf
options lnet networks="o2ib0(ib0)"
options ptlrpc at_min=40 at_max=400 ldlm_enqueue_min=260

# cat /etc/modprobe.d/hfi1.conf
options hfi1 krcvqs=8 piothreshold=0 sge_copy_mode=2 wss_threshold=70 rcvhdrcnt=4096 cap_mask=0x4c09a01cbba

# lctl get_param '*.*.*.threads_max timeout *.*.*.timeout'
ldlm.services.ldlm_canceld.threads_max=128
ldlm.services.ldlm_cbd.threads_max=128
ost.OSS.ost.threads_max=132
ost.OSS.ost_create.threads_max=24
ost.OSS.ost_io.threads_max=40
ost.OSS.ost_out.threads_max=24
ost.OSS.ost_seq.threads_max=24
timeout=100
osd-zfs.scratch-OST0000.quota_slave.timeout=50
osd-zfs.scratch-OST0000.quota_slave_dt.timeout=50
... (for all six OST)

Any hints?
Bernd Melchers

-- 
Archiv- und Backup-Service | fab-service at zedat.fu-berlin.de
Freie Universität Berlin   |