[lustre-discuss] OST server seems overloaded ?

Tung-Han Hsieh thhsieh at twcp1.phys.ntu.edu.tw
Sat Jul 4 20:32:36 PDT 2020


Dear All,

One of our Lustre OST servers continuously shown up the following
error messages in dmesg:

==========================================================================
LNet: Service thread pid 51988 was inactive for 200.44s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
LNet: Service thread pid 63055 completed after 308.42s. This in dicates the system was overloaded (too many service threads, or there were not enough hardware resources).
LNet: Service thread pid 55541 was inactive for 232.30s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 55541, comm: ll_ost_io01_100 3.12.72 #7 SMP Sun Feb 10 17:06:08 CST 2019
Call Trace:
 [<ffffffffa312f1b5>] cv_wait_common+0x95/0x110 [spl]
 [<ffffffffa312f263>] __cv_wait_io+0x13/0x20 [spl]
 [<ffffffffa32ce9b3>] zio_wait+0x113/0x1b0 [zfs]
 [<ffffffffa32210ac>] dmu_buf_hold_array_by_dnode+0x14c/0x4d0 [zfs]
 [<ffffffffa3221494>] dmu_buf_hold_array_by_bonus+0x64/0x80 [zfs]
 [<ffffffffa0377e71>] osd_bufs_get+0x3d1/0xc80 [osd_zfs]
 [<ffffffffa05687dd>] ofd_preprw+0x7dd/0x2000 [ofd]
 [<ffffffffa01c5659>] tgt_brw_read+0x5c9/0x1fb0 [ptlrpc]
 [<ffffffffa01c34e2>] tgt_request_handle+0x762/0x15f0 [ptlrpc]
 [<ffffffffa016de6e>] ptlrpc_main+0xfbe/0x2b30 [ptlrpc]
 [<ffffffff810614fe>] kthread+0xce/0xe0
 [<ffffffff814cced8>] ret_from_fork+0x58/0x90
 [<ffffffffffffffff>] 0xffffffffffffffff
==========================================================================

This OST server installed Lustre-2.10.7 with ZFS backend. It connected
to an external storage through one 8G/s fiber. The external storage is
an Infortrend DS1016 containing 24 bays with RAID6 + 1 hot spare. The
storage contains single partition formatted with ZFS backend with 113TB.
The OST server serves 44 computing nodes, each node has 12 - 32 cores,
and usually full loaded. The OST server has the following hardware spec:

- CPU: Intel Xeon Silver 4214, 2.2GHz, dual CPU, totally 24 cores.
- RAM: 128GB
- Infiniband FDR for internal cluster communication.

and every computing node and the MDT server pocess infiniband network.

We are wondering whether the hardware configuration of this OST server
plus the external storage is really overloaded or not. If yes, then
what else could we do for the improvement.

Thanks very much for your kindly suggestions.

Best Regards,

T.H.Hsieh


More information about the lustre-discuss mailing list