[lustre-discuss] High MDS load

Thu May 28 09:26:32 PDT 2020

>    I have 2 MDSs and periodically on one of them (either at one time or
>    another) peak above 300, causing the file system to basically stop.
>    This lasts for a few minutes and then goes away.  We can't identify any
>    one user running jobs at the times we see this, so it's hard to
>    pinpoint this on a user doing something to cause it.   Could anyone
>    point me in the direction of how to begin debugging this?  Any help is
>    greatly appreciated.

I am not able to solve this problem, but...
We saw this behaviour (lustre 2.12.3 and 2.12.4) parallel with lustre kernel thread
(if i remember: ll_ost_io threads at the ods, but with other messages at
the mds) BUG messages in the
kernel log (dmesg output). At this time the omnipath interface were not
longer pingable. We were not able to say what crashes first, the
omnipath or the lustre parts in the kernel. Perhaps you can have a look
if your mds are pingable from your clients (using the network interface
of your lustre installation). Otherwise it is expected that you get a
high load because your lustre io threads cannot satisfy requests.

Mit freundlichen Grüßen
Bernd Melchers

-- 
Archiv- und Backup-Service | fab-service at zedat.fu-berlin.de
Freie Universität Berlin   | Tel. +49-30-838-55905