[lustre-discuss] High MDS load

Thu May 28 09:31:44 PDT 2020

Since some mailers don't like attachments, I'll just paste in the script we use here.  

I call the script with

./parse.sh | sort -k3 -n

You just need to change out the name of your MDT in two places.

#!/bin/bash
set -e
SLEEP=10
stats_clear()
{
        cd $1
        echo clear >clear
}

stats_print()
{
        cd $1
        echo "===================== $1 ============================"
        for i in *; do 
                [ -d $i ] || continue
                out=`cat ${i}/stats | grep -v "snapshot_time" | grep -v "ping" || true`
                [ -n "$out" ] || continue
                echo $i $out
        done
        echo "============================================================================================="
        echo
}

for i in /proc/fs/lustre/mdt/lzfs-MDT0000 /proc/fs/lustre/obdfilter/*OST*; do
        dir="${i}/exports"
        [ -d "$dir" ] || continue
        stats_clear "$dir"
done
echo "Waiting ${SLEEP}s after clearing stats"
sleep $SLEEP

for i in /proc/fs/lustre/mdt/lzfs-MDT0000/ /proc/fs/lustre/obdfilter/*OST*; do
        dir="${i}/exports"
        [ -d "$dir" ] || continue
        stats_print "$dir"
done

On 5/28/20, 9:28 AM, "lustre-discuss on behalf of Bernd Melchers" <lustre-discuss-bounces at lists.lustre.org on behalf of melchers at zedat.fu-berlin.de> wrote:

    >    I have 2 MDSs and periodically on one of them (either at one time or
    >    another) peak above 300, causing the file system to basically stop.
    >    This lasts for a few minutes and then goes away.  We can't identify any
    >    one user running jobs at the times we see this, so it's hard to
    >    pinpoint this on a user doing something to cause it.   Could anyone
    >    point me in the direction of how to begin debugging this?  Any help is
    >    greatly appreciated.

    I am not able to solve this problem, but...
    We saw this behaviour (lustre 2.12.3 and 2.12.4) parallel with lustre kernel thread
    (if i remember: ll_ost_io threads at the ods, but with other messages at
    the mds) BUG messages in the
    kernel log (dmesg output). At this time the omnipath interface were not
    longer pingable. We were not able to say what crashes first, the
    omnipath or the lustre parts in the kernel. Perhaps you can have a look
    if your mds are pingable from your clients (using the network interface
    of your lustre installation). Otherwise it is expected that you get a
    high load because your lustre io threads cannot satisfy requests.

    Mit freundlichen Grüßen
    Bernd Melchers

    -- 
    Archiv- und Backup-Service | fab-service at zedat.fu-berlin.de
    Freie Universität Berlin   | Tel. +49-30-838-55905
    _______________________________________________
    lustre-discuss mailing list
    lustre-discuss at lists.lustre.org
    https://protect2.fireeye.com/v1/url?k=2b5b7e8e-77ee4041-2b5b549b-0cc47adc5e60-f39b4d99025e7043&q=1&e=02c1fc69-2754-4f01-8478-8cef00277511&u=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org