[lustre-discuss] High MDS load
Carlson, Timothy S
Timothy.Carlson at pnnl.gov
Thu May 28 09:31:44 PDT 2020
Since some mailers don't like attachments, I'll just paste in the script we use here.
I call the script with
./parse.sh | sort -k3 -n
You just need to change out the name of your MDT in two places.
#!/bin/bash
set -e
SLEEP=10
stats_clear()
{
cd $1
echo clear >clear
}
stats_print()
{
cd $1
echo "===================== $1 ============================"
for i in *; do
[ -d $i ] || continue
out=`cat ${i}/stats | grep -v "snapshot_time" | grep -v "ping" || true`
[ -n "$out" ] || continue
echo $i $out
done
echo "============================================================================================="
echo
}
for i in /proc/fs/lustre/mdt/lzfs-MDT0000 /proc/fs/lustre/obdfilter/*OST*; do
dir="${i}/exports"
[ -d "$dir" ] || continue
stats_clear "$dir"
done
echo "Waiting ${SLEEP}s after clearing stats"
sleep $SLEEP
for i in /proc/fs/lustre/mdt/lzfs-MDT0000/ /proc/fs/lustre/obdfilter/*OST*; do
dir="${i}/exports"
[ -d "$dir" ] || continue
stats_print "$dir"
done
On 5/28/20, 9:28 AM, "lustre-discuss on behalf of Bernd Melchers" <lustre-discuss-bounces at lists.lustre.org on behalf of melchers at zedat.fu-berlin.de> wrote:
> I have 2 MDSs and periodically on one of them (either at one time or
> another) peak above 300, causing the file system to basically stop.
> This lasts for a few minutes and then goes away. We can't identify any
> one user running jobs at the times we see this, so it's hard to
> pinpoint this on a user doing something to cause it. Could anyone
> point me in the direction of how to begin debugging this? Any help is
> greatly appreciated.
I am not able to solve this problem, but...
We saw this behaviour (lustre 2.12.3 and 2.12.4) parallel with lustre kernel thread
(if i remember: ll_ost_io threads at the ods, but with other messages at
the mds) BUG messages in the
kernel log (dmesg output). At this time the omnipath interface were not
longer pingable. We were not able to say what crashes first, the
omnipath or the lustre parts in the kernel. Perhaps you can have a look
if your mds are pingable from your clients (using the network interface
of your lustre installation). Otherwise it is expected that you get a
high load because your lustre io threads cannot satisfy requests.
Mit freundlichen Grüßen
Bernd Melchers
--
Archiv- und Backup-Service | fab-service at zedat.fu-berlin.de
Freie Universität Berlin | Tel. +49-30-838-55905
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
https://protect2.fireeye.com/v1/url?k=2b5b7e8e-77ee4041-2b5b549b-0cc47adc5e60-f39b4d99025e7043&q=1&e=02c1fc69-2754-4f01-8478-8cef00277511&u=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org
More information about the lustre-discuss
mailing list