[lustre-discuss] High MDS load

Chad DeWitt ccdewitt at uncc.edu
Thu May 28 09:32:59 PDT 2020


Hi Heath,

Hope you're doing well!

Your mileage may vary (and quite frankly, there may be better approaches),
but this is a quick and dirty set of steps to find which client is issuing
a large number of metadata operations.:


   - Log into the affected MDS.


   - Change into the exports directory.

cd /proc/fs/lustre/mdt/*<Your affected MDT>*/exports/


   - OPTIONAL: Set all your stats to zero and clear out stale clients. (If
   you don't want to do this step, you don't really have to, but it does make
   it easier to see the stats if you are starting with a clean slate. In fact,
   you may want to skip this the first time through and just look for high
   numbers. If a particular client is the source of the issue, the stats
   should clearly be higher for that client when compared to the others.)

echo "C" > clear


   - Wait for a few seconds and dump the stats.

for client in $( ls -d */ ) ; do echo && echo && echo ${client} && cat
${client}/stats && echo ; done


You'll get a listing of stats for each mounted client like so:

open                      278676 samples [reqs]
close                     278629 samples [reqs]
mknod                     2320 samples [reqs]
unlink                    495 samples [reqs]
mkdir                     575 samples [reqs]
rename                    1534 samples [reqs]
getattr                   277552 samples [reqs]
setattr                   550 samples [reqs]
getxattr                  2742 samples [reqs]
statfs                    350058 samples [reqs]
samedir_rename            1534 samples [reqs]


(Don't worry if some of the clients give back what appears to be empty
stats. That just means they are mounted, but have not yet performed any
metadata operations.) From this data, you are looking for any "high"
samples.  The client with the high samples is usually the culprit.  For the
example client stats above, I would look to see what process(es) on this
client is listing, opening, and then closing files in Lustre... The
advantage with this method is you are seeing exactly which metadata
operations are occurring. (I know there are also various utilities included
with Lustre that may give this information as well, but I just go to the
source.)

Once you find the client, you can use various commands, such as mount and
lsof to get a better understanding of what may be hitting Lustre.

Some of the more common issues I've found that can cause a high MDS load:

   - List a directory containing a large number of files. (Instead, unalias
   ls or better yet, use lfs find.)
   - Remove on many files.
   - Open and close many files. (May be better to move the data over to
   another file system, such as XFS, etc.  We keep some of our deep learning
   off Lustre, because of the sheer number of small files.)

Of course the actual mitigation of the load depends on what the user is
attempting to do...

I hope this helps...

Cheers,
Chad

------------------------------------------------------------

Chad DeWitt, CISSP

UNC Charlotte *| *ITS – University Research Computing

ccdewitt at uncc.edu *| *www.uncc.edu

------------------------------------------------------------


If you are not the intended recipient of this transmission or a person
responsible for delivering it to the intended recipient, any disclosure,
copying, distribution, or other use of any of the information in this
transmission is strictly prohibited. If you have received this transmission
in error, please notify me immediately by reply email or by telephone at
704-687-7802. Thank you.


On Thu, May 28, 2020 at 11:37 AM Peeples, Heath <heathp at hpc.msstate.edu>
wrote:

> I have 2 MDSs and periodically on one of them (either at one time or
> another) peak above 300, causing the file system to basically stop.  This
> lasts for a few minutes and then goes away.  We can’t identify any one user
> running jobs at the times we see this, so it’s hard to pinpoint this on a
> user doing something to cause it.   Could anyone point me in the direction
> of how to begin debugging this?  Any help is greatly appreciated.
>
>
>
> Heath
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200528/3db945c1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5325 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200528/3db945c1/attachment.bin>


More information about the lustre-discuss mailing list