[lustre-discuss] High MDS load
BOUYER, QUENTIN
quentin at bouyer.me
Mon Jun 8 01:00:09 PDT 2020
Hi,
Maybe, you can also try this :
https://github.com/quentinbouyer/topmdt
Le 28/05/2020 à 18:32, Chad DeWitt a écrit :
> Hi Heath,
>
> Hope you're doing well!
>
> Your mileage may vary (and quite frankly, there may be better
> approaches), but this is a quick and dirty set of steps to find which
> client is issuing a large number of metadata operations.:
>
> * Log into the affected MDS.
>
> * Change into the exports directory.
>
> cd /proc/fs/lustre/mdt//<Your affected MDT>//exports/
>
> * OPTIONAL: Set all your stats to zero and clear out stale
> clients. (If you don't want to do this step, you don't really
> have to, but it does make it easier to see the stats if you
> are starting with a clean slate. In fact, you may want to skip
> this the first time through and just look for high numbers. If
> a particular client is the source of the issue, the stats
> should clearly be higher for that client when compared to the
> others.)
>
> echo "C" > clear
>
> * Wait for a few seconds and dump the stats.
>
> for client in $( ls -d */ ) ; do echo && echo && echo
> ${client} && cat ${client}/stats && echo ; done
>
>
> You'll get a listing of stats for each mounted client like so:
>
> open 278676 samples [reqs]
> close 278629 samples [reqs]
> mknod 2320 samples [reqs]
> unlink 495 samples [reqs]
> mkdir 575 samples [reqs]
> rename 1534 samples [reqs]
> getattr 277552 samples [reqs]
> setattr 550 samples [reqs]
> getxattr 2742 samples [reqs]
> statfs 350058 samples [reqs]
> samedir_rename 1534 samples [reqs]
>
>
> (Don't worry if some of the clients give back what appears to be empty
> stats. That just means they are mounted, but have not yet performed
> any metadata operations.) From this data, you are looking for any
> "high" samples. The client with the high samples is usually the
> culprit. For the example client stats above, I would look to see what
> process(es) on this client is listing, opening, and then closing files
> in Lustre... The advantage with this method is you are seeing exactly
> which metadata operations are occurring. (I know there are also
> various utilities included with Lustre that may give this information
> as well, but I just go to the source.)
>
> Once you find the client, you can use various commands, such as mount
> and lsof to get a better understanding of what may be hitting Lustre.
>
> Some of the more common issues I've found that can cause a high MDS load:
>
> * List a directory containing a large number of files. (Instead,
> unalias ls or better yet, use lfs find.)
> * Remove on many files.
> * Open and close many files. (May be better to move the data over to
> another file system, such as XFS, etc. We keep some of our deep
> learning off Lustre, because of the sheer number of small files.)
>
> Of course the actual mitigation of the load depends on what the user
> is attempting to do...
>
> I hope this helps...
>
> Cheers,
> Chad
>
> ------------------------------------------------------------
>
> Chad DeWitt, CISSP
>
> UNC Charlotte *| *ITS – University Research Computing
>
> ccdewitt at uncc.edu <mailto:ccdewitt at uncc.edu> *| *www.uncc.edu
>
> ------------------------------------------------------------
>
>
> If you are not the intended recipient of this transmission or a person
> responsible for delivering it to the intended recipient, any
> disclosure, copying, distribution, or other use of any of the
> information in this transmission is strictly prohibited. If you have
> received this transmission in error, please notify me immediately by
> reply email or by telephone at 704-687-7802. Thank you.
>
>
>
> On Thu, May 28, 2020 at 11:37 AM Peeples, Heath
> <heathp at hpc.msstate.edu <mailto:heathp at hpc.msstate.edu>> wrote:
>
> I have 2 MDSs and periodically on one of them (either at one time
> or another) peak above 300, causing the file system to basically
> stop. This lasts for a few minutes and then goes away. We can’t
> identify any one user running jobs at the times we see this, so
> it’s hard to pinpoint this on a user doing something to cause it.
> Could anyone point me in the direction of how to begin debugging
> this? Any help is greatly appreciated.
>
> Heath
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> <mailto:lustre-discuss at lists.lustre.org>
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200608/8d6106cd/attachment.html>
More information about the lustre-discuss
mailing list