[lustre-discuss] High MDS load

BOUYER, QUENTIN quentin at bouyer.me
Mon Jun 8 01:00:09 PDT 2020


Hi,

Maybe, you can also try this :

https://github.com/quentinbouyer/topmdt

Le 28/05/2020 à 18:32, Chad DeWitt a écrit :
> Hi Heath,
>
> Hope you're doing well!
>
> Your mileage may vary (and quite frankly, there may be better 
> approaches), but this is a quick and dirty set of steps to find which 
> client is issuing a large number of metadata operations.:
>
>       * Log into the affected MDS.
>
>       * Change into the exports directory.
>
>         cd /proc/fs/lustre/mdt//<Your affected MDT>//exports/
>
>       * OPTIONAL: Set all your stats to zero and clear out stale
>         clients. (If you don't want to do this step, you don't really
>         have to, but it does make it easier to see the stats if you
>         are starting with a clean slate. In fact, you may want to skip
>         this the first time through and just look for high numbers. If
>         a particular client is the source of the issue, the stats
>         should clearly be higher for that client when compared to the
>         others.)
>
>         echo "C" > clear
>
>       * Wait for a few seconds and dump the stats.
>
>         for client in $( ls -d */ ) ; do echo && echo && echo
>         ${client} && cat ${client}/stats && echo ; done
>
>
> You'll get a listing of stats for each mounted client like so:
>
>     open  278676 samples [reqs]
>     close 278629 samples [reqs]
>     mknod 2320 samples [reqs]
>     unlink  495 samples [reqs]
>     mkdir 575 samples [reqs]
>     rename  1534 samples [reqs]
>     getattr 277552 samples [reqs]
>     setattr 550 samples [reqs]
>     getxattr  2742 samples [reqs]
>     statfs  350058 samples [reqs]
>     samedir_rename  1534 samples [reqs]
>
>
> (Don't worry if some of the clients give back what appears to be empty 
> stats. That just means they are mounted, but have not yet performed 
> any metadata operations.) From this data, you are looking for any 
> "high" samples.  The client with the high samples is usually the 
> culprit.  For the example client stats above, I would look to see what 
> process(es) on this client is listing, opening, and then closing files 
> in Lustre... The advantage with this method is you are seeing exactly 
> which metadata operations are occurring. (I know there are also 
> various utilities included with Lustre that may give this information 
> as well, but I just go to the source.)
>
> Once you find the client, you can use various commands, such as mount 
> and lsof to get a better understanding of what may be hitting Lustre.
>
> Some of the more common issues I've found that can cause a high MDS load:
>
>   * List a directory containing a large number of files. (Instead,
>     unalias ls or better yet, use lfs find.)
>   * Remove on many files.
>   * Open and close many files. (May be better to move the data over to
>     another file system, such as XFS, etc.  We keep some of our deep
>     learning off Lustre, because of the sheer number of small files.)
>
> Of course the actual mitigation of the load depends on what the user 
> is attempting to do...
>
> I hope this helps...
>
> Cheers,
> Chad
>
> ------------------------------------------------------------
>
> Chad DeWitt, CISSP
>
> UNC Charlotte *| *ITS – University Research Computing
>
> ccdewitt at uncc.edu <mailto:ccdewitt at uncc.edu> *| *www.uncc.edu
>
> ------------------------------------------------------------
>
>
> If you are not the intended recipient of this transmission or a person 
> responsible for delivering it to the intended recipient, any 
> disclosure, copying, distribution, or other use of any of the 
> information in this transmission is strictly prohibited. If you have 
> received this transmission in error, please notify me immediately by 
> reply email or by telephone at 704-687-7802. Thank you.
>
>
>
> On Thu, May 28, 2020 at 11:37 AM Peeples, Heath 
> <heathp at hpc.msstate.edu <mailto:heathp at hpc.msstate.edu>> wrote:
>
>     I have 2 MDSs and periodically on one of them (either at one time
>     or another) peak above 300, causing the file system to basically
>     stop.  This lasts for a few minutes and then goes away.  We can’t
>     identify any one user running jobs at the times we see this, so
>     it’s hard to pinpoint this on a user doing something to cause it. 
>      Could anyone point me in the direction of how to begin debugging
>     this?  Any help is greatly appreciated.
>
>     Heath
>
>     _______________________________________________
>     lustre-discuss mailing list
>     lustre-discuss at lists.lustre.org
>     <mailto:lustre-discuss at lists.lustre.org>
>     http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200608/8d6106cd/attachment.html>


More information about the lustre-discuss mailing list