[lustre-discuss] flock vs localflock

Thu Jul 5 14:26:36 PDT 2018

Hi everyone,

We recently saw some extremely high stat loads on our lustre FS.  Output from “llstat -i 1 mdt” looked like:

[root at hpfs-fsl-mds0 lustre]#
/proc/fs/lustre/mds/MDS/mdt/stats @ 1530642446.366015124
Name                      Cur.Count  Cur.Rate   #Events   Unit           last        min          avg        max    stddev
req_waittime              182102     182102     22343734858[usec]      3951261          2        38.94    3027235    897.47
req_qdepth                182103     182103     22343734859[reqs]        12528          0         0.08        571      0.29
req_active                182103     182103     22343734859[reqs]       484211          1         2.88         99      1.50
req_timeout               182104     182104     22343734860[sec]        182104          1         9.32         36     13.44
reqbuf_avail              437980     437980     55509906321[bufs]     27996863         32        63.89         65      0.47

This was driving the load on our MDS up into the 100 to 200 range.  Surprisingly, the MDS and the LFS from a client were still generally responsive.  The numbers in the “Cur.Count” column are normally in the 100’s to 1000’s for our file system (we have ~600 lustre clients). The kiblnd_sd_* and ldlm_* processes were driving up the load.

We’ve tracked down the users causing this.  There were two different workloads that we identified there were causing the problems.  One of them was fairly common, the other is fairly infrequent.  There are a couple of things I wanted input on from the wider community.

First, since one of the workloads is common for our lab and we haven’t seen this issue before (at least not to this extent), we think this might be related specifically to 2.10.4, which recently updated to.  We didn’t see anything in the changelog that was obviously related but if there are any other known issues or groups seeing this, that would be good to know.  We are using ZFS on both the MDT and OST’s.

Also, the ldlm processes lead us to looking at flock vs localflock.  On previous generations of our LFS’s, we used localflock.  But on the current LFS, we decided to try flock instead.  This LFS has been in production for a couple years with no obvious problems due to flock but we decided to drop back to localflock as a precaution for now.  We need to do a more controlled test but this does seem to help.  What are other sites using for locking parameters?

Thanks,
Darby
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180705/c4ad5f3d/attachment.html>