[lustre-discuss] flock vs localflock
Vicker, Darby (JSC-EG311)
darby.vicker-1 at nasa.gov
Thu Jul 5 14:26:36 PDT 2018
Hi everyone,
We recently saw some extremely high stat loads on our lustre FS. Output from “llstat -i 1 mdt” looked like:
[root at hpfs-fsl-mds0 lustre]#
/proc/fs/lustre/mds/MDS/mdt/stats @ 1530642446.366015124
Name Cur.Count Cur.Rate #Events Unit last min avg max stddev
req_waittime 182102 182102 22343734858[usec] 3951261 2 38.94 3027235 897.47
req_qdepth 182103 182103 22343734859[reqs] 12528 0 0.08 571 0.29
req_active 182103 182103 22343734859[reqs] 484211 1 2.88 99 1.50
req_timeout 182104 182104 22343734860[sec] 182104 1 9.32 36 13.44
reqbuf_avail 437980 437980 55509906321[bufs] 27996863 32 63.89 65 0.47
This was driving the load on our MDS up into the 100 to 200 range. Surprisingly, the MDS and the LFS from a client were still generally responsive. The numbers in the “Cur.Count” column are normally in the 100’s to 1000’s for our file system (we have ~600 lustre clients). The kiblnd_sd_* and ldlm_* processes were driving up the load.
We’ve tracked down the users causing this. There were two different workloads that we identified there were causing the problems. One of them was fairly common, the other is fairly infrequent. There are a couple of things I wanted input on from the wider community.
First, since one of the workloads is common for our lab and we haven’t seen this issue before (at least not to this extent), we think this might be related specifically to 2.10.4, which recently updated to. We didn’t see anything in the changelog that was obviously related but if there are any other known issues or groups seeing this, that would be good to know. We are using ZFS on both the MDT and OST’s.
Also, the ldlm processes lead us to looking at flock vs localflock. On previous generations of our LFS’s, we used localflock. But on the current LFS, we decided to try flock instead. This LFS has been in production for a couple years with no obvious problems due to flock but we decided to drop back to localflock as a precaution for now. We need to do a more controlled test but this does seem to help. What are other sites using for locking parameters?
Thanks,
Darby
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180705/c4ad5f3d/attachment.html>
More information about the lustre-discuss
mailing list