[Lustre-discuss] very slow directory access (lstat taking >20s to return)

Frederik Ferner frederik.ferner at diamond.ac.uk
Fri Oct 29 01:40:33 PDT 2010


Hi List,

we occasionally have reports from our users that it takes a very long 
time to access files on our Lustre file system. In particular they 
complain that they can't see the files when doing ls on the directory, 
the ls is taking very long time to return. If we wait long enough 
everything returns to normal until it happens again.

The last time this happened, we managed to do some limited debugging and 
would like to have some advice how to debug this further and hopefully 
fix the problem.

Note that this is still on Lustre 1.6.7.2-ddn3.5 but we are going to 
upgrade to 1.8 next week. Without knowing what the problem is, we are 
not convinced the upgrade will help in this instance.

Doing a 'strace -T -e file ls -n' on one directory with about 750 files, 
while users were seeing the hanging ls, showed lstat calls taking 
seconds, up to 23s.

<snip strace extract>
lstat("164435.dat", {st_mode=S_IFREG|0660, st_size=6711, ...}) = 0 
<4.899480>
getxattr("164435.dat", "system.posix_acl_access", 0x0, 0) = 100 <0.000291>
lstat("164436.dat", {st_mode=S_IFREG|0660, st_size=7939, ...}) = 0 
<13.679477>
getxattr("164436.dat", "system.posix_acl_access", 0x0, 0) = 100 <0.000045>
lstat("164443.dat", {st_mode=S_IFREG|0660, st_size=10811, ...}) = 0 
<14.126075>
getxattr("164443.dat", "system.posix_acl_access", 0x0, 0) = 100 <0.000047>
lstat("164448.dat", {st_mode=S_IFREG|0660, st_size=10804, ...}) = 0 
<12.062267>
getxattr("164448.dat", "system.posix_acl_access", 0x0, 0) = 100 <0.000045>
lstat("164450.dat", {st_mode=S_IFREG|0660, st_size=10815, ...}) = 0 
<23.535400>
getxattr("164450.dat", "system.posix_acl_access", 0x0, 0) = 100 <0.000047>
lstat("164453.dat", {st_mode=S_IFREG|0660, st_size=10809, ...}) = 0 
<8.323233>
getxattr("164453.dat", "system.posix_acl_access", 0x0, 0) = 100 <0.000043>
lstat("164462.dat", {st_mode=S_IFREG|0660, st_size=10818, ...}) = 0 
<15.436090>
getxattr("164462.dat", "system.posix_acl_access", 0x0, 0) = 100 <0.000028>
</snip strace>

During all this time there was nothing in syslog on any of the servers, 
  MDS and client both had a load of less than 0.2. The OSSes were 
busier but did not look overloaded.

llstat -i 1 /proc/fs/lustre/mds/lustre01-MDT0000/stats showed open calls 
at a rate of about 700-1200 per second:
<sample llstat output>
/proc/fs/lustre/mds/lustre01-MDT0000/stats @ 1288260173.745081
Name                      Cur.Count  Cur.Rate   #Events
open                      904        904        19632468
close                     428        428        6264157
mknod                     0          0          20090
link                      0          0          32
unlink                    9          9          162146
mkdir                     0          0          15425
rmdir                     0          0          2357
rename                    3          3          80868
getxattr                  0          0          19051954
setxattr                  0          0          22469
process_config            0          0          2
connect                   0          0          5
reconnect                 0          0          163
disconnect                0          0          4
statfs                    0          0          102566
create                    0          0          30
destroy                   0          0          30
setattr                   23         23         1864211
getattr                   487        487        58908803
llog_init                 0          0          60
notify                    0          0          121
</sample llstat output>


Kind regards,
Frederik
-- 
Frederik Ferner
Computer Systems Administrator		phone: +44 1235 77 8624
Diamond Light Source Ltd.		mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)



More information about the lustre-discuss mailing list