[Lustre-discuss] ldlm_enqueue operation failures
Charles Taylor
taylor at hpc.ufl.edu
Mon Feb 18 13:29:17 PST 2008
FWIW, we got our MGS/MDS and OSSs upgraded to 1.6.4.2 and they seem
to be fine. The clients are still running 1.6.3.
Unfortunately, the upgrade did not resolve our issue. One our
users has an mpi app where every thread opens the same input file
(actually several in succession). Although we have run this job
successfully before on up to 512 procs, it is not working now.
Lustre seems to be locking up when all the threads go after the same
file (to open) and we see things such as ...
Feb 18 15:42:11 r3b-s16 kernel: LustreError: 11-0: an error occurred
while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation
failed with -107
Feb 18 15:42:11 r3b-s16 kernel: LustreError: Skipped 21 previous
similar messages
Feb 18 15:52:51 r3b-s16 kernel: LustreError: 11-0: an error occurred
while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation
failed with -107
Feb 18 15:52:51 r3b-s16 kernel: LustreError: Skipped 19 previous
similar messages
10.13.24.40 at o2ib is our MDS. We have 512 ll_mdt threads (the max).
The actual error in the code on some of the threads will be that the
file was not found (even though it was clearly there) and this only
happens after about an 8 minute timeout.
Note that we have the file system mounted with the "-o flock"
option. Is this part of the problem or are we hitting yet another
bug?
Thanks,
Charlie Taylor
UF HPC Center
More information about the lustre-discuss
mailing list