[Lustre-discuss] ldlm_enqueue operation failures

Mon Feb 18 13:29:17 PST 2008

FWIW, we got our  MGS/MDS and OSSs upgraded to 1.6.4.2 and they seem  
to be fine.    The clients are still running 1.6.3.

Unfortunately, the upgrade did not resolve our issue.    One our  
users has an mpi app where every thread opens the same input file  
(actually several in succession).    Although we have run this job  
successfully before on up to 512 procs, it is not working now.     
Lustre seems to be locking up when all the threads go after the same  
file (to open) and we see things such as ...

Feb 18 15:42:11 r3b-s16 kernel: LustreError: 11-0: an error occurred  
while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation  
failed with -107
Feb 18 15:42:11 r3b-s16 kernel: LustreError: Skipped 21 previous  
similar messages
Feb 18 15:52:51 r3b-s16 kernel: LustreError: 11-0: an error occurred  
while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation  
failed with -107
Feb 18 15:52:51 r3b-s16 kernel: LustreError: Skipped 19 previous  
similar messages

10.13.24.40 at o2ib is our MDS.   We have 512 ll_mdt threads (the max).

The actual error in the code on some of the threads will be that the  
file was not found (even though it was clearly there) and this only  
happens after about an 8 minute timeout.

Note that we have the file system mounted with the "-o flock"  
option.     Is this part of the problem or are we hitting yet another  
bug?

Thanks,

Charlie Taylor
UF HPC Center