[Lustre-discuss] ldlm_enqueue operation failures

Charles Taylor taylor at hpc.ufl.edu
Tue Feb 19 05:45:39 PST 2008


Yes, I understand.    Right now we are just trying to isolate our  
problems so that we don't provide information that is not related to  
the issue.      Just to recap we were running pretty well with our  
patched 1.6.3 implementation.   However, we could not start a 512-way  
job in which each thread tries to open a single copy of the same  
file.    Inevitably, one or more threads would get a "can not open  
file" error and call mpi_abort() even though the file is there and  
many other threads open it successfully.     We thought we were  
hitting lustre bug 13197 which is supposed to be fixed in 1.6.4.2 so  
we upgraded our MGS/MDS and OSSs to 1.6.4.2.   We have *not* upgraded  
the clients (400+ of them) and were hoping to avoid that for the moment.

The upgrade seemed to go well and the file system is accessible on  
all the clients.     However, our 512-way application still cannot  
run.    We tried modifying the app so that each thread opens its own  
copy of the input file (i.e. file.in.<rank>) and duplicated the input  
file 512 times).    This allowed the job to start but it eventually  
failed anyway with another "can not open file" error.

ERROR (proc. 00410) - cannot open file: ./ 
skews_ms2p0.mixt.cva_00411_5.30000E-04


This seems to clearly indicate a problem with Lustre and/or our  
implementation.

On a perhaps separate note (perhaps not), since the upgrade  
yesterday, we are seeing the messages below every ten minutes.      
Perhaps we need shutdown and impose some sanity on all this but in  
reality, this is the only job that is having trouble (out of  
hundreds, sometimes thousands) and the file system seems to be  
operating just fine otherwise.

Any insight is appreciated at this point.   We've put a lot of effort  
into lustre at this point and would like to stick with it but right  
now it looks like it can't scale to a 512 way job.

Thanks for the help,

Charlie



Feb 19 07:07:09 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 202 previous similar messages
Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 201 previous similar messages
Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff810107255850 x36818597/t0 o101-><?>@<?>:-1 lens 232/0 ref 0  
fl Interpret:/0/0 rc -107/0
Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 209 previous similar messages
Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff81011e056c50 x679809/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl  
Interpret:/0/0 rc -107/0
Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 205 previous similar messages
Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff810108157450 x140057135/t0 o101-><?>@<?>:-1 lens 232/0 ref 0  
fl Interpret:/0/0 rc -107/0
Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 201 previous similar messages
Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 205 previous similar messages
Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff81010824c850 x5243687/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl  
Interpret:/0/0 rc -107/0
Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 209 previous similar messages
Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff81010869cc50 x4530492/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl  
Interpret:/0/0 rc -107/0
Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 203 previous similar messages
Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff810107257450 x6548994/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl  
Interpret:/0/0 rc -107/0
Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 205 previous similar messages
Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 207 previous similar messages
Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff81011e056c50 x680167/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl  
Interpret:/0/0 rc -107/0
Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 209 previous similar messages
Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 205 previous similar messages



On Feb 19, 2008, at 12:15 AM, Oleg Drokin wrote:

> Hello!
>
> On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote:
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.13.24.40 at o2ib. The mds_close operation
>> failed with -116
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous
>> similar messages
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>> 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: rc =
>> -116
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>> 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages
>
> These mean client was evicted (And later successfully reconnected)  
> after
> opening file successfully.
>
> We need all the failure/evictions info since job started to make any
> meaningful progress, because as of now I have no idea why clients
> were evicted.
>
> Bye,
>     Oleg




More information about the lustre-discuss mailing list