[Lustre-discuss] ldlm_enqueue operation failures

Charles Taylor taylor at hpc.ufl.edu
Tue Feb 19 05:53:10 PST 2008


One more thing worth mentioning, we have no more callback or watchdog  
timer expired messages so 1.6.4.2 seems to have  fixed that.      So  
it just seems like if 512 threads try to open the same file at  
roughly the same time, we are running out of some resource on the MDS  
or OSSs that keeps Lustre from satisfying the request.

Charlie

On Feb 19, 2008, at 8:45 AM, Charles Taylor wrote:

> Yes, I understand.    Right now we are just trying to isolate our
> problems so that we don't provide information that is not related to
> the issue.      Just to recap we were running pretty well with our
> patched 1.6.3 implementation.   However, we could not start a 512-way
> job in which each thread tries to open a single copy of the same
> file.    Inevitably, one or more threads would get a "can not open
> file" error and call mpi_abort() even though the file is there and
> many other threads open it successfully.     We thought we were
> hitting lustre bug 13197 which is supposed to be fixed in 1.6.4.2 so
> we upgraded our MGS/MDS and OSSs to 1.6.4.2.   We have *not* upgraded
> the clients (400+ of them) and were hoping to avoid that for the  
> moment.
>
> The upgrade seemed to go well and the file system is accessible on
> all the clients.     However, our 512-way application still cannot
> run.    We tried modifying the app so that each thread opens its own
> copy of the input file (i.e. file.in.<rank>) and duplicated the input
> file 512 times).    This allowed the job to start but it eventually
> failed anyway with another "can not open file" error.
>
> ERROR (proc. 00410) - cannot open file: ./
> skews_ms2p0.mixt.cva_00411_5.30000E-04
>
>
> This seems to clearly indicate a problem with Lustre and/or our
> implementation.
>
> On a perhaps separate note (perhaps not), since the upgrade
> yesterday, we are seeing the messages below every ten minutes.
> Perhaps we need shutdown and impose some sanity on all this but in
> reality, this is the only job that is having trouble (out of
> hundreds, sometimes thousands) and the file system seems to be
> operating just fine otherwise.
>
> Any insight is appreciated at this point.   We've put a lot of effort
> into lustre at this point and would like to stick with it but right
> now it looks like it can't scale to a 512 way job.
>
> Thanks for the help,
>
> Charlie
>
>
>
> Feb 19 07:07:09 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 202 previous similar messages
> Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 201 previous similar messages
> Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810107255850 x36818597/t0 o101-><?>@<?>:-1 lens 232/0 ref 0
> fl Interpret:/0/0 rc -107/0
> Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 209 previous similar messages
> Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81011e056c50 x679809/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
> Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810108157450 x140057135/t0 o101-><?>@<?>:-1 lens 232/0 ref 0
> fl Interpret:/0/0 rc -107/0
> Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 201 previous similar messages
> Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
> Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81010824c850 x5243687/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 209 previous similar messages
> Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81010869cc50 x4530492/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 203 previous similar messages
> Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810107257450 x6548994/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 205 previous similar messages
> Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 207 previous similar messages
> Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81011e056c50 x680167/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 209 previous similar messages
> Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
>
>
>
> On Feb 19, 2008, at 12:15 AM, Oleg Drokin wrote:
>
>> Hello!
>>
>> On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote:
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.13.24.40 at o2ib. The mds_close operation
>>> failed with -116
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous
>>> similar messages
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>>> 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed:  
>>> rc =
>>> -116
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>>> 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages
>>
>> These mean client was evicted (And later successfully reconnected)
>> after
>> opening file successfully.
>>
>> We need all the failure/evictions info since job started to make any
>> meaningful progress, because as of now I have no idea why clients
>> were evicted.
>>
>> Bye,
>>     Oleg
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list