[Lustre-discuss] ldlm_enqueue operation failures
Charles Taylor
taylor at hpc.ufl.edu
Tue Feb 19 06:08:07 PST 2008
Ok, on the host that recorded the "can not open file" error we have
this in the log at the time of the failure.
Feb 18 22:46:37 r2a-s33 kernel: LustreError: 21216:0:(file.c:
1040:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO
Feb 18 22:46:37 r2a-s33 kernel: LustreError: 21216:0:(file.c:
1040:ll_glimpse_size()) Skipped 1 previous similar message
Is this a known problem. Is there some resource we need to increase?
Thanks,
Charlie Taylor
UF HPC Center
On Feb 19, 2008, at 8:45 AM, Charles Taylor wrote:
> Yes, I understand. Right now we are just trying to isolate our
> problems so that we don't provide information that is not related to
> the issue. Just to recap we were running pretty well with our
> patched 1.6.3 implementation. However, we could not start a 512-way
> job in which each thread tries to open a single copy of the same
> file. Inevitably, one or more threads would get a "can not open
> file" error and call mpi_abort() even though the file is there and
> many other threads open it successfully. We thought we were
> hitting lustre bug 13197 which is supposed to be fixed in 1.6.4.2 so
> we upgraded our MGS/MDS and OSSs to 1.6.4.2. We have *not* upgraded
> the clients (400+ of them) and were hoping to avoid that for the
> moment.
>
> The upgrade seemed to go well and the file system is accessible on
> all the clients. However, our 512-way application still cannot
> run. We tried modifying the app so that each thread opens its own
> copy of the input file (i.e. file.in.<rank>) and duplicated the input
> file 512 times). This allowed the job to start but it eventually
> failed anyway with another "can not open file" error.
>
> ERROR (proc. 00410) - cannot open file: ./
> skews_ms2p0.mixt.cva_00411_5.30000E-04
>
>
> This seems to clearly indicate a problem with Lustre and/or our
> implementation.
>
> On a perhaps separate note (perhaps not), since the upgrade
> yesterday, we are seeing the messages below every ten minutes.
> Perhaps we need shutdown and impose some sanity on all this but in
> reality, this is the only job that is having trouble (out of
> hundreds, sometimes thousands) and the file system seems to be
> operating just fine otherwise.
>
> Any insight is appreciated at this point. We've put a lot of effort
> into lustre at this point and would like to stick with it but right
> now it looks like it can't scale to a 512 way job.
>
> Thanks for the help,
>
> Charlie
>
>
>
> Feb 19 07:07:09 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 202 previous similar messages
> Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 201 previous similar messages
> Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810107255850 x36818597/t0 o101-><?>@<?>:-1 lens 232/0 ref 0
> fl Interpret:/0/0 rc -107/0
> Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 209 previous similar messages
> Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81011e056c50 x679809/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
> Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810108157450 x140057135/t0 o101-><?>@<?>:-1 lens 232/0 ref 0
> fl Interpret:/0/0 rc -107/0
> Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 201 previous similar messages
> Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
> Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81010824c850 x5243687/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 209 previous similar messages
> Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81010869cc50 x4530492/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 207 previous similar messages
> Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 203 previous similar messages
> Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff810107257450 x6548994/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 205 previous similar messages
> Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 207 previous similar messages
> Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff81011e056c50 x680167/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
> Interpret:/0/0 rc -107/0
> Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) Skipped 209 previous similar messages
> Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
> Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
> 515:mgs_handle()) Skipped 205 previous similar messages
>
>
>
> On Feb 19, 2008, at 12:15 AM, Oleg Drokin wrote:
>
>> Hello!
>>
>> On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote:
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.13.24.40 at o2ib. The mds_close operation
>>> failed with -116
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous
>>> similar messages
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>>> 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed:
>>> rc =
>>> -116
>>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>>> 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages
>>
>> These mean client was evicted (And later successfully reconnected)
>> after
>> opening file successfully.
>>
>> We need all the failure/evictions info since job started to make any
>> meaningful progress, because as of now I have no idea why clients
>> were evicted.
>>
>> Bye,
>> Oleg
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list