[Lustre-discuss] ldlm_enqueue operation failures
Charles Taylor
taylor at hpc.ufl.edu
Tue Feb 19 05:45:39 PST 2008
Yes, I understand. Right now we are just trying to isolate our
problems so that we don't provide information that is not related to
the issue. Just to recap we were running pretty well with our
patched 1.6.3 implementation. However, we could not start a 512-way
job in which each thread tries to open a single copy of the same
file. Inevitably, one or more threads would get a "can not open
file" error and call mpi_abort() even though the file is there and
many other threads open it successfully. We thought we were
hitting lustre bug 13197 which is supposed to be fixed in 1.6.4.2 so
we upgraded our MGS/MDS and OSSs to 1.6.4.2. We have *not* upgraded
the clients (400+ of them) and were hoping to avoid that for the moment.
The upgrade seemed to go well and the file system is accessible on
all the clients. However, our 512-way application still cannot
run. We tried modifying the app so that each thread opens its own
copy of the input file (i.e. file.in.<rank>) and duplicated the input
file 512 times). This allowed the job to start but it eventually
failed anyway with another "can not open file" error.
ERROR (proc. 00410) - cannot open file: ./
skews_ms2p0.mixt.cva_00411_5.30000E-04
This seems to clearly indicate a problem with Lustre and/or our
implementation.
On a perhaps separate note (perhaps not), since the upgrade
yesterday, we are seeing the messages below every ten minutes.
Perhaps we need shutdown and impose some sanity on all this but in
reality, this is the only job that is having trouble (out of
hundreds, sometimes thousands) and the file system seems to be
operating just fine otherwise.
Any insight is appreciated at this point. We've put a lot of effort
into lustre at this point and would like to stick with it but right
now it looks like it can't scale to a 512 way job.
Thanks for the help,
Charlie
Feb 19 07:07:09 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
1442:target_send_reply_msg()) Skipped 202 previous similar messages
Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:12:41 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
515:mgs_handle()) Skipped 201 previous similar messages
Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
1442:target_send_reply_msg()) @@@ processing error (-107)
req at ffff810107255850 x36818597/t0 o101-><?>@<?>:-1 lens 232/0 ref 0
fl Interpret:/0/0 rc -107/0
Feb 19 07:17:12 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:22:42 hpcmds kernel: LustreError: 6056:0:(mgs_handler.c:
515:mgs_handle()) Skipped 209 previous similar messages
Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
1442:target_send_reply_msg()) @@@ processing error (-107)
req at ffff81011e056c50 x679809/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
Interpret:/0/0 rc -107/0
Feb 19 07:27:16 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:32:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
515:mgs_handle()) Skipped 205 previous similar messages
Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
1442:target_send_reply_msg()) @@@ processing error (-107)
req at ffff810108157450 x140057135/t0 o101-><?>@<?>:-1 lens 232/0 ref 0
fl Interpret:/0/0 rc -107/0
Feb 19 07:37:16 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
1442:target_send_reply_msg()) Skipped 201 previous similar messages
Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:42:52 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
515:mgs_handle()) Skipped 205 previous similar messages
Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
1442:target_send_reply_msg()) @@@ processing error (-107)
req at ffff81010824c850 x5243687/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
Interpret:/0/0 rc -107/0
Feb 19 07:47:17 hpcmds kernel: LustreError: 7162:0:(ldlm_lib.c:
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 07:52:59 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
515:mgs_handle()) Skipped 209 previous similar messages
Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
1442:target_send_reply_msg()) @@@ processing error (-107)
req at ffff81010869cc50 x4530492/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
Interpret:/0/0 rc -107/0
Feb 19 07:57:27 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
1442:target_send_reply_msg()) Skipped 207 previous similar messages
Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 08:03:03 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
515:mgs_handle()) Skipped 203 previous similar messages
Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
1442:target_send_reply_msg()) @@@ processing error (-107)
req at ffff810107257450 x6548994/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
Interpret:/0/0 rc -107/0
Feb 19 08:07:30 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
1442:target_send_reply_msg()) Skipped 205 previous similar messages
Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 08:13:05 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
515:mgs_handle()) Skipped 207 previous similar messages
Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
1442:target_send_reply_msg()) @@@ processing error (-107)
req at ffff81011e056c50 x680167/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
Interpret:/0/0 rc -107/0
Feb 19 08:17:33 hpcmds kernel: LustreError: 6056:0:(ldlm_lib.c:
1442:target_send_reply_msg()) Skipped 209 previous similar messages
Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 19 08:23:07 hpcmds kernel: LustreError: 6057:0:(mgs_handler.c:
515:mgs_handle()) Skipped 205 previous similar messages
On Feb 19, 2008, at 12:15 AM, Oleg Drokin wrote:
> Hello!
>
> On Feb 18, 2008, at 5:13 PM, Charles Taylor wrote:
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 11-0: an error occurred
>> while communicating with 10.13.24.40 at o2ib. The mds_close operation
>> failed with -116
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: Skipped 3 previous
>> similar messages
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>> 97:ll_close_inode_openhandle()) inode 17243099 mdc close failed: rc =
>> -116
>> Feb 18 15:32:47 r5b-s42 kernel: LustreError: 7828:0:(file.c:
>> 97:ll_close_inode_openhandle()) Skipped 1 previous similar messages
>
> These mean client was evicted (And later successfully reconnected)
> after
> opening file successfully.
>
> We need all the failure/evictions info since job started to make any
> meaningful progress, because as of now I have no idea why clients
> were evicted.
>
> Bye,
> Oleg
More information about the lustre-discuss
mailing list