[Lustre-discuss] ldlm_enqueue operation failures
Charles Taylor
taylor at hpc.ufl.edu
Mon Feb 18 13:55:18 PST 2008
Well, the log on the MDS at the time of the failure looks like...
Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c:
515:mgs_handle()) Skipped 263 previous similar messages
Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
1442:target_send_reply_msg()) @@@ processing error (-107)
req at ffff81011acf7c50 x1602651/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl
Interpret:/0/0 rc -107/0
Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c:
1442:target_send_reply_msg()) Skipped 427 previous similar messages
Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c:
1474:mds_close()) @@@ no handle for file close ino 43116025: cookie
0x1938027bf9d67349 req at ffff8100ae3bfc00 x10000789/t0 o35-
>beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl
Interpret:/0/0 rc 0/0
Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c:
1474:mds_close()) Skipped 161 previous similar messages
Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c:
210:waiting_locks_callback()) ### lock callback timer expired:
evicting client 2bdea9d4-43c3-a0b0-2822-
c49ecfe6e044 at NET_0x500000a0d1935_UUID nid 10.13.25.53 at o2ib ns: mds-
ufhpc-MDT0000_UUID lock: ffff810053d3f100/0x688cfbc7df2ef487 lrc:
1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582 type:
IBT flags: 4000030 remote: 0x95c1d2685c2c76d9 expref: 21 pid 6090
Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c:
210:waiting_locks_callback()) Skipped 3 previous similar messages
Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c:
962:ldlm_handle_enqueue()) ### lock on destroyed export
ffff8101096ec000 ns: mds-ufhpc-MDT0000_UUID lock:
ffff810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res:
21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 4000030
remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265
Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c:
962:ldlm_handle_enqueue()) Skipped 3 previous similar messages
Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c:
127:mds_finish_transno()) commit transaction for disconnected client
2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0
We don't have any watchdog timeouts associated with the event so I
don't have any tracebacks from those. One one of the clients we
have...
Feb 18 15:33:17 r1b-s23 kernel: LustreError: 11-0: an error occurred
while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation
failed with -107
Feb 18 15:33:17 r1b-s23 kernel: LustreError: Skipped 2 previous
similar messages
Feb 18 15:33:17 r1b-s23 kernel: Lustre: ufhpc-MDT0000-mdc-
ffff81012d370800: Connection to service ufhpc-MDT0000 via nid
10.13.24.40 at o2ib was lost; in progress operations using thi\
s service will wait for recovery to complete.
Feb 18 15:33:17 r1b-s23 kernel: Lustre: Skipped 2 previous similar
messages
Feb 18 15:33:17 r1b-s23 kernel: LustreError: 167-0: This client was
evicted by ufhpc-MDT0000; in progress operations using this service
will fail.
Feb 18 15:33:17 r1b-s23 kernel: LustreError: Skipped 2 previous
similar messages
Feb 18 15:33:17 r1b-s23 kernel: LustreError: 12004:0:(mdc_locks.c:
423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
Feb 18 15:33:17 r1b-s23 kernel: LustreError: 12004:0:(mdc_locks.c:
423:mdc_finish_enqueue()) Skipped 3 previous similar messages
Feb 18 15:33:17 r1b-s23 kernel: Lustre: ufhpc-MDT0000-mdc-
ffff81012d370800: Connection restored to service ufhpc-MDT0000 using
nid 10.13.24.40 at o2ib.
Feb 18 15:33:17 r1b-s23 kernel: Lustre: Skipped 2 previous similar
messages
ct
On Feb 18, 2008, at 4:42 PM, Oleg Drokin wrote:
> Hello!
>
> On Feb 18, 2008, at 4:29 PM, Charles Taylor wrote:
>
>> Unfortunately, the upgrade did not resolve our issue. One our
>> users has an mpi app where every thread opens the same input file
>> (actually several in succession). Although we have run this job
>> successfully before on up to 512 procs, it is not working now.
>> Lustre seems to be locking up when all the threads go after the same
>> file (to open) and we see things such as ...
>
> Can you upload full log from start of problematic job to end
> somewhere?
> Also somewhere when first watchdog timeouts hit, it would be nice
> if you
> can do sysrq-t on MDS too to get traces of all threads (you need to
> have
> big dmesg buffer for them to fit, of use serial console).
> Is the job uses flocks/fcntl locks at all? if not, then don't worry
> about
> mounting with -o flock.
>
> Bye,
> Oleg
More information about the lustre-discuss
mailing list