[Lustre-discuss] ldlm_enqueue operation failures

Mon Feb 18 13:55:18 PST 2008

Well, the log on the MDS at the time of the failure looks like...

Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS
Feb 18 15:25:50 hpcmds kernel: LustreError: 7162:0:(mgs_handler.c: 
515:mgs_handle()) Skipped 263 previous similar messages
Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff81011acf7c50 x1602651/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl  
Interpret:/0/0 rc -107/0
Feb 18 15:29:25 hpcmds kernel: LustreError: 6057:0:(ldlm_lib.c: 
1442:target_send_reply_msg()) Skipped 427 previous similar messages
Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 
1474:mds_close()) @@@ no handle for file close ino 43116025: cookie  
0x1938027bf9d67349  req at ffff8100ae3bfc00 x10000789/t0 o35- 
 >beb7df79-6127-c0ca-9d36-2a96817a77a9@:-1 lens 296/1736 ref 0 fl  
Interpret:/0/0 rc 0/0
Feb 18 15:31:28 hpcmds kernel: LustreError: 7150:0:(mds_open.c: 
1474:mds_close()) Skipped 161 previous similar messages
Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 
210:waiting_locks_callback()) ### lock callback timer expired:  
evicting client 2bdea9d4-43c3-a0b0-2822- 
c49ecfe6e044 at NET_0x500000a0d1935_UUID nid 10.13.25.53 at o2ib  ns: mds- 
ufhpc-MDT0000_UUID lock: ffff810053d3f100/0x688cfbc7df2ef487 lrc:  
1/0,0 mode: CR/CR res: 21878337/3424633214 bits 0x3 rrc: 582 type:  
IBT flags: 4000030 remote: 0x95c1d2685c2c76d9 expref: 21 pid 6090
Feb 18 15:33:17 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 
210:waiting_locks_callback()) Skipped 3 previous similar messages
Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 
962:ldlm_handle_enqueue()) ### lock on destroyed export  
ffff8101096ec000 ns: mds-ufhpc-MDT0000_UUID lock:  
ffff810225fe12c0/0x688cfbc7df2ef505 lrc: 2/0,0 mode: CR/CR res:  
21878337/3424633214 bits 0x3 rrc: 579 type: IBT flags: 4000030  
remote: 0x95c1d2685c2c76e0 expref: 6 pid 6265
Feb 18 15:33:17 hpcmds kernel: LustreError: 6265:0:(ldlm_lockd.c: 
962:ldlm_handle_enqueue()) Skipped 3 previous similar messages
Feb 18 15:33:17 hpcmds kernel: Lustre: 6061:0:(mds_reint.c: 
127:mds_finish_transno()) commit transaction for disconnected client  
2bdea9d4-43c3-a0b0-2822-c49ecfe6e044: rc 0

We don't have any watchdog timeouts associated with the event so I  
don't have any tracebacks from those.    One one of the clients we  
have...

Feb 18 15:33:17 r1b-s23 kernel: LustreError: 11-0: an error occurred  
while communicating with 10.13.24.40 at o2ib. The ldlm_enqueue operation  
failed with -107
Feb 18 15:33:17 r1b-s23 kernel: LustreError: Skipped 2 previous  
similar messages
Feb 18 15:33:17 r1b-s23 kernel: Lustre: ufhpc-MDT0000-mdc- 
ffff81012d370800: Connection to service ufhpc-MDT0000 via nid  
10.13.24.40 at o2ib was lost; in progress operations using thi\
s service will wait for recovery to complete.
Feb 18 15:33:17 r1b-s23 kernel: Lustre: Skipped 2 previous similar  
messages
Feb 18 15:33:17 r1b-s23 kernel: LustreError: 167-0: This client was  
evicted by ufhpc-MDT0000; in progress operations using this service  
will fail.
Feb 18 15:33:17 r1b-s23 kernel: LustreError: Skipped 2 previous  
similar messages
Feb 18 15:33:17 r1b-s23 kernel: LustreError: 12004:0:(mdc_locks.c: 
423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
Feb 18 15:33:17 r1b-s23 kernel: LustreError: 12004:0:(mdc_locks.c: 
423:mdc_finish_enqueue()) Skipped 3 previous similar messages
Feb 18 15:33:17 r1b-s23 kernel: Lustre: ufhpc-MDT0000-mdc- 
ffff81012d370800: Connection restored to service ufhpc-MDT0000 using  
nid 10.13.24.40 at o2ib.
Feb 18 15:33:17 r1b-s23 kernel: Lustre: Skipped 2 previous similar  
messages

ct

On Feb 18, 2008, at 4:42 PM, Oleg Drokin wrote:

> Hello!
>
> On Feb 18, 2008, at 4:29 PM, Charles Taylor wrote:
>
>> Unfortunately, the upgrade did not resolve our issue.    One our
>> users has an mpi app where every thread opens the same input file
>> (actually several in succession).    Although we have run this job
>> successfully before on up to 512 procs, it is not working now.
>> Lustre seems to be locking up when all the threads go after the same
>> file (to open) and we see things such as ...
>
> Can you upload full log from start of problematic job to end  
> somewhere?
> Also somewhere when first watchdog timeouts hit, it would be nice  
> if you
> can do sysrq-t on MDS too to get traces of all threads (you need to  
> have
> big dmesg buffer for them to fit, of use serial console).
> Is the job uses flocks/fcntl locks at all? if not, then don't worry  
> about
> mounting with -o flock.
>
> Bye,
>     Oleg