[Lustre-discuss] Lustre Falling over

Charles Taylor taylor at hpc.ufl.edu
Fri Jan 11 04:18:50 PST 2008


Were are running a patched (mballoc and others) 1.6.3 lustre on  
x86_64 platform under Centos4.5.   Kernel is ...

Linux submit.ufhpc 2.6.18-8.1.14.el5Lustre #1 SMP Fri Oct 12 15:51:56  
EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

and is the same on the clients and server side.   We have both tcp  
and o2ib NIDS.

It seems to work pretty well most of the time but yesterday a large  
(256-way) job started and it seems that the MDS pretty much just fell  
over.    This occurred at 18:36 (see log below).    The whole cluster  
went into recovery mode as though we had rebooted the MDS.    Prior  
to that we were getting some "target send reply msg" and "no handle  
for file close" errors.

We know that each thread of the 256-way job opens a writes to a file  
independently but roughly concurrently so it probably got hit pretty  
hard all at once.   Nonetheless, we hit it just as hard in our pre- 
production testing and it held up very well.   So the questions are...

1. Does anyone recognize the errors below as known bugs for which  
there are patches/fixes we missed?

2. Looking at the tracebacks, I can't help but wonder if there is  
some ldlm parameter we need to tweak for a cluster of our size.

Note that we have 320 o2ib clients and 80 tcp clients (400 nodes/1600  
cores).    Our MDS is a 4-way Opteron 2212 w/ 8GB RAM.     We have 8  
OSS's and 24 OST's (3 per OSS).   Each OSS is a 4-way Opteron 275  
with 4GB RAM and 3 2Gb/s FC HCAs connected to Xyratex 4900FC storage  
arrays.

Any ideas?   Suggestions?

Thanks,

Charlie Taylor
UF HPC Center

Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) @@@ processing error (-116)   
req at ffff810122fe1e00 x183667281/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) Skipped 2 previous similar messages
Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: 
1474:mds_close()) @@@ no handle for file close ino 30269183: cookie  
0x23710b5b3ed594eb  req at ffff810116300000 x183716808/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: 
1474:mds_close()) Skipped 6 previous similar messages
Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) @@@ processing error (-116)   
req at ffff810116300000 x183716808/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) Skipped 6 previous similar messages
Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: 
1474:mds_close()) @@@ no handle for file close ino 30298600: cookie  
0x23710b5b3c868998  req at ffff810119cf1400 x183809041/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: 
1474:mds_close()) Skipped 1 previous similar message
Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) @@@ processing error (-116)   
req at ffff810119cf1400 x183809041/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) Skipped 1 previous similar message
Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: 
1474:mds_close()) @@@ no handle for file close ino 30205356: cookie  
0x23710b5b3dfc4c91  req at ffff8101189a3600 x183817543/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: 
1474:mds_close()) Skipped 5 previous similar messages
Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) @@@ processing error (-116)   
req at ffff8101189a3600 x183817543/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) Skipped 5 previous similar messages
Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: 
1474:mds_close()) @@@ no handle for file close ino 24185989: cookie  
0x23710b5b3ed9874c  req at ffff81012fa67c00 x183830524/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: 
1474:mds_close()) Skipped 1 previous similar message
Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) @@@ processing error (-116)   
req at ffff81012fa67c00 x183830524/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) Skipped 1 previous similar message
Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: 
1474:mds_close()) @@@ no handle for file close ino 28959573: cookie  
0x23710b5b3eda7dd4  req at ffff8101161d0c00 x183871936/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: 
1474:mds_close()) Skipped 1 previous similar message
Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) @@@ processing error (-116)   
req at ffff8101161d0c00 x183871936/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) Skipped 1 previous similar message
Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(mds_open.c: 
1474:mds_close()) @@@ no handle for file close ino 30137171: cookie  
0x23710b5b3c790ddb  req at ffff810119320400 x183904052/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) @@@ processing error (-116)   
req at ffff810119320400 x183904052/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
Jan 10 13:53:00 hpcmds kernel: Lustre: ufhpc-MDT0000: haven't heard  
from client 8514a628-1f97-0759-0932-de3e470cbb4f (at  
10.13.26.48 at o2ib) in 2252 seconds. I think it's dead, and I am  
evicting it.
Jan 10 13:53:00 hpcmds kernel: Lustre: MGS: haven't heard from client  
3e041a15-91e3-5c94-78fa-3b511d86bcd9 (at 10.13.26.48 at o2ib) in 2252  
seconds. I think it's dead, and I am evicting it.
Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: 
1474:mds_close()) @@@ no handle for file close ino 30296257: cookie  
0x23710b5b3c18eca5  req at ffff81011f3d3000 x184041701/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: 
1474:mds_close()) Skipped 2 previous similar messages
Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) @@@ processing error (-116)   
req at ffff81011f3d3000 x184041701/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) Skipped 2 previous similar messages
Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: 
1474:mds_close()) @@@ no handle for file close ino 29841036: cookie  
0x23710b5b3d470bf2  req at ffff81011860cc00 x184196858/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: 
1474:mds_close()) Skipped 4 previous similar messages
Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) @@@ processing error (-116)   
req at ffff81011860cc00 x184196858/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) Skipped 4 previous similar messages
Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: 
1474:mds_close()) @@@ no handle for file close ino 30101082: cookie  
0x23710b5b3e8e62f1  req at ffff810117d1e400 x184423975/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: 
1474:mds_close()) Skipped 3 previous similar messages
Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) @@@ processing error (-116)   
req at ffff810117d1e400 x184423975/t0 o35- 
 >775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1  
lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) Skipped 3 previous similar messages
Jan 10 18:36:57 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 
210:waiting_locks_callback()) ### lock callback timer expired:  
evicting client e23549c4-cd61- 
b385-8f2d-1c125e41e7ef at NET_0x500000a0d1a78_UUID nid  
10.13.26.120 at o2ib  ns: mds-ufhpc-MDT0000_UUID lock:  
ffff8100bb893680/0x23710b5b59ab4680 lrc: 1/0,0 mode: CR/CR res:  
30066994/507692538 bits 0x3 rrc: 309 type: IBT flags: 4000030 remote:  
0x4a51d1af1d561084 expref: 13 pid 31402
Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(handler.c: 
1498:mds_handle()) operation 101 on unconnected MDS from  
12345-10.13.26.120 at o2ib
Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) @@@ processing error (-107)   
req at ffff8101191b9400 x10882458/t0 o101-><?>@<?>:-1 lens 512/0 ref 0  
fl Interpret:/0/0 rc -107/0
Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: 
1437:target_send_reply_msg()) Skipped 12 previous similar messages
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885ec6a0>] :ptlrpc:ldlm_handle_enqueue+0xda0/0x1420
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff888430bd>] :mds:mds_handle 
+0x478d/0x59af
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff888430bd>] :mds:mds_handle 
+0x478d/0x59af
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]  
default_wake_function+0x0/0xe
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8001c26c>] __mod_timer 
+0xb0/0xbe
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be25>] child_rip+0xa/0x11
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]  
default_wake_function+0x0/0xe
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8001c26c>] __mod_timer 
+0xb0/0xbe
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be1b>] child_rip+0x0/0x11
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be25>] child_rip+0xa/0x11
Jan 10 18:45:17 hpcmds kernel:
Jan 10 18:45:17 hpcmds kernel: ll_mdt_100    S ffff81022c2315e0     0  
31377      1         31378 31376  
[<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70
Jan 10 18:45:17 hpcmds kernel:  (L-TLB)
Jan 10 18:45:17 hpcmds kernel:  ffff81022c2315e0 ffff81022c231510  
ffff81022c231510 000000000000000a
Jan 10 18:45:17 hpcmds kernel:  ffff8102206c90c0 ffff81013396f040  
000abe5ed9a075d0 [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70
Jan 10 18:45:17 hpcmds kernel:  0000000000009a43
Jan 10 18:45:17 hpcmds kernel:  ffff8102206c92a8 0000000000000002  
ffffffff885ceb51 ffffffffffffffff
Jan 10 18:45:17 hpcmds kernel: Call Trace:
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]  
default_wake_function+0x0/0xe
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be1b>] child_rip+0x0/0x11
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be25>] child_rip+0xa/0x11
Jan 10 18:45:17 hpcmds kernel:
Jan 10 18:45:17 hpcmds kernel: ll_mdt_17     S ffff81011b57f6a0     0  
26533      1         26534 26532 (L-TLB)
Jan 10 18:45:17 hpcmds kernel:  ffff81011b57f6a0  
[<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0
Jan 10 18:45:17 hpcmds kernel:  ffff81011b57f5d0 ffff81011b57f5d0  
000000000000000a
Jan 10 18:45:17 hpcmds kernel:  ffff810118f48040 ffff8101339010c0  
000abe5ed9897be8 0000000000008bd4
Jan 10 18:45:17 hpcmds kernel:  ffff810118f48228 0000000000000001  
ffffffff885ceb51 ffffffffffffffff
Jan 10 18:45:17 hpcmds kernel: Call Trace:
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be1b>] child_rip+0x0/0x11
Jan 10 18:45:17 hpcmds kernel:
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff800611f7>] schedule_timeout 
+0x8a/0xad
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80092c5e>] process_timeout 
+0x0/0x5
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff800611f7>] schedule_timeout 
+0x8a/0xad
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80092c5e>] process_timeout 
+0x0/0x5
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]  
default_wake_function+0x0/0xe
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/0x90
Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]  
default_wake_function+0x0/0xe
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885e4460>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
Jan 10 18:45:17 hpcmds kernel:   
[<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/0x90


Followed by tons more of the above tracebacks.







More information about the lustre-discuss mailing list