[Lustre-discuss] Lustre Falling over

Wojciech Turek wjt27 at cam.ac.uk
Fri Jan 11 04:58:20 PST 2008


Hi,

I forgot to ask, if your mds is failing over what happens on failover  
mds? Does it recover connections to all clients? Can you also  
describe you failover solution (e.g. heartbeat etc.)?
cheers

Wojciech Turek
On 11 Jan 2008, at 12:18, Charles Taylor wrote:

>
> Were are running a patched (mballoc and others) 1.6.3 lustre on
> x86_64 platform under Centos4.5.   Kernel is ...
>
> Linux submit.ufhpc 2.6.18-8.1.14.el5Lustre #1 SMP Fri Oct 12 15:51:56
> EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
>
> and is the same on the clients and server side.   We have both tcp
> and o2ib NIDS.
>
> It seems to work pretty well most of the time but yesterday a large
> (256-way) job started and it seems that the MDS pretty much just fell
> over.    This occurred at 18:36 (see log below).    The whole cluster
> went into recovery mode as though we had rebooted the MDS.    Prior
> to that we were getting some "target send reply msg" and "no handle
> for file close" errors.
>
> We know that each thread of the 256-way job opens a writes to a file
> independently but roughly concurrently so it probably got hit pretty
> hard all at once.   Nonetheless, we hit it just as hard in our pre-
> production testing and it held up very well.   So the questions are...
>
> 1. Does anyone recognize the errors below as known bugs for which
> there are patches/fixes we missed?
>
> 2. Looking at the tracebacks, I can't help but wonder if there is
> some ldlm parameter we need to tweak for a cluster of our size.
>
> Note that we have 320 o2ib clients and 80 tcp clients (400 nodes/1600
> cores).    Our MDS is a 4-way Opteron 2212 w/ 8GB RAM.     We have 8
> OSS's and 24 OST's (3 per OSS).   Each OSS is a 4-way Opteron 275
> with 4GB RAM and 3 2Gb/s FC HCAs connected to Xyratex 4900FC storage
> arrays.
>
> Any ideas?   Suggestions?
>
> Thanks,
>
> Charlie Taylor
> UF HPC Center
>
> Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> req at ffff810122fe1e00 x183667281/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 2 previous similar messages
> Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30269183: cookie
> 0x23710b5b3ed594eb  req at ffff810116300000 x183716808/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c:
> 1474:mds_close()) Skipped 6 previous similar messages
> Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> req at ffff810116300000 x183716808/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 6 previous similar messages
> Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30298600: cookie
> 0x23710b5b3c868998  req at ffff810119cf1400 x183809041/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c:
> 1474:mds_close()) Skipped 1 previous similar message
> Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> req at ffff810119cf1400 x183809041/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 1 previous similar message
> Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30205356: cookie
> 0x23710b5b3dfc4c91  req at ffff8101189a3600 x183817543/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c:
> 1474:mds_close()) Skipped 5 previous similar messages
> Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> req at ffff8101189a3600 x183817543/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 5 previous similar messages
> Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 24185989: cookie
> 0x23710b5b3ed9874c  req at ffff81012fa67c00 x183830524/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c:
> 1474:mds_close()) Skipped 1 previous similar message
> Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> req at ffff81012fa67c00 x183830524/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 1 previous similar message
> Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 28959573: cookie
> 0x23710b5b3eda7dd4  req at ffff8101161d0c00 x183871936/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c:
> 1474:mds_close()) Skipped 1 previous similar message
> Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> req at ffff8101161d0c00 x183871936/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 1 previous similar message
> Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30137171: cookie
> 0x23710b5b3c790ddb  req at ffff810119320400 x183904052/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> req at ffff810119320400 x183904052/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:53:00 hpcmds kernel: Lustre: ufhpc-MDT0000: haven't heard
> from client 8514a628-1f97-0759-0932-de3e470cbb4f (at
> 10.13.26.48 at o2ib) in 2252 seconds. I think it's dead, and I am
> evicting it.
> Jan 10 13:53:00 hpcmds kernel: Lustre: MGS: haven't heard from client
> 3e041a15-91e3-5c94-78fa-3b511d86bcd9 (at 10.13.26.48 at o2ib) in 2252
> seconds. I think it's dead, and I am evicting it.
> Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30296257: cookie
> 0x23710b5b3c18eca5  req at ffff81011f3d3000 x184041701/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c:
> 1474:mds_close()) Skipped 2 previous similar messages
> Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> req at ffff81011f3d3000 x184041701/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 2 previous similar messages
> Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 29841036: cookie
> 0x23710b5b3d470bf2  req at ffff81011860cc00 x184196858/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c:
> 1474:mds_close()) Skipped 4 previous similar messages
> Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> req at ffff81011860cc00 x184196858/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 4 previous similar messages
> Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30101082: cookie
> 0x23710b5b3e8e62f1  req at ffff810117d1e400 x184423975/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c:
> 1474:mds_close()) Skipped 3 previous similar messages
> Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> req at ffff810117d1e400 x184423975/t0 o35-
>> 775cf9dd-3a19-227f-8fe6-60b97358df87 at NET_0x500000a0d1851_UUID:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 3 previous similar messages
> Jan 10 18:36:57 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c:
> 210:waiting_locks_callback()) ### lock callback timer expired:
> evicting client e23549c4-cd61-
> b385-8f2d-1c125e41e7ef at NET_0x500000a0d1a78_UUID nid
> 10.13.26.120 at o2ib  ns: mds-ufhpc-MDT0000_UUID lock:
> ffff8100bb893680/0x23710b5b59ab4680 lrc: 1/0,0 mode: CR/CR res:
> 30066994/507692538 bits 0x3 rrc: 309 type: IBT flags: 4000030 remote:
> 0x4a51d1af1d561084 expref: 13 pid 31402
> Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(handler.c:
> 1498:mds_handle()) operation 101 on unconnected MDS from
> 12345-10.13.26.120 at o2ib
> Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-107)
> req at ffff8101191b9400 x10882458/t0 o101-><?>@<?>:-1 lens 512/0 ref 0
> fl Interpret:/0/0 rc -107/0
> Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 12 previous similar messages
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885ec6a0>] :ptlrpc:ldlm_handle_enqueue+0xda0/0x1420
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff888430bd>] :mds:mds_handle
> +0x478d/0x59af
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff888430bd>] :mds:mds_handle
> +0x478d/0x59af
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]
> default_wake_function+0x0/0xe
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8001c26c>] __mod_timer
> +0xb0/0xbe
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be25>] child_rip+0xa/ 
> 0x11
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]
> default_wake_function+0x0/0xe
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8001c26c>] __mod_timer
> +0xb0/0xbe
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be1b>] child_rip 
> +0x0/0x11
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be25>] child_rip+0xa/ 
> 0x11
> Jan 10 18:45:17 hpcmds kernel:
> Jan 10 18:45:17 hpcmds kernel: ll_mdt_100    S ffff81022c2315e0     0
> 31377      1         31378 31376
> [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70
> Jan 10 18:45:17 hpcmds kernel:  (L-TLB)
> Jan 10 18:45:17 hpcmds kernel:  ffff81022c2315e0 ffff81022c231510
> ffff81022c231510 000000000000000a
> Jan 10 18:45:17 hpcmds kernel:  ffff8102206c90c0 ffff81013396f040
> 000abe5ed9a075d0 [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70
> Jan 10 18:45:17 hpcmds kernel:  0000000000009a43
> Jan 10 18:45:17 hpcmds kernel:  ffff8102206c92a8 0000000000000002
> ffffffff885ceb51 ffffffffffffffff
> Jan 10 18:45:17 hpcmds kernel: Call Trace:
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]
> default_wake_function+0x0/0xe
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be1b>] child_rip 
> +0x0/0x11
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be25>] child_rip+0xa/ 
> 0x11
> Jan 10 18:45:17 hpcmds kernel:
> Jan 10 18:45:17 hpcmds kernel: ll_mdt_17     S ffff81011b57f6a0     0
> 26533      1         26534 26532 (L-TLB)
> Jan 10 18:45:17 hpcmds kernel:  ffff81011b57f6a0
> [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0
> Jan 10 18:45:17 hpcmds kernel:  ffff81011b57f5d0 ffff81011b57f5d0
> 000000000000000a
> Jan 10 18:45:17 hpcmds kernel:  ffff810118f48040 ffff8101339010c0
> 000abe5ed9897be8 0000000000008bd4
> Jan 10 18:45:17 hpcmds kernel:  ffff810118f48228 0000000000000001
> ffffffff885ceb51 ffffffffffffffff
> Jan 10 18:45:17 hpcmds kernel: Call Trace:
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be1b>] child_rip 
> +0x0/0x11
> Jan 10 18:45:17 hpcmds kernel:
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff800611f7>] schedule_timeout
> +0x8a/0xad
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80092c5e>] process_timeout
> +0x0/0x5
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff800611f7>] schedule_timeout
> +0x8a/0xad
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80092c5e>] process_timeout
> +0x0/0x5
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]
> default_wake_function+0x0/0xe
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 
> 0x90
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]
> default_wake_function+0x0/0xe
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885e4460>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 
> 0x90
>
>
> Followed by tons more of the above tracebacks.
>
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list