[Lustre-discuss] How long should an MDS failover event block

Wed Apr 30 07:53:37 PDT 2008

Greetings,

I'm using 1.6.4.3 and have 2 MDSes servers with heartbeat.
I have 2 OSSes (each with one OST).  A client mounts the filesystem
with

  mount -t lustre mds01 at tcp0:mds02 at tcp0:/i3_lfs3  /mnt/lustre/i3_lfs3

and everything is great.

I then run this on the client:

while true; do echo "----------- $(date)"; touch $(date +%s); sleep 2;
done

So I turn off the power on mds01 and the output of the above script
is:

----------- Wed Apr 30 08:31:29 MDT 2008
----------- Wed Apr 30 08:31:31 MDT 2008
----------- Wed Apr 30 08:34:53 MDT 2008
----------- Wed Apr 30 08:38:15 MDT 2008
----------- Wed Apr 30 08:38:17 MDT 2008

In other words, my client blocks for almost 7 minutes.

Is this what I should expect?

Can I reasonably shrink this time?  Since some of the time is spent
with the MDT being "recovered" - as shown by doing
     cat /proc/fs/lustre/mds/i3_lfs3-MDT0000/recovery_status
can I expect that as my file system has more data on it (eventually
it'll be about 700TB) that this time will also increase.

Also, the MDS gives up this stack trace; doesn't look good; should I
be worried?  As I said, after 7 minutes or so everything seems to be
working (though I'm not really doing much with the filesystem yet).

Thanks
JR

Apr 30 08:36:33 mds02 kernel: Lustre: 30256:0:(ldlm_lib.c:
747:target_handle_connect()) i3_lfs3-MDT0000: refuse reconnection from
b1d47bed-6eca-7ccf-45e8-c39f930b361f at 10.200.20.63@tcp to
0xffff810203c6a000; still busy with 3 active RPCs
Apr 30 08:36:33 mds02 kernel: LustreError: 30256:0:(ldlm_lib.c:
1442:target_send_reply_msg()) @@@ processing error (-16)
req at ffff810227188e00
x19317/t0 o38->b1d47bed-6eca-7ccf-45e8-
c39f930b361f at NET_0x200000ac8143f_UUID:-1 lens 304/200 ref 0 fl
Interpret:/0/0 rc -16/0
Apr 30 08:37:00 mds02 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb())
Watchdog triggered for pid 30253: it was inactive for 100s
Apr 30 08:37:00 mds02 kernel: Lustre: 0:0:(linux-debug.c:
168:libcfs_debug_dumpstack()) showing stack for process 30253
Apr 30 08:37:00 mds02 kernel: ll_mdt_12     S ffff810001071780     0
30253      1         30254 30252 (L-TLB)
Apr 30 08:37:00 mds02 kernel: ffff810201c5b798 0000000000000046
ffff810201c5b7b8 000000000000000a
Apr 30 08:37:00 mds02 kernel:        ffff8102024559d8 ffff810202455790
ffff8102278ad040 00003b4ad4efe62e
Apr 30 08:37:00 mds02 kernel:        00000000000068eb
0000000500000000
Apr 30 08:37:00 mds02 kernel: Call Trace:
<ffffffff8013acbf>{lock_timer_base+27}
<ffffffff884f0f70>{:ptlrpc:ldlm_expired_completion_wait+0}
Apr 30 08:37:00 mds02 kernel:
<ffffffff802d91a1>{schedule_timeout+138}
<ffffffff8013b1d8>{process_timeout+0}
Apr 30 08:37:00 mds02 kernel:
<ffffffff884f23cd>{:ptlrpc:ldlm_completion_ast+861}
Apr 30 08:37:00 mds02 kernel:
<ffffffff884da629>{:ptlrpc:ldlm_lock_enqueue+1369}
<ffffffff8012ae07>{default_wake_function+0}
Apr 30 08:37:00 mds02 kernel:
<ffffffff884d6e8a>{:ptlrpc:ldlm_lock_addref_internal_nolock+58}
Apr 30 08:37:00 mds02 kernel:
<ffffffff884eed90>{:ptlrpc:ldlm_blocking_ast+0}
<ffffffff884f0004>{:ptlrpc:ldlm_cli_enqueue_local+1108}
Apr 30 08:37:00 mds02 kernel:
<ffffffff8874ce27>{:mds:mds_fid2locked_dentry+471}
<ffffffff884f2070>{:ptlrpc:ldlm_completion_ast+0}
Apr 30 08:37:00 mds02 kernel:
<ffffffff8874d5e7>{:mds:mds_getattr_lock+1783}
<ffffffff887588a1>{:mds:mds_intent_policy+1489}
Apr 30 08:37:00 mds02 kernel:
<ffffffff884dd7d6>{:ptlrpc:ldlm_resource_putref+438}
Apr 30 08:37:00 mds02 kernel:
<ffffffff884da1c3>{:ptlrpc:ldlm_lock_enqueue+243}
<ffffffff884d7bfd>{:ptlrpc:ldlm_lock_create+2477}
Apr 30 08:37:00 mds02 kernel:
<ffffffff8850f1e5>{:ptlrpc:lustre_swab_buf+197}
<ffffffff884fa8c0>{:ptlrpc:ldlm_server_completion_ast+0}
Apr 30 08:37:00 mds02 kernel:
<ffffffff884f6ff7>{:ptlrpc:ldlm_handle_enqueue+3495}
Apr 30 08:37:00 mds02 kernel:
<ffffffff884fae30>{:ptlrpc:ldlm_server_blocking_ast+0}
Apr 30 08:37:00 mds02 kernel:        <ffffffff88753fad>{:mds:mds_handle
+18141} <ffffffff88479d92>{:obdclass:class_handle2object+210}
Apr 30 08:37:00 mds02 kernel:
<ffffffff88511660>{:ptlrpc:lustre_swab_ptlrpc_body+0}
Apr 30 08:37:00 mds02 kernel:
<ffffffff8850f1e5>{:ptlrpc:lustre_swab_buf+197}
<ffffffff8013adf9>{__mod_timer+173}
Apr 30 08:37:00 mds02 kernel:
<ffffffff88519327>{:ptlrpc:ptlrpc_main+4903}
<ffffffff8012ae07>{default_wake_function+0}
Apr 30 08:37:00 mds02 kernel:        <ffffffff8010bb8a>{child_rip+8}
<ffffffff88518000>{:ptlrpc:ptlrpc_main+0}
Apr 30 08:37:00 mds02 kernel:        <ffffffff8010bb82>{child_rip+0}
Apr 30 08:37:00 mds02 kernel: LustreError: 30253:0:(ldlm_request.c:
64:ldlm_expired_completion_wait()) ### lock timed out (enqueued at
12095661
20, 100s ago); not entering recovery in server code, just going back
to sleep ns: mds-i3_lfs3-MDT0000_UUID lock:
ffff8102002cda80/0xdd61b8a90a
25a009 lrc: 3/1,0 mode: --/CR res: 1096876033/1115803706 bits 0x3 rrc:
4 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 30253
Apr 30 08:37:00 mds02 kernel: LustreError: dumping log to /tmp/lustre-
log.1209566220.30253