[Lustre-discuss] How long should an MDS failover event block

Fri May 2 15:14:46 PDT 2008

On Apr 30, 2008  07:53 -0700, stumped wrote:
> I'm using 1.6.4.3 and have 2 MDSes servers with heartbeat.
> I have 2 OSSes (each with one OST).  A client mounts the filesystem
> with
> 
>   mount -t lustre mds01 at tcp0:mds02 at tcp0:/i3_lfs3  /mnt/lustre/i3_lfs3
> 
> and everything is great.
> 
> I then run this on the client:
> 
> while true; do echo "----------- $(date)"; touch $(date +%s); sleep 2;
> done
> 
> So I turn off the power on mds01 and the output of the above script
> is:
> 
> ----------- Wed Apr 30 08:31:29 MDT 2008
> ----------- Wed Apr 30 08:31:31 MDT 2008
> ----------- Wed Apr 30 08:34:53 MDT 2008
> ----------- Wed Apr 30 08:38:15 MDT 2008
> ----------- Wed Apr 30 08:38:17 MDT 2008
> 
> In other words, my client blocks for almost 7 minutes.
> 
> Is this what I should expect?

Yes, this is pretty reasonable.  We are working to reduce this delay
in 1.8 with a feature called "adaptive timeouts" that will tune the
Lustre recovery timeout as a function of how busy the servers are.

Currently the default timeout is 100s and for large clusters (say 500
or more clients) the recommended delay is 300s because under load a
server is in essence undergoing a denial-of-service attack from the
clients, and each client might get 1/1000th of a server's idle performance,
which is sometimes hard to distinguish from server failure.

> Can I reasonably shrink this time?  Since some of the time is spent
> with the MDT being "recovered" - as shown by doing
>      cat /proc/fs/lustre/mds/i3_lfs3-MDT0000/recovery_status
> can I expect that as my file system has more data on it (eventually
> it'll be about 700TB) that this time will also increase.

The recovery time is not a function of the amount of data on the
filesystem.

> Also, the MDS gives up this stack trace; doesn't look good; should I
> be worried?  As I said, after 7 minutes or so everything seems to be
> working (though I'm not really doing much with the filesystem yet).
> 
> 
> Apr 30 08:36:33 mds02 kernel: Lustre: 30256:0:(ldlm_lib.c:
> 747:target_handle_connect()) i3_lfs3-MDT0000: refuse reconnection from
> b1d47bed-6eca-7ccf-45e8-c39f930b361f at 10.200.20.63@tcp to
> 0xffff810203c6a000; still busy with 3 active RPCs
> Apr 30 08:36:33 mds02 kernel: LustreError: 30256:0:(ldlm_lib.c:
> 1442:target_send_reply_msg()) @@@ processing error (-16)
> req at ffff810227188e00
> x19317/t0 o38->b1d47bed-6eca-7ccf-45e8-
> c39f930b361f at NET_0x200000ac8143f_UUID:-1 lens 304/200 ref 0 fl
> Interpret:/0/0 rc -16/0
> Apr 30 08:37:00 mds02 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb())
> Watchdog triggered for pid 30253: it was inactive for 100s
> Apr 30 08:37:00 mds02 kernel: LustreError: 30253:0:(ldlm_request.c:
> 64:ldlm_expired_completion_wait()) ### lock timed out (enqueued at
> 1209566120, 100s ago); not entering recovery in server code, just going back
> to sleep ns: mds-i3_lfs3-MDT0000_UUID lock: ffff8102002cda80/0xdd61b8a90a
> 25a009 lrc: 3/1,0 mode: --/CR res: 1096876033/1115803706 bits 0x3 rrc:
> 4 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 30253
> Apr 30 08:37:00 mds02 kernel: LustreError: dumping log to /tmp/lustre-
> log.1209566220.30253

Probably not ideal, but not fatal.  There should be a message after
this which reports that this thread was responsive after 120s or
similar...

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.