[Lustre-discuss] root on lustre and timeouts

Wed Apr 29 13:48:34 PDT 2009

On Apr 29, 2009  10:39 -0400, Robin Humble wrote:
> we are (happily) using read-only root-on-Lustre in production with
> oneSIS, but have noticed something odd...
> 
> if a root-on-Lustre client node has been up for more than 10 or 12hours
> then it survives an MDS failure/failover/reboot event(*), but if the
> client is newly rebooted and has been up for less than this time, then
> it doesn't successfully reconnect after an MDS event and the node is
> ~dead.
> 
> by trial and error I've also found that if I rsync /lib64, /bin, and
> /sbin from Lustre to a root ramdisk, 'echo 3 > /proc/sys/vm/drop_caches',
> and symlink the rest of dirs to Lustre then the node sails through MDS
> events. leaving out any one of the dirs/steps leads to a dead node. so
> it looks like the Lustre kernel's recovery process is somehow tied to
> userspace via apps in /bin and /sbin?
> 
> I can reproduce the weird 10-12hr behaviour at will by changing the
> clock on nodes in a toy Lustre test setup. ie.
>  - servers and client all have the correct time
>  - reboot client node
>  - stop ntpd everywhere
>  - use 'date --set ...' to set all clocks to be X hours in the future
>  - cause a MDS event(*)
>  - wait for recovery to complete
>  - if X <= ~10 to 12 then the client will be dead

This shouldn't really happen.  We of course test failover with client
uptimes a lot less than 10-12h without problems, though not with root
filesystems on Lustre.  Providing any MDS console messages that are
unique to a failing short-lived client vs a long-lived client might
point us in the right direction.

One of the few things that is time dependent on the client is the
DLM lock LRU list.  Idle locks will expire from the client cache
over time.  You can force a flush of the client's MDS lock cache with:

	# check how many metadata locks client currently has
	client# cat /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size

	client# echo clear > /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size

The MGC shouldn't be the culprit since it only holds a single lock
that never expires.

> it's no big deal to put those 3 dirs into ramdisk as they're really
> small (and the part-on-ramdisk model is nice and flexible too), so
> we'll probably move to running in this way anyway, but I'm still
> curious as to why a kernel-only system like Lustre a) cares about
> userspace at all during recovery b) why it has a 10-12hr timescale :-)

It would be good to know the root cause of this problem, as it may
expose a defect in another part of the code.

> changing the contents of /proc/sys/lnet/upcall into some path stat'able
> without Lustre being up doesn't change anything.

There are no longer upcalls needed on the client for recovery, and having
the upcall inside Lustre when Lustre itself is not accessible is always a
bad idea.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.