[Lustre-discuss] root on lustre and timeouts

Wed Apr 29 07:39:20 PDT 2009

we are (happily) using read-only root-on-Lustre in production with
oneSIS, but have noticed something odd...

if a root-on-Lustre client node has been up for more than 10 or 12hours
then it survives an MDS failure/failover/reboot event(*), but if the
client is newly rebooted and has been up for less than this time, then
it doesn't successfully reconnect after an MDS event and the node is
~dead.

by trial and error I've also found that if I rsync /lib64, /bin, and
/sbin from Lustre to a root ramdisk, 'echo 3 > /proc/sys/vm/drop_caches',
and symlink the rest of dirs to Lustre then the node sails through MDS
events. leaving out any one of the dirs/steps leads to a dead node. so
it looks like the Lustre kernel's recovery process is somehow tied to
userspace via apps in /bin and /sbin?

I can reproduce the weird 10-12hr behaviour at will by changing the
clock on nodes in a toy Lustre test setup. ie.
 - servers and client all have the correct time
 - reboot client node
 - stop ntpd everywhere
 - use 'date --set ...' to set all clocks to be X hours in the future
 - cause a MDS event(*)
 - wait for recovery to complete
 - if X <= ~10 to 12 then the client will be dead

it's no big deal to put those 3 dirs into ramdisk as they're really
small (and the part-on-ramdisk model is nice and flexible too), so
we'll probably move to running in this way anyway, but I'm still
curious as to why a kernel-only system like Lustre a) cares about
userspace at all during recovery b) why it has a 10-12hr timescale :-)

changing the contents of /proc/sys/lnet/upcall into some path stat'able
without Lustre being up doesn't change anything.

BTW, OSS reboot/failover is handled fine with root on Lustre, as are
regular (non-root on Lustre clients) - this behaviour seems to be
limited to the MDS/MGS failure when all/almost-all of the OS is on Lustre.

our setup is patchless 1.6.4.3 clients, 1.6.6 servers, rhel5.2/5.3
x86_64, but the behaviour seems the same with much newer Lustre too
eg. patched b_release_1_8_0.  

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

(*) umount mdt and mgs, lustre_rmmod, wait 10 mins, mount them again