[Lustre-discuss] root on lustre and timeouts

Thu Apr 30 08:48:56 PDT 2009

On Wed, Apr 29, 2009 at 10:42:44AM -0500, Troy Benjegerdes wrote:
>On Wed, Apr 29, 2009 at 10:39:20AM -0400, Robin Humble wrote:
>> we are (happily) using read-only root-on-Lustre in production with
>> oneSIS, but have noticed something odd...
>> 
>> if a root-on-Lustre client node has been up for more than 10 or 12hours
>> then it survives an MDS failure/failover/reboot event(*), but if the
>> client is newly rebooted and has been up for less than this time, then
>> it doesn't successfully reconnect after an MDS event and the node is
>> ~dead.
>> 
>> by trial and error I've also found that if I rsync /lib64, /bin, and
>> /sbin from Lustre to a root ramdisk, 'echo 3 > /proc/sys/vm/drop_caches',
>> and symlink the rest of dirs to Lustre then the node sails through MDS
>> events. leaving out any one of the dirs/steps leads to a dead node. so
>> it looks like the Lustre kernel's recovery process is somehow tied to
>> userspace via apps in /bin and /sbin?
>
>Now that's interesting.. What distro are you using? I have been toying
>with the idea of modifiying the Debian initramfs-tools boot ramdisk to
>include bushbox and dropbear-ssh in order to debug these kind of
>root-network-filesystem bugs.

yeah, putting an ssh server into the initramfs is certainly possible.
I've mostly used IPMI Serial-over-LAN and lots of echo's and occasional
dropping into /bin/ash to debug problems.

>In my case, I'm running AFS as the root
>filesystem, and I have the 'afsd' in the ramdisk that gets started at
>boot. I'm wondering if the Lustre binaries that are necessary could be
>placed in the initrd as well.

cool.

I'm mostly working with CentOS 5.2 and 5.3, with a oneSIS initramfs as
a starting point.  http://onesis.sourceforge.net/

for pure root-on-Lustre, and with a recent kernel that accepts a huge
initramfs (RHEL/CentOS kernels are too old), the minimum initramfs
requirements would probably just be a 64bit busybox build with
/sbin/mount.lustre and piles of IB and Lustre kernel modules.
the /init script can then be altered slightly to mount a Lustre fs and
then bind mount the OS image sub-directory to the right place before
you switch_root to it, and after that it's just like the normal oneSIS
NFS read-only root...
nothing particularly tricky.

for those older kernels I found the IB modules would fit but the
Lustre modules were too large for the initramfs to handle (I thought the
bad old days of initrd size limitations were over?!). so I needed to
rsync over the correct /lib/modules/`uname -r`/ tree, or just specific
modules, into the ramfs before I could fire up Lustre. hence rsync
needed to be in the initramfs.

once rsync is there, then things get pretty flexible and hybrid
approaches with some/all of the OS in ramdisk (or on local disk), and
some on Lustre becomes pretty easy to play with :-)

I followed the oneSIS approach and pass a bunch of possible boot
variants to the /init script via /proc/cmdline, so a single initramfs
can be pointed at different OS root images on different Lustre fs's, do
different bind mounts, or be told to install various parts of the OS
onto different media.

in production for OSS's and MDS's we use an all-on-ramdisk Lustre-free
(for obvious reasons) variant, and we will probably migrate our current
pure Lustre root compute nodes to the hybrid model soon.

hopefully I'll tidy/generalise the code and push some of this back to
oneSIS at some stage.

the key changes I made from the basic oneSIS initramfs are probably:
 - compile up a 64bit busybox (big filesystems didn't seem to work with
   32bit busybox IIRC) against glibc (not ulibc), as glibc is needed
   for rsync anyway. I just used the oneSIS config "busybux bbconfig"
   'cos I don't know much about busybox.
 - get ssh and rsync running in the initramfs. I put the cluster's
   usual ssh in there rather than dropbear as I needed it working without
   a passwd. a rsync server to boot from would also be possible and then
   maybe ssh wouldn't be needed in the initramfs. quite a few shared
   libs are needed to get ssh and rsync working.
 - put mount.lustre into the initramfs
 - include IB and (if they fit) Lustre modules in the initramfs
 - start editing /init to mount, rsync, bind mount, ... things to where
   you want them to be.

>It would be nice if various distros could work 'out of the box' with
>readonly network filesystems. 

definitely.

sadly the hybrid approach (which will probably always need quite a bit
of tweaking) ultimately might be the best way forward as it's good to
have the option of unloading some commonly used libs and dirs from
Lustre and have them in local ram or local SSD/disk/USB, etc - a bit
more scalable.

having said that, we haven't noticed any scalability problems with 150+
clients yet, except a little load on the MDS when all nodes execute the
same command at once (cexec, pdsh etc.).

BTW, as was pointed out in one talk of this years LUG, Lustre 1.8's
OSS read cache should help things like root-on-Lustre because small
commonly used files will likely be cached in the OSS's and won't result
in disk accesses.

cheers,
robin