[Lustre-discuss] root on lustre and timeouts

Thu Apr 30 08:20:42 PDT 2009

On Wed, Apr 29, 2009 at 02:48:34PM -0600, Andreas Dilger wrote:
>On Apr 29, 2009  10:39 -0400, Robin Humble wrote:
>> we are (happily) using read-only root-on-Lustre in production with
>> oneSIS, but have noticed something odd...
>> 
>> if a root-on-Lustre client node has been up for more than 10 or 12hours
>> then it survives an MDS failure/failover/reboot event(*), but if the
>> client is newly rebooted and has been up for less than this time, then
>> it doesn't successfully reconnect after an MDS event and the node is
>> ~dead.
>> 
>> by trial and error I've also found that if I rsync /lib64, /bin, and
>> /sbin from Lustre to a root ramdisk, 'echo 3 > /proc/sys/vm/drop_caches',
>> and symlink the rest of dirs to Lustre then the node sails through MDS
>> events. leaving out any one of the dirs/steps leads to a dead node. so
>> it looks like the Lustre kernel's recovery process is somehow tied to
>> userspace via apps in /bin and /sbin?
>> 
>> I can reproduce the weird 10-12hr behaviour at will by changing the
>> clock on nodes in a toy Lustre test setup. ie.
>>  - servers and client all have the correct time
>>  - reboot client node
>>  - stop ntpd everywhere
>>  - use 'date --set ...' to set all clocks to be X hours in the future
>>  - cause a MDS event(*)
>>  - wait for recovery to complete
>>  - if X <= ~10 to 12 then the client will be dead
>
>This shouldn't really happen.  We of course test failover with client
>uptimes a lot less than 10-12h without problems, though not with root
>filesystems on Lustre.  Providing any MDS console messages that are
>unique to a failing short-lived client vs a long-lived client might
>point us in the right direction.

sadly, there's not much there... our consoles lose character over IPMI,
so the below are from syslog.
fox3 (.213) is mds/mgs. fox4,6 (.214,.216) are oss's and fox1 (.211) is client.
mds/oss's are 1.6.6, client is 1.6.4.3.

success (setting clocks forward):

Apr 30 22:58:31 fox3 kernel: Lustre: MGS MGS started
Apr 30 22:58:31 fox3 kernel: Lustre: Server MGS on device /dev/md0 has started
...
Apr 30 22:58:31 fox3 kernel: Lustre: Enabling user_xattr
Apr 30 22:58:31 fox3 kernel: Lustre: 8758:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service test-MDT0000, 1 recoverable clients, last_transno 180390470883
Apr 30 22:58:31 fox3 kernel: Lustre: MDT test-MDT0000 now serving test-MDT0000_UUID (test-MDT0000/50ae1644-01d5-bc1a-4e65-954d53f99b50), but will be in recovery for at least 5:00, or until 1 client reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/test-MDT0000/recovery_status.
Apr 30 22:58:31 fox3 kernel: Lustre: 8758:0:(lproc_mds.c:273:lprocfs_wr_group_upcall()) test-MDT0000: group upcall set to /usr/sbin/l_getgroups
Apr 30 22:58:31 fox3 kernel: Lustre: test-MDT0000.mdt: set parameter group_upcall=/usr/sbin/l_getgroups
Apr 30 22:58:31 fox3 kernel: Lustre: 8758:0:(mds_lov.c:1070:mds_notify()) MDS test-MDT0000: in recovery, not resetting orphans on test-OST0000_UUID
Apr 30 22:58:31 fox3 kernel: Lustre: 8758:0:(mds_lov.c:1070:mds_notify()) MDS test-MDT0000: in recovery, not resetting orphans on test-OST0001_UUID
Apr 30 22:58:31 fox3 kernel: Lustre: Server test-MDT0000 on device /dev/md1 has started
Apr 30 22:58:32 fox6 kernel: Lustre: 11544:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0 changed server handle from 0x18a8172888292c35 to 0xfdb564f8289dd54
Apr 30 22:58:32 fox6 kernel: but is still in recovery
Apr 30 22:58:32 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import
Apr 30 22:58:32 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib.
Apr 30 22:58:38 fox4 kernel: Lustre: 11583:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0 changed server handle from 0x18a8172888292c0b to 0xfdb564f8289dd69
Apr 30 22:58:38 fox4 kernel: but is still in recovery
Apr 30 22:58:38 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import
Apr 30 22:58:38 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib.
Apr 30 22:58:38 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib.
Apr 30 22:58:47 fox1 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import
Apr 30 22:58:47 fox1 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib.
Apr 30 22:58:47 fox3 kernel: Lustre: 8690:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) test-MDT0000: starting recovery timer
Apr 30 22:58:47 fox1 kernel: LustreError: 724:0:(client.c:1750:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req at ffff81025ec35800 x1960/t180390441407 o101->test-MDT0000_UUID at 10.8.30.213@o2ib:12 lens 440/768 ref 2 fl Complete:RP/4/0 rc 301/301
Apr 30 22:58:47 fox1 kernel: LustreError: 724:0:(client.c:1750:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req at ffff81025dd4c400 x5132/t180390443405 o101->test-MDT0000_UUID at 10.8.30.213@o2ib:12 lens 440/768 ref 2 fl Complete:RP/4/0 rc 301/301
Apr 30 22:58:47 fox3 kernel: Lustre: test-MDT0000: sending delayed replies to recovered clients
Apr 30 22:58:47 fox3 kernel: Lustre: test-MDT0000: recovery complete: rc 0
Apr 30 22:58:47 fox1 kernel: Lustre: test-MDT0000-mdc-ffff81025a2b7800: Connection restored to service test-MDT0000 using nid 10.8.30.213 at o2ib.
Apr 30 22:58:47 fox6 kernel: Lustre: test-OST0000: received MDS connection from 10.8.30.213 at o2ib
Apr 30 22:58:47 fox3 kernel: Lustre: MDS test-MDT0000: test-OST0000_UUID now active, resetting orphans
Apr 30 22:58:47 fox4 kernel: Lustre: test-OST0001: received MDS connection from 10.8.30.213 at o2ib
Apr 30 22:58:47 fox3 kernel: Lustre: MDS test-MDT0000: test-OST0001_UUID now active, resetting orphans
Apr 30 23:03:07 fox3 kernel: Lustre: MGS: haven't heard from client 83d64a6f-bf4f-6d1f-231f-a3e0b08b9ab7 (at 10.8.30.211 at o2ib) in 243 seconds. I think it's dead, and I am evicting it.

where the last line above confuses me - that looks like something
failed, but the node is completely happy :-/

 # cexec fox:3 cat /proc/fs/lustre/mds/test-MDT0000/recovery_status
status: COMPLETE
recovery_start: 1241136707
recovery_duration: 0
completed_clients: 1/1
replayed_requests: 0
last_transno: 180390470883

fail (node recently booted):

Apr 30 23:40:43 fox3 kernel: Lustre: MGS MGS started
Apr 30 23:40:43 fox3 kernel: Lustre: Server MGS on device /dev/md0 has started
...
Apr 30 23:40:44 fox3 kernel: Lustre: Enabling user_xattr
Apr 30 23:40:44 fox3 kernel: Lustre: 10739:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service test-MDT0000, 1 recoverable clients, last_transno 180390483549
Apr 30 23:40:44 fox3 kernel: Lustre: MDT test-MDT0000 now serving test-MDT0000_UUID (test-MDT0000/50ae1644-01d5-bc1a-4e65-954d53f99b50), but will be in recovery for at least 5:00, or until 1 client reconnect. During this time new clients will not be allowed to connect. Recovery progress can be monitored by watching /proc/fs/lustre/mds/test-MDT0000/recovery_status.
Apr 30 23:40:44 fox3 kernel: Lustre: 10739:0:(lproc_mds.c:273:lprocfs_wr_group_upcall()) test-MDT0000: group upcall set to /usr/sbin/l_getgroups
Apr 30 23:40:44 fox3 kernel: Lustre: test-MDT0000.mdt: set parameter group_upcall=/usr/sbin/l_getgroups
Apr 30 23:40:44 fox3 kernel: Lustre: 10739:0:(mds_lov.c:1070:mds_notify()) MDS test-MDT0000: in recovery, not resetting orphans on test-OST0000_UUID
Apr 30 23:40:44 fox3 kernel: Lustre: 10739:0:(mds_lov.c:1070:mds_notify()) MDS test-MDT0000: in recovery, not resetting orphans on test-OST0001_UUID
Apr 30 23:40:44 fox3 kernel: Lustre: Server test-MDT0000 on device /dev/md1 has started
Apr 30 23:40:52 fox3 kernel: Lustre: 10671:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) test-MDT0000: starting recovery timer
Apr 30 23:41:02 fox6 kernel: Lustre: 11544:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0 changed server handle from 0xfdb564f8289dd54 to 0xc63759a4fedd9a8
Apr 30 23:41:02 fox6 kernel: but is still in recovery
Apr 30 23:41:02 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import
Apr 30 23:41:02 fox6 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib.
...

 # cexec fox:3 cat /proc/fs/lustre/mds/test-MDT0000/recovery_status
status: RECOVERING
recovery_start: 1241098852
time_remaining: 278
connected_clients: 1/1
completed_clients: 0/1
replayed_requests: 0/??
queued_requests: 0
next_transno: 180390483550

...
Apr 30 23:42:23 fox4 kernel: Lustre: 11583:0:(import.c:736:ptlrpc_connect_interpret()) MGS at MGC10.8.30.213@o2ib_0 changed server handle from 0xfdb564f8289dd69 to 0xc63759a4fedd9bd
Apr 30 23:42:23Apr 30 23:42:23 fox4 kernel: but is still in recovery
Apr 30 23:42:23 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Reactivating import
Apr 30 23:42:23 fox4 kernel: Lustre: MGC10.8.30.213 at o2ib: Connection restored to service MGS using nid 10.8.30.213 at o2ib.
Apr 30 23:45:52 fox3 kernel: LustreError: 0:0:(ldlm_lib.c:1161:target_recovery_expired()) test-MDT0000: recovery timed out, aborting

 # cexec fox:3 cat /proc/fs/lustre/mds/test-MDT0000/recovery_status
status: RECOVERING
recovery_start: 1241098852
time_remaining: 0
connected_clients: 1/1
completed_clients: 0/1
replayed_requests: 0/??
queued_requests: 0
next_transno: 180390483550

and 'RECOVERING' doesn't stop anytime soon in this case. only when I
reset the node (or maybe wait 10+hours for something to timeout and
evict) does it go to COMPLETE.

I'm not sure, but I think with newer Lustre versions that I tried,
recovery does complete in a timely manner and go COMPLETE, but the
client is still dead.

>One of the few things that is time dependent on the client is the
>DLM lock LRU list.  Idle locks will expire from the client cache
>over time.  You can force a flush of the client's MDS lock cache with:

cool - thanks - makes sense.

>	# check how many metadata locks client currently has
>	client# cat /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size
>
>	client# echo clear > /proc/fs/lustre/ldlm/namespaces/*mdc*/lru_size
>
>The MGC shouldn't be the culprit since it only holds a single lock
>that never expires.

darn, doesn't seem to help :-/
I also did a clear into all
  /proc/fs/lustre/ldlm/namespaces/*/lru_size
and that didn't seem to change anything.

cheers,
robin

>> it's no big deal to put those 3 dirs into ramdisk as they're really
>> small (and the part-on-ramdisk model is nice and flexible too), so
>> we'll probably move to running in this way anyway, but I'm still
>> curious as to why a kernel-only system like Lustre a) cares about
>> userspace at all during recovery b) why it has a 10-12hr timescale :-)
>
>It would be good to know the root cause of this problem, as it may
>expose a defect in another part of the code.
>
>> changing the contents of /proc/sys/lnet/upcall into some path stat'able
>> without Lustre being up doesn't change anything.
>
>There are no longer upcalls needed on the client for recovery, and having
>the upcall inside Lustre when Lustre itself is not accessible is always a
>bad idea.
>
>Cheers, Andreas
>--
>Andreas Dilger
>Sr. Staff Engineer, Lustre Group
>Sun Microsystems of Canada, Inc.
>
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss