[Lustre-discuss] Re-activating a partial lustre disk--update

Wed Sep 3 08:39:28 PDT 2008

Hi,

I tried again to mount read-only my partial lustre disk.   On a client
I issued the command:
>>mount -t lustre -o ro ic-mds1 at o2ib:/crew4 /crew4

The command line prompt returned immediately.   The disk could not be
accessed however.  Looking at the MGS/MDT server /var/log/messages
file I saw that the mount again tried to re-activate the damaged (and
not non-existant) crew4-OST0000 and crew4-OST0002 parts of this
volume.

>From messages:
Sep  3 11:16:03 mds1 kernel: LustreError:
3361:0:(genops.c:1005:class_disconnect_stale_exports()) crew4-MDT0000:
disconnecting 2 stale clients
Sep  3 11:16:03 mds1 kernel: Lustre: crew4-MDT0000: sending delayed
replies to recovered clients
Sep  3 11:16:03 mds1 kernel: Lustre:
3361:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are
active, abort quota recovery
Sep  3 11:16:03 mds1 kernel: Lustre:
3361:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are
active, abort quota recovery
Sep  3 11:16:03 mds1 kernel: Lustre: crew4-MDT0000: recovery complete: rc 0
Sep  3 11:16:03 mds1 kernel: Lustre: MDS crew4-MDT0000:
crew4-OST0001_UUID now active, resetting orphans
Sep  3 11:16:03 mds1 kernel: LustreError:
20824:0:(mds_lov.c:705:__mds_lov_synchronize()) crew4-OST0000_UUID
failed at update_mds: -108
Sep  3 11:16:03 mds1 kernel: LustreError:
20824:0:(mds_lov.c:748:__mds_lov_synchronize()) crew4-OST0000_UUID
sync failed -108, deactivating
Sep  3 11:16:03 mds1 kernel: LustreError:
20826:0:(mds_lov.c:705:__mds_lov_synchronize()) crew4-OST0002_UUID
failed at update_mds: -108
Sep  3 11:16:03 mds1 kernel: LustreError:
20826:0:(mds_lov.c:748:__mds_lov_synchronize()) crew4-OST0002_UUID
sync failed -108, deactivating

I again used lctl as in my previous post to deactivate the device id's
associated with the failed hardware crew4-OST0000, crew4-OST0002.

>From messages:
Sep  3 11:20:45 mds1 kernel: Lustre: setting import crew4-OST0000_UUID
INACTIVE by administrator request
Sep  3 11:21:04 mds1 kernel: Lustre: setting import crew4-OST0002_UUID
INACTIVE by administrator request

So my current understanding is that the "recovery" status before the
mount was unchanged from when I left the office yesterday...
[root at mds1 crew4-MDT0000]# cat recovery_status
status: RECOVERING
recovery_start: 1220380113
time remaining: 0
connected_clients: 0/2
completed_clients: 0/2
replayed_requests: 0/??
queued_requests: 0
next_transno: 112339940

And that the client is not able to use a partial lustre disk...

>From messages on client:
Sep  3 11:22:58 crew01 kernel: Lustre: setting import
crew4-MDT0000_UUID INACTIVE by administrator request
Sep  3 11:22:58 crew01 kernel: Lustre: setting import
crew4-OST0000_UUID INACTIVE by administrator request
Sep  3 11:22:58 crew01 kernel: LustreError:
8832:0:(llite_lib.c:1520:ll_statfs_internal()) obd_statfs fails: rc =
-5

...and the mount hangs I guess waiting for the bad OSTs to return.

However on the mgs/mdg the bad disks are automatically re-activated???
  I did the lctl, dl below ten minutes after the above transactions:
13 UP lov crew4-mdtlov crew4-mdtlov_UUID 4
 14 UP osc crew4-OST0000-osc crew4-mdtlov_UUID 5
 15 UP osc crew4-OST0001-osc crew4-mdtlov_UUID 5
 16 UP osc crew4-OST0002-osc crew4-mdtlov_UUID 5
 17 UP mds crew4-MDT0000 crew4-MDT0000_UUID 5
 18 UP osc crew4-OST0003-osc crew4-mdtlov_UUID 5
 19 UP osc crew4-OST0004-osc crew4-mdtlov_UUID 5

The crew4-OST0000, crew4-OST0002 are again listed as UP.    Why?  Can
I echo a zero or one to a /proc/fs/lustre  file somewhere to keep
these volumes from being re-activated?

This is CentOS 4 linux kernel  2.6.18-53.1.13.el5  with lustre-1.6.4.3smp.

Have a nice day!
megan