[Lustre-discuss] Re-activating a partial lustre disk--update
Ms. Megan Larko
dobsonunit at gmail.com
Wed Sep 3 08:39:28 PDT 2008
Hi,
I tried again to mount read-only my partial lustre disk. On a client
I issued the command:
>>mount -t lustre -o ro ic-mds1 at o2ib:/crew4 /crew4
The command line prompt returned immediately. The disk could not be
accessed however. Looking at the MGS/MDT server /var/log/messages
file I saw that the mount again tried to re-activate the damaged (and
not non-existant) crew4-OST0000 and crew4-OST0002 parts of this
volume.
>From messages:
Sep 3 11:16:03 mds1 kernel: LustreError:
3361:0:(genops.c:1005:class_disconnect_stale_exports()) crew4-MDT0000:
disconnecting 2 stale clients
Sep 3 11:16:03 mds1 kernel: Lustre: crew4-MDT0000: sending delayed
replies to recovered clients
Sep 3 11:16:03 mds1 kernel: Lustre:
3361:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are
active, abort quota recovery
Sep 3 11:16:03 mds1 kernel: Lustre:
3361:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are
active, abort quota recovery
Sep 3 11:16:03 mds1 kernel: Lustre: crew4-MDT0000: recovery complete: rc 0
Sep 3 11:16:03 mds1 kernel: Lustre: MDS crew4-MDT0000:
crew4-OST0001_UUID now active, resetting orphans
Sep 3 11:16:03 mds1 kernel: LustreError:
20824:0:(mds_lov.c:705:__mds_lov_synchronize()) crew4-OST0000_UUID
failed at update_mds: -108
Sep 3 11:16:03 mds1 kernel: LustreError:
20824:0:(mds_lov.c:748:__mds_lov_synchronize()) crew4-OST0000_UUID
sync failed -108, deactivating
Sep 3 11:16:03 mds1 kernel: LustreError:
20826:0:(mds_lov.c:705:__mds_lov_synchronize()) crew4-OST0002_UUID
failed at update_mds: -108
Sep 3 11:16:03 mds1 kernel: LustreError:
20826:0:(mds_lov.c:748:__mds_lov_synchronize()) crew4-OST0002_UUID
sync failed -108, deactivating
I again used lctl as in my previous post to deactivate the device id's
associated with the failed hardware crew4-OST0000, crew4-OST0002.
>From messages:
Sep 3 11:20:45 mds1 kernel: Lustre: setting import crew4-OST0000_UUID
INACTIVE by administrator request
Sep 3 11:21:04 mds1 kernel: Lustre: setting import crew4-OST0002_UUID
INACTIVE by administrator request
So my current understanding is that the "recovery" status before the
mount was unchanged from when I left the office yesterday...
[root at mds1 crew4-MDT0000]# cat recovery_status
status: RECOVERING
recovery_start: 1220380113
time remaining: 0
connected_clients: 0/2
completed_clients: 0/2
replayed_requests: 0/??
queued_requests: 0
next_transno: 112339940
And that the client is not able to use a partial lustre disk...
>From messages on client:
Sep 3 11:22:58 crew01 kernel: Lustre: setting import
crew4-MDT0000_UUID INACTIVE by administrator request
Sep 3 11:22:58 crew01 kernel: Lustre: setting import
crew4-OST0000_UUID INACTIVE by administrator request
Sep 3 11:22:58 crew01 kernel: LustreError:
8832:0:(llite_lib.c:1520:ll_statfs_internal()) obd_statfs fails: rc =
-5
...and the mount hangs I guess waiting for the bad OSTs to return.
However on the mgs/mdg the bad disks are automatically re-activated???
I did the lctl, dl below ten minutes after the above transactions:
13 UP lov crew4-mdtlov crew4-mdtlov_UUID 4
14 UP osc crew4-OST0000-osc crew4-mdtlov_UUID 5
15 UP osc crew4-OST0001-osc crew4-mdtlov_UUID 5
16 UP osc crew4-OST0002-osc crew4-mdtlov_UUID 5
17 UP mds crew4-MDT0000 crew4-MDT0000_UUID 5
18 UP osc crew4-OST0003-osc crew4-mdtlov_UUID 5
19 UP osc crew4-OST0004-osc crew4-mdtlov_UUID 5
The crew4-OST0000, crew4-OST0002 are again listed as UP. Why? Can
I echo a zero or one to a /proc/fs/lustre file somewhere to keep
these volumes from being re-activated?
This is CentOS 4 linux kernel 2.6.18-53.1.13.el5 with lustre-1.6.4.3smp.
Have a nice day!
megan
More information about the lustre-discuss
mailing list