[Lustre-discuss] Re-activating a partial lustre disk

Thu Sep 4 11:34:26 PDT 2008

Greetings!

Following the long recovery period I am able to mount the disk.   The
mount command returns very nearly immediately.   The difficulty is
that the mounted disk cannot be used.   All commands such as "ls" or
"df" or "cd" will hang.   Eventually I "fuser -km /crew4" and "umount
-f crew4"  to clear the process and free the command line.   So the
disk now mounts but is unusable for all practical purposes.

The log files contain the following information:
The MGS/MDS:
Sep  4 14:07:13 mds1 kernel: LustreError: 11-0: an error occurred
while communicating with 172.18.0.14 at o2ib. The ost_connect operation
failed with -19
Sep  4 14:07:13 mds1 kernel: Lustre: Client crew4-client has started
Sep  4 14:07:13 mds1 kernel: LustreError: Skipped 1 previous similar message

The OSS:
Sep  4 14:10:56 oss3 kernel: LustreError:
4881:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-19)  req at ffff8103e7ba5a00 x5186/t0 o8-><?>@<?>:-1 lens 240/0 ref 0
fl Interpret:/0/0 rc -19/0
Sep  4 14:10:56 oss3 kernel: LustreError:
4881:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 7 previous
similar messages
Sep  4 14:10:56 oss3 kernel: LustreError: Skipped 7 previous similar messages
Sep  4 14:15:04 oss3 kernel: LustreError: 11-0: an error occurred
while communicating with 0 at lo. The ost_connect operation failed with
-19

The client box on which the /crew4 disk was mounted (and yes, it
appears properly in mtab FWIW):
First---  its own MGS:
Sep  4 14:07:13 mds1 kernel: LustreError: 11-0: an error occurred
while communicating with 172.18.0.14 at o2ib. The ost_connect operation
failed with -19
Sep  4 14:07:13 mds1 kernel: Lustre: Client crew4-client has started
Sep  4 14:07:13 mds1 kernel: LustreError: Skipped 1 previous similar message

Second-- its own OSS:
Sep  4 14:10:56 oss3 kernel: LustreError:
4881:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-19)  req at ffff8103e7ba5a00 x5186/t0 o8-><?>@<?>:-1 lens 240/0 ref 0
fl Interpret:/0/0 rc -19/0
Sep  4 14:10:56 oss3 kernel: LustreError:
4881:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 7 previous
similar messages
Sep  4 14:10:56 oss3 kernel: LustreError: Skipped 7 previous similar messages

Third--  from a real client (which I don't like to potentially hang):
Sep  3 16:31:05 crew01 kernel: Lustre: Client crew4-client has started
Sep  3 16:31:05 crew01 kernel: LustreError: 11-0: an error occurred
while communicating with 172.18.0.14 at o2ib. The ost_connect operation
failed with -19
Sep  3 16:31:05 crew01 kernel: LustreError: Skipped 1 previous similar message
Sep  3 16:35:15 crew01 kernel: LustreError: 11-0: an error occurred
while communicating with 172.18.0.14 at o2ib. The ost_connect operation
failed with -19
Sep  3 16:35:15 crew01 kernel: LustreError: Skipped 1 previous similar message
Sep  3 16:35:27 crew01 mountd[3994]: authenticated unmount request
from crewtape1.iges.org:1015 for /crew3 (/crew3)
Sep  3 16:39:25 crew01 kernel: LustreError: 11-0: an error occurred
while communicating with 172.18.0.14 at o2ib. The ost_connect operation
failed with -19

Note that the above messages on the real client will appear until I
have unmounted the /crew4 lustre disk.

On MGS/MDS:
lctl > dl
  0 UP mgs MGS MGS 13
  1 UP mgc MGC172.18.0.10 at o2ib 81039216-0261-c74d-3f2f-a504788ad8f8 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov crew2-mdtlov crew2-mdtlov_UUID 4
  4 UP mds crew2-MDT0000 crew2mds_UUID 9
  5 UP osc crew2-OST0000-osc crew2-mdtlov_UUID 5
  6 UP osc crew2-OST0001-osc crew2-mdtlov_UUID 5
  7 UP osc crew2-OST0002-osc crew2-mdtlov_UUID 5
  8 UP lov crew3-mdtlov crew3-mdtlov_UUID 4
  9 UP mds crew3-MDT0000 crew3mds_UUID 9
 10 UP osc crew3-OST0000-osc crew3-mdtlov_UUID 5
 11 UP osc crew3-OST0001-osc crew3-mdtlov_UUID 5
 12 UP osc crew3-OST0002-osc crew3-mdtlov_UUID 5
 13 UP lov crew4-mdtlov crew4-mdtlov_UUID 4
 14 UP osc crew4-OST0000-osc crew4-mdtlov_UUID 5
 15 UP osc crew4-OST0001-osc crew4-mdtlov_UUID 5
 16 UP osc crew4-OST0002-osc crew4-mdtlov_UUID 5
 17 UP mds crew4-MDT0000 crew4-MDT0000_UUID 9
 18 UP osc crew4-OST0003-osc crew4-mdtlov_UUID 5
 19 UP osc crew4-OST0004-osc crew4-mdtlov_UUID 5

The other lustre disks /crew2 and /crew3 are working just fine; no
errors.   The /crew4 disk on the MGS/MDS shows the crew4-OST0000 and
crew4-OST0002 as "UP".   They have been specifically deactivated.

On the OST, hosting the /crew4 disks, the lctl shows the following:
 0 UP mgc MGC172.18.0.10 at o2ib b4c1b639-11d5-9092-c0d0-cebc2365afec 5
  1 UP ost OSS OSS_uuid 3
  2 UP obdfilter crew4-OST0001 crew4-OST0001_UUID 11
  3 UP obdfilter crew4-OST0003 crew4-OST0003_UUID 11
  4 UP obdfilter crew4-OST0004 crew4-OST0004_UUID 11

Most of the errors are ost_connect failed.   Is the MGS/MDT disk
crwe4-MDT0000 still trying to use all of the OST's?  Do I need to
dummy some hardware to 8Tb partitions formatted lustre and named
crew4-OST0000, crew4-OST0002 on the OSS to "trick" lustre into
connecting with an OST which is deactivated?  I would like to be able
to get a few files from this damaged disk if possible.  However, if
that is not to be, I will learn and move on.

Enjoy your day!
megan