[Lustre-discuss] Replacing an MDT

Ms. Megan Larko dobsonunit at gmail.com
Fri Mar 20 10:22:13 PDT 2009


Hello,

I have a lustre system (still 1.6.3) that has an MDT which was too
small and ran out of inodes.  We removed files to take it back from
the edge and then unmounted the Lustre disk from the clients and
started to replace the MDT on the MGS.

I followed the instructions in Chapter 15 of the Lustre Manual under
"Backup and Restore".   I had no trouble unmounting the MDT remounting
it as -t ldiskfs and running the getfattr and tar commands.   I
umounted the original disk and then mounted a new larger disk as -t
ldiskfs and proceded to restore the data to the bigger disk via
setfattr and tar expansion of the data I had just gotten hours before
from the original MDT (Recall all clients have had this disk unmounted
so no activity should have occurred to it.).  When I mount the new
larger disk as -t lustre as the MDT I see no mount errors, but the
following errors appear in MGS /var/log/messages (no client access at
this point):

Mar 20 12:39:40 mds1 kernel: Lustre: MDT crew8-MDT0000 now serving dev
(f8a0e9b5-c2f1-8297-4ead-e34c9680b3cf) with recovery enabled
Mar 20 12:39:40 mds1 kernel: Lustre: Server crew8-MDT0000 on device
/dev/METADATA2/LV2 has started
Mar 20 12:39:40 mds1 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.64.215 at o2ib. The ost_connect
operation failed with -114
Mar 20 12:39:40 mds1 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.64.215 at o2ib. The ost_connect
operation failed with -114
Mar 20 12:39:40 mds1 kernel: LustreError: Skipped 6 previous similar messages
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(llog_lvfs.c:597:llog_lvfs_create()) error looking up logfile
0xa65662:0x9c30d2f6: rc -2
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(osc_request.c:3446:osc_llog_init()) failed
LLOG_MDS_OST_ORIG_CTXT
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(osc_request.c:3457:osc_llog_init()) osc 'crew8-OST0000-osc'
tgt 'crew8-MDT0000' cnt 1 catid ffffc200050f8000 rc=-2
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(osc_request.c:3459:osc_llog_init()) logid 0xa65662:0x9c30d2f6
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(lov_log.c:214:lov_llog_init()) error osc_llog_init idx 0 osc
'crew8-OST0000-osc' tgt 'crew8-MDT0000' (rc=-2)
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(mds_log.c:207:mds_llog_init()) lov_llog_init err -2
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(llog_obd.c:392:llog_cat_initialize()) rc: -2
Mar 20 12:40:05 mds1 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.64.215 at o2ib. The ost_connect
operation failed with -114
Mar 20 12:40:05 mds1 kernel: LustreError: Skipped 4 previous similar messages

The df  shows the volume (crew8-MDT0000) mounted as is the other disk
(crew2-MDT0000).
/dev/METADATA1/LV1    204G  4.7G  190G   3% /srv/lustre/mds/crew2-MDT0000
/dev/METADATA2/LV2    204G  5.3G  187G   3% /srv/lustre/mds/crew8-MDT0000

The lctl dl shows all of the disks as being up:
[root at mds1 ~]# lctl
lctl > dl
  0 UP mgs MGS MGS 5
  1 UP mgc MGC192.168.64.210 at o2ib b09fab05-c2ad-8ebb-553e-0e35f2fba17a 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov crew2-mdtlov crew2-mdtlov_UUID 4
  4 UP mds crew2-MDT0000 crew2mds_UUID 9
  5 UP osc crew2-OST0000-osc crew2-mdtlov_UUID 5
  6 UP osc crew2-OST0001-osc crew2-mdtlov_UUID 5
  7 UP osc crew2-OST0002-osc crew2-mdtlov_UUID 5
  8 UP lov crew8-mdtlov crew8-mdtlov_UUID 4
  9 UP mds crew8-MDT0000 crew8-MDT0000_UUID 15
 10 UP osc crew8-OST0000-osc crew8-mdtlov_UUID 5
 11 UP osc crew8-OST0001-osc crew8-mdtlov_UUID 5
 12 UP osc crew8-OST0002-osc crew8-mdtlov_UUID 5
 13 UP osc crew8-OST0003-osc crew8-mdtlov_UUID 5
 14 UP osc crew8-OST0004-osc crew8-mdtlov_UUID 5
 15 UP osc crew8-OST0005-osc crew8-mdtlov_UUID 5
 16 UP osc crew8-OST0006-osc crew8-mdtlov_UUID 5
 17 UP osc crew8-OST0007-osc crew8-mdtlov_UUID 5
 18 UP osc crew8-OST0008-osc crew8-mdtlov_UUID 5
 19 UP osc crew8-OST0009-osc crew8-mdtlov_UUID 5
 20 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5
 21 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5

Does this have anything to do with "Remove the recovery logs (now
invalid), run  'rm OBJECTS/* CATALOGS'"?
Should I just copy or rsync specific files from the smaller
crew8-MDT0000 to the new, largre crew8-MDT0000?

megan



More information about the lustre-discuss mailing list