[Lustre-discuss] Changing lustre network numbers...

Ms. Megan Larko dobsonunit at gmail.com
Tue Dec 9 09:54:44 PST 2008


Greetings,

Our 2.6.18-53.1.13.el5_lustre.1.6.4.3smp lustre system was wonderfully
stable for the last few months until today when I tried to change it
to use another network.    Our group uses InfiniBand (IB) for the
lustre network.   I shutdown all the systems (the bldg had a scheduled
power outage today so it was a good time to adjust the network; it is
re-wired into a new smart IB switch with another research group to
share data).   I set-up for the new IB IP numbers and  set my CentOS
5.1 not to bring up IB on boot.  Brought computers up nicely without
Lustre.  Finalized new config and tested it via ssh and ping.   The
new IB  IP numbers are working.  To allow lustre to use the new IP
number scheme on the OST's I ran the following:

[root at oss1 ~]# tunefs.lustre --erase-params --writeconf
--mgsnode=ic-mds1 at o2ib /dev/sdb1
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target:     crew2-OST0000
Index:      0
UUID:       crew2d1_UUID
Lustre FS:  crew2
Mount type: ldiskfs
Flags:      0x402
              (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=172.18.0.10 at o2ib


   Permanent disk data:
Target:     crew2-OST0000
Index:      0
UUID:       crew2d1_UUID
Lustre FS:  crew2
Mount type: ldiskfs
Flags:      0x542
              (OST update writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.64.210 at o2ib

Writing CONFIGS/mountdata

(Yes, I did remember to change the /dev/abc appropriately each time.)

The MGS/MDS is where I am having some confusion.   On the
192.168.64.210 mds1 box, I ran the following for the metadata MGS/MDS
disk:
[root at mds1 ~]# tunefs.lustre  --mgs  --writeconf
--mgsnode=ic-mds1 at o2ib  /dev/METADATA1/LV1
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

     Read previous values:
Target:     crew2-MDT0000
Index:      0
UUID:       crew2mds_UUID
Lustre FS:  crew2
Mount type: ldiskfs
Flags:      0x405
              (MDT MGS )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr,
Parameters:

   Permanent disk data:
Target:     crew2-MDT0000
Index:      0
UUID:       crew2mds_UUID
Lustre FS:  crew2
Mount type: ldiskfs
Flags:      0x505
              (MDT MGS writeconf )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr,
Parameters: mgsnode=192.168.64.210 at o2ib

Writing CONFIGS/mountdata
[root at mds1 ~]# mount -a -t lustre

I do not know if that was the correct incantation of the command for
the mgs/mds computer mgs/mdt.

For the two other mdt on the mgs/mds computer, I ran:
[root at mds1 ~]# tunefs.lustre  --writeconf --mgsnode=ic-mds1 at o2ib /dev/md0
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target:     crew3-MDT0000
Index:      0
UUID:       crew3mds_UUID
Lustre FS:  crew3
Mount type: ldiskfs
Flags:      0x401
              (MDT )
Persistent mount opts:
errors=remount-ro,iopen_nopriv,user_xattr,errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mgsnode=172.18.0.10 at o2ib


   Permanent disk data:
Target:     crew3-MDT0000
Index:      0
UUID:       crew3mds_UUID
Lustre FS:  crew3
Mount type: ldiskfs
Flags:      0x501
              (MDT writeconf )
Persistent mount opts:
errors=remount-ro,iopen_nopriv,user_xattr,errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mgsnode=172.18.0.10 at o2ib mgsnode=192.168.64.210 at o2ib

Writing CONFIGS/mountdata

I can successfully lctl ping around the new IB network numbers:
[root at crew01 ~]# lctl ping 192.168.64.210 at o2ib
12345-0 at lo
12345-192.168.64.210 at o2ib

I cannot mount any of my lustre disks now.  The error on the client is:
[root at crew01 ~]# tail /var/log/messages
Dec  9 12:23:30 crew01 perfquery: ibpanic: [5751] madrpc_init: can't
init UMAD library: (No such file or directory)
Dec  9 12:23:40 crew01 perfquery: ibpanic: [5752] madrpc_init: can't
init UMAD library: (No such file or directory)
Dec  9 12:23:40 crew01 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.64.210 at o2ib. The mds_connect
operation failed with -11
Dec  9 12:23:50 crew01 perfquery: ibpanic: [5753] madrpc_init: can't
init UMAD library: (No such file or directory)
Dec  9 12:24:00 crew01 perfquery: ibpanic: [5754] madrpc_init: can't
init UMAD library: (No such file or directory)
Dec  9 12:24:10 crew01 perfquery: ibpanic: [5755] madrpc_init: can't
init UMAD library: (No such file or directory)
Dec  9 12:24:20 crew01 perfquery: ibpanic: [5756] madrpc_init: can't
init UMAD library: (No such file or directory)
Dec  9 12:24:30 crew01 perfquery: ibpanic: [5757] madrpc_init: can't
init UMAD library: (No such file or directory)
Dec  9 12:24:30 crew01 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.64.210 at o2ib. The mds_connect
operation failed with -11
Dec  9 12:24:30 crew01 kernel: LustreError: Skipped 1 previous similar message

The errors on the mgs/mds are:
Dec  9 12:20:45 mds1 kernel: Lustre: crew2-MDT0000: temporarily
refusing client connection from 192.168.64.211 at o2ib
Dec  9 12:20:45 mds1 kernel: Lustre: Skipped 18 previous similar messages
Dec  9 12:20:45 mds1 kernel: LustreError:
4486:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-11)  req at ffff81006d752800 x6/t0 o38-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc -11/0
Dec  9 12:20:45 mds1 kernel: LustreError:
4486:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 18 previous
similar messages
[root at mds1 ~]# tail /var/log/messages
Dec  9 12:19:08 mds1 kernel: LDISKFS FS on sdf, internal journal
Dec  9 12:19:08 mds1 kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
Dec  9 12:20:45 mds1 kernel: Lustre: crew2-MDT0000: temporarily
refusing client connection from 192.168.64.211 at o2ib
Dec  9 12:20:45 mds1 kernel: Lustre: Skipped 18 previous similar messages
Dec  9 12:20:45 mds1 kernel: LustreError:
4486:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-11)  req at ffff81006d752800 x6/t0 o38-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc -11/0
Dec  9 12:20:45 mds1 kernel: LustreError:
4486:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 18 previous
similar messages
Dec  9 12:22:00 mds1 kernel: Lustre: crew2-MDT0000: temporarily
refusing client connection from 192.168.64.211 at o2ib
Dec  9 12:22:00 mds1 kernel: Lustre: Skipped 2 previous similar messages
Dec  9 12:22:00 mds1 kernel: LustreError:
4489:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error
(-11)  req at ffff81006faf9c00 x13/t0 o38-><?>@<?>:-1 lens 240/0 ref 0 fl
Interpret:/0/0 rc -11/0
Dec  9 12:22:00 mds1 kernel: LustreError:
4489:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 2 previous
similar messages

On my mds/mgs computer, my working device list under old IP numbers
looked like this:
lctl > dl
  0 UP mgs MGS MGS 11
  1 UP mgc MGC172.18.0.10 at o2ib bd220344-9aa1-c2d5-d65c-19038700158a 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov crew2-mdtlov crew2-mdtlov_UUID 4
  4 UP mds crew2-MDT0000 crew2mds_UUID 6
  5 UP osc crew2-OST0000-osc crew2-mdtlov_UUID 5
  6 UP osc crew2-OST0001-osc crew2-mdtlov_UUID 5
  7 UP osc crew2-OST0002-osc crew2-mdtlov_UUID 5
  8 UP mgc MGC172.18.0.10 at o2ib 3a773da4-9688-423e-4bc7-af8b90db36a3 5
  9 UP lov crew3-mdtlov crew3-mdtlov_UUID 4
 10 UP mds crew3-MDT0000 crew3mds_UUID 6
 11 UP osc crew3-OST0000-osc crew3-mdtlov_UUID 5
 12 UP osc crew3-OST0001-osc crew3-mdtlov_UUID 5
 13 UP osc crew3-OST0002-osc crew3-mdtlov_UUID 5
 14 UP lov crew8-mdtlov crew8-mdtlov_UUID 4
 15 UP mds crew8-MDT0000 crew8-MDT0000_UUID 15
 16 UP osc crew8-OST0000-osc crew8-mdtlov_UUID 5
 17 UP osc crew8-OST0001-osc crew8-mdtlov_UUID 5
 18 UP osc crew8-OST0002-osc crew8-mdtlov_UUID 5
 19 UP osc crew8-OST0003-osc crew8-mdtlov_UUID 5
 20 UP osc crew8-OST0004-osc crew8-mdtlov_UUID 5
 21 UP osc crew8-OST0005-osc crew8-mdtlov_UUID 5
 22 UP osc crew8-OST0006-osc crew8-mdtlov_UUID 5
 23 UP osc crew8-OST0007-osc crew8-mdtlov_UUID 5
 24 UP osc crew8-OST0008-osc crew8-mdtlov_UUID 5
 25 UP osc crew8-OST0009-osc crew8-mdtlov_UUID 5
 26 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5
 27 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5

Since making lustre conf changes on mgs/mds computer, my device list
looks like this:
lctl > dl
  0 UP mgs MGS MGS 5
  1 UP mgc MGC192.168.64.210 at o2ib 70d8bc53-c08b-e79c-5698-6b86b20f6aac 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov crew2-mdtlov crew2-mdtlov_UUID 4
  4 UP mds crew2-MDT0000 crew2mds_UUID 3
  5 UP lov crew3-mdtlov crew3-mdtlov_UUID 4
  6 UP mds crew3-MDT0000 crew3mds_UUID 3
  7 UP lov crew8-mdtlov crew8-mdtlov_UUID 4
  8 UP mds crew8-MDT0000 crew8-MDT0000_UUID 3

Where have I erred in changing the IP numbers for my Lustre network?
I hope someone can guide me as to how to fix it.

Thank you.
Megan Larko



More information about the lustre-discuss mailing list