[Lustre-discuss] Multirail IB Configuration Issue

mages, brian brian.mages at emc.com
Wed Feb 13 09:14:37 PST 2013


Hi Grégoire,

Thanks for the reply.

I thought that might be the issue as well.  However, it doesn't explain why the other two clients work successfully when both interfaces are on the same network.

...Brian

From: Gregoire Pichon [mailto:gregoire.pichon at bull.net]
Sent: Wednesday, February 13, 2013 4:19 AM
To: mages, brian; lustre-discuss at lists.lustre.org
Subject: RE: Multirail IB Configuration Issue

Hi,

The two LNet networks need separate IPoIB networks.
o2ib0  - 10.0.0.0/24
o2ib1 - 10.1.0.0/24

Nodes with 2 physical interfaces (like you servers) have ib0 on o2ib0 and ib1 on o2ib1
Nodes with 1 physical interface (like your clients) have 2 logical interfaces: ib0 on o2ib0 and ib0:0 on o2ib1.

Grégoire.

De : lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org> [mailto:lustre-discuss-bounces at lists.lustre.org] De la part de mages, brian
Envoyé : mardi 12 février 2013 21:34
À : lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
Objet : [Lustre-discuss] Multirail IB Configuration Issue

Hi,

I'm having difficulty getting one of my clients to work with a multirail IB configuration.  Here's what I've got:

        Host            OS Version      Lustre Version          Function                Storage                 Interface ib0           Interface ib1

 1.  bmr1-s7 CentOS 5.7      2.1.1                   MGS,MDS,OSS1    mdt,mdt2,ost1->6,ost13->18      192.168.1.25/24 192.168.1.35/24
 2.  bmr1-s8 CentOS 5.7      2.1.1                   OSS2                    ost7->12,ost19->24              192.168.1.26/24 192.168.1.36/24
 3.  bmr1-s5 CentOS 5.7      2.1.1                   OSS3                    ost25->30                       192.168.1.20/24 192.168.1.30/24
 4.  bmr1-s6 CentOS 5.7      2.1.1                   OSS4                    ost31->36                       192.168.1.21/24 192.168.1.31/24
 5.  bmr2-s9 CentOS 5.7      2.1.1                   Client                  n/a                             192.168.1.209/24

The "/lustre" filesystem consists of mdt and ost1->12 (using bmr1-s7 and bmr1-s8).
The "/lustre2" filesystem consists of mdt2 and ost13->36 (using bmr1-s7, bmr1-s8, bmr1-s5, and bmr1-s6).
On each OSS, half the OSTs are available only on ib0 and the other half only on ib1.

>From bmr1-s5 and bmr1-s6 (using as clients), I can successfully mount and access "/lustre".  I can also successfully mount "/lustre2".

>From bmr2-s9, I can neither mount "/lustre" nor "/lustre2".  Originally, the issue with bmr2-s9 was that it was running 1.8.6-wc1 (server on CentOS 5.6).  Since this config (i.e., multirail) wasn't supported on that version, I upgraded to 2.1.1.  Originally, I tried installing and testing the 2.1.1 client without success.  Then, since it had worked with the 2.1.1 server on both bmr1-s5 and bmr1-s6, I thought I'd try that next.  Unfortunately, it still didn't work.

1a) Here's what I see on the client when I try to mount "/lustre":

[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]#

1b) Here's an excerpt from "/var/log/messages" on the client (after executing the above command):

Feb 12 15:00:54 bmr2-s9 kernel: Lustre: 5512:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.25 at o2ib->MGC192.168.1.25 at o2ib_0<mailto:MGC192.168.1.25 at o2ib-%3eMGC192.168.1.25 at o2ib_0> netid 50000: select flavor null
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: Reactivating import
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(ldlm_lib.c:357:client_obd_setup()) can't add initial connection
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:522:class_setup()) setup lustre-OST0001-osc-ffff81045d783c00 failed (-2)
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:1361:class_config_llog_handler()) Err -2 on cfg command:
Feb 12 15:00:54 bmr2-s9 kernel: Lustre:    cmd=cf003 0:lustre-OST0001-osc  1:lustre-OST0001_UUID  2:192.168.1.35 at o2ib1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 15c-8: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(llite_lib.c:950:ll_fill_super()) Unable to process log: -2
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 4923:0:(lov_obd.c:927:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_config.c:567:class_cleanup()) Device 5 not setup
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: client ffff81045d783c00 umount complete
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount  (-2)

1c) Here's an excerpt from "/var/log/messages" on the server (after executing the above command):

Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from 2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib<mailto:2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib> t0 exp 0000000000000000 cur 1360699254 last 0
Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGS->NET_0x50000c0a801d1_UUID netid 50000: select flavor null

2a) Here's what I see on the client when I try to mount "/lustre" (using the other interface):

[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Is 'lustre' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.
[root at bmr2-s9 ~]#

2b) Here's an excerpt from "/var/log/messages" on the client (after executing the above command):

Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 5580:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.35 at o2ib->MGC192.168.1.35 at o2ib_0<mailto:MGC192.168.1.35 at o2ib-%3eMGC192.168.1.35 at o2ib_0> netid 50000: select flavor null
Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721863 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699617] [real_sent 1360699617] [current 1360699617] [deadline 5s] [delay -5s]  req at ffff81043b76e400 x1426793186721863/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699622 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 1 previous similar message
Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired   req at ffff81043b76e000 x1426793186721864/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 6 previous similar messages
Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 5s
Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721868 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699642] [real_sent 1360699642] [current 1360699642] [deadline 10s] [delay -10s]  req at ffff810430e30800 x1426793186721868/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699652 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:22 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired   req at ffff81045d7ce800 x1426793186721867/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 1 previous similar message
Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 10s
Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721872 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699667] [real_sent 1360699667] [current 1360699667] [deadline 15s] [delay -15s]  req at ffff810444576c00 x1426793186721872/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699682 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:47 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 156-2: The client profile 'lustre-client' could not be read from the MGS.  Does that filesystem exist?
Feb 12 15:07:54 bmr2-s9 kernel: Lustre: client ffff81045f465800 umount complete
Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 5580:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount  (-22)

2c) Here's an excerpt from "/var/log/messages" on the server (after executing the above command):

Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can't accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>
Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Skipped 2 previous similar messages
Feb 12 15:07:22 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can't accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>

3) Here's what one of the MDTs looks like (the other is similarly configured):

[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdp
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target:     lustre-MDT0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x5
              (MDT MGS )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>


   Permanent disk data:
Target:     lustre-MDT0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x105
              (MDT MGS writeconf )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>

exiting before disk write.
[root at bmr1-s7 ~]#

4) Here's what one of the OSTs looks like (the others are similarly configured):

[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdf
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target:     lustre-OST0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
              (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0


   Permanent disk data:
Target:     lustre-OST0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x102
              (OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0

exiting before disk write.
[root at bmr1-s7 ~]#

I'd appreciate any help or direction on a potential resolution.  Let me know what additional information is needed, if any.  Hopefully, I'm just missing something simple.

Thanks in advance,
...Brian


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20130213/efc8be48/attachment.htm>


More information about the lustre-discuss mailing list