[Lustre-discuss] o2ib cant ping/mount Infiniband NID

subbu kl subbukl at gmail.com
Fri Jan 16 02:08:29 PST 2009


Liang,

Right; you reproduced the exact problem. But as you can see in my previous
mail I think I have solved that problem by mannually assiging IP to ib0
(check this line # ifconfig ib0 172.24.198.111 and *"Added LNI" lines  *)

we are back to sqare one now I guess ! LNET is up with mannually assigned
IPs. normal ping succeds between machines but not lctl ping.

so my current problem is this :
# lctl ping 172.24.198.112 at o2ib
failed to ping 172.24.198.112 at o2ib: Input/output error

/var/log/messages:

Jan 16 10:24:14 p128 kernel: Lustre:
2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback())
172.24.198.112 at o2ib: ROUTE ERROR -22
Jan 16 10:24:14 p128 kernel: Lustre:
2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages
for 172.24.198.112 at o2ib: connection failed

how can I get rid of this connection problem?

~subbu


On Fri, Jan 16, 2009 at 2:11 PM, Liang Zhen <Zhen.Liang at sun.com> wrote:

> Subbu,
>
> We don't have any tip for setup IPoIB, looks like linux can't find the
> ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it's because you
> didn't assign any address to ib0 (or failed to assign address to ib0) before
> loading o2iblnd  in the first try.
> I can reproduce exactly same error by:
> 1. modprobe ib_ipoib
> 2. ifconfig ib0 up  // without assign any address
> 3. modprobe ko2iblnd
> 4. lctl network up
>
> Regards
> Liang
>
> subbu kl:
>
>> Liang,
>> after executing following echo :
>> echo +neterror > /proc/sys/lnet/printk
>>
>> now lctlt ping shows the following error
>>
>> # lctl ping 172.24.198.112 at o2ib
>> failed to ping 172.24.198.112 at o2ib: Input/output error
>>
>> Jan 16 10:24:14 p128 kernel: Lustre:
>> 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198.112 at o2ib:
>> ROUTE ERROR -22
>> Jan 16 10:24:14 p128 kernel: Lustre:
>> 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages
>> for 172.24.198.112 at o2ib: connection failed
>>
>> Looks like some problem with "IB connection manager" !
>>
>> 1. do we have any help docs to setup IPoIB and Lustre, lustre operation
>> manual has very minimal info about this . I think I am missing some IPoIB
>> setup part here.
>> 2. or is it mannual assignment of  IP addresses to "ib0" is creating some
>> problem
>>
>>
>> *Some more supporting info :
>> *subnet manager of following version is also running : OpenSM 3.1.8
>>
>> Initially I got this error for MDS mount
>>
>> Jan 16 09:45:20 p128 kernel: LustreError:
>> 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get IP address for
>> interface ib0
>> Jan 16 09:45:20 p128 kernel: LustreError:
>> 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB interface ib0:
>> -99
>> Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting up
>> LNI o2ib
>> Jan 16 09:45:21 p128 kernel: LustreError:
>> 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed
>> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc
>> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko):
>> Input/output error
>> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc
>> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko):
>> Unknown symbol in module, or unknown parameter (see dmesg)
>> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req
>> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get
>> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol
>> ptlrpc_lprocfs_register_obd
>> .
>> .
>> .
>>
>> then I mannually set the IP address for ib0 as folows :
>> # ifconfig ib0 172.24.198.111
>>
>> [root at p186 ~]# ifconfig ib0
>> ib0       Link encap:InfiniBand  HWaddr
>> 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>          inet addr:172.24.198.112  Bcast:172.24.255.255  Mask:255.255.0.0
>>          UP BROADCAST MULTICAST  MTU:65520  Metric:1
>>          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:256
>>          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>>
>> then it mounted sucessfully
>>
>> * Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198.111 at o2ib[8/64]
>> Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started*
>> Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter
>> lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000
>> Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr
>> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new disk,
>> initializing
>> Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 now serving dev
>> (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with recovery enabled
>> Jan 16 09:47:09 p128 kernel: Lustre:
>> 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT0000: group
>> upcall set to /usr/sbin/l_getgroups
>> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: set parameter
>> group_upcall=/usr/sbin/l_getgroups
>> Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 on device
>> /dev/loop0 has started
>> .
>> .
>> .
>>
>>
>> ~subbu
>>
>>
>> On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen <Zhen.Liang at sun.com <mailto:
>> Zhen.Liang at sun.com>> wrote:
>>
>>    Subbu,
>>
>>    I'd suggest:
>>    1) make sure ko2iblnd has been brought up (please check if there
>>    is any error message when startup ko2iblnd)
>>    2) echo +neterror > /proc/sys/lnet/printk, then try with lctl
>>    ping, if it still can't work please post error messages
>>
>>    Regards
>>    Liang
>>
>>    subbu kl:
>>
>>        Problem is similer to
>>
>> http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html
>>        But by looking at the thread could not really get the solution
>>        for the problem.
>>
>>        I have two RHEL5 Linux servers installed with following packages -
>>
>>        kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1
>>        kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>>        lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>>        lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>>        lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>>        e2fsprogs-1.40.7.sun3-0redhat
>>
>>
>>        machine 1: with ib0 IP address : 172.24.198.111
>>        machine 2: with ib0 IP address : 172.24.198.112
>>
>>        /etc/modprobe.conf contains
>>        options lnet networks=o2ib
>>
>>        TCP networking worked fine and now I am trying with Infiniband
>>        network finding it difficult in communicating with IB nodes
>>        mounting effort throghs me the following error
>>
>>        [root at p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1
>>        mount.lustre: mount /dev/loop0 at /mnt/ost1 failed:
>>        Input/output error
>>        Is the MGS running?
>>
>>        /var/log/messages :
>>        Jan 15 16:55:25 p186 kernel: kjournald starting.  Commit
>>        interval 5 seconds
>>        Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
>>        Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem
>>        with ordered data mode.
>>        Jan 15 16:55:25 p186 kernel: kjournald starting.  Commit
>>        interval 5 seconds
>>        Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
>>        Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem
>>        with ordered data mode.
>>        Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled
>>        Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled
>>        Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from
>>        MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib 5s ago has
>>        timed out (limit 5s).
>>        Jan 15 16:55:30 p186 kernel: LustreError:
>>        7193:0:(obd_mount.c:1062:server_start_targets()) Required
>>        registration failed for lustre-OSTffff: -5
>>        Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication
>>        error with the MGS.  Is the MGS running?
>>        Jan 15 16:55:30 p186 kernel: LustreError:
>>        7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start
>>        targets: -5
>>        Jan 15 16:55:30 p186 kernel: LustreError:
>>        7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff
>>        Jan 15 16:55:30 p186 kernel: LustreError:
>>        7193:0:(obd_mount.c:119:server_deregister_mount())
>>        lustre-OSTffff not registered
>>        Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0
>>        reqs (0 success)
>>        Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents
>>        scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost
>>        Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated
>>        and it took 0
>>        Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0
>>        preallocated, 0 discarded
>>        Jan 15 16:55:30 p186 kernel: Lustre: server umount
>>        lustre-OSTffff complete
>>        Jan 15 16:55:30 p186 kernel: LustreError:
>>        7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount
>>         (-5)
>>
>>        All pinging efforts also failed to the IB NIDS local/remote
>>        can ping the ip address :
>>        [root at p186 ~]# ping 172.24.198.112
>>        PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data.
>>        64 bytes from 172.24.198.112 <http://172.24.198.112>:
>>        icmp_seq=1 ttl=64 time=0.052 ms
>>        64 bytes from 172.24.198.112 <http://172.24.198.112>:
>>        icmp_seq=2 ttl=64 time=0.024 ms
>>
>>
>>        --- 172.24.198.112 ping statistics ---
>>        2 packets transmitted, 2 received, 0% packet loss, time 1000ms
>>        rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms
>>        [root at p186 ~]# ping 172.24.198.111
>>        PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data.
>>        64 bytes from 172.24.198.111 <http://172.24.198.111>:
>>        icmp_seq=1 ttl=64 time=2.16 ms
>>        64 bytes from 172.24.198.111 <http://172.24.198.111>:
>>        icmp_seq=2 ttl=64 time=0.296 ms
>>
>>
>>        --- 172.24.198.111 ping statistics ---
>>        2 packets transmitted, 2 received, 0% packet loss, time 1000ms
>>        rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms
>>
>>        but cant ping the NIDS :
>>        [root at p186 ~]# lctl ping 172.24.198.112 at o2ib
>>        failed to ping 172.24.198.112 at o2ib: Input/output error
>>        [root at p186 ~]# lctl ping 172.24.198.111 at o2ib
>>        failed to ping 172.24.198.111 at o2ib: Input/output error
>>
>>        Any idea why lnet cant ping NIDS ?
>>
>>        some more configurations:
>>        [root at p186 ~]# ibstat
>>        CA 'mthca0'
>>               CA type: MT23108
>>               Number of ports: 2
>>               Firmware version: 3.5.0
>>               Hardware version: a1
>>               Node GUID: 0x0002c9020021550c
>>
>>        Machines are connected via IB switch.
>>
>>        Looking forward for help.
>>
>>        ~subbu
>>
>>  ------------------------------------------------------------------------
>>
>>        _______________________________________________
>>        Lustre-discuss mailing list
>>        Lustre-discuss at lists.lustre.org
>>        <mailto:Lustre-discuss at lists.lustre.org>
>>        http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>
>>
>>
>> --
>> . . . s u b b u
>> "You've got to be original, because if you're like someone else, what do
>> they need you for?"
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>
>


-- 
. . . s u b b u
"You've got to be original, because if you're like someone else, what do
they need you for?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090116/4b101f30/attachment.htm>


More information about the lustre-discuss mailing list