[Lustre-discuss] o2ib cant ping/mount Infiniband NID

Liang Zhen Zhen.Liang at Sun.COM
Fri Jan 16 00:41:51 PST 2009


Subbu,

We don't have any tip for setup IPoIB, looks like linux can't find the 
ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it's because you 
didn't assign any address to ib0 (or failed to assign address to ib0) 
before loading o2iblnd  in the first try.
I can reproduce exactly same error by:
1. modprobe ib_ipoib
2. ifconfig ib0 up  // without assign any address
3. modprobe ko2iblnd
4. lctl network up

Regards
Liang

subbu kl:
> Liang,
> after executing following echo :
> echo +neterror > /proc/sys/lnet/printk
>
> now lctlt ping shows the following error
>
> # lctl ping 172.24.198.112 at o2ib
> failed to ping 172.24.198.112 at o2ib: Input/output error
>
> Jan 16 10:24:14 p128 kernel: Lustre: 
> 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198.112 at o2ib: 
> ROUTE ERROR -22
> Jan 16 10:24:14 p128 kernel: Lustre: 
> 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting 
> messages for 172.24.198.112 at o2ib: connection failed
>
> Looks like some problem with "IB connection manager" !
>
> 1. do we have any help docs to setup IPoIB and Lustre, lustre 
> operation manual has very minimal info about this . I think I am 
> missing some IPoIB setup part here.
> 2. or is it mannual assignment of  IP addresses to "ib0" is creating 
> some problem
>
>
> *Some more supporting info :
> *subnet manager of following version is also running : OpenSM 3.1.8
>
> Initially I got this error for MDS mount
>
> Jan 16 09:45:20 p128 kernel: LustreError: 
> 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get IP address 
> for interface ib0
> Jan 16 09:45:20 p128 kernel: LustreError: 
> 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB interface 
> ib0: -99
> Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting 
> up LNI o2ib
> Jan 16 09:45:21 p128 kernel: LustreError: 
> 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed
> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc 
> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): 
> Input/output error
> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc 
> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): 
> Unknown symbol in module, or unknown parameter (see dmesg)
> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req
> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get
> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol 
> ptlrpc_lprocfs_register_obd
> .
> .
> .
>
> then I mannually set the IP address for ib0 as folows :
> ifconfig ib0 172.24.198.111
>
> [root at p186 ~]# ifconfig ib0
> ib0       Link encap:InfiniBand  HWaddr 
> 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>           inet addr:172.24.198.112  Bcast:172.24.255.255  Mask:255.255.0.0
>           UP BROADCAST MULTICAST  MTU:65520  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:256
>           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>
> then it mounted sucessfully
>
> Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198.111 at o2ib [8/64]
> Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started
> Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter 
> lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000
> Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr
> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new disk, 
> initializing
> Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 now serving 
> dev (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with 
> recovery enabled
> Jan 16 09:47:09 p128 kernel: Lustre: 
> 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT0000: 
> group upcall set to /usr/sbin/l_getgroups
> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: set parameter 
> group_upcall=/usr/sbin/l_getgroups
> Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 on device 
> /dev/loop0 has started
> .
> .
> .
>
>
> ~subbu
>
>
> On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen <Zhen.Liang at sun.com 
> <mailto:Zhen.Liang at sun.com>> wrote:
>
>     Subbu,
>
>     I'd suggest:
>     1) make sure ko2iblnd has been brought up (please check if there
>     is any error message when startup ko2iblnd)
>     2) echo +neterror > /proc/sys/lnet/printk, then try with lctl
>     ping, if it still can't work please post error messages
>
>     Regards
>     Liang
>
>     subbu kl:
>
>         Problem is similer to
>         http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html
>         But by looking at the thread could not really get the solution
>         for the problem.
>
>         I have two RHEL5 Linux servers installed with following packages -
>
>         kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1
>         kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>         lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>         lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>         lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>         e2fsprogs-1.40.7.sun3-0redhat
>
>
>         machine 1: with ib0 IP address : 172.24.198.111
>         machine 2: with ib0 IP address : 172.24.198.112
>
>         /etc/modprobe.conf contains
>         options lnet networks=o2ib
>
>         TCP networking worked fine and now I am trying with Infiniband
>         network finding it difficult in communicating with IB nodes
>         mounting effort throghs me the following error
>
>         [root at p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1
>         mount.lustre: mount /dev/loop0 at /mnt/ost1 failed:
>         Input/output error
>         Is the MGS running?
>
>         /var/log/messages :
>         Jan 15 16:55:25 p186 kernel: kjournald starting.  Commit
>         interval 5 seconds
>         Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
>         Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem
>         with ordered data mode.
>         Jan 15 16:55:25 p186 kernel: kjournald starting.  Commit
>         interval 5 seconds
>         Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
>         Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem
>         with ordered data mode.
>         Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled
>         Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled
>         Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from
>         MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib 5s ago has
>         timed out (limit 5s).
>         Jan 15 16:55:30 p186 kernel: LustreError:
>         7193:0:(obd_mount.c:1062:server_start_targets()) Required
>         registration failed for lustre-OSTffff: -5
>         Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication
>         error with the MGS.  Is the MGS running?
>         Jan 15 16:55:30 p186 kernel: LustreError:
>         7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start
>         targets: -5
>         Jan 15 16:55:30 p186 kernel: LustreError:
>         7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff
>         Jan 15 16:55:30 p186 kernel: LustreError:
>         7193:0:(obd_mount.c:119:server_deregister_mount())
>         lustre-OSTffff not registered
>         Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0
>         reqs (0 success)
>         Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents
>         scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost
>         Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated
>         and it took 0
>         Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0
>         preallocated, 0 discarded
>         Jan 15 16:55:30 p186 kernel: Lustre: server umount
>         lustre-OSTffff complete
>         Jan 15 16:55:30 p186 kernel: LustreError:
>         7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount
>          (-5)
>
>         All pinging efforts also failed to the IB NIDS local/remote
>         can ping the ip address :
>         [root at p186 ~]# ping 172.24.198.112
>         PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data.
>         64 bytes from 172.24.198.112 <http://172.24.198.112>:
>         icmp_seq=1 ttl=64 time=0.052 ms
>         64 bytes from 172.24.198.112 <http://172.24.198.112>:
>         icmp_seq=2 ttl=64 time=0.024 ms
>
>
>         --- 172.24.198.112 ping statistics ---
>         2 packets transmitted, 2 received, 0% packet loss, time 1000ms
>         rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms
>         [root at p186 ~]# ping 172.24.198.111
>         PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data.
>         64 bytes from 172.24.198.111 <http://172.24.198.111>:
>         icmp_seq=1 ttl=64 time=2.16 ms
>         64 bytes from 172.24.198.111 <http://172.24.198.111>:
>         icmp_seq=2 ttl=64 time=0.296 ms
>
>
>         --- 172.24.198.111 ping statistics ---
>         2 packets transmitted, 2 received, 0% packet loss, time 1000ms
>         rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms
>
>         but cant ping the NIDS :
>         [root at p186 ~]# lctl ping 172.24.198.112 at o2ib
>         failed to ping 172.24.198.112 at o2ib: Input/output error
>         [root at p186 ~]# lctl ping 172.24.198.111 at o2ib
>         failed to ping 172.24.198.111 at o2ib: Input/output error
>
>         Any idea why lnet cant ping NIDS ?
>
>         some more configurations:
>         [root at p186 ~]# ibstat
>         CA 'mthca0'
>                CA type: MT23108
>                Number of ports: 2
>                Firmware version: 3.5.0
>                Hardware version: a1
>                Node GUID: 0x0002c9020021550c
>
>         Machines are connected via IB switch.
>
>         Looking forward for help.
>
>         ~subbu
>         ------------------------------------------------------------------------
>
>         _______________________________________________
>         Lustre-discuss mailing list
>         Lustre-discuss at lists.lustre.org
>         <mailto:Lustre-discuss at lists.lustre.org>
>         http://lists.lustre.org/mailman/listinfo/lustre-discuss
>          
>
>
>
>
>
> -- 
> . . . s u b b u
> "You've got to be original, because if you're like someone else, what 
> do they need you for?"
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list