[Lustre-discuss] o2ib cant ping/mount Infiniband NID
Liang Zhen
Zhen.Liang at Sun.COM
Fri Jan 16 00:41:51 PST 2009
Subbu,
We don't have any tip for setup IPoIB, looks like linux can't find the
ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it's because you
didn't assign any address to ib0 (or failed to assign address to ib0)
before loading o2iblnd in the first try.
I can reproduce exactly same error by:
1. modprobe ib_ipoib
2. ifconfig ib0 up // without assign any address
3. modprobe ko2iblnd
4. lctl network up
Regards
Liang
subbu kl:
> Liang,
> after executing following echo :
> echo +neterror > /proc/sys/lnet/printk
>
> now lctlt ping shows the following error
>
> # lctl ping 172.24.198.112 at o2ib
> failed to ping 172.24.198.112 at o2ib: Input/output error
>
> Jan 16 10:24:14 p128 kernel: Lustre:
> 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198.112 at o2ib:
> ROUTE ERROR -22
> Jan 16 10:24:14 p128 kernel: Lustre:
> 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting
> messages for 172.24.198.112 at o2ib: connection failed
>
> Looks like some problem with "IB connection manager" !
>
> 1. do we have any help docs to setup IPoIB and Lustre, lustre
> operation manual has very minimal info about this . I think I am
> missing some IPoIB setup part here.
> 2. or is it mannual assignment of IP addresses to "ib0" is creating
> some problem
>
>
> *Some more supporting info :
> *subnet manager of following version is also running : OpenSM 3.1.8
>
> Initially I got this error for MDS mount
>
> Jan 16 09:45:20 p128 kernel: LustreError:
> 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get IP address
> for interface ib0
> Jan 16 09:45:20 p128 kernel: LustreError:
> 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB interface
> ib0: -99
> Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting
> up LNI o2ib
> Jan 16 09:45:21 p128 kernel: LustreError:
> 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed
> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc
> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko):
> Input/output error
> Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc
> (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko):
> Unknown symbol in module, or unknown parameter (see dmesg)
> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req
> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get
> Jan 16 09:45:21 p128 kernel: osc: Unknown symbol
> ptlrpc_lprocfs_register_obd
> .
> .
> .
>
> then I mannually set the IP address for ib0 as folows :
> ifconfig ib0 172.24.198.111
>
> [root at p186 ~]# ifconfig ib0
> ib0 Link encap:InfiniBand HWaddr
> 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
> inet addr:172.24.198.112 Bcast:172.24.255.255 Mask:255.255.0.0
> UP BROADCAST MULTICAST MTU:65520 Metric:1
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:256
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>
> then it mounted sucessfully
>
> Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198.111 at o2ib [8/64]
> Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started
> Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter
> lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000
> Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr
> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new disk,
> initializing
> Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 now serving
> dev (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with
> recovery enabled
> Jan 16 09:47:09 p128 kernel: Lustre:
> 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT0000:
> group upcall set to /usr/sbin/l_getgroups
> Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: set parameter
> group_upcall=/usr/sbin/l_getgroups
> Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 on device
> /dev/loop0 has started
> .
> .
> .
>
>
> ~subbu
>
>
> On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen <Zhen.Liang at sun.com
> <mailto:Zhen.Liang at sun.com>> wrote:
>
> Subbu,
>
> I'd suggest:
> 1) make sure ko2iblnd has been brought up (please check if there
> is any error message when startup ko2iblnd)
> 2) echo +neterror > /proc/sys/lnet/printk, then try with lctl
> ping, if it still can't work please post error messages
>
> Regards
> Liang
>
> subbu kl:
>
> Problem is similer to
> http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html
> But by looking at the thread could not really get the solution
> for the problem.
>
> I have two RHEL5 Linux servers installed with following packages -
>
> kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1
> kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
> lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
> lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
> lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
> e2fsprogs-1.40.7.sun3-0redhat
>
>
> machine 1: with ib0 IP address : 172.24.198.111
> machine 2: with ib0 IP address : 172.24.198.112
>
> /etc/modprobe.conf contains
> options lnet networks=o2ib
>
> TCP networking worked fine and now I am trying with Infiniband
> network finding it difficult in communicating with IB nodes
> mounting effort throghs me the following error
>
> [root at p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1
> mount.lustre: mount /dev/loop0 at /mnt/ost1 failed:
> Input/output error
> Is the MGS running?
>
> /var/log/messages :
> Jan 15 16:55:25 p186 kernel: kjournald starting. Commit
> interval 5 seconds
> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem
> with ordered data mode.
> Jan 15 16:55:25 p186 kernel: kjournald starting. Commit
> interval 5 seconds
> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem
> with ordered data mode.
> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled
> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled
> Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from
> MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib 5s ago has
> timed out (limit 5s).
> Jan 15 16:55:30 p186 kernel: LustreError:
> 7193:0:(obd_mount.c:1062:server_start_targets()) Required
> registration failed for lustre-OSTffff: -5
> Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication
> error with the MGS. Is the MGS running?
> Jan 15 16:55:30 p186 kernel: LustreError:
> 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start
> targets: -5
> Jan 15 16:55:30 p186 kernel: LustreError:
> 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff
> Jan 15 16:55:30 p186 kernel: LustreError:
> 7193:0:(obd_mount.c:119:server_deregister_mount())
> lustre-OSTffff not registered
> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0
> reqs (0 success)
> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents
> scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost
> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated
> and it took 0
> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0
> preallocated, 0 discarded
> Jan 15 16:55:30 p186 kernel: Lustre: server umount
> lustre-OSTffff complete
> Jan 15 16:55:30 p186 kernel: LustreError:
> 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount
> (-5)
>
> All pinging efforts also failed to the IB NIDS local/remote
> can ping the ip address :
> [root at p186 ~]# ping 172.24.198.112
> PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data.
> 64 bytes from 172.24.198.112 <http://172.24.198.112>:
> icmp_seq=1 ttl=64 time=0.052 ms
> 64 bytes from 172.24.198.112 <http://172.24.198.112>:
> icmp_seq=2 ttl=64 time=0.024 ms
>
>
> --- 172.24.198.112 ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
> rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms
> [root at p186 ~]# ping 172.24.198.111
> PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data.
> 64 bytes from 172.24.198.111 <http://172.24.198.111>:
> icmp_seq=1 ttl=64 time=2.16 ms
> 64 bytes from 172.24.198.111 <http://172.24.198.111>:
> icmp_seq=2 ttl=64 time=0.296 ms
>
>
> --- 172.24.198.111 ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
> rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms
>
> but cant ping the NIDS :
> [root at p186 ~]# lctl ping 172.24.198.112 at o2ib
> failed to ping 172.24.198.112 at o2ib: Input/output error
> [root at p186 ~]# lctl ping 172.24.198.111 at o2ib
> failed to ping 172.24.198.111 at o2ib: Input/output error
>
> Any idea why lnet cant ping NIDS ?
>
> some more configurations:
> [root at p186 ~]# ibstat
> CA 'mthca0'
> CA type: MT23108
> Number of ports: 2
> Firmware version: 3.5.0
> Hardware version: a1
> Node GUID: 0x0002c9020021550c
>
> Machines are connected via IB switch.
>
> Looking forward for help.
>
> ~subbu
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> <mailto:Lustre-discuss at lists.lustre.org>
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>
>
>
>
> --
> . . . s u b b u
> "You've got to be original, because if you're like someone else, what
> do they need you for?"
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
More information about the lustre-discuss
mailing list