<span style="font-family: verdana,sans-serif; color: rgb(51, 51, 153);">Liang,</span><br style="font-family: verdana,sans-serif; color: rgb(51, 51, 153);"><span style="font-family: verdana,sans-serif; color: rgb(51, 51, 153);">after executing following echo :</span><br>
<span style="font-family: courier new,monospace;">echo +neterror > /proc/sys/lnet/printk</span><br><br><span style="font-family: verdana,sans-serif; color: rgb(51, 51, 153);">now lctlt ping shows the following error</span><br>
<br><span style="font-family: courier new,monospace;">
# lctl ping 172.24.198.112@o2ib</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
failed to ping 172.24.198.112@o2ib: Input/output error</span><br style="font-family: courier new,monospace;">
<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198.112@o2ib: ROUTE ERROR -22</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
Jan 16 10:24:14 p128 kernel: Lustre:
2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting
messages for 172.24.198.112@o2ib: connection failed</span><br>
<br style="color: rgb(0, 0, 102);"><span style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">Looks like some problem with "IB connection manager" !</span><br style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">
<br style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);"><span style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">1. do we have any help docs to setup IPoIB and Lustre, lustre operation manual has very minimal info about this . I think I am missing some IPoIB setup part here.</span><br style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">
<span style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">2. or is it mannual assignment of IP addresses to "ib0" is creating some problem</span><br style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">
<br><br style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);"><b><span style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">Some more supporting info :</span><br style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">
</b><span style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">subnet manager of following version is also running : OpenSM 3.1.8</span><br style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);"><br style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">
<span style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">Initially I got this error for MDS mount</span><br><br><span style="font-family: courier new,monospace;">Jan 16 09:45:20 p128 kernel: LustreError: 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get IP address for interface ib0</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Jan 16 09:45:20 p128 kernel: LustreError: 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB interface ib0: -99</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting up LNI o2ib</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Jan 16 09:45:21 p128 kernel: LustreError: 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): Input/output error</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): Unknown symbol in module, or unknown parameter (see dmesg)</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ptlrpc_lprocfs_register_obd</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">.</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">.</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">.</span><br><br><span style="color: rgb(0, 0, 102); font-family: verdana,sans-serif;">then I mannually set the IP address for ib0 as folows :</span><br>
<span style="font-family: courier new,monospace;">ifconfig ib0 172.24.198.111</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">[root@p186 ~]# ifconfig ib0</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">ib0 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> inet addr:172.24.198.112 Bcast:172.24.255.255 Mask:255.255.0.0</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> UP BROADCAST MULTICAST MTU:65520 Metric:1</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> RX packets:0 errors:0 dropped:0 overruns:0 frame:0</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> collisions:0 txqueuelen:256</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;"> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)</span><br><br><span style="color: rgb(0, 0, 102); font-family: verdana,sans-serif;">then it mounted sucessfully</span><br>
<br><span style="font-family: courier new,monospace;">Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198.111@o2ib [8/64]</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new disk, initializing</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000 now serving dev (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with recovery enabled</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Jan 16 09:47:09 p128 kernel: Lustre: 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt: set parameter group_upcall=/usr/sbin/l_getgroups</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000 on device /dev/loop0 has started</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">.</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">.</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">.</span><br style="font-family: courier new,monospace;">
<br><br><span style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);">~subbu</span><br style="font-family: verdana,sans-serif; color: rgb(0, 0, 102);"><br><br><div class="gmail_quote">On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen <span dir="ltr"><<a href="mailto:Zhen.Liang@sun.com">Zhen.Liang@sun.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Subbu,<br>
<br>
I'd suggest:<br>
1) make sure ko2iblnd has been brought up (please check if there is any error message when startup ko2iblnd)<br>
2) echo +neterror > /proc/sys/lnet/printk, then try with lctl ping, if it still can't work please post error messages<br>
<br>
Regards<br>
Liang<br>
<br>
subbu kl:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div class="Wj3C7c">
Problem is similer to <a href="http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html" target="_blank">http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html</a><br>
But by looking at the thread could not really get the solution for the problem.<br>
<br>
I have two RHEL5 Linux servers installed with following packages -<br>
<br>
kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1<br>
kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp<br>
lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp<br>
lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp<br>
lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp<br>
e2fsprogs-1.40.7.sun3-0redhat<br>
<br>
<br>
machine 1: with ib0 IP address : 172.24.198.111<br>
machine 2: with ib0 IP address : 172.24.198.112<br>
<br>
/etc/modprobe.conf contains<br>
options lnet networks=o2ib<br>
<br>
TCP networking worked fine and now I am trying with Infiniband network finding it difficult in communicating with IB nodes mounting effort throghs me the following error<br>
<br>
[root@p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1<br>
mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: Input/output error<br>
Is the MGS running?<br>
<br>
/var/log/messages :<br>
Jan 15 16:55:25 p186 kernel: kjournald starting. Commit interval 5 seconds<br>
Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal<br>
Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>
Jan 15 16:55:25 p186 kernel: kjournald starting. Commit interval 5 seconds<br>
Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal<br>
Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>
Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled<br>
Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled<br>
Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from MGC172.24.198.111@o2ib to NID 172.24.198.111@o2ib 5s ago has timed out (limit 5s).<br>
Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1062:server_start_targets()) Required registration failed for lustre-OSTffff: -5<br>
Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication error with the MGS. Is the MGS running?<br>
Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start targets: -5<br>
Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff<br>
Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:119:server_deregister_mount()) lustre-OSTffff not registered<br>
Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success)<br>
Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost<br>
Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated and it took 0<br>
Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 discarded<br>
Jan 15 16:55:30 p186 kernel: Lustre: server umount lustre-OSTffff complete<br>
Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount (-5)<br>
<br>
All pinging efforts also failed to the IB NIDS local/remote<br>
can ping the ip address :<br>
[root@p186 ~]# ping 172.24.198.112<br>
PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data.<br></div></div>
64 bytes from 172.24.198.112 <<a href="http://172.24.198.112" target="_blank">http://172.24.198.112</a>>: icmp_seq=1 ttl=64 time=0.052 ms<br>
64 bytes from 172.24.198.112 <<a href="http://172.24.198.112" target="_blank">http://172.24.198.112</a>>: icmp_seq=2 ttl=64 time=0.024 ms<div class="Ih2E3d"><br>
<br>
--- 172.24.198.112 ping statistics ---<br>
2 packets transmitted, 2 received, 0% packet loss, time 1000ms<br>
rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms<br>
[root@p186 ~]# ping 172.24.198.111<br>
PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data.<br></div>
64 bytes from 172.24.198.111 <<a href="http://172.24.198.111" target="_blank">http://172.24.198.111</a>>: icmp_seq=1 ttl=64 time=2.16 ms<br>
64 bytes from 172.24.198.111 <<a href="http://172.24.198.111" target="_blank">http://172.24.198.111</a>>: icmp_seq=2 ttl=64 time=0.296 ms<div class="Ih2E3d"><br>
<br>
--- 172.24.198.111 ping statistics ---<br>
2 packets transmitted, 2 received, 0% packet loss, time 1000ms<br>
rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms<br>
<br>
but cant ping the NIDS :<br>
[root@p186 ~]# lctl ping 172.24.198.112@o2ib<br>
failed to ping 172.24.198.112@o2ib: Input/output error<br>
[root@p186 ~]# lctl ping 172.24.198.111@o2ib<br>
failed to ping 172.24.198.111@o2ib: Input/output error<br>
<br>
Any idea why lnet cant ping NIDS ?<br>
<br>
some more configurations:<br>
[root@p186 ~]# ibstat<br>
CA 'mthca0'<br>
CA type: MT23108<br>
Number of ports: 2<br>
Firmware version: 3.5.0<br>
Hardware version: a1<br>
Node GUID: 0x0002c9020021550c<br>
<br>
Machines are connected via IB switch.<br>
<br>
Looking forward for help.<br>
<br>
~subbu<br></div>
------------------------------------------------------------------------<br>
<br>
_______________________________________________<br>
Lustre-discuss mailing list<br>
<a href="mailto:Lustre-discuss@lists.lustre.org" target="_blank">Lustre-discuss@lists.lustre.org</a><br>
<a href="http://lists.lustre.org/mailman/listinfo/lustre-discuss" target="_blank">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a><br>
<br>
</blockquote>
<br>
</blockquote></div><br><br clear="all"><br>-- <br>. . . s u b b u<br>"You've got to be original, because if you're like someone else, what do they need you for?"<br>