[Lustre-discuss] o2ib cant ping/mount Infiniband NID

Liang Zhen Zhen.Liang at Sun.COM
Thu Jan 15 07:07:09 PST 2009


Subbu,

I'd suggest:
1) make sure ko2iblnd has been brought up (please check if there is any 
error message when startup ko2iblnd)
2) echo +neterror > /proc/sys/lnet/printk, then try with lctl ping, if 
it still can't work please post error messages

Regards
Liang

subbu kl:
> Problem is similer to 
> http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html
> But by looking at the thread could not really get the solution for the 
> problem.
>
> I have two RHEL5 Linux servers installed with following packages -
>
> kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1
> kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
> lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
> lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
> lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
> e2fsprogs-1.40.7.sun3-0redhat
>
>
> machine 1: with ib0 IP address : 172.24.198.111
> machine 2: with ib0 IP address : 172.24.198.112
>
> /etc/modprobe.conf contains
> options lnet networks=o2ib
>
> TCP networking worked fine and now I am trying with Infiniband network 
> finding it difficult in communicating with IB nodes mounting effort 
> throghs me the following error
>
> [root at p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1
> mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: Input/output error
> Is the MGS running?
>
> /var/log/messages :
> Jan 15 16:55:25 p186 kernel: kjournald starting.  Commit interval 5 
> seconds
> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with 
> ordered data mode.
> Jan 15 16:55:25 p186 kernel: kjournald starting.  Commit interval 5 
> seconds
> Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with 
> ordered data mode.
> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled
> Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled
> Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from 
> MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib 5s ago has timed out 
> (limit 5s).
> Jan 15 16:55:30 p186 kernel: LustreError: 
> 7193:0:(obd_mount.c:1062:server_start_targets()) Required registration 
> failed for lustre-OSTffff: -5
> Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication error 
> with the MGS.  Is the MGS running?
> Jan 15 16:55:30 p186 kernel: LustreError: 
> 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start targets: -5
> Jan 15 16:55:30 p186 kernel: LustreError: 
> 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff
> Jan 15 16:55:30 p186 kernel: LustreError: 
> 7193:0:(obd_mount.c:119:server_deregister_mount()) lustre-OSTffff not 
> registered
> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 
> success)
> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 
> goal hits, 0 2^N hits, 0 breaks, 0 lost
> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated and it 
> took 0
> Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 
> discarded
> Jan 15 16:55:30 p186 kernel: Lustre: server umount lustre-OSTffff complete
> Jan 15 16:55:30 p186 kernel: LustreError: 
> 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount  (-5)
>
> All pinging efforts also failed to the IB NIDS local/remote
> can ping the ip address :
> [root at p186 ~]# ping 172.24.198.112
> PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data.
> 64 bytes from 172.24.198.112 <http://172.24.198.112>: icmp_seq=1 
> ttl=64 time=0.052 ms
> 64 bytes from 172.24.198.112 <http://172.24.198.112>: icmp_seq=2 
> ttl=64 time=0.024 ms
>
> --- 172.24.198.112 ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
> rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms
> [root at p186 ~]# ping 172.24.198.111
> PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data.
> 64 bytes from 172.24.198.111 <http://172.24.198.111>: icmp_seq=1 
> ttl=64 time=2.16 ms
> 64 bytes from 172.24.198.111 <http://172.24.198.111>: icmp_seq=2 
> ttl=64 time=0.296 ms
>
> --- 172.24.198.111 ping statistics ---
> 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
> rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms
>
> but cant ping the NIDS :
> [root at p186 ~]# lctl ping 172.24.198.112 at o2ib
> failed to ping 172.24.198.112 at o2ib: Input/output error
> [root at p186 ~]# lctl ping 172.24.198.111 at o2ib
> failed to ping 172.24.198.111 at o2ib: Input/output error
>
> Any idea why lnet cant ping NIDS ?
>
> some more configurations:
> [root at p186 ~]# ibstat
> CA 'mthca0'
>         CA type: MT23108
>         Number of ports: 2
>         Firmware version: 3.5.0
>         Hardware version: a1
>         Node GUID: 0x0002c9020021550c
>
> Machines are connected via IB switch.
>
> Looking forward for help.
>
> ~subbu
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list