[Lustre-discuss] o2ib cant ping/mount Infiniband NID

subbu kl subbukl at gmail.com
Thu Jan 15 03:55:14 PST 2009


Problem is similer to
http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html
But by looking at the thread could not really get the solution for the
problem.

I have two RHEL5 Linux servers installed with following packages -

kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1
kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
e2fsprogs-1.40.7.sun3-0redhat


machine 1: with ib0 IP address : 172.24.198.111
machine 2: with ib0 IP address : 172.24.198.112

/etc/modprobe.conf contains
options lnet networks=o2ib

TCP networking worked fine and now I am trying with Infiniband network
finding it difficult in communicating with IB nodes mounting effort throghs
me the following error

[root at p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1
mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: Input/output error
Is the MGS running?

/var/log/messages :
Jan 15 16:55:25 p186 kernel: kjournald starting.  Commit interval 5 seconds
Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Jan 15 16:55:25 p186 kernel: kjournald starting.  Commit interval 5 seconds
Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with ordered
data mode.
Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled
Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled
Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from
MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib 5s ago has timed out
(limit 5s).
Jan 15 16:55:30 p186 kernel: LustreError:
7193:0:(obd_mount.c:1062:server_start_targets()) Required registration
failed for lustre-OSTffff: -5
Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication error with
the MGS.  Is the MGS running?
Jan 15 16:55:30 p186 kernel: LustreError:
7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start targets: -5
Jan 15 16:55:30 p186 kernel: LustreError:
7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OSTffff
Jan 15 16:55:30 p186 kernel: LustreError:
7193:0:(obd_mount.c:119:server_deregister_mount()) lustre-OSTffff not
registered
Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0
success)
Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal
hits, 0 2^N hits, 0 breaks, 0 lost
Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated and it took 0
Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0
discarded
Jan 15 16:55:30 p186 kernel: Lustre: server umount lustre-OSTffff complete
Jan 15 16:55:30 p186 kernel: LustreError:
7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount  (-5)

All pinging efforts also failed to the IB NIDS local/remote
can ping the ip address :
[root at p186 ~]# ping 172.24.198.112
PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data.
64 bytes from 172.24.198.112: icmp_seq=1 ttl=64 time=0.052 ms
64 bytes from 172.24.198.112: icmp_seq=2 ttl=64 time=0.024 ms

--- 172.24.198.112 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms
[root at p186 ~]# ping 172.24.198.111
PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data.
64 bytes from 172.24.198.111: icmp_seq=1 ttl=64 time=2.16 ms
64 bytes from 172.24.198.111: icmp_seq=2 ttl=64 time=0.296 ms

--- 172.24.198.111 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms

but cant ping the NIDS :
[root at p186 ~]# lctl ping 172.24.198.112 at o2ib
failed to ping 172.24.198.112 at o2ib: Input/output error
[root at p186 ~]# lctl ping 172.24.198.111 at o2ib
failed to ping 172.24.198.111 at o2ib: Input/output error

Any idea why lnet cant ping NIDS ?

some more configurations:
[root at p186 ~]# ibstat
CA 'mthca0'
        CA type: MT23108
        Number of ports: 2
        Firmware version: 3.5.0
        Hardware version: a1
        Node GUID: 0x0002c9020021550c

Machines are connected via IB switch.

Looking forward for help.

~subbu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090115/0af49ff6/attachment.htm>


More information about the lustre-discuss mailing list