[Lustre-discuss] o2ib cant ping/mount Infiniband NID

Liang Zhen Zhen.Liang at Sun.COM
Fri Jan 23 18:36:04 PST 2009


Subbu,
I think we can't see anything from tcpdump even run ping sucessfully, 
because we only need ipoib for connecting (not for transaction).
I think we need these information for diagnosing:
1. modprobe.conf  of two nodes with IB
2. ifconfig on these two nodes
3. routing table on these two nodes
4. try lctl ping itself on both nodes and see if any error (with +neterror)

Regards
Liang

subbu kl:
> problem remained same, when I run lctl ping with tcpdump 4.0.0 I dont 
> see any activity on ib0 !
>
> another exhaustive Lustre debug log I took with lctl ping do you see 
> any problem with it ?
>
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(module.c:160:libcfs_psdev_open()) Process entered
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(module.c:164:libcfs_psdev_open()) kmalloced 'ldu': 8 at 
> f5bc6620 (tot 7258558).
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(module.c:171:libcfs_psdev_open()) Process leaving (rc=0 : 0 : 0)
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(module.c:228:libcfs_ioctl()) Process entered
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(linux-module.c:49:libcfs_ioctl_getdata()) Process entered
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(linux-module.c:90:libcfs_ioctl_getdata()) Process leaving 
> (rc=0 : 0 : 0)
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(api-ni.c:1223:LNetNIInit()) refs 1
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(api-ni.c:1614:lnet_ping()) kmalloced 'info': 144 at f0b95880 
> (tot 7258702).
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(lib-lnet.h:251:lnet_eq_alloc()) kmalloced 'eq': 48 at 
> efda1a00 (tot 7258750).
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:72:LNetEQAlloc()) kmalloced 'eq->eq_events': 240 at 
> f0b95c80 (tot 7258990).
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(lib-lnet.h:279:lnet_md_alloc()) kmalloced 'md': 84 at 
> ed16acc0 (tot 7259074).
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(lib-lnet.h:327:lnet_msg_alloc()) kmalloced 'msg': 268 at 
> f205a400 (tot 7259342).
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(lib-move.c:2395:LNetGet()) LNetGet -> 12345-172.24.198.140 at o2ib
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(o2iblnd_cb.c:1531:kiblnd_send()) sending 0 bytes in 0 frags 
> to 12345-172.24.198.140 at o2ib
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(o2iblnd.c:312:kiblnd_create_peer()) kmalloced 'peer': 56 at 
> efda18c0 (tot 7259398).
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(o2iblnd_cb.c:1501:kiblnd_launch_tx()) peer[efda18c0] -> 
> 172.24.198.140 at o2ib (1)++
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(o2iblnd_cb.c:1380:kiblnd_connect_peer()) peer[efda18c0] -> 
> 172.24.198.140 at o2ib (2)++
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(o2iblnd_cb.c:1507:kiblnd_launch_tx()) peer[efda18c0] -> 
> 172.24.198.140 at o2ib (3)--
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:209:LNetEQPoll()) Process entered
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, 
> eq->size: 2
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0)
> Jan 23 17:23:39 p186 kernel: Lustre: 
> 2782:0:(o2iblnd_cb.c:2682:kiblnd_cm_callback()) 172.24.198.140 at o2ib 
> Addr resolved: 0
> Jan 23 17:23:40 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered
> Jan 23 17:23:40 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, 
> eq->size: 2
> Jan 23 17:23:40 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0)
> Jan 23 17:23:40 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:239:LNetEQPoll()) Process leaving (rc=0 : 0 : 0)
> Jan 23 17:23:40 p186 kernel: Lustre: 
> 14294:0:(api-ni.c:1665:lnet_ping()) poll 0(-1 -1)
> Jan 23 17:23:40 p186 kernel: Lustre: 
> 14294:0:(lib-md.c:69:lnet_md_unlink()) Queueing unlink of md ed16acc0
> Jan 23 17:23:40 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:209:LNetEQPoll()) Process entered
> Jan 23 17:23:40 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered
> Jan 23 17:23:40 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, 
> eq->size: 2
> Jan 23 17:23:40 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:152:lib_get_event()) Process leaving (rc=0 : 0 : 0)
> Jan 23 17:23:56 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=4294962944 : -4352 : ffffef00)
> Jan 23 17:23:56 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=4294966784 : -512 : fffffe00)
> Jan 23 17:23:56 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 
> : 2817 : b01)
> Jan 23 17:23:56 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 
> : 2047 : 7ff)
> Jan 23 17:23:56 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=4294740832 : -226464 : fffc8b60)
> Jan 23 17:23:56 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=4286216485 : -8750811 : ff7a7925)
> Jan 23 17:23:56 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=5821091 : 5821091 : 58d2a3)
> Jan 23 17:23:56 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=3356952 : 3356952 : 333918)
> Jan 23 17:23:56 p186 kernel: Lustre: 
> 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8510847)
> Jan 23 17:24:21 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=4294962944 : -4352 : ffffef00)
> Jan 23 17:24:21 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=4294966784 : -512 : fffffe00)
> Jan 23 17:24:21 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2817 
> : 2817 : b01)
> Jan 23 17:24:21 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving (rc=2047 
> : 2047 : 7ff)
> Jan 23 17:24:21 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=4294740832 : -226464 : fffc8b60)
> Jan 23 17:24:21 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=4286216485 : -8750811 : ff7a7925)
> Jan 23 17:24:21 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=5821091 : 5821091 : 58d2a3)
> Jan 23 17:24:21 p186 kernel: Lustre: 
> 8276:0:(lvfs_lib.c:173:lprocfs_read_helper()) Process leaving 
> (rc=3356952 : 3356952 : 333918)
> Jan 23 17:24:21 p186 kernel: Lustre: 
> 8276:0:(pinger.c:193:ptlrpc_pinger_main()) next ping in 25000 (8535847)
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 2794:0:(o2iblnd_cb.c:2704:kiblnd_cm_callback()) 172.24.198.140 at o2ib: 
> ROUTE ERROR -110
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 2794:0:(o2iblnd.c:422:kiblnd_unlink_peer_locked()) peer[efda18c0] -> 
> 172.24.198.140 at o2ib (2)--
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 2794:0:(router.c:151:lnet_notify()) 172.24.198.141 at o2ib notifying 
> 172.24.198.140 at o2ib: down
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 2794:0:(router.c:82:lnet_notify_locked()) Old news
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 2794:0:(o2iblnd_cb.c:2118:kiblnd_peer_connect_failed()) Deleting 
> messages for 172.24.198.140 at o2ib: connection failed
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 2794:0:(lib-md.c:73:lnet_md_unlink()) Unlinking md ed16acc0
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 2794:0:(lib-lnet.h:301:lnet_md_free()) kfreed 'md': 84 at ed16acc0 
> (tot 7259314).
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 2794:0:(lib-lnet.h:344:lnet_msg_free()) kfreed 'msg': 268 at f205a400 
> (tot 7259046).
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 2794:0:(o2iblnd_cb.c:2706:kiblnd_cm_callback()) peer[efda18c0] -> 
> 172.24.198.140 at o2ib (1)--
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 2794:0:(o2iblnd.c:357:kiblnd_destroy_peer()) kfreed 'peer': 56 at 
> efda18c0 (tot 7258990).
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:146:lib_get_event()) Process entered
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:149:lib_get_event()) event: f0b95cf8, sequence: 1, 
> eq->size: 2
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:170:lib_get_event()) Process leaving (rc=1 : 1 : 1)
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:232:LNetEQPoll()) Process leaving (rc=1 : 1 : 1)
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(api-ni.c:1665:lnet_ping()) poll 1(4 -113) unlinked
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(lib-lnet.h:259:lnet_eq_free()) kfreed 'eq': 48 at efda1a00 
> (tot 7258942).
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(lib-eq.c:135:LNetEQFree()) kfreed 'events': 240 at f0b95c80 
> (tot 7258702).
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(api-ni.c:1772:lnet_ping()) kfreed 'info': 144 at f0b95880 
> (tot 7258558).
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(module.c:336:libcfs_ioctl()) Process leaving (rc=4294967291 : 
> -5 : fffffffb)
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(module.c:178:libcfs_psdev_release()) Process entered
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(module.c:183:libcfs_psdev_release()) kfreed 'ldu': 8 at 
> f5bc6620 (tot 7258550).
> Jan 23 17:24:29 p186 kernel: Lustre: 
> 14294:0:(module.c:187:libcfs_psdev_release()) Process leaving (rc=0 : 
> 0 : 0)
>
> ~subbu
>
> On Fri, Jan 16, 2009 at 3:38 PM, subbu kl <subbukl at gmail.com 
> <mailto:subbukl at gmail.com>> wrote:
>
>     Liang,
>
>     Right; you reproduced the exact problem. But as you can see in my
>     previous mail I think I have solved that problem by mannually
>     assiging IP to ib0 (check this line # ifconfig ib0 172.24.198.111
>     and *"Added LNI" lines  *)
>
>     we are back to sqare one now I guess ! LNET is up with mannually
>     assigned IPs. normal ping succeds between machines but not lctl ping.
>
>     so my current problem is this :
>
>     # lctl ping 172.24.198.112 at o2ib
>     failed to ping 172.24.198.112 at o2ib: Input/output error
>
>     /var/log/messages:
>
>
>     Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687:
>     kiblnd_cm_callback()) 172.24.198.112 at o2ib: ROUTE ERROR -22
>     Jan 16 10:24:14 p128 kernel: Lustre:
>     2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting
>     messages for 172.24.198.112 at o2ib: connection failed
>
>     how can I get rid of this connection problem?
>
>     ~subbu
>
>
>
>     On Fri, Jan 16, 2009 at 2:11 PM, Liang Zhen <Zhen.Liang at sun.com
>     <mailto:Zhen.Liang at sun.com>> wrote:
>
>         Subbu,
>
>         We don't have any tip for setup IPoIB, looks like linux can't
>         find the ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I
>         think it's because you didn't assign any address to ib0 (or
>         failed to assign address to ib0) before loading o2iblnd  in
>         the first try.
>         I can reproduce exactly same error by:
>         1. modprobe ib_ipoib
>         2. ifconfig ib0 up  // without assign any address
>         3. modprobe ko2iblnd
>         4. lctl network up
>
>         Regards
>         Liang
>
>         subbu kl:
>
>             Liang,
>             after executing following echo :
>             echo +neterror > /proc/sys/lnet/printk
>
>             now lctlt ping shows the following error
>
>             # lctl ping 172.24.198.112 at o2ib
>             failed to ping 172.24.198.112 at o2ib: Input/output error
>
>             Jan 16 10:24:14 p128 kernel: Lustre:
>             2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback())
>             172.24.198.112 at o2ib: ROUTE ERROR -22
>             Jan 16 10:24:14 p128 kernel: Lustre:
>             2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed())
>             Deleting messages for 172.24.198.112 at o2ib: connection failed
>
>             Looks like some problem with "IB connection manager" !
>
>             1. do we have any help docs to setup IPoIB and Lustre,
>             lustre operation manual has very minimal info about this .
>             I think I am missing some IPoIB setup part here.
>             2. or is it mannual assignment of  IP addresses to "ib0"
>             is creating some problem
>
>
>             *Some more supporting info :
>             *subnet manager of following version is also running :
>             OpenSM 3.1.8
>
>             Initially I got this error for MDS mount
>
>             Jan 16 09:45:20 p128 kernel: LustreError:
>             4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get
>             IP address for interface ib0
>             Jan 16 09:45:20 p128 kernel: LustreError:
>             4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB
>             interface ib0: -99
>             Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error
>             -100 starting up LNI o2ib
>             Jan 16 09:45:21 p128 kernel: LustreError:
>             4991:0:(events.c:707:ptlrpc_init_portals()) network
>             initialisation failed
>             Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting
>             ptlrpc
>             (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko):
>             Input/output error
>             Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting
>             osc
>             (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko):
>             Unknown symbol in module, or unknown parameter (see dmesg)
>             Jan 16 09:45:21 p128 kernel: osc: Unknown symbol
>             ldlm_prep_enqueue_req
>             Jan 16 09:45:21 p128 kernel: osc: Unknown symbol
>             ldlm_resource_get
>             Jan 16 09:45:21 p128 kernel: osc: Unknown symbol
>             ptlrpc_lprocfs_register_obd
>             .
>             .
>             .
>
>             then I mannually set the IP address for ib0 as folows :
>             # ifconfig ib0 172.24.198.111
>
>             [root at p186 ~]# ifconfig ib0
>             ib0       Link encap:InfiniBand  HWaddr
>             80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>                      inet addr:172.24.198.112  Bcast:172.24.255.255
>              Mask:255.255.0.0
>                      UP BROADCAST MULTICAST  MTU:65520  Metric:1
>                      RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>                      TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>                      collisions:0 txqueuelen:256
>                      RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>
>             then it mounted sucessfully
>
>             *Jan 16 09:47:09 p128 kernel: Lustre: Added LNI
>             172.24.198.111 at o2ib [8/64]
>             Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started*
>             Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter
>             lustre-MDT0000.mdt.group_upcall in log lustre-MDT0000
>             Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr
>             Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000: new
>             disk, initializing
>             Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT0000
>             now serving dev
>             (lustre-MDT0000/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with
>             recovery enabled
>             Jan 16 09:47:09 p128 kernel: Lustre:
>             5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall())
>             lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups
>             Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT0000.mdt:
>             set parameter group_upcall=/usr/sbin/l_getgroups
>             Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT0000
>             on device /dev/loop0 has started
>             .
>             .
>             .
>
>
>             ~subbu
>
>
>             On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen
>             <Zhen.Liang at sun.com <mailto:Zhen.Liang at sun.com>
>             <mailto:Zhen.Liang at sun.com <mailto:Zhen.Liang at sun.com>>>
>             wrote:
>
>                Subbu,
>
>                I'd suggest:
>                1) make sure ko2iblnd has been brought up (please check
>             if there
>                is any error message when startup ko2iblnd)
>                2) echo +neterror > /proc/sys/lnet/printk, then try
>             with lctl
>                ping, if it still can't work please post error messages
>
>                Regards
>                Liang
>
>                subbu kl:
>
>                    Problem is similer to
>                  
>              http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html
>                    But by looking at the thread could not really get
>             the solution
>                    for the problem.
>
>                    I have two RHEL5 Linux servers installed with
>             following packages -
>
>                    kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1
>                    kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>                  
>              lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>                    lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>                  
>              lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
>                    e2fsprogs-1.40.7.sun3-0redhat
>
>
>                    machine 1: with ib0 IP address : 172.24.198.111
>                    machine 2: with ib0 IP address : 172.24.198.112
>
>                    /etc/modprobe.conf contains
>                    options lnet networks=o2ib
>
>                    TCP networking worked fine and now I am trying with
>             Infiniband
>                    network finding it difficult in communicating with
>             IB nodes
>                    mounting effort throghs me the following error
>
>                    [root at p186 ~]# mount -t lustre -o loop
>             /tmp/lustre-ost1 /mnt/ost1
>                    mount.lustre: mount /dev/loop0 at /mnt/ost1 failed:
>                    Input/output error
>                    Is the MGS running?
>
>                    /var/log/messages :
>                    Jan 15 16:55:25 p186 kernel: kjournald starting.
>              Commit
>                    interval 5 seconds
>                    Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0,
>             internal journal
>                    Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted
>             filesystem
>                    with ordered data mode.
>                    Jan 15 16:55:25 p186 kernel: kjournald starting.
>              Commit
>                    interval 5 seconds
>                    Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0,
>             internal journal
>                    Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted
>             filesystem
>                    with ordered data mode.
>                    Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file
>             extents enabled
>                    Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc
>             enabled
>                    Jan 15 16:55:30 p186 kernel: Lustre: Request x7
>             sent from
>                    MGC172.24.198.111 at o2ib to NID 172.24.198.111 at o2ib
>             5s ago has
>                    timed out (limit 5s).
>                    Jan 15 16:55:30 p186 kernel: LustreError:
>                    7193:0:(obd_mount.c:1062:server_start_targets())
>             Required
>                    registration failed for lustre-OSTffff: -5
>                    Jan 15 16:55:30 p186 kernel: LustreError: 15f-b:
>             Communication
>                    error with the MGS.  Is the MGS running?
>                    Jan 15 16:55:30 p186 kernel: LustreError:
>                    7193:0:(obd_mount.c:1597:server_fill_super())
>             Unable to start
>                    targets: -5
>                    Jan 15 16:55:30 p186 kernel: LustreError:
>                    7193:0:(obd_mount.c:1382:server_put_super()) no obd
>             lustre-OSTffff
>                    Jan 15 16:55:30 p186 kernel: LustreError:
>                    7193:0:(obd_mount.c:119:server_deregister_mount())
>                    lustre-OSTffff not registered
>                    Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0
>             blocks 0
>                    reqs (0 success)
>                    Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0
>             extents
>                    scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost
>                    Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0
>             generated
>                    and it took 0
>                    Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0
>                    preallocated, 0 discarded
>                    Jan 15 16:55:30 p186 kernel: Lustre: server umount
>                    lustre-OSTffff complete
>                    Jan 15 16:55:30 p186 kernel: LustreError:
>                    7193:0:(obd_mount.c:1951:lustre_fill_super())
>             Unable to mount
>                     (-5)
>
>                    All pinging efforts also failed to the IB NIDS
>             local/remote
>                    can ping the ip address :
>                    [root at p186 ~]# ping 172.24.198.112
>                    PING 172.24.198.112 (172.24.198.112) 56(84) bytes
>             of data.
>                    64 bytes from 172.24.198.112 <http://172.24.198.112>:
>                    icmp_seq=1 ttl=64 time=0.052 ms
>                    64 bytes from 172.24.198.112 <http://172.24.198.112>:
>                    icmp_seq=2 ttl=64 time=0.024 ms
>
>
>                    --- 172.24.198.112 ping statistics ---
>                    2 packets transmitted, 2 received, 0% packet loss,
>             time 1000ms
>                    rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms
>                    [root at p186 ~]# ping 172.24.198.111
>                    PING 172.24.198.111 (172.24.198.111) 56(84) bytes
>             of data.
>                    64 bytes from 172.24.198.111 <http://172.24.198.111>:
>                    icmp_seq=1 ttl=64 time=2.16 ms
>                    64 bytes from 172.24.198.111 <http://172.24.198.111>:
>                    icmp_seq=2 ttl=64 time=0.296 ms
>
>
>                    --- 172.24.198.111 ping statistics ---
>                    2 packets transmitted, 2 received, 0% packet loss,
>             time 1000ms
>                    rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms
>
>                    but cant ping the NIDS :
>                    [root at p186 ~]# lctl ping 172.24.198.112 at o2ib
>                    failed to ping 172.24.198.112 at o2ib: Input/output error
>                    [root at p186 ~]# lctl ping 172.24.198.111 at o2ib
>                    failed to ping 172.24.198.111 at o2ib: Input/output error
>
>                    Any idea why lnet cant ping NIDS ?
>
>                    some more configurations:
>                    [root at p186 ~]# ibstat
>                    CA 'mthca0'
>                           CA type: MT23108
>                           Number of ports: 2
>                           Firmware version: 3.5.0
>                           Hardware version: a1
>                           Node GUID: 0x0002c9020021550c
>
>                    Machines are connected via IB switch.
>
>                    Looking forward for help.
>
>                    ~subbu
>                  
>              ------------------------------------------------------------------------
>
>                    _______________________________________________
>                    Lustre-discuss mailing list
>                    Lustre-discuss at lists.lustre.org
>             <mailto:Lustre-discuss at lists.lustre.org>
>                    <mailto:Lustre-discuss at lists.lustre.org
>             <mailto:Lustre-discuss at lists.lustre.org>>
>
>                    http://lists.lustre.org/mailman/listinfo/lustre-discuss
>                    
>
>
>
>
>             -- 
>             . . . s u b b u
>             "You've got to be original, because if you're like someone
>             else, what do they need you for?"
>             ------------------------------------------------------------------------
>
>             _______________________________________________
>             Lustre-discuss mailing list
>             Lustre-discuss at lists.lustre.org
>             <mailto:Lustre-discuss at lists.lustre.org>
>             http://lists.lustre.org/mailman/listinfo/lustre-discuss
>              
>
>
>
>
>
>     -- 
>     . . . s u b b u
>     "You've got to be original, because if you're like someone else,
>     what do they need you for?"
>
>
>
>
> -- 
> . . . s u b b u
> "You've got to be original, because if you're like someone else, what 
> do they need you for?"
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list