[Lustre-discuss] lctl ping error "Unexpected version" between Lustre 1.8.1.1 and 1.8.2

Alexander Bugl alexander.bugl at zmaw.de
Wed Feb 17 01:01:21 PST 2010


Hi,

I have a Lustre 1.8.1.1 System (MDS, OSS, all CentOS 5.3) with Lustre
1.6.4.3 (clients, Debian etch) running without problems.

I now have 4 additional OSS nodes, which I set up using the new Lustre
1.8.2. But I can't lctl ping between 1.8.1.1 nodes and 1.8.2 nodes using
InfiniBand. To be more precise:

OSS node 1:
[root at oss01 ~]# ifconfig | grep -C1 ib0
ib0       Link encap:InfiniBand  HWaddr ...
inet addr:172.16.30.134  Bcast:172.16.30.255  Mask:255.255.255.0
[root at oss01 ~]# uname -a
Linux oss01 2.6.18-164.11.1.el5_lustre.1.8.2 #1 SMP Fri Jan 22 19:11:17
MST 2010 x86_64 x86_64 x86_64 GNU/Linux

OSS node 5:
[root at oss05 ~]# ifconfig | grep -C1 ib0
ib0       Link encap:InfiniBand  HWaddr ...
inet addr:172.16.30.138  Bcast:172.16.30.255  Mask:255.255.255.0
[root at oss05 ~]# uname -a
Linux oss05 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 SMP Tue Oct 6 05:48:57
MDT 2009 x86_64 x86_64 x86_64 GNU/Linux

InfiniBand network is up and running, I can ping oss1 from oss5 and vice
versa:
[root at oss01 ~]# ping 172.16.30.138
PING 172.16.30.138 (172.16.30.138) 56(84) bytes of data.
64 bytes from 172.16.30.138: icmp_seq=1 ttl=64 time=0.125 ms
64 bytes from 172.16.30.138: icmp_seq=2 ttl=64 time=0.083 ms
[root at oss05 ~]# ping 172.16.30.134
PING 172.16.30.134 (172.16.30.134) 56(84) bytes of data.
64 bytes from 172.16.30.134: icmp_seq=1 ttl=64 time=2.19 ms
64 bytes from 172.16.30.134: icmp_seq=2 ttl=64 time=0.076 ms

And I am able to lctl ping the machines on their own addresses:
[root at oss01 ~]# lctl ping 172.16.30.134 at o2ib
12345-0 at lo
12345-172.16.30.134 at o2ib
[root at oss05 ~]# lctl ping 172.16.30.138 at o2ib
12345-0 at lo
12345-172.16.30.138 at o2ib

But I can't lctl ping the other machine:
[root at oss01 ~]# lctl ping 172.16.30.138 at o2ib
failed to ping 172.16.30.138 at o2ib: Protocol error
[root at oss05 ~]# lctl ping 172.16.30.134 at o2ib
failed to ping 172.16.30.134 at o2ib: Protocol error

dmesg/meassage output is a little bit longer, but no other errors are
logged except this line:
[root at oss01 ~]# dmesg |tail -1
LustreError: 8855:0:(api-ni.c:1781:lnet_ping())
12345-172.16.30.138 at o2ib: Unexpected version 0x1
[root at oss05 ~]# dmesg |tail -1
LustreError: 19249:0:(api-ni.c:1735:lnet_ping())
12345-172.16.30.134 at o2ib: Unexpected version 0x2

I did not find anything regarding "Unexpected version 0x?" uding Google ...

So I can't mix 1.8.1.1 nodes and 1.8.2 nodes. That would be no major
problem, because I could upgrade the "older" MDS and OSS nodes to 1.8.2,
too, but I currently can't upgrade the 1.6.4.3 Lustre clients. And the
client nodes can't be lctl ping'ed from Lustre 1.8.2, too (172.16.30.70
being one client IP):
[root at oss01 ~]# lctl ping 172.16.30.70 at o2ib
failed to ping 172.16.30.70 at o2ib: Protocol error

I have nearly no InfiniBand know how (I inherited this system), so sorry
if my question is a stupid one:

What is going on here, and have I a simple possibility to solve that
problem of no LNET connectivity between Lustre 1.8.2 and the older
1.8.1.1/1.6.4.3 servers?

With regards, Alex

-- 
Alexander Bugl,  Central IT Services, ZMAW
Max  Planck  Institute   for   Meteorology
Bundesstrasse 53, D-20146 Hamburg, Germany
tel +49-40-41173-351, fax -356, room PE048



More information about the lustre-discuss mailing list