[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

Robin Humble robin.humble+lustre at anu.edu.au
Wed Jul 15 08:22:26 PDT 2009


On Wed, Jul 15, 2009 at 10:10:06AM -0400, Brian J. Murrell wrote:
>On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote:
>> 
>>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5
>>   Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>   Lustre: MGC10.8.30.244 at o2ib: Reactivating import
>>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5
>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>   Lustre: Client system-client has started
>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.201 at o2ib failed: 5
>>   ... last message repeated 17 times ...
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.203 at o2ib failed: 5
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.205 at o2ib failed: 5
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.208 at o2ib failed: 5
>
>These are all LND errors.  What versions of OFED are you using on each
>end?

all kernels all compiled with the rhel5 kernel tree's standard OFED.
I think 1.3.2 is what's in rhel5.3/centos5.3?

>> looks like it succeeds in the end, but only after a struggle.
>Is it completely stable and performant after the struggle?  Do the error
>messages stop?

the fs's appear to be fine.

the error messages are just on the initial mount of the first lustre fs.
subsequent mounts of other lustre fs's don't get any messages, so it
seems like it's just an extremely noisy protocol/version negotiation
the first time the 1.8.1 lnet fires up and tries to talk to 1.6.7.2
servers??

another data point is that the above errors don't happen with
2.6.18-128.1.14.el5 patched with 1.8.0.1 and using the same in-kernel
OFED, so it's probably something that's happened between 1.8.0.1 and
1.8.1-pre.
or I guess it could be a rhel change between 2.6.18-128.1.14.el5 and
2.6.18-128.1.16.el5, but that seems less likely.
I can spin up a 2.6.18-128.1.14.el5 with b_release_1_8_1 if you like...

>> BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1?
>> I'm confused about which is closest to the final 1.8.1 :-/
>
>b_release_1_8_1 is the branch and v1_8_1_RC1 is the tag (i.e. snapshot
>in time from the branch) which is getting tested from that branch which
>has the potential to become 1.8.1 if the testing pans out.  It is
>entirely possible that even when v1_8_1_RCn becomes the final release,
>there will be patches dangling on the tip of b_release_1_8_1 that are
>not release blockers but there in case we need a 1.8.1.1.
>
>So the choice is yours.  If you want to be using exactly what could
>potentially be the GA release, you should stick to using the most recent
>tags.  If you want to test ahead of what could be the GA, use the branch
>tip.

cool. thanks for the explanation.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility



More information about the lustre-discuss mailing list