[Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

Robin Humble robin.humble+lustre at anu.edu.au
Wed Jul 15 10:44:40 PDT 2009


On Wed, Jul 15, 2009 at 11:22:26AM -0400, Robin Humble wrote:
>On Wed, Jul 15, 2009 at 10:10:06AM -0400, Brian J. Murrell wrote:
>>On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote:
>>> 
>>>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5
>>>   Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.244 at o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.244 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>>   Lustre: MGC10.8.30.244 at o2ib: Reactivating import
>>>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5
>>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.245 at o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.245 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>>   Lustre: Client system-client has started
>>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.201 at o2ib failed: 5
>>>   ... last message repeated 17 times ...
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.201 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.202 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.203 at o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.203 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.204 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.205 at o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.205 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.206 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.207 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30.208 at o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096
>>>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30.208 at o2ib failed: 5
>>
>>> looks like it succeeds in the end, but only after a struggle.
>>Is it completely stable and performant after the struggle?  Do the error
>>messages stop?
>the fs's appear to be fine.

hmmm - actually, the fs's are _mostly_ fine... but sometimes i/o that
happens right after the above errors fails completely. after a few more
trials, this seems to happen about 40% of the time... :-/

eg. (+/- some missing characters from crappy IPMI SoL) you can see that
rsync has managed to list the files on the newly mounted lustre fs, but
then gets i/o errors when trying to copy the files off lustre to ramdisk ->

...
rsync: readlink "/mnt/lustre_system/lib64/libattr.so.1.1.0.1.0" failed: Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libc-2.5.so" faile(5)
rsync: readlink "/mnt/lustre_system/lib64/libcrypt-2.5.so" failed: Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libdevmapper-event.a.1.02" failed: Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libdevmapper-event.so.1.02" failed: Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libexpat.so.0.5.0" failed: Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib6-2.0.a" failed: Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libgmodule-2.0.aject-2.0.a" failed: Input/output error (5)
...

so maybe lnet has renegotiated a connection to the MDS ok, but not to
the OSS's yet.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility



More information about the lustre-discuss mailing list