[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203.250 at tcp, match 19154486 length 728 too big
Michael D. Seymour
seymour at cita.utoronto.ca
Fri May 22 13:38:09 PDT 2009
Hi all,
I hope you could help us with some connection problems we are having with our
lustre file system. The filesystem roc consists of 6 OSSs with one OST per OSS.
Each OSS uses the 1.6.7 RHEL 5 kernel on Centos 5.1 (one unit uses Centos 5.3).
The MDS uses CentOS 5.1 and Lustre 1.6.7. 203 RHEL-based clients mount the
filesystem and all use Lustre 1.6.7. All are connected via a Gb ethernet switch
stack.
One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a
different network.
We get the following messages on a particular client:
May 22 15:07:45 trinity kernel: LustreError:
5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from
12345-10.5.203.250 at tcp, match 19154486 length 728 too big: 704 left, 704 allowed
May 22 15:07:45 trinity kernel: LustreError:
5111:0:(lib-move.c:110:lnet_try_match_md()) Skipped 3 previous similar messages
May 22 15:12:45 trinity kernel: Lustre: Request x19154486 sent from
roc-MDT0000-mdc-000001044e1d4c00 to NID 10.5.203.250 at tcp 300s ago has timed out
(limit 300s).
May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages
May 22 15:12:45 trinity kernel: Lustre: roc-MDT0000-mdc-000001044e1d4c00:
Connection to service roc-MDT0000 via nid 10.5.203.250 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages
May 22 15:12:45 trinity kernel: Lustre: roc-MDT0000-mdc-000001044e1d4c00:
Connection restored to service roc-MDT0000 using nid 10.5.203.250 at tcp.
May 22 15:12:45 trinity kernel: Lustre: Skipped 4 previous similar messages
[root at trinity ~]# cat /proc/fs/lustre/lov/roc-clilov-000001044e1d4c00/uuid
84adb9a1-8959-fcf5-cc72-81c6a1e171b8
On the MDS containing roc-MDT0000:
May 22 15:12:45 rocpile kernel: Lustre:
19236:0:(ldlm_lib.c:538:target_handle_reconnect()) roc-MDT0000:
84adb9a1-8959-fcf5-cc72-81c6a1e171b8 reconnecting
May 22 15:12:45 rocpile kernel: Lustre:
19236:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 4 previous similar
messages
Any idea what could be causing this? BUG 11332 looked similar, but it has been
closed because of other related bugs being fixed.
Thanks,
Mike
--
Michael D. Seymour Phone: 416-978-8497
Scientific Computing Support Fax: 416-978-3921
Canadian Institute for Theoretical Astrophysics, University of Toronto
More information about the lustre-discuss
mailing list