[Lustre-discuss] lnet_try_match_md()) Matching packet from 12345-10.5.203.250 at tcp, match 19154486 length 728 too big

Michael D. Seymour seymour at cita.utoronto.ca
Fri May 22 13:38:09 PDT 2009


Hi all,

I hope you could help us with some connection problems we are having with our 
lustre file system. The filesystem roc consists of 6 OSSs with one OST per OSS. 
Each OSS uses the 1.6.7 RHEL 5 kernel on Centos 5.1 (one unit uses Centos 5.3). 
The MDS uses CentOS 5.1 and Lustre 1.6.7. 203 RHEL-based clients mount the 
filesystem and all use Lustre 1.6.7. All are connected via a Gb ethernet switch 
stack.

One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a 
different network.

We get the following messages on a particular client:

May 22 15:07:45 trinity kernel: LustreError: 
5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 
12345-10.5.203.250 at tcp, match 19154486 length 728 too big: 704 left, 704 allowed
May 22 15:07:45 trinity kernel: LustreError: 
5111:0:(lib-move.c:110:lnet_try_match_md()) Skipped 3 previous similar messages
May 22 15:12:45 trinity kernel: Lustre: Request x19154486 sent from 
roc-MDT0000-mdc-000001044e1d4c00 to NID 10.5.203.250 at tcp 300s ago has timed out 
(limit 300s).
May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages
May 22 15:12:45 trinity kernel: Lustre: roc-MDT0000-mdc-000001044e1d4c00: 
Connection to service roc-MDT0000 via nid 10.5.203.250 at tcp was lost; in progress 
operations using this service will wait for recovery to complete.
May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages
May 22 15:12:45 trinity kernel: Lustre: roc-MDT0000-mdc-000001044e1d4c00: 
Connection restored to service roc-MDT0000 using nid 10.5.203.250 at tcp.
May 22 15:12:45 trinity kernel: Lustre: Skipped 4 previous similar messages

[root at trinity ~]# cat /proc/fs/lustre/lov/roc-clilov-000001044e1d4c00/uuid
84adb9a1-8959-fcf5-cc72-81c6a1e171b8

On the MDS containing roc-MDT0000:

May 22 15:12:45 rocpile kernel: Lustre: 
19236:0:(ldlm_lib.c:538:target_handle_reconnect()) roc-MDT0000: 
84adb9a1-8959-fcf5-cc72-81c6a1e171b8 reconnecting
May 22 15:12:45 rocpile kernel: Lustre: 
19236:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 4 previous similar 
messages

Any idea what could be causing this? BUG 11332 looked similar, but it has been 
closed because of other related bugs being fixed.

Thanks,
Mike

-- 
Michael D. Seymour                 Phone: 416-978-8497
Scientific Computing Support       Fax: 416-978-3921
Canadian Institute for Theoretical Astrophysics, University of Toronto



More information about the lustre-discuss mailing list