[Lustre-discuss] reading file hangs on Lustre 1.6.4 node

Brian J. Murrell Brian.Murrell at Sun.COM
Wed Dec 12 11:37:35 PST 2007


On Wed, 2007-12-12 at 18:52 +0300, Anatoly Oreshkin wrote:
> 
> In this case read test reads a number of files and then hangs on some file.
> The command dmesg issued on client node gives such error messages:
> 
> LustreError: 5017:0:(socklnd.c:1599:ksocknal_destroy_conn()) Completing partial
> receive from 12345-85.142.10.197 at tcp, ip 85.142.10.197:988, with error

I'm not really sure the origin or meaning of this message but it seems
pretty clear.  This looks like a networking issue.

...
> hw tcp v4 csum failed
> hw tcp v4 csum failed

And this makes it look even more like a networking issue.

> Dmesg issued on head node gives errors:
> 
> LustreError: 15048:0:(ost_handler.c:821:ost_brw_read()) @@@ timeout on bulk PUT
>   req at cc6af600 x4566962/t0 
> o3->629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID:-1 lens 
> 384/336 ref 0 fl Interpret:/0/0 rc 0/0
> Lustre: 15048:0:(ost_handler.c:881:ost_brw_read()) vtrak1fs-OST0000: ignoring 
> bulk IO comm error with 
> 629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID id 
> 12345-192.168.1.2 at tcp - client will retry
> Lustre: 14987:0:(ldlm_lib.c:519:target_handle_reconnect()) vtrak1fs-OST0000: 
> 629198c9-085d-f95a-462f-b5e535904a3d reconnecting

All more indications of networking issues.

I think you need to start debugging your network.

b.





More information about the lustre-discuss mailing list