[Lustre-discuss] reading file hangs on Lustre 1.6.4 node
Brian J. Murrell
Brian.Murrell at Sun.COM
Wed Dec 12 11:37:35 PST 2007
On Wed, 2007-12-12 at 18:52 +0300, Anatoly Oreshkin wrote:
>
> In this case read test reads a number of files and then hangs on some file.
> The command dmesg issued on client node gives such error messages:
>
> LustreError: 5017:0:(socklnd.c:1599:ksocknal_destroy_conn()) Completing partial
> receive from 12345-85.142.10.197 at tcp, ip 85.142.10.197:988, with error
I'm not really sure the origin or meaning of this message but it seems
pretty clear. This looks like a networking issue.
...
> hw tcp v4 csum failed
> hw tcp v4 csum failed
And this makes it look even more like a networking issue.
> Dmesg issued on head node gives errors:
>
> LustreError: 15048:0:(ost_handler.c:821:ost_brw_read()) @@@ timeout on bulk PUT
> req at cc6af600 x4566962/t0
> o3->629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID:-1 lens
> 384/336 ref 0 fl Interpret:/0/0 rc 0/0
> Lustre: 15048:0:(ost_handler.c:881:ost_brw_read()) vtrak1fs-OST0000: ignoring
> bulk IO comm error with
> 629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID id
> 12345-192.168.1.2 at tcp - client will retry
> Lustre: 14987:0:(ldlm_lib.c:519:target_handle_reconnect()) vtrak1fs-OST0000:
> 629198c9-085d-f95a-462f-b5e535904a3d reconnecting
All more indications of networking issues.
I think you need to start debugging your network.
b.
More information about the lustre-discuss
mailing list