[Lustre-discuss] reading file hangs on Lustre 1.6.4 node

Wed Dec 12 07:52:59 PST 2007

Hello,

I am a novice to Lustre.
I've installed Lustre 1.6.4 on Scientific Linux 4.4 with
kernel 2.6.9-55.0.9.EL_lustre.1.6.4smp

MGS server, MDS server and OST server all are installed on  head node.
MGS and MDS servers have their storage  on different disks.

MGS server  on /dev/sdb1 disk
/usr/sbin/mkfs.lustre --fsname=vtrak1fs  --mgs /dev/sdb1

MDS server on /dev/sdc1
/usr/sbin/mkfs.lustre --fsname=vtrak1fs --mdt --mgsnode=head_node at tcp0 
/dev/sdc1

OST storage is based on RAID5 and connected via SCSI directly to head node.
OST1 server on /dev/sdg1

/usr/sbin/mkfs.lustre --fsname=vtrak1fs --ost --mgsnode=head_node at tcp0 
/dev/sdg1

On client node Lustre is started by mount

mount -t lustre head_node at tcp0:/vtrak1fs /vtrak1

TCP networking is used for communication with nodes.
The file /etc/modprobe.conf contains the line:

options lnet networks=tcp

Command /usr/sbin/lctl list_nids issued on head node gives

85.142.10.197 at tcp

For testing purpose I was reading all files on head node from OST1.
All files were read successfuly.

Then I started the same read test of all files from OST1 on client node
with address 192.168.1.2

Command /usr/sbin/lctl list_nids issued on client node gives:
192.168.1.2 at tcp

In this case read test reads a number of files and then hangs on some file.
The command dmesg issued on client node gives such error messages:

LustreError: 5017:0:(socklnd.c:1599:ksocknal_destroy_conn()) Completing partial
receive from 12345-85.142.10.197 at tcp, ip 85.142.10.197:988, with error
LustreError: 5017:0:(events.c:134:client_bulk_callback()) event type 1, status 
-5, desc ca9d3c00
LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout 
(sent at 1197447164, 150s ago)  req at ca9d3200 x4566962/t0 
o3->vtrak1fs-OST0000_UUID at 85.142.10.197@tcp:28 lens 384/336 ref 2 fl Rpc:/0/0 
rc 0/-22
LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 8 
previous similar messages
Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection to service vtrak1fs-OST0000 
via nid 85.142.10.197 at tcp was lost; in progress operations using this service 
will wait for recovery to complete.
Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection restored to service 
vtrak1fs-OST0000 using nid 85.142.10.197 at tcp.
Lustre: Skipped 1 previous similar message
hw tcp v4 csum failed
hw tcp v4 csum failed
...

Dmesg issued on head node gives errors:

LustreError: 15048:0:(ost_handler.c:821:ost_brw_read()) @@@ timeout on bulk PUT
  req at cc6af600 x4566962/t0 
o3->629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID:-1 lens 
384/336 ref 0 fl Interpret:/0/0 rc 0/0
Lustre: 15048:0:(ost_handler.c:881:ost_brw_read()) vtrak1fs-OST0000: ignoring 
bulk IO comm error with 
629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID id 
12345-192.168.1.2 at tcp - client will retry
Lustre: 14987:0:(ldlm_lib.c:519:target_handle_reconnect()) vtrak1fs-OST0000: 
629198c9-085d-f95a-462f-b5e535904a3d reconnecting

On Lustre client data checksums are disabled by default.

cat /proc/fs/lustre/llite/vtrak1fs-f7f53200/checksum_pages -> 0

What might be the reason(s)  ?

Any hints ?  How to trace the problem ?

Thank you.