[Lustre-discuss] reading file hangs on Lustre 1.6.4 node
Anatoly Oreshkin
Anatoly.Oreshkin at pnpi.spb.ru
Wed Dec 12 07:52:59 PST 2007
Hello,
I am a novice to Lustre.
I've installed Lustre 1.6.4 on Scientific Linux 4.4 with
kernel 2.6.9-55.0.9.EL_lustre.1.6.4smp
MGS server, MDS server and OST server all are installed on head node.
MGS and MDS servers have their storage on different disks.
MGS server on /dev/sdb1 disk
/usr/sbin/mkfs.lustre --fsname=vtrak1fs --mgs /dev/sdb1
MDS server on /dev/sdc1
/usr/sbin/mkfs.lustre --fsname=vtrak1fs --mdt --mgsnode=head_node at tcp0
/dev/sdc1
OST storage is based on RAID5 and connected via SCSI directly to head node.
OST1 server on /dev/sdg1
/usr/sbin/mkfs.lustre --fsname=vtrak1fs --ost --mgsnode=head_node at tcp0
/dev/sdg1
On client node Lustre is started by mount
mount -t lustre head_node at tcp0:/vtrak1fs /vtrak1
TCP networking is used for communication with nodes.
The file /etc/modprobe.conf contains the line:
options lnet networks=tcp
Command /usr/sbin/lctl list_nids issued on head node gives
85.142.10.197 at tcp
For testing purpose I was reading all files on head node from OST1.
All files were read successfuly.
Then I started the same read test of all files from OST1 on client node
with address 192.168.1.2
Command /usr/sbin/lctl list_nids issued on client node gives:
192.168.1.2 at tcp
In this case read test reads a number of files and then hangs on some file.
The command dmesg issued on client node gives such error messages:
LustreError: 5017:0:(socklnd.c:1599:ksocknal_destroy_conn()) Completing partial
receive from 12345-85.142.10.197 at tcp, ip 85.142.10.197:988, with error
LustreError: 5017:0:(events.c:134:client_bulk_callback()) event type 1, status
-5, desc ca9d3c00
LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout
(sent at 1197447164, 150s ago) req at ca9d3200 x4566962/t0
o3->vtrak1fs-OST0000_UUID at 85.142.10.197@tcp:28 lens 384/336 ref 2 fl Rpc:/0/0
rc 0/-22
LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 8
previous similar messages
Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection to service vtrak1fs-OST0000
via nid 85.142.10.197 at tcp was lost; in progress operations using this service
will wait for recovery to complete.
Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection restored to service
vtrak1fs-OST0000 using nid 85.142.10.197 at tcp.
Lustre: Skipped 1 previous similar message
hw tcp v4 csum failed
hw tcp v4 csum failed
...
Dmesg issued on head node gives errors:
LustreError: 15048:0:(ost_handler.c:821:ost_brw_read()) @@@ timeout on bulk PUT
req at cc6af600 x4566962/t0
o3->629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID:-1 lens
384/336 ref 0 fl Interpret:/0/0 rc 0/0
Lustre: 15048:0:(ost_handler.c:881:ost_brw_read()) vtrak1fs-OST0000: ignoring
bulk IO comm error with
629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID id
12345-192.168.1.2 at tcp - client will retry
Lustre: 14987:0:(ldlm_lib.c:519:target_handle_reconnect()) vtrak1fs-OST0000:
629198c9-085d-f95a-462f-b5e535904a3d reconnecting
On Lustre client data checksums are disabled by default.
cat /proc/fs/lustre/llite/vtrak1fs-f7f53200/checksum_pages -> 0
What might be the reason(s) ?
Any hints ? How to trace the problem ?
Thank you.
More information about the lustre-discuss
mailing list