[Lustre-discuss] reading file hangs on Lustre 1.6.4 node
    Anatoly Oreshkin 
    Anatoly.Oreshkin at pnpi.spb.ru
       
    Wed Dec 12 07:52:59 PST 2007
    
    
  
Hello,
I am a novice to Lustre.
I've installed Lustre 1.6.4 on Scientific Linux 4.4 with
kernel 2.6.9-55.0.9.EL_lustre.1.6.4smp
MGS server, MDS server and OST server all are installed on  head node.
MGS and MDS servers have their storage  on different disks.
MGS server  on /dev/sdb1 disk
/usr/sbin/mkfs.lustre --fsname=vtrak1fs  --mgs /dev/sdb1
MDS server on /dev/sdc1
/usr/sbin/mkfs.lustre --fsname=vtrak1fs --mdt --mgsnode=head_node at tcp0 
/dev/sdc1
OST storage is based on RAID5 and connected via SCSI directly to head node.
OST1 server on /dev/sdg1
/usr/sbin/mkfs.lustre --fsname=vtrak1fs --ost --mgsnode=head_node at tcp0 
/dev/sdg1
On client node Lustre is started by mount
mount -t lustre head_node at tcp0:/vtrak1fs /vtrak1
TCP networking is used for communication with nodes.
The file /etc/modprobe.conf contains the line:
options lnet networks=tcp
Command /usr/sbin/lctl list_nids issued on head node gives
85.142.10.197 at tcp
For testing purpose I was reading all files on head node from OST1.
All files were read successfuly.
Then I started the same read test of all files from OST1 on client node
with address 192.168.1.2
Command /usr/sbin/lctl list_nids issued on client node gives:
192.168.1.2 at tcp
In this case read test reads a number of files and then hangs on some file.
The command dmesg issued on client node gives such error messages:
LustreError: 5017:0:(socklnd.c:1599:ksocknal_destroy_conn()) Completing partial
receive from 12345-85.142.10.197 at tcp, ip 85.142.10.197:988, with error
LustreError: 5017:0:(events.c:134:client_bulk_callback()) event type 1, status 
-5, desc ca9d3c00
LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout 
(sent at 1197447164, 150s ago)  req at ca9d3200 x4566962/t0 
o3->vtrak1fs-OST0000_UUID at 85.142.10.197@tcp:28 lens 384/336 ref 2 fl Rpc:/0/0 
rc 0/-22
LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 8 
previous similar messages
Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection to service vtrak1fs-OST0000 
via nid 85.142.10.197 at tcp was lost; in progress operations using this service 
will wait for recovery to complete.
Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection restored to service 
vtrak1fs-OST0000 using nid 85.142.10.197 at tcp.
Lustre: Skipped 1 previous similar message
hw tcp v4 csum failed
hw tcp v4 csum failed
...
Dmesg issued on head node gives errors:
LustreError: 15048:0:(ost_handler.c:821:ost_brw_read()) @@@ timeout on bulk PUT
  req at cc6af600 x4566962/t0 
o3->629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID:-1 lens 
384/336 ref 0 fl Interpret:/0/0 rc 0/0
Lustre: 15048:0:(ost_handler.c:881:ost_brw_read()) vtrak1fs-OST0000: ignoring 
bulk IO comm error with 
629198c9-085d-f95a-462f-b5e535904a3d at NET_0x20000c0a80102_UUID id 
12345-192.168.1.2 at tcp - client will retry
Lustre: 14987:0:(ldlm_lib.c:519:target_handle_reconnect()) vtrak1fs-OST0000: 
629198c9-085d-f95a-462f-b5e535904a3d reconnecting
On Lustre client data checksums are disabled by default.
cat /proc/fs/lustre/llite/vtrak1fs-f7f53200/checksum_pages -> 0
What might be the reason(s)  ?
Any hints ?  How to trace the problem ?
Thank you.
    
    
More information about the lustre-discuss
mailing list