[Lustre-discuss] lustre client 1.6.5.1 hangs

Heiko Schroeter schroete at iup.physik.uni-bremen.de
Thu Jul 10 01:25:25 PDT 2008


Hello,

we have a _test_ setup for a lustre 1.6.5.1 installation with 2 Raid Systems 
(64 Bit Systems) counting for 4 OSTs with 6TB each. One combined MDS and MDT 
server (32 Bit system , for testing only).

OST lustre mkfs:
"mkfs.lustre --param="failover.mode=failout" --fsname 
scia --ost --mkfsoptions='-i 2097152 -E stride=16 -b 
4096' --mgsnode=mds1lustre at tcp0 /dev/sdb"
(Our files are quite large 100MB+ on the system)

Kernel: Vanilla Kernel 2.6.22.19, lustre compiled from the sources on Gentoo 
2008.0

The client mount point is /misc/testfs via automount.
The access can be done through a link from /mnt/testfs -> /misc/testfs

The following procedure hangs a client:
1) copy files to the lustre system
2) do a 'du -sh /mnt/testfs/willi' while copying
3) unmount an OST (here OST0003) while copying

The 'du' job hangs and the lustre file system cannot be acessed any longer on 
this client even from other logins. The only way to restore normal op is IMHO 
a hard reset of the machine. A reboot hangs because the filesystem is still 
active.
Other clients and there mount points are not affected as long as they do not 
access the file system with 'du' 'ls' or so.
I know that this is drastic but may happen in production by our users.

Deactivating/Reactivating or remounting the OST does not have any effect on 
the 'du' job. The 'du' job (#29665 see process list below) and the 
correpsonding lustre thread (#29694) cannot be killed manually.

This behaviour is reproducable. The OST0003 is not reactivated on the client 
side though the MDS does so. It seems that this info does not propagate to 
the client. See last lines of dmesg below.

What is the proper way (besides avoiding the use of 'du') to reactivate the 
client file system ?

Thanks and Regards
Heiko




The process list on the CLIENT:
<snip>
root     29175  5026  0 08:36 ?        00:00:00 sshd: laura [priv]
laura   29177 29175  0 08:36 ?        00:00:01 sshd: laura at pts/0
laura   29178 29177  0 08:36 pts/0    00:00:00 -bash
laura   29665 29178  0 09:15 pts/0    00:00:03 du -sh /mnt/testfs/foo/fam/
schell   29694     2  0 09:15 ?        00:00:00 [ll_sa_29665]
root     29695  4846  0 09:15 ?        00:00:00 /usr/sbin/automount --timeout 
60 --pid-file /var/run/autofs.misc.pid /misc yp auto.misc
<snap>

and CLIENT dmesg:
Lustre: 5361:0:(import.c:395:import_select_connection()) 
scia-OST0003-osc-ffff8100ea24a000: tried all connections, increasing latency 
to 6s
Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 10 previous 
similar messages
LustreError: 11-0: an error occurred while communicating with 
192.168.16.97 at tcp. The ost_connect operation failed with -19
LustreError: Skipped 20 previous similar messages
Lustre: 5361:0:(import.c:395:import_select_connection()) 
scia-OST0003-osc-ffff8100ea24a000: tried all connections, increasing latency 
to 51s
Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 20 previous 
similar messages
LustreError: 11-0: an error occurred while communicating with 
192.168.16.97 at tcp. The ost_connect operation failed with -19
LustreError: Skipped 24 previous similar messages
Lustre: 5361:0:(import.c:395:import_select_connection()) 
scia-OST0003-osc-ffff8100ea24a000: tried all connections, increasing latency 
to 51s
Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 24 previous 
similar messages
LustreError: 167-0: This client was evicted by scia-OST0003; in progress 
operations using this service will fail.

The MDS dmesg:
<snip>
Lustre: 6108:0:(import.c:395:import_select_connection()) scia-OST0003-osc: 
tried all connections, increasing latency to 51s
Lustre: 6108:0:(import.c:395:import_select_connection()) Skipped 10 previous 
similar messages
LustreError: 11-0: an error occurred while communicating with 
192.168.16.97 at tcp. The ost_connect operation failed with -19
LustreError: Skipped 10 previous similar messages
Lustre: 6108:0:(import.c:395:import_select_connection()) scia-OST0003-osc: 
tried all connections, increasing latency to 51s
Lustre: 6108:0:(import.c:395:import_select_connection()) Skipped 20 previous 
similar messages
Lustre: Permanently deactivating scia-OST0003
Lustre: Setting parameter scia-OST0003-osc.osc.active in log scia-client
Lustre: Skipped 3 previous similar messages
Lustre: setting import scia-OST0003_UUID INACTIVE by administrator request
Lustre: scia-OST0003-osc.osc: set parameter active=0
Lustre: Skipped 2 previous similar messages
Lustre: scia-MDT0000: haven't heard from client 
9111f740-b7a7-e2ff-b672-288a66decfab (at 192.168.16.106 at tcp) in 1269 seconds. 
I think it's dead, and I am evicting it.
Lustre: Permanently reactivating scia-OST0003
Lustre: Modifying parameter scia-OST0003-osc.osc.active in log scia-client
Lustre: Skipped 1 previous similar message
Lustre: 15406:0:(import.c:395:import_select_connection()) scia-OST0003-osc: 
tried all connections, increasing latency to 51s
Lustre: 15406:0:(import.c:395:import_select_connection()) Skipped 2 previous 
similar messages
LustreError: 167-0: This client was evicted by scia-OST0003; in progress 
operations using this service will fail.
Lustre: scia-OST0003-osc: Connection restored to service scia-OST0003 using 
nid 192.168.16.97 at tcp.
Lustre: scia-OST0003-osc.osc: set parameter active=1
Lustre: MDS scia-MDT0000: scia-OST0003_UUID now active, resetting orphans
<snap>




More information about the lustre-discuss mailing list