[Lustre-discuss] lustre client 1.6.5.1 hangs
Heiko Schroeter
schroete at iup.physik.uni-bremen.de
Thu Jul 10 01:25:25 PDT 2008
Hello,
we have a _test_ setup for a lustre 1.6.5.1 installation with 2 Raid Systems
(64 Bit Systems) counting for 4 OSTs with 6TB each. One combined MDS and MDT
server (32 Bit system , for testing only).
OST lustre mkfs:
"mkfs.lustre --param="failover.mode=failout" --fsname
scia --ost --mkfsoptions='-i 2097152 -E stride=16 -b
4096' --mgsnode=mds1lustre at tcp0 /dev/sdb"
(Our files are quite large 100MB+ on the system)
Kernel: Vanilla Kernel 2.6.22.19, lustre compiled from the sources on Gentoo
2008.0
The client mount point is /misc/testfs via automount.
The access can be done through a link from /mnt/testfs -> /misc/testfs
The following procedure hangs a client:
1) copy files to the lustre system
2) do a 'du -sh /mnt/testfs/willi' while copying
3) unmount an OST (here OST0003) while copying
The 'du' job hangs and the lustre file system cannot be acessed any longer on
this client even from other logins. The only way to restore normal op is IMHO
a hard reset of the machine. A reboot hangs because the filesystem is still
active.
Other clients and there mount points are not affected as long as they do not
access the file system with 'du' 'ls' or so.
I know that this is drastic but may happen in production by our users.
Deactivating/Reactivating or remounting the OST does not have any effect on
the 'du' job. The 'du' job (#29665 see process list below) and the
correpsonding lustre thread (#29694) cannot be killed manually.
This behaviour is reproducable. The OST0003 is not reactivated on the client
side though the MDS does so. It seems that this info does not propagate to
the client. See last lines of dmesg below.
What is the proper way (besides avoiding the use of 'du') to reactivate the
client file system ?
Thanks and Regards
Heiko
The process list on the CLIENT:
<snip>
root 29175 5026 0 08:36 ? 00:00:00 sshd: laura [priv]
laura 29177 29175 0 08:36 ? 00:00:01 sshd: laura at pts/0
laura 29178 29177 0 08:36 pts/0 00:00:00 -bash
laura 29665 29178 0 09:15 pts/0 00:00:03 du -sh /mnt/testfs/foo/fam/
schell 29694 2 0 09:15 ? 00:00:00 [ll_sa_29665]
root 29695 4846 0 09:15 ? 00:00:00 /usr/sbin/automount --timeout
60 --pid-file /var/run/autofs.misc.pid /misc yp auto.misc
<snap>
and CLIENT dmesg:
Lustre: 5361:0:(import.c:395:import_select_connection())
scia-OST0003-osc-ffff8100ea24a000: tried all connections, increasing latency
to 6s
Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 10 previous
similar messages
LustreError: 11-0: an error occurred while communicating with
192.168.16.97 at tcp. The ost_connect operation failed with -19
LustreError: Skipped 20 previous similar messages
Lustre: 5361:0:(import.c:395:import_select_connection())
scia-OST0003-osc-ffff8100ea24a000: tried all connections, increasing latency
to 51s
Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 20 previous
similar messages
LustreError: 11-0: an error occurred while communicating with
192.168.16.97 at tcp. The ost_connect operation failed with -19
LustreError: Skipped 24 previous similar messages
Lustre: 5361:0:(import.c:395:import_select_connection())
scia-OST0003-osc-ffff8100ea24a000: tried all connections, increasing latency
to 51s
Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 24 previous
similar messages
LustreError: 167-0: This client was evicted by scia-OST0003; in progress
operations using this service will fail.
The MDS dmesg:
<snip>
Lustre: 6108:0:(import.c:395:import_select_connection()) scia-OST0003-osc:
tried all connections, increasing latency to 51s
Lustre: 6108:0:(import.c:395:import_select_connection()) Skipped 10 previous
similar messages
LustreError: 11-0: an error occurred while communicating with
192.168.16.97 at tcp. The ost_connect operation failed with -19
LustreError: Skipped 10 previous similar messages
Lustre: 6108:0:(import.c:395:import_select_connection()) scia-OST0003-osc:
tried all connections, increasing latency to 51s
Lustre: 6108:0:(import.c:395:import_select_connection()) Skipped 20 previous
similar messages
Lustre: Permanently deactivating scia-OST0003
Lustre: Setting parameter scia-OST0003-osc.osc.active in log scia-client
Lustre: Skipped 3 previous similar messages
Lustre: setting import scia-OST0003_UUID INACTIVE by administrator request
Lustre: scia-OST0003-osc.osc: set parameter active=0
Lustre: Skipped 2 previous similar messages
Lustre: scia-MDT0000: haven't heard from client
9111f740-b7a7-e2ff-b672-288a66decfab (at 192.168.16.106 at tcp) in 1269 seconds.
I think it's dead, and I am evicting it.
Lustre: Permanently reactivating scia-OST0003
Lustre: Modifying parameter scia-OST0003-osc.osc.active in log scia-client
Lustre: Skipped 1 previous similar message
Lustre: 15406:0:(import.c:395:import_select_connection()) scia-OST0003-osc:
tried all connections, increasing latency to 51s
Lustre: 15406:0:(import.c:395:import_select_connection()) Skipped 2 previous
similar messages
LustreError: 167-0: This client was evicted by scia-OST0003; in progress
operations using this service will fail.
Lustre: scia-OST0003-osc: Connection restored to service scia-OST0003 using
nid 192.168.16.97 at tcp.
Lustre: scia-OST0003-osc.osc: set parameter active=1
Lustre: MDS scia-MDT0000: scia-OST0003_UUID now active, resetting orphans
<snap>
More information about the lustre-discuss
mailing list