[Lustre-discuss] lustre client 1.6.5.1 hangs

Thu Jul 10 08:21:04 PDT 2008

We are experiencing the same problem with 1.6.4.2.  We thought it was the statahead problems.  After turning off the statahead code, we experienced the same problem again.  I had hoped going to 1.6.5 would resolve the issue.  If you open a bug, would you mind sending the bug number to the list?  I would like to get on the CC list.

> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
> Heiko Schroeter
> Sent: Thursday, July 10, 2008 2:25 AM
> To: lustre-discuss at clusterfs.com
> Subject: [Lustre-discuss] lustre client 1.6.5.1 hangs
>
> Hello,
>
> we have a _test_ setup for a lustre 1.6.5.1 installation with
> 2 Raid Systems
> (64 Bit Systems) counting for 4 OSTs with 6TB each. One
> combined MDS and MDT
> server (32 Bit system , for testing only).
>
> OST lustre mkfs:
> "mkfs.lustre --param="failover.mode=failout" --fsname
> scia --ost --mkfsoptions='-i 2097152 -E stride=16 -b
> 4096' --mgsnode=mds1lustre at tcp0 /dev/sdb"
> (Our files are quite large 100MB+ on the system)
>
> Kernel: Vanilla Kernel 2.6.22.19, lustre compiled from the
> sources on Gentoo
> 2008.0
>
> The client mount point is /misc/testfs via automount.
> The access can be done through a link from /mnt/testfs -> /misc/testfs
>
> The following procedure hangs a client:
> 1) copy files to the lustre system
> 2) do a 'du -sh /mnt/testfs/willi' while copying
> 3) unmount an OST (here OST0003) while copying
>
> The 'du' job hangs and the lustre file system cannot be
> acessed any longer on
> this client even from other logins. The only way to restore
> normal op is IMHO
> a hard reset of the machine. A reboot hangs because the
> filesystem is still
> active.
> Other clients and there mount points are not affected as long
> as they do not
> access the file system with 'du' 'ls' or so.
> I know that this is drastic but may happen in production by our users.
>
> Deactivating/Reactivating or remounting the OST does not have
> any effect on
> the 'du' job. The 'du' job (#29665 see process list below) and the
> correpsonding lustre thread (#29694) cannot be killed manually.
>
> This behaviour is reproducable. The OST0003 is not
> reactivated on the client
> side though the MDS does so. It seems that this info does not
> propagate to
> the client. See last lines of dmesg below.
>
> What is the proper way (besides avoiding the use of 'du') to
> reactivate the
> client file system ?
>
> Thanks and Regards
> Heiko
>
>
>
>
> The process list on the CLIENT:
> <snip>
> root     29175  5026  0 08:36 ?        00:00:00 sshd: laura [priv]
> laura   29177 29175  0 08:36 ?        00:00:01 sshd: laura at pts/0
> laura   29178 29177  0 08:36 pts/0    00:00:00 -bash
> laura   29665 29178  0 09:15 pts/0    00:00:03 du -sh
> /mnt/testfs/foo/fam/
> schell   29694     2  0 09:15 ?        00:00:00 [ll_sa_29665]
> root     29695  4846  0 09:15 ?        00:00:00
> /usr/sbin/automount --timeout
> 60 --pid-file /var/run/autofs.misc.pid /misc yp auto.misc
> <snap>
>
> and CLIENT dmesg:
> Lustre: 5361:0:(import.c:395:import_select_connection())
> scia-OST0003-osc-ffff8100ea24a000: tried all connections,
> increasing latency
> to 6s
> Lustre: 5361:0:(import.c:395:import_select_connection())
> Skipped 10 previous
> similar messages
> LustreError: 11-0: an error occurred while communicating with
> 192.168.16.97 at tcp. The ost_connect operation failed with -19
> LustreError: Skipped 20 previous similar messages
> Lustre: 5361:0:(import.c:395:import_select_connection())
> scia-OST0003-osc-ffff8100ea24a000: tried all connections,
> increasing latency
> to 51s
> Lustre: 5361:0:(import.c:395:import_select_connection())
> Skipped 20 previous
> similar messages
> LustreError: 11-0: an error occurred while communicating with
> 192.168.16.97 at tcp. The ost_connect operation failed with -19
> LustreError: Skipped 24 previous similar messages
> Lustre: 5361:0:(import.c:395:import_select_connection())
> scia-OST0003-osc-ffff8100ea24a000: tried all connections,
> increasing latency
> to 51s
> Lustre: 5361:0:(import.c:395:import_select_connection())
> Skipped 24 previous
> similar messages
> LustreError: 167-0: This client was evicted by scia-OST0003;
> in progress
> operations using this service will fail.
>
> The MDS dmesg:
> <snip>
> Lustre: 6108:0:(import.c:395:import_select_connection())
> scia-OST0003-osc:
> tried all connections, increasing latency to 51s
> Lustre: 6108:0:(import.c:395:import_select_connection())
> Skipped 10 previous
> similar messages
> LustreError: 11-0: an error occurred while communicating with
> 192.168.16.97 at tcp. The ost_connect operation failed with -19
> LustreError: Skipped 10 previous similar messages
> Lustre: 6108:0:(import.c:395:import_select_connection())
> scia-OST0003-osc:
> tried all connections, increasing latency to 51s
> Lustre: 6108:0:(import.c:395:import_select_connection())
> Skipped 20 previous
> similar messages
> Lustre: Permanently deactivating scia-OST0003
> Lustre: Setting parameter scia-OST0003-osc.osc.active in log
> scia-client
> Lustre: Skipped 3 previous similar messages
> Lustre: setting import scia-OST0003_UUID INACTIVE by
> administrator request
> Lustre: scia-OST0003-osc.osc: set parameter active=0
> Lustre: Skipped 2 previous similar messages
> Lustre: scia-MDT0000: haven't heard from client
> 9111f740-b7a7-e2ff-b672-288a66decfab (at 192.168.16.106 at tcp)
> in 1269 seconds.
> I think it's dead, and I am evicting it.
> Lustre: Permanently reactivating scia-OST0003
> Lustre: Modifying parameter scia-OST0003-osc.osc.active in
> log scia-client
> Lustre: Skipped 1 previous similar message
> Lustre: 15406:0:(import.c:395:import_select_connection())
> scia-OST0003-osc:
> tried all connections, increasing latency to 51s
> Lustre: 15406:0:(import.c:395:import_select_connection())
> Skipped 2 previous
> similar messages
> LustreError: 167-0: This client was evicted by scia-OST0003;
> in progress
> operations using this service will fail.
> Lustre: scia-OST0003-osc: Connection restored to service
> scia-OST0003 using
> nid 192.168.16.97 at tcp.
> Lustre: scia-OST0003-osc.osc: set parameter active=1
> Lustre: MDS scia-MDT0000: scia-OST0003_UUID now active,
> resetting orphans
> <snap>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>