[Lustre-discuss] Clients frozen during pressure test

Tue Aug 4 10:43:47 PDT 2009

On Aug 03, 2009  22:10 +0800, Lu Wang wrote:
> 		I am doing pressure test for a new 10-OSS Lustre file system using 70 client node. (each server has 10Gb Ethernet connection, each client has 1Gb Ethernet connection, there are 3 OST on 3 RAID6 volulme for one OSS)
> 		Each time, after about 4 hours, clients began to be frozen one after another. command "lfs check osts" shows that the frozen clients cannot access some OSTs. 
> 		error: check 'testfs-OST0007-osc-c9b82800': Resource temporarily unavailable (11)
> 		error: check 'testfs-OST0008-osc-c9b82800': Resource temporarily unavailable (11)
> 		error: check 'testfs-OST0009-osc-c9b82800': Resource temporarily unavailable (11)
> 
> and  command "lctl ping server" , shows "Input/Out put error"
> 				
>   	 	   However, the servers are not so busy( util% <10)  when clients are frozen. My question is:
> 			1.Why  clients cannot reconnect when servers are not so busy? 
> 			2. I am setting timeout=1000, do I need add timeout to a number larger?
> 			3.Is there any other  variable needed to be tuned under heavy pressure? 
> each server has 10Gb Ethernet connection, each client has 1Gb Ethernet connection. 

Using a timeout of 1000s is not good at all.  That means clients will
wait AT LEAST 1000s (16 minutes!) before even detecting that an RPC
failed.  If you are running Lustre 1.8 you don't even need to set the
timeouts - they are scaled automatically based on the load at the server.

You should check for error messages on the server that might indicate
why it is having a problem.  If all of the clients are reporting the
same OSTs, possibly on the same OSS, then that is a good sign there is
something wrong with that OSS.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.