[Lustre-discuss] Fw: Re: Clients frozen during pressure test

Tue Aug 4 16:30:45 PDT 2009

Hi, 
		After modification of TCP statck on OSS, it seems that failed clients can reconnect again. The pressure test has persisted  for 12 hours.  
         "You should check for error messages on the server that might indicate
why it is having a problem.  If all of the clients are reporting the
same OSTs, possibly on the same OSS, then that is a good sign there is
something wrong with that OSS." --Different clients reported unconnection to different OSTs, so the problem might caused by a certain OST's failure. 
--------------				 
Lu Wang
2009-08-05

-------------------------------------------------------------
发件人：Andreas Dilger
发送日期：2009-08-05 01:41:03
收件人：Lu Wang
抄送：lustre-discuss
主题：Re: [Lustre-discuss] Clients frozen during pressure test

On Aug 03, 2009  22:10 +0800, Lu Wang wrote:
> 		I am doing pressure test for a new 10-OSS Lustre file system using 70 client node. (each server has 10Gb Ethernet connection, each client has 1Gb Ethernet connection, there are 3 OST on 3 RAID6 volulme for one OSS)
> 		Each time, after about 4 hours, clients began to be frozen one after another. command "lfs check osts" shows that the frozen clients cannot access some OSTs. 
> 		error: check 'testfs-OST0007-osc-c9b82800': Resource temporarily unavailable (11)
> 		error: check 'testfs-OST0008-osc-c9b82800': Resource temporarily unavailable (11)
> 		error: check 'testfs-OST0009-osc-c9b82800': Resource temporarily unavailable (11)
> 
> and  command "lctl ping server" , shows "Input/Out put error"
> 				
>   	 	   However, the servers are not so busy( util% <10)  when clients are frozen. My question is:
> 			1.Why  clients cannot reconnect when servers are not so busy? 
> 			2. I am setting timeout=1000, do I need add timeout to a number larger?
> 			3.Is there any other  variable needed to be tuned under heavy pressure? 
> each server has 10Gb Ethernet connection, each client has 1Gb Ethernet connection. 

Using a timeout of 1000s is not good at all.  That means clients will
wait AT LEAST 1000s (16 minutes!) before even detecting that an RPC
failed.  If you are running Lustre 1.8 you don't even need to set the
timeouts - they are scaled automatically based on the load at the server.

You should check for error messages on the server that might indicate
why it is having a problem.  If all of the clients are reporting the
same OSTs, possibly on the same OSS, then that is a good sign there is
something wrong with that OSS.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.