[Lustre-discuss] Frequent OSS Crashes with heavy load

Mon Nov 10 23:52:59 PST 2008

 Hi all, 
    Since there are jobs running on the  clustre, I cann't do PIOS test now. I am afraid this situtaion may happen later. Does Lustre has some solution to deal with over-subscribed instead of Kernel crash? Users can accpt that their jobs are slow down, but they can not accept their jobs are  dead because of crash of OSSs.
Or is there any other reason may cause crash of OSSs? 
    Thank you very much! 

------------------				 
wanglu
2008-11-11

-------------------------------------------------------------
发件人：Wang lu
发送日期：2008-11-11 01:01:12
收件人：Brian J. Murrell
抄送：lustre-discuss at lists.lustre.org
主题：Re: [Lustre-discuss] Frequent OSS Crashes with heavy load

Thanks a lot. I will go on tomorrow.

Brian J. Murrell 写:

> On Mon, 2008-11-10 at 16:42 +0000, Wang lu wrote:
>> I have already 512(max number) IO thread running. Some of them are of "Dead"
>> status. Is it safe to draw conclusion that the OSS is oversubscribed? 
> 
> Until you do some analysis of your storage with the iokit, one cannot
> really draw any conclusions, however if you are already at the maximum
> value of OST threads, it would not be difficult to believe that perhaps
> this is a possibility.
> 
> Try a simple experiment and half the number to 256 and see if you have
> any drop off in throughput to the storage devices.  If not, then you can
> easily assume that 512 was either too much or not necessary.  You can
> try doing this again if you wish.  If you get to a value of OST threads
> where your throughput is lower than it should be, you've gone too low.
> 
> But really, the iokit is the more efficient and accurate way to
> determine this.
> 
> b.
> 

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss