[Lustre-discuss] Frequent OSS Crashes with heavy load

Andreas Dilger adilger at sun.com
Tue Nov 11 01:50:58 PST 2008


On Nov 11, 2008  15:52 +0800, wanglu wrote:
>     Since there are jobs running on the  clustre, I cann't do PIOS test now. I am afraid this situtaion may happen later. Does Lustre has some solution to deal with over-subscribed instead of Kernel crash? Users can accpt that their jobs are slow down, but they can not accept their jobs are  dead because of crash of OSSs.
> Or is there any other reason may cause crash of OSSs? 

You can increase the lustre timeout, temporarily on all clients & servers:

    lctl set_param timeout=200

or permanently in the filesystem configuration (on the MGS only):

    lctl conf_param {fsname}.sys.timeout=200

> wanglu
> 2008-11-11
> 
> -------------------------------------------------------------
> 发件人:Wang lu
> 发送日期:2008-11-11 01:01:12
> 收件人:Brian J. Murrell
> 抄送:lustre-discuss at lists.lustre.org
> 主题:Re: [Lustre-discuss] Frequent OSS Crashes with heavy load
> 
> Thanks a lot. I will go on tomorrow.
> 
> Brian J. Murrell 写:
> 
> > On Mon, 2008-11-10 at 16:42 +0000, Wang lu wrote:
> >> I have already 512(max number) IO thread running. Some of them are of "Dead"
> >> status. Is it safe to draw conclusion that the OSS is oversubscribed? 
> > 
> > Until you do some analysis of your storage with the iokit, one cannot
> > really draw any conclusions, however if you are already at the maximum
> > value of OST threads, it would not be difficult to believe that perhaps
> > this is a possibility.
> > 
> > Try a simple experiment and half the number to 256 and see if you have
> > any drop off in throughput to the storage devices.  If not, then you can
> > easily assume that 512 was either too much or not necessary.  You can
> > try doing this again if you wish.  If you get to a value of OST threads
> > where your throughput is lower than it should be, you've gone too low.
> > 
> > But really, the iokit is the more efficient and accurate way to
> > determine this.
> > 
> > b.
> > 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list