[Lustre-discuss] Frequent OSS Crashes with heavy load
Andreas Dilger
adilger at sun.com
Tue Nov 11 01:50:58 PST 2008
On Nov 11, 2008 15:52 +0800, wanglu wrote:
> Since there are jobs running on the clustre, I cann't do PIOS test now. I am afraid this situtaion may happen later. Does Lustre has some solution to deal with over-subscribed instead of Kernel crash? Users can accpt that their jobs are slow down, but they can not accept their jobs are dead because of crash of OSSs.
> Or is there any other reason may cause crash of OSSs?
You can increase the lustre timeout, temporarily on all clients & servers:
lctl set_param timeout=200
or permanently in the filesystem configuration (on the MGS only):
lctl conf_param {fsname}.sys.timeout=200
> wanglu
> 2008-11-11
>
> -------------------------------------------------------------
> 发件人:Wang lu
> 发送日期:2008-11-11 01:01:12
> 收件人:Brian J. Murrell
> 抄送:lustre-discuss at lists.lustre.org
> 主题:Re: [Lustre-discuss] Frequent OSS Crashes with heavy load
>
> Thanks a lot. I will go on tomorrow.
>
> Brian J. Murrell 写:
>
> > On Mon, 2008-11-10 at 16:42 +0000, Wang lu wrote:
> >> I have already 512(max number) IO thread running. Some of them are of "Dead"
> >> status. Is it safe to draw conclusion that the OSS is oversubscribed?
> >
> > Until you do some analysis of your storage with the iokit, one cannot
> > really draw any conclusions, however if you are already at the maximum
> > value of OST threads, it would not be difficult to believe that perhaps
> > this is a possibility.
> >
> > Try a simple experiment and half the number to 256 and see if you have
> > any drop off in throughput to the storage devices. If not, then you can
> > easily assume that 512 was either too much or not necessary. You can
> > try doing this again if you wish. If you get to a value of OST threads
> > where your throughput is lower than it should be, you've gone too low.
> >
> > But really, the iokit is the more efficient and accurate way to
> > determine this.
> >
> > b.
> >
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list