[Lustre-discuss] Frequent OSS Crashes with heavy load

Brian J. Murrell Brian.Murrell at Sun.COM
Mon Nov 10 06:58:28 PST 2008


On Mon, 2008-11-10 at 14:49 +0000, Wang lu wrote:
> Thanks, Brian,
>    During a "crash", I can neither SSH to the OSS server, nor
> start a new console on the machine directly. A "df" uses over 10 sec.

Yeah, sounds like the OSS is quite "backed up".

>    Our system has 3 server nodes. 1 server for MDS, 2 servers for 2 OSSs. And
> each OSS has 2 disk arrays attached. The Total space is 57TB.

Ahhh.  OK.  Your description made it sound like you were running all of
those on a single node and the reality is that Lustre doesn't do
anything magic.  If you only use a single node, you likely won't see any
better performance than say, just NFS.  But I digress.

>    The problem may be caused by oversubscribed. Since the %util and average
> load are both high. However, I do not know 1.How to estimate the optimum number
> of OST thread? Do you have any suggestion? 

Use our iokit.

> 2.What is the relationship between OST thread number and the number of Lustre
> client nodes?

Nothing.  The relationship is the point of diminishing returns on
driving your storage as you add more threads.  Most storage can benefit
from having multiple threads from a single machine driving it -- to a
point of saturation.  There is no point in driving the storage beyond
that point of saturation.  The iokit will test your storage, throwing
more and more threads at it.  When you look at the output you will find
a maximum number of threads beyond which you get no more increase in
performance.  That number is your optimum OST threads number.

Certainly you can achieve this without the iokit by just playing with
the number of ost threads, adjusting up and down (think about doing a
binary search for example) until you find your sweet spot.  This method
is of course more time consuming.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20081110/cb535a1f/attachment.pgp>


More information about the lustre-discuss mailing list