[Lustre-discuss] brief 'hangs' on file operations

Thu Sep 2 09:21:36 PDT 2010

On Thursday, September 02, 2010, Andreas Dilger wrote:
> On 2010-09-02, at 06:43, Tina Friedrich wrote:
> > Causing most grieve at the moment is that we sometimes see delays
> > writing files. From the writing clients end, it simply looks as if I/O
> > stops for a while (we've seen 'pauses' of anything up to 10 seconds).
> > This appears to be independent of what client does the writing, and
> > software doing the writing. We investigated this a bit using strace and
> > dd; the 'slow' calls appear to always be either open, write, or close
> > calls. Usually, these take well below 0.001s; in around 0.5% or 1% of
> > cases, they take up to multiple seconds. It does not seem to be
> > associated with any specific OST, OSS, client or anything; there is
> > nothing in any log files or any exceptional load on MDS or OSS or any of
> > the clients.
> 
> This is most likely associated with delays in committing the journal on the
> MDT or OST, which can happen if the journal fills completely.  Having
> larger journals can help, if you have enough RAM to keep them all in
> memory and not overflow.  Alternately, if you make the journals small it
> will limit the latency, at the cost of reducing overall performance.  A
> third alternative might be to use SSDs for the journal devices.

As diamond uses DDN hardware, it should help in general with performance to 
update to 1.8 and to enable the async journal feature. I guess it also might 
help to reduce those delays, as writes are more optimized.

A question, though. Tina, do you use our ddn udev rules, which tune the 
devices for optimized performance? If not, please send a mail to 
support at ddn.com and ask for a recent udev rpm please (available for RHEL5 only 
so far, also *might* work on SLES11, but udev syntax changes to often, IMHO). 
And put [lustre] into the subject line please, as the lustre team maintains 
them.

Cheers,
Bernd