[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

Mon Apr 13 21:33:20 PDT 2009

On Apr 12, 2009  16:41 +0100, Peter Grandi wrote:
>    «Ideally, the RAID configuration should allow the 1 MB Lustre RPCs
>     to evenly fit only one RAID stripe without requiring an expensive
>     read-modify-write cycle.»
> 
> As usual, I would be wary of using RAID5 or RAID6 as an OST, as RAID10
> is nearly always so much nicer. On the other hand a RAID5 or RAID6 with
> a stripe size of 4KiB might be vaguely tolerable (as Linux on x86
> variants uses 4KiB as block size anyhow).

Actually, the ldiskfs allocator (mballoc, also used in ext4) is optimized
for RAID configurations.  It allocates RAID-stripe sized and aligned
chunks if possible, specifically to avoid read-modify-write, and also
to avoid fragmented free space.  In the absence of other information,
mballoc will assume a 1MB RAID stripe width, but modern Lustre + e2fsprogs
allows storing the RAID stripe width in the superblock via tune2fs.  It
is also possible to have the clients match their RPC size and alignment
to the RAID stripe size/alignment, though it isn't yet automatic to pass
the optimal RPC size from the underlying RAID all the way to the client.

> This seems to indicate that finding space becomes slower. That seems to
> be a minor issue. There are two bigger issues:
> 
> * A fundamental issue is that nearby or contiguous stretches of blocks
>   become scarce, as free blocks tend to be widely scattered. This is
>   independent of file system design (except those that have a compacting
>   collector).

This is true of almost all filesystems.  This means the best possible
avoidance is to allocate blocks contiguously in the first place, to
ensure that when a file is freed it will also result in contiguous
free blocks.

> There are other potential issues; 'ext3' and 'ext4' for example use
> relatively small allocation groups, and try to keep some space free in
> every group, and this works well when there is lots of free space; also
> when there is little disk space, it becomes much more difficult to find
> parity RAID aligned free space, and this means that partity RAID based
> OSTs end up with many more RMW cycles than otherwise when writing.

While it is true that ext3 and ext4 have relatively small allocation
groups (128MB), the allocators are totally different and as a result
the performance of ext4 (which is largely based on Lustre ldiskfs)
is much better than ext3.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.