[Lustre-discuss] slow journal/commitrw on OSTs lead to crash

Sun Apr 12 08:41:50 PDT 2009

[ ... ]

>> I ran into a similar scenario with lustre when I hit 80% full as
>> well. Exact same problem with journal commits and disks seemingly
>> unusable. iostat on the disks (DDN 9500 array) shows huge numbers of
>> small reads. Almost like the disk is being scanned.

This reminds me of a different issue:

  http://lists.lustre.org/pipermail/lustre-discuss/2008-November/009124.html

   «We have experienced all these errors when we have a big job that is
    writing many small chunks.  when the writes are ... say 80 bytes and
    the block size is 4k bytes, the back end storage can slow down with
    read block, modify block, write block, to such and extent as to
    cause the slow commitrw and slow journal messages very similar to
    yours.»

Those reads can be part of RMW cycles for sub-stripe writes, fairly
common say on journals.

  http://manual.lustre.org/manual/LustreManual16_HTML/RAID.html#50548852_pgfId-1288359

   «Do not put an ext3 journal onto RAID5. As the journal is written
    linearly and synchronously, in most cases writes do not fill whole
    stripes and RAID5 has to read parities.»

   «Ideally, the RAID configuration should allow the 1 MB Lustre RPCs
    to evenly fit only one RAID stripe without requiring an expensive
    read-modify-write cycle.»

As usual, I would be wary of using RAID5 or RAID6 as an OST, as RAID10
is nearly always so much nicer. On the other hand a RAID5 or RAID6 with
a stripe size of 4KiB might be vaguely tolerable (as Linux on x86
variants uses 4KiB as block size anyhow).

> [ ... ] as ext2/3 filesystems get full, they become less efficient.
> That's not surprising as a filesystem certainly can fall into the
> category of resources that become less efficient as they become more
> full due to the overhead of finding suitable allocations. [ ... ]

This seems to indicate that finding space becomes slower. That seems to
be a minor issue. There are two bigger issues:

* A fundamental issue is that nearby or contiguous stretches of blocks
  become scarce, as free blocks tend to be widely scattered. This is
  independent of file system design (except those that have a compacting
  collector).

* A more incidental one is that since most file systems tend to allocate
  free blocks sequentially starting from the beginning, the last blocks
  to remain free tend to be in the inner cylinders. That is, odds are
  that in a 50% full filesystem the allocated 50% is mostly in the outer
  cylinders, and the free blocks are mostly in the inner cylinders,
  which can be much slower.

There are other potential issues; 'ext3' and 'ext4' for example use
relatively small allocation groups, and try to keep some space free in
every group, and this works well when there is lots of free space; also
when there is little disk space, it becomes much more difficult to find
parity RAID aligned free space, and this means that partity RAID based
OSTs end up with many more RMW cycles than otherwise when writing.

Overall, people who like *sustainable* performance have a big problems:
that performance numbers drive sales, and the numbers that matter in
practice are the "peak" numbers obtained from a fresh install and quick
benchmark run. So in the storage development community there is a lot of
focus on performance as in a freshly loaded nearly empty filesystem on
the outer tracks of storage system doing IO in large aligned blocks.