[Lustre-discuss] How to estimate the time for e2fsck on OST

Tue Aug 4 15:47:21 PDT 2009

[ ... ]

adilger> Putting 4 OSTs on a single disk doesn't make sense.
adilger> A single OST can be up to 8TB, and if you have multiple
adilger> OSTs on the same disk(s) it will cause terrible
adilger> performance problems due to seeking.

Uhm, not exactly, that's a quick but simplistic answer: things
are more complicated than that.

The seeking depends strictly on access patterns and number of disks
in most cases.

Suppose that you have a 1TB disk and divide it into one or two
filesystems: for a given file set (assumption relaxed later) and
access pattern the same bits of the disk will be accessed.

The two filesystems end up being mostly super-cylinder-groups, that
mostly disjoined free space allocation pools. There are secondary
effects as to the disjoined free space allocations (one filesystem
means allocations can spread all over the disk, two filesystems
will restrict allocation to two separate pools, which most likely
will improve clustering).

Then two separate filesystems are more resilient to serious
mangling, and might fsck faster (because of the better clustering)
if done sequentially.

But the assumption "given file set" does not hold if the two
filesystems are part of the same Lustre filesystem *and* striping
is happening. In that case two objects that are parts of the same
Lustre file will usually end up on the two partitions and Lustre
will assume that they can be fetched in parallel but cannot really,
and this may reduce performance.

But the the overall effect will not be big; it will mostly be the
same as if the max object size had been doubled, because again
performance depends mostly on file access patterns and number of
drives. For small files though it will halve the number of disks on
which it can stripe, but this can be countered by halving the max
object size.

Consider this example, a max object size of 1MiB, and a 100MiB file
and 10 drives and striping.

With one filesystem per drive you can read 10MiB in paralle in 1MiB
objects (stripe size 10MiB). With two filesystems per drive you can
read 20MiB in parallel (stripe size 20MiB) in 2x1MiB objects that
are serialized by the drive.

If the max object size is changed to 512KiB in the two filesystem
per drive, you can still read 10MiB in parallel in 2x512MiB objects
(back to the 10MiB stripe size).

Now one might argue that in the 10x1MiB case the 1MiB is likely to
be more contiguous than in the 10x2x512KiB case, where the two
512KiB objects being forced to be in different halves of the disk,
but then let me point out that the 100MiB file striped across the
10 drives in 1MiB objects has got 10x1MiB objects per drive, anyhow
and whether they are clustered or not is mostly up to luck.

So the issue really is whether 20x512KiB objects per drive are
going to be less clustered than 10x1MiB objecs, and my guess is
that it does not matter a lot, and in some cases it might be of
benefit.

Anyhow, there is a case where two OSTs per drive is most likely of
benefit. That's the case where two OSTs belong to two Lustre
filesystems, one faster (outer track OSTs) and used more often and
one slower (inner track OSTs) and used less often. That means a
crude form of hand-clustering.

Still though performance likely depends more on the overall file
access patterns and the number of disks than on whether they are
split across two distinct allocation pools.

Note 1: a fair bit also depends on the in-cylinder-group allocation
policy of 'ldiskfs' and how often the allocator will switch to a
different cylinder group and

Note 2: maybe there is some special issue within Lustre that makes
it rather less effective with the partitions per disk.

Note 3: in many if not most (just a guess) Lustre installations the
"disk" is actually a SAN RAID pool, and each OST is a LUN of that
SAN RAID pool, and that LUN is in effect a slice of a partition off
each disk. Now this is may not be at all what Lustre should be
about :-).

Amazing barely related discovery BTW: while searching info on the
current cylinder group policies of file system designs in the 'ext'
family, I found that there was an interesting filesystem called
"ext4" in 1997, which has some elements reminiscent of Lustre (or
the original UNIX filesystem design):

  http://www.cs.cmu.edu/~mihaib/fs/fs.html
    "A Dual-Disk File System: ext4 Mihai Budiu April 16, 1997"

So RedHat and Linus should change the name of the recently
introduced one to 'ext5'.