[lustre-discuss] Chunk of file -> LNET node

Thu Mar 2 16:46:05 PST 2017

On Mar 2, 2017, at 12:31, François Tessier <ftessier at anl.gov> wrote:
> 
> Hello,
> 
> Correct me if I'm wrong: when a file is created on a Lustre fs, a set of
> OSTs (depending on the stripe count) is assigned.

... a set of OST objects is assigned.

> It means that the chunks of file (of size stripe_size) will be distributed
> among these OSTs. To each OST corresponds a set of LNET nodes.

I'd say "Each OST is hosted by an OSS node".

> From an application point of view, when the file is effectively written, the
> chunks are sent to the OST(s) through the corresponding set of LNET nodes.

s/LNET/OSS/ yes.

> My questions are:
> 
> - How to know (if possible using the Lustre API), for each chunk, what
> is the corresponding LNET node?

After the fact this is relatively straight forward.  You can use the FIEMAP
ioctl (via the "filefrag" utility from Lustre e2fsprogs) running on any client
to report exactly the placement of each byte of the file on each OST.

In advance of actual file IO (or also after the fact), the formula for each
file is basically:

    fetch file layout via llapi_layout_get_by_path() or similar
    stripe_index = (logical file offset / stripe_size) % stripe_count
    OST index = llapi_layout_ost_index_get(layout, stripe_index)

> - Is this distribution decided at file creation? In other words, is this
> distribution based only on offsets in file?

Yes, round-robin (RAID-0) striping is currently the only form of file layout,
and the OST object allocation is done when the file is first opened.  The
OST object used is round-robin based only on file offset, as shown above.
It is possible to "change" the layout of a file after it was written using the
"lfs migrate" command, though this is essentially rewriting the file content
after the fact to map to new objects/OSTs as requested.

We are also working on a new feature for the Lustre 2.10 release (PFL, see
http://wiki.lustre.org/images/1/1a/Progressive-File-Layouts_Hammond.pdf and DoM
for 2.11, see http://wiki.lustre.org/images/8/8f/LUG2014-DataOnMDT-Pershin.pdf )
that will allow each file's layout to have different segments based on the file
offset, so that it is possible to have different stripe count, stripe size, and
even different classes of storage based on the file offset (e.g. SSD for the
first 1MB index, HDD for the rest of the file).

This will allow a great deal of flexibility for file layouts if applications/libraries
need it, and will improve "out of the box" performance for users that don't want to
deal with the details.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation