[Lustre-devel] Wide striping
Alex Kulyavtsev
aik at fnal.gov
Thu Oct 20 09:24:53 PDT 2011
On Oct 3, 2011, at 3:15 PM, Nathan Rutman wrote:
> ... snip...
> We have been thinking about a different wide-striping method that
> doesn't have these problems. The basic idea is to create a new
> stripe type that encodes the list of OSTs compactly, and then using
> the same (or a calculable) object identifier (or FID) on all these
> OSTs.
>
>
> Our version of widestriping does not involve increasing the EA size
> at all, but instead utilizes a new stripe pattern. (This will not
> be understandable by older Lustre versions, which will generate an
> error locally, or potentially we can convert into the BZ-4424 form
> if the layout fits in that format). A bitmap will identify which
> OSTs hold a stripe of this file. The bitmap should probably fit into
> current ext4 EA size limit, giving us ~32k stripes.
>
> Some OST’s may be down at file creation time, or new OSTs added
> later; hence there will likely be holes in the bitmap (but
> relatively few).
1) There will be holes when OST pools used: if the file can be written
only to the set of OST from specific OST POOL and if by the virtue of
configuration OSTs in the pool do not represent continuous set then
there will be holes in OST bit map even if all OSTs are online.
2) "relatively few holes [in bitmap]" - did you consider compressing
bitmap? Like BBC or WAH described at en.wikipedia.org/wiki/
Bitmap_index#Compression ? Reportedly you can do bitwise operations
without decompression. This way you can go up in number of stripes
(well, 32k is big number). But it may help control RPC size - you may
represent wide striping with few integers effectively representing
continuous blocks and OST holes, the size of the descriptor is the
function of # of blocks and holes and to the less extent function of
number of stripes.
More:
It is possible to have two bitmaps:
0000000111111111000000111111 - one describing general "blocks" of OST
= ((beg1,end1),(beg2,end2))
0000000000000010100100100000 - other describing "corrections" - drop
two OST, add two OST ; here 4 bits, compressed to X bytes
0000000111111101100100011111 - OST map, computed on client as bitwise
XOR to uncompressed maps (1) and (2)
Each of two maps is compressed for transfer, thus shall not take much
space.
3) If metadata file format going to be changed, is it right time to
reserve descriptors to have few replicas of the file data?
In such case we need to have number of replicas, and layout descriptor
for each replica. Each replica may have different number of stripes,
thus you can have widely striped file replica on SAS disks (or in
flash) and replicate it to slower disk storage with one or "few"
stripes for further tape archival.
I assume after initial writes file has more or less "stable"
content. Replicas can be on different media type, like flash/ SAS/
SATA, fast / cheap disks, effectively Hierarchical Storage.
I'm thinking about "lazy" replication as you implemented to replicate
data to another file system but in this case replication is within the
same lustre file system. Client became aware of multiple replicas and
can chose what file replica to use (e.g when some OSTs down). It
eliminates OST as single point of failure.
Alex.
> ______________________________________________________________________
> This email may contain privileged or confidential information, which
> should only be used for the purpose for which it was sent by
> Xyratex. No further rights or licenses are granted to use such
> information. If you are not the intended recipient of this message,
> please notify the sender by return and delete it. You may not use,
> copy, disclose or rely on the information contained in it.
>
> Internet email is susceptible to data corruption, interception and
> unauthorised amendment for which Xyratex does not accept liability.
> While we have taken reasonable precautions to ensure that this email
> is free of viruses, Xyratex does not accept liability for the
> presence of any computer viruses in this email, nor for any losses
> caused as a result of viruses.
>
> Xyratex Technology Limited (03134912), Registered in England &
> Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
>
> The Xyratex group of companies also includes, Xyratex Ltd,
> registered in Bermuda, Xyratex International Inc, registered in
> California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia,
> Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic
> of China and Xyratex Japan Limited registered in Japan.
> ______________________________________________________________________
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20111020/a01ee1d3/attachment.htm>
More information about the lustre-devel
mailing list