[Lustre-discuss] MDS disk recmendations

Andreas Dilger adilger at sun.com
Mon Jun 16 11:31:12 PDT 2008


On Jun 16, 2008  14:00 -0400, Brock Palen wrote:
> This paper does not talk about the MDS though.  We have a sun 2540   
> with 12 15K 300 GB drives.
> 
> We plan to use 4 drives  in a 1+0  with the rest being spares.  What  
> I am curious about are the following options
> 
> Stripe size,
> Readahead on the MDS Raid

There is a discussion about MDS + RAID in the Lustre Manual, section 10.

When formatting a filesystem on a RAID device, it is beneficial to specify
additional parameters at the time of formatting. This ensures that the
filesystem is optimized for the underlying disk geometry. Use the
--mkfsoptions parameter to specify these options in the Lustre configuration.

For RAID5, RAID6, RAID1+0 storage, specifying the -E stride={stride_size}
option improves the layout of the filesystem metadata ensuring that no single
disk contains all of the allocation bitmaps. The stride_size parameter is in
units of 4096-byte blocks and represents the amount of contiguous data written
to a single disk before moving to the next disk. This is applicable to both
MDS and OST filesystems.

Note - It is better to have the MDS on RAID1+0 than on RAID5 or RAID6.

RAID1 with an internal journal and two disks from different controllers.  
If you need a larger MDT, create multiple RAID1 devices from pairs of
disks, and then make a RAID0 array of the RAID1 devices.  This ensures
maximum reliability because multiple disk failures only have a small
chance of hitting both disks in the same RAID1 device.

Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance 
that even two disk failures can cause the loss of the whole MDT device.
The first failure will disable an entire half of the mirror and the
second failure has a 50% chance of disabling the remaining mirror.

> I did not find anything in the manual about this, other than disable  
> readahead on DDN hardware but that sounded like OST's  not MDS.

Readahead will not have much benefit for the MDT, because most of the
IO is random.  The chunksize for RAID1 is mostly meaningless.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list