[Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?

Andreas Dilger adilger at whamcloud.com
Thu May 19 11:44:47 PDT 2011


On May 19, 2011, at 10:28, Kevin Van Maren wrote:
> Dardo D Kleiner - CONTRACTOR wrote:
>> Short answer: of course it works - they're just block devices after all - but you'll find that you won't realize the performance gains you might expect (at least not for an MDT).
>> 
> 
> Yes.  See the email thread "improving metadata performance" and Robin 
> Humble's talk at LUG.  The MDT disk is rarely the bottleneck (although 
> that could change with full size-on-mds support), which others had 
> discovered using a ram-based (tmpfs) MDT.

I will assert that MDT disk performance is rarely the bottleneck only for
filesystem modifying operations, because the seek latency is largely hidden
by the linear IO of the journal, and because most metadata benchmarks are
done on test filesystems that are empty (i.e. free inodes are all contiguous).

I think for real-world usage on filesystems that are aged, and/or cold-cache
operations (just mounted, or larger than can fit in RAM) that SSD can help
significantly.

> As for putting the entire filesystem on flash, sure that would be pretty 
> nifty, but expensive.  Not being able to do failover, with storage on 
> internal PCIe cards, is a downside.

I doubt this will be possible for a long time to come, due to cost, even if
the PCI cards have external interfaces (as I've heard some high-end ones do).

>> Aside from simply being fast OSTs, there are several areas that would allow Lustre to take advantage of these kinds of devices:
>> 
>> 1) SMP scaling for the MDS - the problem right now is that the low latency of these devices really shines best when you have many threads scattering small I/O.  The current (1.8.x) Lustre MDS doesn't 
>> do this.
>> 
> 
> SMP scaling is a big issue.  In Lustre 1.8.x the maximum performance is 
> not more than 8 CPUs (maybe fewer) for the MDT -- additional cpu cores 
> results in _lower_ performance.  There are patches for Lustre 2.x to 
> improve SMP scaling, but I haven't tested a workload.
> 
>> 2) Flashcache/bcache over traditional disk storage (OST or MDT) - this can be done today, of course.  There's some interop issues in my testing, but when it works it does what it says it does.  It 
>> still won't really help an MDT though.
>> 3) Targeted device mapping of the metadata portions of an OST on traditional disk (e.g. extent lists) onto flash.
>> 
>> #1 is substantial work (ongoing I believe).  #2 is pretty nifty, basically grow your local page cache beyond RAM - helps when "hot" working set is large.  #3 is trickier and though I haven't tried it 
>> I understand there's real effort ongoing in this regard.
>> 
> 
> flex_bg is in ext4, which allows the inodes to be packed together.

As an FYI, a patch to enable flex_bg (and other ext4 features) by default was
just landed to the master branch for 2.1.  It also reduces the number of inodes
created on large OSTs (i.e. pretty much any new OST), and increases the number
of inodes created on the MDT.  That is more inline with typical users of Lustre
today, and testing so far has shown that flex_bg reduces mke2fs and e2fsck time
noticably.  The higher MDT inode ratio is also helpful for flash users, since it
more efficiently uses the space on the MDT.

>> Filesystem size in this discussion is mostly irrelevant for an MDT, its just whether or not the device is big enough for the number of objects (a few million is *not* many).  A huge number of clients 
>> thrashing about creating/modifying/deleting is where these things have the most potential.
>> 
>> - Dardo
>> 
>> On 5/16/11 2:58 PM, Carlson, Timothy S wrote:
>> 
>>> Folks,
>>> 
>>> I know that flash based technology gets talked about from time to time on the list, but I was wondering if anybody has actually implemented FusionIO devices for metadata. The last thread I can find on the mailing list that relates to this topic dates from 3 years ago. The software driving the Fusion cards has come quite a ways since then and I've got good experience using the device as a raw disk. I'm just fishing around to see if anybody has implemented one of these devices in a reasonably sized Lustre config where "reasonably" is left open to interpretation. I'm thinking>500T and a few million files.
>>> 
>>> Thanks!
>>> 
>>> Tim
>>> 
>>> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.






More information about the lustre-discuss mailing list