[Lustre-discuss] Anybody actually using Flash (Fusion IO specifically) for meta data?

Kevin Van Maren kevin.van.maren at oracle.com
Thu May 19 09:28:07 PDT 2011


Dardo D Kleiner - CONTRACTOR wrote:
> Short answer: of course it works - they're just block devices after all - but you'll find that you won't realize the performance gains you might expect (at least not for an MDT).
>   

Yes.  See the email thread "improving metadata performance" and Robin 
Humble's talk at LUG.  The MDT disk is rarely the bottleneck (although 
that could change with full size-on-mds support), which others had 
discovered using a ram-based (tmpfs) MDT.

As for putting the entire filesystem on flash, sure that would be pretty 
nifty, but expensive.  Not being able to do failover, with storage on 
internal PCIe cards, is a downside.

> Aside from simply being fast OSTs, there are several areas that would allow Lustre to take advantage of these kinds of devices:
>
> 1) SMP scaling for the MDS - the problem right now is that the low latency of these devices really shines best when you have many threads scattering small I/O.  The current (1.8.x) Lustre MDS doesn't 
> do this.
>   
SMP scaling is a big issue.  In Lustre 1.8.x the maximum performance is 
not more than 8 CPUs (maybe fewer) for the MDT -- additional cpu cores 
results in _lower_ performance.  There are patches for Lustre 2.x to 
improve SMP scaling, but I haven't tested a workload.

> 2) Flashcache/bcache over traditional disk storage (OST or MDT) - this can be done today, of course.  There's some interop issues in my testing, but when it works it does what it says it does.  It 
> still won't really help an MDT though.
> 3) Targeted device mapping of the metadata portions of an OST on traditional disk (e.g. extent lists) onto flash.
>
> #1 is substantial work (ongoing I believe).  #2 is pretty nifty, basically grow your local page cache beyond RAM - helps when "hot" working set is large.  #3 is trickier and though I haven't tried it 
> I understand there's real effort ongoing in this regard.
>   

flex_bg is in ext4, which allows the inodes to be packed together.

> Filesystem size in this discussion is mostly irrelevant for an MDT, its just whether or not the device is big enough for the number of objects (a few million is *not* many).  A huge number of clients 
> thrashing about creating/modifying/deleting is where these things have the most potential.
>
> - Dardo
>
> On 5/16/11 2:58 PM, Carlson, Timothy S wrote:
>   
>> Folks,
>>
>> I know that flash based technology gets talked about from time to time on the list, but I was wondering if anybody has actually implemented FusionIO devices for metadata. The last thread I can find on the mailing list that relates to this topic dates from 3 years ago. The software driving the Fusion cards has come quite a ways since then and I've got good experience using the device as a raw disk. I'm just fishing around to see if anybody has implemented one of these devices in a reasonably sized Lustre config where "reasonably" is left open to interpretation. I'm thinking>500T and a few million files.
>>
>> Thanks!
>>
>> Tim
>>
>>     




More information about the lustre-discuss mailing list