[Lustre-discuss] is Luster ready for prime time?

Mon Jan 21 11:25:17 PST 2013

Indivar,

I would be very interested to see what tuning parameters you have set to
tune lustre and the storage for small files.  I have had similar setups in
the past and been stumped by the small file performance.

-- 
Bobbie Lind

>Date: Mon, 21 Jan 2013 11:24:32 -0500
>From: greg whynott <greg.whynott at gmail.com>
>Subject: Re: [Lustre-discuss] is Luster ready for prime time?
>To: Indivar Nair <indivar.nair at techterra.in>
>Cc: "lustre-discuss at lists.lustre.org"
>	<lustre-discuss at lists.lustre.org>
>Message-ID:
>	<CAKuzA1G4-W122LQrf3VKqADd=WrDgcAVx5hyAGJfZwwR8KKG2g at mail.gmail.com>
>Content-Type: text/plain; charset="utf-8"
>
>Thanks very much Indivar,  informative read.    it is good to see others
>in
>our sector are using the technology and you have some good points.
>
>have a great day,
>greg
>
>
>
>On Sat, Jan 19, 2013 at 6:52 AM, Indivar Nair
><indivar.nair at techterra.in>wrote:
>
>>  Hi Greg,
>>
>> One of our customers had a similar requirement and we deployed Lustre
>> 2.0.0.1 for them. This was in July 2011. Though there were a lots of
>> problems initially, all of them were sorted out over time. They are
>>quite
>> happy with it now.
>>
>> *Environment:*
>> Its a 150 Artist studio with around 60 Render nodes. The studio mainly
>> uses Mocha, After Effects, Silhouette, Synth Eye, Maya, and Nuke among
>> others. They mainly work on 3D Effects and Stereoscopy Conversions.
>> Around 45% of Artists and Render Nodes are on Linux and use native
>>Lustre
>> Client. All others access it through Samba.
>>
>> *Lustre Setup:*
>> It consists of 2 x Dell R610 as MDS Nodes, and 4 x Dell R710 as OSS
>>Nodes.
>> 2 x Dell MD3200 with 12x1TB SAS Nearline Disks are used for storage.
>>Each
>> Dell MD3200s are shared among 2 OSS nodes for H/A.
>>
>> Since the original plan (which didn't happen) was to move to a 100%
>>Linux
>> environment, we didn't allocate separate Samba Gateways and use the OSS
>> nodes with CTDB for it. Thankfully, we haven't had any issues with that
>>yet.
>>
>> *Performance:*
>> We get a good THROUGHPUT of 800 - 1000MB/s with Lustre Caching. The
>>disks
>> it self provide much lesser speeds. But that is fine, as caching is in
>> effect most of the time.
>>
>> *Challenge:*
>> The challenge for us was to tune the storage for small files 10 - 50MB
>> totalling to 10s of GBs. An average shot would consist of 2000 - 4000
>>.dpx
>> images. Some Scenes / Shots also had millions of <1MB Maya Cache files.
>> This did tax the storage, especially the MDS. Fixed it to an extent by
>> adding more RAM to MDS.
>>
>> *Suggestions:*
>>
>> 1. Get the real number of small files (I mean <1MB ones) created / used
>>by
>> all software. These are the ones that could give you the most trouble.
>>Do
>> not assume anything.
>>
>> 2. Get the file - sizes, numbers and access patterns absolutely correct.
>> This is the key.
>>     Its easier to design and tune Lustre for large files and I/O.
>>
>> 3. Network tuning is as important and storage tuning. Tune Switches,
>>each
>> Workstation, Render Nodes, Samba / NFS Gateways, OSS Nodes, MDS Nodes,
>> everything.
>>
>> 4. Similarly do not undermine Samba / NFS Gateway. Size and tune them
>> correctly too.
>>
>> 5. Use High Speed Switching like QDR Infiniband or 40GigE, especially
>>for
>> backend connectivity between Samba/NFS Gateway and Lustre MDS/OSS Nodes.
>>
>> 6. As far as possible, have fixed directory pattern for all projects.
>> Separate working files (Maya, Nuke, etc.) from the data, i.e. frames /
>> images, videos, etc. at the top directory level it self. This will help
>>you
>> tune / manage the storage better. Different directory tree for different
>> file sizes or file access types.
>>
>> If designed and tuned right, I think Lustre is best storage currently
>> available for your kind of work.
>>
>> Hope this helps.
>>
>> Regards,
>>
>>
>> Indivar Nair
>>
>>
>> On Fri, Jan 18, 2013 at 1:51 AM, greg whynott
>><greg.whynott at gmail.com>wrote:
>>
>>> Hi Charles,
>>>
>>>   I received a few off list challenging email messages along with a few
>>> fishing ones,  but its all good.   its interesting how a post asking a
>>> question can make someone appear angry.  8)
>>>
>>> Our IO profiles from the different segments of our business do vary
>>> greatly.   The HPC is more or less the typical load you would expect to
>>> see,  depending on which software is in use for the for the job being
>>>ran.
>>>       We have hundreds of artists and administrative staff who use the
>>>file
>>> system in a variety of ways.   Some examples would include but not
>>>limited
>>> to:  saving out multiple revisions of photoshop documents (typically
>>>in the
>>> hundreds of megs to +1gig range),   video editing (stereoscopic 2k and
>>>4k
>>> images(again from 10's 100's to gigs in size) including uncompressed
>>> video,  excel, word and similar files,  thousands of project files
>>>(from
>>> software such as Maya,  Nuke and similar)  these also vary largely in
>>>size,
>>> from 1 to thousands of megs in size.
>>>
>>> The intention is keep our data bases and VM requirements on the
>>>existing
>>> file system which is comprised of about 100 10k SAS drives,  it works
>>>well.
>>>
>>> We did consider GPFS but that consideration went out the door once I
>>> started talking to them and hammering in some numbers into their online
>>> calculator.  Things got a bit crazy quickly.   They have different
>>>pricing
>>> for the different types and speeds of Intel CPUs.  I got the feeling
>>>they
>>> were trying to squeeze every penny out of customers they could.  felt
>>>very
>>> Brocade-ish and left a bad taste with us.   wouldn't of been much of a
>>> problem as some other shops I've worked at,  but here we do have a
>>>finite
>>> budget to work within.
>>>
>>> The NAS vendors could all be considered scale out I suspect.   All 3
>>>can
>>> scale out the storage and front end.  NA C-mode can have up to 24
>>>heads,
>>> Blue Arc goes up to 4 or 8 depending on the class,  Isilon can go up
>>>to 24
>>> nodes or more as well if memory serves me correctly,  and they all
>>>have a
>>> single name space solution in place.   They each have their limits,
>>>but
>>> for our use case they are really subjective.   We will not hit the
>>>limits
>>> of their scalability before we are considering a fork lift refresh.
>>>In our
>>> view,  for what they offer it is perty much a wash for them - any would
>>> meet our needs.  NetApp still has a silly agg/vol size limit,  at
>>>least it
>>> is up to 90TB now (from 9 in the past(formatted fs use))..  in April
>>>it is
>>> suppose to go much higher.
>>>
>>>  The block storage idea in the mix - since all our HPC is linux,  they
>>> all would become luster clients.   To provide a gateway into the luster
>>> storage for none linux/luster hosts the thinking was a clustered pair
>>>of
>>> linux boxes running SAMBA/NFS which were also Luster clients.    Its
>>>just
>>> an idea being bounced around at this point.  The data serving
>>>requirements
>>> of the non HPC parts of the business are much less.   The video editors
>>> most likely would stay on our existing storage solution as that is
>>>working
>>> out very well for them, but even if we did put them onto the Luster
>>>FS,  I
>>> think they would be fine.  based on that, it didn't seem so crazy to
>>> consider block access in this method.   that said,  I think we would
>>>be one
>>> of the first in M&E to do so,  pioneers if you will...
>>>
>>>
>>> diversify - we will end up in the same boat for the same reasons.
>>>
>>>
>>> thanks Charles,
>>> greg
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jan 17, 2013 at 2:20 PM, Hammitt, Charles Allen <
>>> chammitt at email.unc.edu> wrote:
>>>
>>>>  ** **
>>>>
>>>> Somewhat surprised that no one has responded yet; although it?s likely
>>>> that the responses would be rather subjective?including mine, of
>>>>course!
>>>> ****
>>>>
>>>> ** **
>>>>
>>>> Generally I would say that it would be interesting to know more about
>>>> your datasets and intended workload; however, you mention this is to
>>>>be
>>>> used as your day-to-day main business storage?so I imagine those
>>>> characteristics would greatly vary? mine certainly do; that much is
>>>>for
>>>> sure!****
>>>>
>>>> ** **
>>>>
>>>> I don?t really think uptime would be as much an issue here; there are
>>>> lots of redundancies, recovery mechanisms, and plenty of stable
>>>>branches to
>>>> choose from?the question becomes what are the feature-set needs,
>>>> performance usability for different file types and workloads, and
>>>>general
>>>> comfort level with greater complexity and somewhat less resources.
>>>>That
>>>> said, I?d personally be a bit wary of using it as a general
>>>>filesystem for
>>>> *all* your needs.  ****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> I do find it interesting that your short list is a wide range mix of
>>>> storage and filesystem types; traditional NAS, scale-out NAS, and
>>>>then some
>>>> block storage with a parallel filesytem in Lustre.  Why no GPFS on
>>>>the list
>>>> for comparison?****
>>>>
>>>> ** **
>>>>
>>>> I currently manage, or have used in the past *[bluearc]*, all the
>>>> storage / filesystems and more from your list.  The reason being is
>>>>that
>>>> different storage and filesystems components have some things they
>>>>are good
>>>> at? while other things they might not be as good at doing.  So I
>>>>diversify
>>>> by putting different storage/filesystem component pieces in the areas
>>>>where
>>>> they excel at best?****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> Regards,****
>>>>
>>>> ** **
>>>>
>>>> Charles****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> *From:* lustre-discuss-bounces at lists.lustre.org [mailto:
>>>> lustre-discuss-bounces at lists.lustre.org] *On Behalf Of *greg whynott
>>>> *Sent:* Thursday, January 17, 2013 12:18 PM
>>>> *To:* lustre-discuss at lists.lustre.org
>>>>
>>>> *Subject:* [Lustre-discuss] is Luster ready for prime time?****
>>>>
>>>>  ** **
>>>>
>>>> Hello,
>>>>
>>>>
>>>> just signed up today, please forgive me if this question has been
>>>> covered recently.  - in a bit of a rush to get an answer on this as
>>>>we need
>>>> to make a decision soon,  the idea of using luster was thrown into
>>>>the mix
>>>> very late in the decision making process.
>>>>
>>>> ****
>>>>
>>>>  We are looking to procure a new storage solution which will
>>>> predominately be used for HPC output but will also be used as our main
>>>> business centric storage for day to day use.  Meaning the file system
>>>>needs
>>>> to be available 24/7/365.    The last time I was involved in
>>>>considering
>>>> Luster was about 6 years ago and it was at that time being considered
>>>>for
>>>> scratch space for HPC usage only. ****
>>>>
>>>> Our VMs and databases would remain on non-luster storage as we already
>>>> have that in place and it works well.    The luster file system
>>>>potentially
>>>> would have everything else.  Projects we work on typically take up to
>>>>2
>>>> years to complete and during that time we would want all assets to
>>>>remain
>>>> on the file system.****
>>>>
>>>> Some of the vendors on our short list include HDS(Blue Arc), Isilon
>>>>and
>>>> NetApp.    Last week we started bouncing the idea of using Luster
>>>>around.
>>>> I'd love to use it if it is considered stable enough to do so.
>>>>
>>>> your thoughts and/or comments would be greatly appreciated.  thanks
>>>>for
>>>> your time.
>>>>
>>>> greg
>>>>
>>>>
>>>> ****
>>>>
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>
>-------------- next part --------------
>An HTML attachment was scrubbed...
>URL: 
>http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130121/d311
>779c/attachment-0001.html
>
>------------------------------
>
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss at lists.lustre.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>End of Lustre-discuss Digest, Vol 84, Issue 12
>**********************************************