[Lustre-discuss] is Luster ready for prime time?

Indivar Nair indivar.nair at techterra.in
Thu Jan 31 19:08:11 PST 2013


Hi Bobbie,

Small file performance is an issue.
It is the caching that balances it out. Due to the nature of the work, all
nodes in a given pool will always ask for the same set of files. So the
initial response to requests may be slow, but the subsequent ones are fine.

As I had mentioned earlier, we also had problems with listing large
directories. We worked around it by having a cron job on the Samba Gateway
get the file stat in large directories at regular intervals, thereby
keeping the OSS vfs cache primed at all times.

Play around with these parameters on MDS, OSS and Gateway ...it works out
differently for everyone -
--------------------------------------------------------------------------------------------------------------------------------------------------------------
sysctl -w vm.vfs_cache_pressure=2
sysctl -w vm.dirty_ratio=15
sysctl -w vm.swappiness=90                 #Swapping out regularly makes
more space for caches
sysctl -w vm.dirty_background_ratio=4
--------------------------------------------------------------------------------------------------------------------------------------------------------------

On the Gateways / Clients, run after each time you mount Lustre -
--------------------------------------------------------------------------------------------------------------------------------------------------------------
pushd /proc/fs/lustre/osc
for ost in *-OST*
 do
  echo 32 > ${ost}/max_rpcs_in_flight
 done
popd

lctl set_param osc.*.max_dirty_mb=512
--------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------------------------------------------------
/proc/fs/lustre/llite/<fsname>-<uid>/max_read_ahead_mb            # at
default 40MB, as most of our files are in the 10MB range

/proc/fs/lustre/llite/<fsname>-<uid>/max_read_ahead_whole_mb  # set to 10MB
/proc/fs/lustre/llite/*/statahead_max
# set to 8192
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Regards,


Indivar Nair







On Tue, Jan 22, 2013 at 12:55 AM, Lind, Bobbie J <bobbie.j.lind at intel.com>wrote:

> Indivar,
>
> I would be very interested to see what tuning parameters you have set to
> tune lustre and the storage for small files.  I have had similar setups in
> the past and been stumped by the small file performance.
>
> --
> Bobbie Lind
>
>
>
> >Date: Mon, 21 Jan 2013 11:24:32 -0500
> >From: greg whynott <greg.whynott at gmail.com>
> >Subject: Re: [Lustre-discuss] is Luster ready for prime time?
> >To: Indivar Nair <indivar.nair at techterra.in>
> >Cc: "lustre-discuss at lists.lustre.org"
> >       <lustre-discuss at lists.lustre.org>
> >Message-ID:
> >       <CAKuzA1G4-W122LQrf3VKqADd=
> WrDgcAVx5hyAGJfZwwR8KKG2g at mail.gmail.com>
> >Content-Type: text/plain; charset="utf-8"
> >
> >Thanks very much Indivar,  informative read.    it is good to see others
> >in
> >our sector are using the technology and you have some good points.
> >
> >have a great day,
> >greg
> >
> >
> >
> >On Sat, Jan 19, 2013 at 6:52 AM, Indivar Nair
> ><indivar.nair at techterra.in>wrote:
> >
> >>  Hi Greg,
> >>
> >> One of our customers had a similar requirement and we deployed Lustre
> >> 2.0.0.1 for them. This was in July 2011. Though there were a lots of
> >> problems initially, all of them were sorted out over time. They are
> >>quite
> >> happy with it now.
> >>
> >> *Environment:*
> >> Its a 150 Artist studio with around 60 Render nodes. The studio mainly
> >> uses Mocha, After Effects, Silhouette, Synth Eye, Maya, and Nuke among
> >> others. They mainly work on 3D Effects and Stereoscopy Conversions.
> >> Around 45% of Artists and Render Nodes are on Linux and use native
> >>Lustre
> >> Client. All others access it through Samba.
> >>
> >> *Lustre Setup:*
> >> It consists of 2 x Dell R610 as MDS Nodes, and 4 x Dell R710 as OSS
> >>Nodes.
> >> 2 x Dell MD3200 with 12x1TB SAS Nearline Disks are used for storage.
> >>Each
> >> Dell MD3200s are shared among 2 OSS nodes for H/A.
> >>
> >> Since the original plan (which didn't happen) was to move to a 100%
> >>Linux
> >> environment, we didn't allocate separate Samba Gateways and use the OSS
> >> nodes with CTDB for it. Thankfully, we haven't had any issues with that
> >>yet.
> >>
> >> *Performance:*
> >> We get a good THROUGHPUT of 800 - 1000MB/s with Lustre Caching. The
> >>disks
> >> it self provide much lesser speeds. But that is fine, as caching is in
> >> effect most of the time.
> >>
> >> *Challenge:*
> >> The challenge for us was to tune the storage for small files 10 - 50MB
> >> totalling to 10s of GBs. An average shot would consist of 2000 - 4000
> >>.dpx
> >> images. Some Scenes / Shots also had millions of <1MB Maya Cache files.
> >> This did tax the storage, especially the MDS. Fixed it to an extent by
> >> adding more RAM to MDS.
> >>
> >> *Suggestions:*
> >>
> >> 1. Get the real number of small files (I mean <1MB ones) created / used
> >>by
> >> all software. These are the ones that could give you the most trouble.
> >>Do
> >> not assume anything.
> >>
> >> 2. Get the file - sizes, numbers and access patterns absolutely correct.
> >> This is the key.
> >>     Its easier to design and tune Lustre for large files and I/O.
> >>
> >> 3. Network tuning is as important and storage tuning. Tune Switches,
> >>each
> >> Workstation, Render Nodes, Samba / NFS Gateways, OSS Nodes, MDS Nodes,
> >> everything.
> >>
> >> 4. Similarly do not undermine Samba / NFS Gateway. Size and tune them
> >> correctly too.
> >>
> >> 5. Use High Speed Switching like QDR Infiniband or 40GigE, especially
> >>for
> >> backend connectivity between Samba/NFS Gateway and Lustre MDS/OSS Nodes.
> >>
> >> 6. As far as possible, have fixed directory pattern for all projects.
> >> Separate working files (Maya, Nuke, etc.) from the data, i.e. frames /
> >> images, videos, etc. at the top directory level it self. This will help
> >>you
> >> tune / manage the storage better. Different directory tree for different
> >> file sizes or file access types.
> >>
> >> If designed and tuned right, I think Lustre is best storage currently
> >> available for your kind of work.
> >>
> >> Hope this helps.
> >>
> >> Regards,
> >>
> >>
> >> Indivar Nair
> >>
> >>
> >> On Fri, Jan 18, 2013 at 1:51 AM, greg whynott
> >><greg.whynott at gmail.com>wrote:
> >>
> >>> Hi Charles,
> >>>
> >>>   I received a few off list challenging email messages along with a few
> >>> fishing ones,  but its all good.   its interesting how a post asking a
> >>> question can make someone appear angry.  8)
> >>>
> >>> Our IO profiles from the different segments of our business do vary
> >>> greatly.   The HPC is more or less the typical load you would expect to
> >>> see,  depending on which software is in use for the for the job being
> >>>ran.
> >>>       We have hundreds of artists and administrative staff who use the
> >>>file
> >>> system in a variety of ways.   Some examples would include but not
> >>>limited
> >>> to:  saving out multiple revisions of photoshop documents (typically
> >>>in the
> >>> hundreds of megs to +1gig range),   video editing (stereoscopic 2k and
> >>>4k
> >>> images(again from 10's 100's to gigs in size) including uncompressed
> >>> video,  excel, word and similar files,  thousands of project files
> >>>(from
> >>> software such as Maya,  Nuke and similar)  these also vary largely in
> >>>size,
> >>> from 1 to thousands of megs in size.
> >>>
> >>> The intention is keep our data bases and VM requirements on the
> >>>existing
> >>> file system which is comprised of about 100 10k SAS drives,  it works
> >>>well.
> >>>
> >>> We did consider GPFS but that consideration went out the door once I
> >>> started talking to them and hammering in some numbers into their online
> >>> calculator.  Things got a bit crazy quickly.   They have different
> >>>pricing
> >>> for the different types and speeds of Intel CPUs.  I got the feeling
> >>>they
> >>> were trying to squeeze every penny out of customers they could.  felt
> >>>very
> >>> Brocade-ish and left a bad taste with us.   wouldn't of been much of a
> >>> problem as some other shops I've worked at,  but here we do have a
> >>>finite
> >>> budget to work within.
> >>>
> >>> The NAS vendors could all be considered scale out I suspect.   All 3
> >>>can
> >>> scale out the storage and front end.  NA C-mode can have up to 24
> >>>heads,
> >>> Blue Arc goes up to 4 or 8 depending on the class,  Isilon can go up
> >>>to 24
> >>> nodes or more as well if memory serves me correctly,  and they all
> >>>have a
> >>> single name space solution in place.   They each have their limits,
> >>>but
> >>> for our use case they are really subjective.   We will not hit the
> >>>limits
> >>> of their scalability before we are considering a fork lift refresh.
> >>>In our
> >>> view,  for what they offer it is perty much a wash for them - any would
> >>> meet our needs.  NetApp still has a silly agg/vol size limit,  at
> >>>least it
> >>> is up to 90TB now (from 9 in the past(formatted fs use))..  in April
> >>>it is
> >>> suppose to go much higher.
> >>>
> >>>  The block storage idea in the mix - since all our HPC is linux,  they
> >>> all would become luster clients.   To provide a gateway into the luster
> >>> storage for none linux/luster hosts the thinking was a clustered pair
> >>>of
> >>> linux boxes running SAMBA/NFS which were also Luster clients.    Its
> >>>just
> >>> an idea being bounced around at this point.  The data serving
> >>>requirements
> >>> of the non HPC parts of the business are much less.   The video editors
> >>> most likely would stay on our existing storage solution as that is
> >>>working
> >>> out very well for them, but even if we did put them onto the Luster
> >>>FS,  I
> >>> think they would be fine.  based on that, it didn't seem so crazy to
> >>> consider block access in this method.   that said,  I think we would
> >>>be one
> >>> of the first in M&E to do so,  pioneers if you will...
> >>>
> >>>
> >>> diversify - we will end up in the same boat for the same reasons.
> >>>
> >>>
> >>> thanks Charles,
> >>> greg
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Jan 17, 2013 at 2:20 PM, Hammitt, Charles Allen <
> >>> chammitt at email.unc.edu> wrote:
> >>>
> >>>>  ** **
> >>>>
> >>>> Somewhat surprised that no one has responded yet; although it?s likely
> >>>> that the responses would be rather subjective?including mine, of
> >>>>course!
> >>>> ****
> >>>>
> >>>> ** **
> >>>>
> >>>> Generally I would say that it would be interesting to know more about
> >>>> your datasets and intended workload; however, you mention this is to
> >>>>be
> >>>> used as your day-to-day main business storage?so I imagine those
> >>>> characteristics would greatly vary? mine certainly do; that much is
> >>>>for
> >>>> sure!****
> >>>>
> >>>> ** **
> >>>>
> >>>> I don?t really think uptime would be as much an issue here; there are
> >>>> lots of redundancies, recovery mechanisms, and plenty of stable
> >>>>branches to
> >>>> choose from?the question becomes what are the feature-set needs,
> >>>> performance usability for different file types and workloads, and
> >>>>general
> >>>> comfort level with greater complexity and somewhat less resources.
> >>>>That
> >>>> said, I?d personally be a bit wary of using it as a general
> >>>>filesystem for
> >>>> *all* your needs.  ****
> >>>>
> >>>> ** **
> >>>>
> >>>> ** **
> >>>>
> >>>> I do find it interesting that your short list is a wide range mix of
> >>>> storage and filesystem types; traditional NAS, scale-out NAS, and
> >>>>then some
> >>>> block storage with a parallel filesytem in Lustre.  Why no GPFS on
> >>>>the list
> >>>> for comparison?****
> >>>>
> >>>> ** **
> >>>>
> >>>> I currently manage, or have used in the past *[bluearc]*, all the
> >>>> storage / filesystems and more from your list.  The reason being is
> >>>>that
> >>>> different storage and filesystems components have some things they
> >>>>are good
> >>>> at? while other things they might not be as good at doing.  So I
> >>>>diversify
> >>>> by putting different storage/filesystem component pieces in the areas
> >>>>where
> >>>> they excel at best?****
> >>>>
> >>>> ** **
> >>>>
> >>>> ** **
> >>>>
> >>>> ** **
> >>>>
> >>>> Regards,****
> >>>>
> >>>> ** **
> >>>>
> >>>> Charles****
> >>>>
> >>>> ** **
> >>>>
> >>>> ** **
> >>>>
> >>>> ** **
> >>>>
> >>>> *From:* lustre-discuss-bounces at lists.lustre.org [mailto:
> >>>> lustre-discuss-bounces at lists.lustre.org] *On Behalf Of *greg whynott
> >>>> *Sent:* Thursday, January 17, 2013 12:18 PM
> >>>> *To:* lustre-discuss at lists.lustre.org
> >>>>
> >>>> *Subject:* [Lustre-discuss] is Luster ready for prime time?****
> >>>>
> >>>>  ** **
> >>>>
> >>>> Hello,
> >>>>
> >>>>
> >>>> just signed up today, please forgive me if this question has been
> >>>> covered recently.  - in a bit of a rush to get an answer on this as
> >>>>we need
> >>>> to make a decision soon,  the idea of using luster was thrown into
> >>>>the mix
> >>>> very late in the decision making process.
> >>>>
> >>>> ****
> >>>>
> >>>>  We are looking to procure a new storage solution which will
> >>>> predominately be used for HPC output but will also be used as our main
> >>>> business centric storage for day to day use.  Meaning the file system
> >>>>needs
> >>>> to be available 24/7/365.    The last time I was involved in
> >>>>considering
> >>>> Luster was about 6 years ago and it was at that time being considered
> >>>>for
> >>>> scratch space for HPC usage only. ****
> >>>>
> >>>> Our VMs and databases would remain on non-luster storage as we already
> >>>> have that in place and it works well.    The luster file system
> >>>>potentially
> >>>> would have everything else.  Projects we work on typically take up to
> >>>>2
> >>>> years to complete and during that time we would want all assets to
> >>>>remain
> >>>> on the file system.****
> >>>>
> >>>> Some of the vendors on our short list include HDS(Blue Arc), Isilon
> >>>>and
> >>>> NetApp.    Last week we started bouncing the idea of using Luster
> >>>>around.
> >>>> I'd love to use it if it is considered stable enough to do so.
> >>>>
> >>>> your thoughts and/or comments would be greatly appreciated.  thanks
> >>>>for
> >>>> your time.
> >>>>
> >>>> greg
> >>>>
> >>>>
> >>>> ****
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Lustre-discuss mailing list
> >>> Lustre-discuss at lists.lustre.org
> >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >>>
> >>>
> >>
> >-------------- next part --------------
> >An HTML attachment was scrubbed...
> >URL:
> >
> http://lists.lustre.org/pipermail/lustre-discuss/attachments/20130121/d311
> >779c/attachment-0001.html
> >
> >------------------------------
> >
> >_______________________________________________
> >Lustre-discuss mailing list
> >Lustre-discuss at lists.lustre.org
> >http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> >End of Lustre-discuss Digest, Vol 84, Issue 12
> >**********************************************
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20130201/2092b396/attachment.htm>


More information about the lustre-discuss mailing list