[lustre-discuss] Designing a new Lustre system

E.S. Rosenberg esr+lustre at mail.hebrew.edu
Thu Dec 21 10:32:49 PST 2017


Thanks for all the great answers!

Still looking for more info for #4...

Thanks again,
Eli

On Thu, Dec 21, 2017 at 12:26 AM, Mohr Jr, Richard Frank (Rick Mohr) <
rmohr at utk.edu> wrote:

> My $0.02 below.
>
> > On Dec 20, 2017, at 11:21 AM, E.S. Rosenberg <esr+lustre at mail.hebrew.edu>
> wrote:
> >
> > 1. After my recent experience with failover I wondered is there any
> reason not to set all machines that are within reasonable cable range as
> potential failover nodes so that in the very unlikely event of both
> machines connected to a disk enclosure failing simple recabling + manual
> mount would still work?
>
> That would probably work fine.  I don’t know if there are any drawbacks to
> having a long list of failover nodes.  I’m not sure how long it would take
> a client to timeout, go to the next node, and then work its way down 4 or 5
> more nodes.  But I suspect that would be a very unlikely scenario and
> probably not worth worrying about.
>
> > 2. I'm trying to decide how to do metadata, on the one hand I would very
> much like/prefer to have a failover pair, on the other hand when I look at
> the load on the MDS it seems like a big waste to have even one machine
> allocated to this exclusively, so I was thinking instead to maybe make all
> Lustre nodes MDS+OSS, this would as I understand potentially provide better
> metadata performance if needed and also allow me to put small files on the
> MDS and also provide for better resilience. Am I correct in these
> assumptions? Has anyone done something similar?
>
> As I believe Patrick mentioned, memory usage needs to be considered and
> not just CPU utilization.  For fast MDS access, you will want to cache a
> bunch of inodes. For fast OSS access, you may want aggressively cache file
> contents.  And then you have to consider memory usage for locking (which
> can be substantial in some cases).  These factors can be mitigated to a
> certain extent by tuning Lustre parameters.  I’m not saying that your idea
> wouldn’t work, but you may want to consider some of these things closely
> before making a decision.
>
> > 3. An LLNL lecture at Open-ZFS last year seems to strongly suggest using
> zfs over ldiskfs,is this indeed 'the way to go for new systems' or are both
> still fully valid options?
>
> Both are valid options.  ldiskfs is kind of the “tried and true”
> technology, but zfs has some nice features that make it appealing.  From a
> performance perspective, ldiskfs performs better than zfs for the MDT.  On
> the osts, I have had an easier time getting the most performance from my
> hardware using ldiskfs as well.  (ZFS hasn’t been bad, but I always seem to
> get better results for streaming IO with ldiskfs.  Maybe my zfs tuning
> skills are not up to snuff.)  Also, I have had the experience that, given
> MDTs of the same capacity, one formatted with zfs doesn’t seem to provide
> as many inodes one formatted with ldiskfs.  (Again, this might be due to a
> lack of proper ZFS settings on my part.)
>
> With that being said, I recently worked on a file system where we needed
> to migrate the data off the MDT to some new storage in order to increase
> the MDT capacity.  We had made a conscious decision to use ZFS on the MDT
> even though the performance wasn’t as good as ldiskfs because we had
> foreseen the possibility of needing to increase the MDT storage capacity
> (or move to different storage).  When the time came to migrate to new
> storage, we were able to use zfs send/receive to move data using
> incremental snapshots.  This was much easier than trying to tar up the
> contents of a ldiskfs-backed MDT and untar it to the new storage.
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171221/77135ceb/attachment.html>


More information about the lustre-discuss mailing list