[lustre-discuss] Designing a new Lustre system

Mohr Jr, Richard Frank (Rick Mohr) rmohr at utk.edu
Wed Dec 20 14:26:25 PST 2017

My $0.02 below.

> On Dec 20, 2017, at 11:21 AM, E.S. Rosenberg <esr+lustre at mail.hebrew.edu> wrote:
> 1. After my recent experience with failover I wondered is there any reason not to set all machines that are within reasonable cable range as potential failover nodes so that in the very unlikely event of both machines connected to a disk enclosure failing simple recabling + manual mount would still work?

That would probably work fine.  I don’t know if there are any drawbacks to having a long list of failover nodes.  I’m not sure how long it would take a client to timeout, go to the next node, and then work its way down 4 or 5 more nodes.  But I suspect that would be a very unlikely scenario and probably not worth worrying about.

> 2. I'm trying to decide how to do metadata, on the one hand I would very much like/prefer to have a failover pair, on the other hand when I look at the load on the MDS it seems like a big waste to have even one machine allocated to this exclusively, so I was thinking instead to maybe make all Lustre nodes MDS+OSS, this would as I understand potentially provide better metadata performance if needed and also allow me to put small files on the MDS and also provide for better resilience. Am I correct in these assumptions? Has anyone done something similar?

As I believe Patrick mentioned, memory usage needs to be considered and not just CPU utilization.  For fast MDS access, you will want to cache a bunch of inodes. For fast OSS access, you may want aggressively cache file contents.  And then you have to consider memory usage for locking (which can be substantial in some cases).  These factors can be mitigated to a certain extent by tuning Lustre parameters.  I’m not saying that your idea wouldn’t work, but you may want to consider some of these things closely before making a decision.

> 3. An LLNL lecture at Open-ZFS last year seems to strongly suggest using zfs over ldiskfs,is this indeed 'the way to go for new systems' or are both still fully valid options?

Both are valid options.  ldiskfs is kind of the “tried and true” technology, but zfs has some nice features that make it appealing.  From a performance perspective, ldiskfs performs better than zfs for the MDT.  On the osts, I have had an easier time getting the most performance from my hardware using ldiskfs as well.  (ZFS hasn’t been bad, but I always seem to get better results for streaming IO with ldiskfs.  Maybe my zfs tuning skills are not up to snuff.)  Also, I have had the experience that, given MDTs of the same capacity, one formatted with zfs doesn’t seem to provide as many inodes one formatted with ldiskfs.  (Again, this might be due to a lack of proper ZFS settings on my part.)

With that being said, I recently worked on a file system where we needed to migrate the data off the MDT to some new storage in order to increase the MDT capacity.  We had made a conscious decision to use ZFS on the MDT even though the performance wasn’t as good as ldiskfs because we had foreseen the possibility of needing to increase the MDT storage capacity (or move to different storage).  When the time came to migrate to new storage, we were able to use zfs send/receive to move data using incremental snapshots.  This was much easier than trying to tar up the contents of a ldiskfs-backed MDT and untar it to the new storage.

Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences

More information about the lustre-discuss mailing list