<div dir="ltr"><div><div><div>Thanks for all the great answers!<br><br></div>Still looking for more info for #4...<br><br></div>Thanks again,<br></div>Eli<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Dec 21, 2017 at 12:26 AM, Mohr Jr, Richard Frank (Rick Mohr) <span dir="ltr"><<a href="mailto:rmohr@utk.edu" target="_blank">rmohr@utk.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">My $0.02 below.<br>

<span class=""><br>

> On Dec 20, 2017, at 11:21 AM, E.S. Rosenberg <<a href="mailto:esr%2Blustre@mail.hebrew.edu">esr+lustre@mail.hebrew.edu</a>> wrote:<br>

><br>

> 1. After my recent experience with failover I wondered is there any reason not to set all machines that are within reasonable cable range as potential failover nodes so that in the very unlikely event of both machines connected to a disk enclosure failing simple recabling + manual mount would still work?<br>

<br>

</span>That would probably work fine.  I don’t know if there are any drawbacks to having a long list of failover nodes.  I’m not sure how long it would take a client to timeout, go to the next node, and then work its way down 4 or 5 more nodes.  But I suspect that would be a very unlikely scenario and probably not worth worrying about.<br>

<span class=""><br>

> 2. I'm trying to decide how to do metadata, on the one hand I would very much like/prefer to have a failover pair, on the other hand when I look at the load on the MDS it seems like a big waste to have even one machine allocated to this exclusively, so I was thinking instead to maybe make all Lustre nodes MDS+OSS, this would as I understand potentially provide better metadata performance if needed and also allow me to put small files on the MDS and also provide for better resilience. Am I correct in these assumptions? Has anyone done something similar?<br>

<br>

</span>As I believe Patrick mentioned, memory usage needs to be considered and not just CPU utilization.  For fast MDS access, you will want to cache a bunch of inodes. For fast OSS access, you may want aggressively cache file contents.  And then you have to consider memory usage for locking (which can be substantial in some cases).  These factors can be mitigated to a certain extent by tuning Lustre parameters.  I’m not saying that your idea wouldn’t work, but you may want to consider some of these things closely before making a decision.<br>

<span class=""><br>

> 3. An LLNL lecture at Open-ZFS last year seems to strongly suggest using zfs over ldiskfs,is this indeed 'the way to go for new systems' or are both still fully valid options?<br>

<br>

</span>Both are valid options.  ldiskfs is kind of the “tried and true” technology, but zfs has some nice features that make it appealing.  From a performance perspective, ldiskfs performs better than zfs for the MDT.  On the osts, I have had an easier time getting the most performance from my hardware using ldiskfs as well.  (ZFS hasn’t been bad, but I always seem to get better results for streaming IO with ldiskfs.  Maybe my zfs tuning skills are not up to snuff.)  Also, I have had the experience that, given MDTs of the same capacity, one formatted with zfs doesn’t seem to provide as many inodes one formatted with ldiskfs.  (Again, this might be due to a lack of proper ZFS settings on my part.)<br>

<br>

With that being said, I recently worked on a file system where we needed to migrate the data off the MDT to some new storage in order to increase the MDT capacity.  We had made a conscious decision to use ZFS on the MDT even though the performance wasn’t as good as ldiskfs because we had foreseen the possibility of needing to increase the MDT storage capacity (or move to different storage).  When the time came to migrate to new storage, we were able to use zfs send/receive to move data using incremental snapshots.  This was much easier than trying to tar up the contents of a ldiskfs-backed MDT and untar it to the new storage.<br>

<br>

--<br>

Rick Mohr<br>

Senior HPC System Administrator<br>

National Institute for Computational Sciences<br>

<a href="http://www.nics.tennessee.edu" rel="noreferrer" target="_blank">http://www.nics.tennessee.edu</a><br>

<br>

</blockquote></div><br></div>