[Lustre-discuss] NFS vs Lustre

Sat Aug 29 11:15:08 PDT 2009

Lee,

Thanks for posting this. I found the background and perspective very  
interesting.

John

John K. Dawson
jkdawson at gmail.com
612-860-2388

On Aug 29, 2009, at 12:56 PM, Lee Ward wrote:

> You seem to be correct. Nobody ever seems to contrast NFS with these
> super file systems solutions. That is interesting.
>
> It's Saturday, the family is out running around. I have time to think
> about this question. Unfortunately, for you, I do this more for  
> myself.
> Which means this is going to be a stream-of-consciousness thing far  
> more
> than a well organized discussion. Sorry.
>
> I'd begin by motivating both NFS and Lustre. Why do they exist? What
> problems do they solve.
>
> NFS first.
>
> Way back in the day, ethernet and the concept of a workstation got
> popular. There were many tools to copy files between machines but few
> ways to share a name space; Have the directory hierarchy and it's
> content directly accessible to an application on a foreign machine.  
> This
> made file sharing awkward. The model was to copy the file or files to
> the workstation where the work was going to be done, do the work, and
> copy the results back to some, hopefully, well maintained central
> machine.
>
> There *were* solutions to this at the time. I recall an attractive
> alternative called RFS (I believe) from the Bell Labs folks, via some
> place in England if I'm remembering right, it's been a looong time  
> after
> all. It had issues though. The nastiest issue for me was that if a
> client went down the service side would freeze, at least partially.
> Since this could happen willy-nilly, depending on the users wishes and
> how well the power button on his workstation was protected, together
> with the power cord and ethernet connection, this freezing of service
> for any amount of time was difficult to accept. This was so even in a
> rather small collection of machines.
>
> The problem with RFS (?) and it's cousins were that they were all
> stateful. The service side depended on state that was held at the
> client. If the client went down, the service side couldn't continue
> without a whole lot of recovery, timeouts, etc. It was a very  
> *annoying*
> problem.
>
> In the latter half of the 1980's (am I remembering right?) SUN  
> proposed
> an open protocol called NFS. An implementation using this protocol  
> could
> do most everything RFS(?) could but it didn't suffer the service-side
> hangs. It couldn't. It was stateless. If the client went down, the
> server just didn't care. If the server went down, the client had the
> opportunity to either give up on the local operation, usually with an
> error returned, or wait. It was always up to the user and for client
> failures the annoyance was limited to the user(s) on that client.
>
> SUN, also, wisely desired the protocol to be ubiquitous. They  
> published
> it. They wanted *everyone* to adopt it. More, they would help
> competitors. SUN held interoperability bake-a-thons to help with this.
>
> It looks like they succeeded, all around :)
>
> Let's sum up, then. The goals for NFS were:
>
> 1) Share a local file system name space across the network.
> 2) Do it in a robust, resilient way. Pesky FS issues because some user
> kicked the cord out of his workstation was unacceptable.
> 3) Make it ubiquitous. SUN was a workstation vendor. They sold servers
> but almost everyone had a VAX in their back pocket where they made the
> infrastructure investment. SUN needed the high-value machines to  
> support
> this protocol.
>
> Now Lustre.
>
> Lustre has a weird story and I'm not going to go into all of it. The
> shortest, relevant, part is that while there was at least one solution
> that DOE/NNSA felt acceptable, GPFS, it was not available on anything
> other than an IBM platform and because DOE/NNSA had a semi-formal  
> policy
> of buying from different vendors at each of the three labs we were  
> kind
> of stuck. Other file systems, existing and imminent, at the time were
> examined but they were all distributed file systems and we needed IO
> *bandwidth*. We needed lots, and lots of bandwidth.
>
> We also needed that ubiquitous thing that SUN had as one of their  
> goals.
> We didn't want to pay millions of dollars for another GPFS. We felt  
> that
> would only be painting ourselves into a corner. Whatever we did, the
> result *had* to be open. It also had to be attractive to smaller sites
> as we wanted to turn loose of the ting at some point. If it was
> attractive for smaller machines we felt we would win in the long term
> as, eventually, the cost to further and maintain this thing was spread
> across the community.
>
> As far as technical goals, I guess we just wanted GPFS, but open. More
> though, we wanted it to survive in our platform roadmaps for at  
> least a
> decade. The actual technical requirements for the contract that DOE/ 
> NNSA
> executed with HP, CFS was the sub-contractor responsible for
> development, can be found here:
>
> <http://www-cs-students.stanford.edu/~trj/SGS_PathForward_SOW.pdf>
>
> LLNL used to host this but it's no longer there? Oh well, hopefully  
> this
> link will be good for a while, at least.
>
> I'm just going to jump to the end and sum the goals up:
>
> 1) It must do *everything* NFS can. We relaxed the stateless thing
> though, see the next item for why.
> 2) It must support full POSIX semantics; Last writer wins, POSIX  
> locks,
> etc.
> 3) It must support all of the transports we are interested in.
> 4) It must be scalable, in that we can cheaply attach storage and both
> performance (reading *and* writing) and capacity within a single  
> mounted
> file system increase in direct proportion.
> 6) We wanted it to be easy, administratively. Our goal was that it  
> be no
> harder than NFS to set up and maintain. We were involving too many  
> folks
> with PhDs in the operation of our machines at the time. Before you  
> yell
> FAIL, I'll say we did try. I'll also say we didn't make CFS  
> responsible
> for this part of the task. Don't blame them overly much, OK?
> 7) We recognized we were asking for a stateful system, we wanted to
> mitigate that by having some focus on resiliency. These were big
> machines and clients died all the time.
> 8) While not in the SOW, we structured the contract to accomplish some
> future form of wide acceptance. We wanted it to be ubiquitous.
>
> That's a lot of goals! For the technical ones, the main ones are all
> pretty much structured to ask two things of what became Lustre. First,
> give us everything NFS functionally does but go far beyond it in
> performance. Second, give us everything NFS functionally does but make
> it completely equivalent to a local file system, semantically.
>
> There's a little more we have to consider. NFS4 is a different beast
> than NFS2 or NFS3. NFS{2,3} had some serious issues that becaome more
> prominent as time went by. First, security; It had none. Folks had
> bandaged on some different things to try to cure this but they weren't
> standard across platforms. Second, it couldn't do the full POSIX
> required semantics. That was attacked with the NFS lock protocols  
> but it
> was such an after-thought it will always remain problematic. Third,  
> new
> authorization possibilities introduced by Microsoft and then POSIX,
> called ACLs, had no way of being accomplished.
>
> NFS4 addresses those by:
>
> 1) Introducing state. Can do full POSIX now without the lock servers.
> Lots of resiliency mechanisms introduced to offset the downside of  
> this,
> too.
> 2) Formalizing and offerring standardized authentication headers.
> 3) Introducing ACLs that map to equivalents in POSIX and Microsoft.
>
> Strengths and Weaknesses of the Two
> -----------------------------------
>
> NFS4 does most everything Lustre can with one very important  
> exception,
> IO bandwidth.
>
> Both seem able to deliver metadata performance at roughly the same
> speeds. File create, delete, and stat rates are about the same. NetApp
> seems to have a partial enhancement. They bought the Spinnaker goodies
> some time back and have deployed that technology, and redirection
> too(?), within their servers. The good about that is two users in
> different directories *could* leverage two servers, independently,  
> and,
> so, scale metadata performance. It's not guaranteed but at least there
> is the possibility. If the two users are in the same directory, it's  
> not
> much different, though, I'm thinking. Someone correct me if I'm wrong?
>
> Both can offer full POSIX now. It's nasty in both cases but, yes, in
> theory you can export mail directory hierarchies with locking.
>
> The NFS client and server are far easier to set up and maintain. The
> tools to debug issues are advanced. While the Lustre folks have done
> much to improve this area, NFS is just leaps and bounds ahead. It's
> easier to deal with NFS than Lustre. Just far, far easier, still.
>
> NFS is just built in to everything. My TV has it, for hecks sake.  
> Lustre
> is, seemingly, always an add-on. It's also a moving target. We're
> constantly futzing with it, upgrading, and patching. Lustre might be
> compilable most everywhere we care about but building it isn't  
> trivial.
> The supplied modules are great but, still, moving targets in that we
> wait for SUN to catch up to the vendor supplied changes that affect
> Lustre. Given Lustre's size and interaction with other components in  
> the
> OS, that happens far more frequently than desired. NFS just plain wins
> the ubiquity argument at present.
>
> NFS IO performance does *not* scale. It's still an in-band protocol.  
> The
> data is carried in the same message as the request and is,  
> practically,
> limited in size. Reads are more scalable in writes, a popular
> file-segment can be satisfied from the cache on reads but develops
> issues at some point. For writes, NFS3 and NFS4 help in that they
> directly support write-behind so that a client doesn't have to wait  
> for
> data to go to disk, but it's just not enough. If one streams data
> to/from the store, it can be larger than the cache. A client that  
> might
> read a file already made "hot" but at a very different rate just  
> loses.
> A client, writing, is always looking for free memory to buffer  
> content.
> Again, too many of these, simultaneously, and performance descends to
> the native speed of the attached back-end store and that store can  
> only
> get so big.
>
> Lustre IO performance *does* scale. It uses a 3rd-party transfer.
> Requests are made to the metadata server and IO moves directly between
> the affected storage component(s) and the client. The more storage
> components, the less possibility of contention between clients and the
> more data can be accepted/supplied per unit time.
>
> NFS4 has a proposed extension, called pNFS, to address this problem.  
> It
> just introduces the 3rd-party data transfers that Lustre enjoys. If  
> and
> when that is a standard, and is well supported by clients and vendors,
> the really big technical difference will virtually disappear. It's  
> been
> a long time coming, though. It's still not there. Will it ever be,
> really?
>
> The answer to the NFS vs. Lustre question comes down to the workload  
> for
> a given application then, since they do have overlap in their solution
> space. If I were asked to look at a platform and recommend a  
> solution I
> would worry about IO bandwidth requirements. If the platform in  
> question
> were either read-mostly and, practically, never needed sustained  
> read or
> write bandwidth, NFS would be an easy choice. I'd even think hard  
> about
> NFS if the platform created many files but all were very small;  
> Today's
> filers have very respectable IOPS rates. If it came down to IO
> bandwidth, I'm still on the parallel file system bandwagon. NFS just
> can't deal with that at present and I do still have the folks, in  
> house,
> to manage the administrative burden.
>
> Done. That was useful for me. I think five years ago I might have  
> opted
> for Lustre in the "create many small files" case, where I would  
> consider
> NFS today, so re-examining the motivations, relative strengths, and
> weaknesses of both was useful. As I said, I did this more as a
> self-exercise than anything else but I hope you can find something
> useful here, too. The family is back from their errands, too :) Best
> wishes and good luck.
>
> 		--Lee
>
>
> On Wed, 2009-08-26 at 04:11 -0600, Tharindu Rukshan Bamunuarachchi
> wrote:
>> hi All,
>>
>>
>>
>> I need to prepare small report on “NFS vs. Lustre” ?
>>
>>
>>
>> I could find lot of resources about Lustre vs. (CXFS, GPFS, GFS) …
>>
>>
>>
>> Can you guys please provide few tips … URLs … etc.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> cheers,
>>
>> __
>>
>> tharindu
>>
>>
>>
>>
>> *******************************************************************************************************************************************************************
>>
>> "The information contained in this email including in any attachment
>> is confidential and is meant to be read only by the person to whom it
>> is addressed. If you are not the intended recipient(s), you are
>> prohibited from printing, forwarding, saving or copying this email.  
>> If
>> you have received this e-mail in error, please immediately notify the
>> sender and delete this e-mail and its attachments from your  
>> computer."
>>
>> *******************************************************************************************************************************************************************
>>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2419 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090829/5698c6fb/attachment.bin>