[Lustre-discuss] NFS vs Lustre

Sat Aug 29 10:56:40 PDT 2009

You seem to be correct. Nobody ever seems to contrast NFS with these
super file systems solutions. That is interesting.

It's Saturday, the family is out running around. I have time to think
about this question. Unfortunately, for you, I do this more for myself.
Which means this is going to be a stream-of-consciousness thing far more
than a well organized discussion. Sorry.

I'd begin by motivating both NFS and Lustre. Why do they exist? What
problems do they solve.

NFS first.

Way back in the day, ethernet and the concept of a workstation got
popular. There were many tools to copy files between machines but few
ways to share a name space; Have the directory hierarchy and it's
content directly accessible to an application on a foreign machine. This
made file sharing awkward. The model was to copy the file or files to
the workstation where the work was going to be done, do the work, and
copy the results back to some, hopefully, well maintained central
machine.

There *were* solutions to this at the time. I recall an attractive
alternative called RFS (I believe) from the Bell Labs folks, via some
place in England if I'm remembering right, it's been a looong time after
all. It had issues though. The nastiest issue for me was that if a
client went down the service side would freeze, at least partially.
Since this could happen willy-nilly, depending on the users wishes and
how well the power button on his workstation was protected, together
with the power cord and ethernet connection, this freezing of service
for any amount of time was difficult to accept. This was so even in a
rather small collection of machines.

The problem with RFS (?) and it's cousins were that they were all
stateful. The service side depended on state that was held at the
client. If the client went down, the service side couldn't continue
without a whole lot of recovery, timeouts, etc. It was a very *annoying*
problem.

In the latter half of the 1980's (am I remembering right?) SUN proposed
an open protocol called NFS. An implementation using this protocol could
do most everything RFS(?) could but it didn't suffer the service-side
hangs. It couldn't. It was stateless. If the client went down, the
server just didn't care. If the server went down, the client had the
opportunity to either give up on the local operation, usually with an
error returned, or wait. It was always up to the user and for client
failures the annoyance was limited to the user(s) on that client.

SUN, also, wisely desired the protocol to be ubiquitous. They published
it. They wanted *everyone* to adopt it. More, they would help
competitors. SUN held interoperability bake-a-thons to help with this.

It looks like they succeeded, all around :)

Let's sum up, then. The goals for NFS were:

1) Share a local file system name space across the network.
2) Do it in a robust, resilient way. Pesky FS issues because some user
kicked the cord out of his workstation was unacceptable.
3) Make it ubiquitous. SUN was a workstation vendor. They sold servers
but almost everyone had a VAX in their back pocket where they made the
infrastructure investment. SUN needed the high-value machines to support
this protocol.

Now Lustre.

Lustre has a weird story and I'm not going to go into all of it. The
shortest, relevant, part is that while there was at least one solution
that DOE/NNSA felt acceptable, GPFS, it was not available on anything
other than an IBM platform and because DOE/NNSA had a semi-formal policy
of buying from different vendors at each of the three labs we were kind
of stuck. Other file systems, existing and imminent, at the time were
examined but they were all distributed file systems and we needed IO
*bandwidth*. We needed lots, and lots of bandwidth.

We also needed that ubiquitous thing that SUN had as one of their goals.
We didn't want to pay millions of dollars for another GPFS. We felt that
would only be painting ourselves into a corner. Whatever we did, the
result *had* to be open. It also had to be attractive to smaller sites
as we wanted to turn loose of the ting at some point. If it was
attractive for smaller machines we felt we would win in the long term
as, eventually, the cost to further and maintain this thing was spread
across the community.

As far as technical goals, I guess we just wanted GPFS, but open. More
though, we wanted it to survive in our platform roadmaps for at least a
decade. The actual technical requirements for the contract that DOE/NNSA
executed with HP, CFS was the sub-contractor responsible for
development, can be found here:

<http://www-cs-students.stanford.edu/~trj/SGS_PathForward_SOW.pdf>

LLNL used to host this but it's no longer there? Oh well, hopefully this
link will be good for a while, at least.

I'm just going to jump to the end and sum the goals up:

1) It must do *everything* NFS can. We relaxed the stateless thing
though, see the next item for why.
2) It must support full POSIX semantics; Last writer wins, POSIX locks,
etc.
3) It must support all of the transports we are interested in.
4) It must be scalable, in that we can cheaply attach storage and both
performance (reading *and* writing) and capacity within a single mounted
file system increase in direct proportion.
6) We wanted it to be easy, administratively. Our goal was that it be no
harder than NFS to set up and maintain. We were involving too many folks
with PhDs in the operation of our machines at the time. Before you yell
FAIL, I'll say we did try. I'll also say we didn't make CFS responsible
for this part of the task. Don't blame them overly much, OK?
7) We recognized we were asking for a stateful system, we wanted to
mitigate that by having some focus on resiliency. These were big
machines and clients died all the time.
8) While not in the SOW, we structured the contract to accomplish some
future form of wide acceptance. We wanted it to be ubiquitous.

That's a lot of goals! For the technical ones, the main ones are all
pretty much structured to ask two things of what became Lustre. First,
give us everything NFS functionally does but go far beyond it in
performance. Second, give us everything NFS functionally does but make
it completely equivalent to a local file system, semantically.

There's a little more we have to consider. NFS4 is a different beast
than NFS2 or NFS3. NFS{2,3} had some serious issues that becaome more
prominent as time went by. First, security; It had none. Folks had
bandaged on some different things to try to cure this but they weren't
standard across platforms. Second, it couldn't do the full POSIX
required semantics. That was attacked with the NFS lock protocols but it
was such an after-thought it will always remain problematic. Third, new
authorization possibilities introduced by Microsoft and then POSIX,
called ACLs, had no way of being accomplished.

NFS4 addresses those by:

1) Introducing state. Can do full POSIX now without the lock servers.
Lots of resiliency mechanisms introduced to offset the downside of this,
too.
2) Formalizing and offerring standardized authentication headers.
3) Introducing ACLs that map to equivalents in POSIX and Microsoft.

Strengths and Weaknesses of the Two
-----------------------------------

NFS4 does most everything Lustre can with one very important exception,
IO bandwidth.

Both seem able to deliver metadata performance at roughly the same
speeds. File create, delete, and stat rates are about the same. NetApp
seems to have a partial enhancement. They bought the Spinnaker goodies
some time back and have deployed that technology, and redirection
too(?), within their servers. The good about that is two users in
different directories *could* leverage two servers, independently, and,
so, scale metadata performance. It's not guaranteed but at least there
is the possibility. If the two users are in the same directory, it's not
much different, though, I'm thinking. Someone correct me if I'm wrong?

Both can offer full POSIX now. It's nasty in both cases but, yes, in
theory you can export mail directory hierarchies with locking.

The NFS client and server are far easier to set up and maintain. The
tools to debug issues are advanced. While the Lustre folks have done
much to improve this area, NFS is just leaps and bounds ahead. It's
easier to deal with NFS than Lustre. Just far, far easier, still.

NFS is just built in to everything. My TV has it, for hecks sake. Lustre
is, seemingly, always an add-on. It's also a moving target. We're
constantly futzing with it, upgrading, and patching. Lustre might be
compilable most everywhere we care about but building it isn't trivial.
The supplied modules are great but, still, moving targets in that we
wait for SUN to catch up to the vendor supplied changes that affect
Lustre. Given Lustre's size and interaction with other components in the
OS, that happens far more frequently than desired. NFS just plain wins
the ubiquity argument at present.

NFS IO performance does *not* scale. It's still an in-band protocol. The
data is carried in the same message as the request and is, practically,
limited in size. Reads are more scalable in writes, a popular
file-segment can be satisfied from the cache on reads but develops
issues at some point. For writes, NFS3 and NFS4 help in that they
directly support write-behind so that a client doesn't have to wait for
data to go to disk, but it's just not enough. If one streams data
to/from the store, it can be larger than the cache. A client that might
read a file already made "hot" but at a very different rate just loses.
A client, writing, is always looking for free memory to buffer content.
Again, too many of these, simultaneously, and performance descends to
the native speed of the attached back-end store and that store can only
get so big.

Lustre IO performance *does* scale. It uses a 3rd-party transfer.
Requests are made to the metadata server and IO moves directly between
the affected storage component(s) and the client. The more storage
components, the less possibility of contention between clients and the
more data can be accepted/supplied per unit time.

NFS4 has a proposed extension, called pNFS, to address this problem. It
just introduces the 3rd-party data transfers that Lustre enjoys. If and
when that is a standard, and is well supported by clients and vendors,
the really big technical difference will virtually disappear. It's been
a long time coming, though. It's still not there. Will it ever be,
really?

The answer to the NFS vs. Lustre question comes down to the workload for
a given application then, since they do have overlap in their solution
space. If I were asked to look at a platform and recommend a solution I
would worry about IO bandwidth requirements. If the platform in question
were either read-mostly and, practically, never needed sustained read or
write bandwidth, NFS would be an easy choice. I'd even think hard about
NFS if the platform created many files but all were very small; Today's
filers have very respectable IOPS rates. If it came down to IO
bandwidth, I'm still on the parallel file system bandwagon. NFS just
can't deal with that at present and I do still have the folks, in house,
to manage the administrative burden.

Done. That was useful for me. I think five years ago I might have opted
for Lustre in the "create many small files" case, where I would consider
NFS today, so re-examining the motivations, relative strengths, and
weaknesses of both was useful. As I said, I did this more as a
self-exercise than anything else but I hope you can find something
useful here, too. The family is back from their errands, too :) Best
wishes and good luck.

		--Lee

On Wed, 2009-08-26 at 04:11 -0600, Tharindu Rukshan Bamunuarachchi
wrote:
> hi All,
> 
>  
> 
> I need to prepare small report on “NFS vs. Lustre” ?
> 
>  
> 
> I could find lot of resources about Lustre vs. (CXFS, GPFS, GFS) …
> 
>  
> 
> Can you guys please provide few tips … URLs … etc.
> 
>  
> 
>  
> 
>  
> 
>  
> 
> cheers,
> 
> __
> 
> tharindu
> 
>  
> 
> 
> *******************************************************************************************************************************************************************
> 
> "The information contained in this email including in any attachment
> is confidential and is meant to be read only by the person to whom it
> is addressed. If you are not the intended recipient(s), you are
> prohibited from printing, forwarding, saving or copying this email. If
> you have received this e-mail in error, please immediately notify the
> sender and delete this e-mail and its attachments from your computer."
> 
> *******************************************************************************************************************************************************************
>