[Lustre-devel] 2 primitives for the NRS

Fri Jan 4 10:19:44 PST 2008

Wonderful questions! 

> I am trying to plug in an I/O request scheduler into OSS before
> read/write requests get dispatched to the obdfilter. 

If you base your code off b1_6, you can take a free ride on the initial
request processing done by Nathan's adaptive timeout code.

> What I am using
> is hashed bins with a basic rb tree, assuming it would be fairly
> reasonable to handle the number of I/O requests that can reach an
> OSS. My interface calls are very similar to yours, except the lack
> of plug-in comparison. I would not have more to suggest on this. 
> But I do have a couple of questions to check for possible thoughts if
> you have.
> 
> (1) How are you going to order the requests, say the read/write
> ones? I assume you made it flexible with a plug-in compare(). 

Yes - because I don't know yet what's going to work best.  And maybe
different services might want different orders.  And generalisation
when it doesn't hurt performance is a hard habit to break :)

My first thoughts are about fairness to all clients in the face of
unfairness elsewhere - e.g. a gridlocked network - so I'm thinking of
something that picks buffered RPC requests round-robin on client ID.

This is probably good for workloads of large numbers of single-client
jobs to ensure that no individual client can be starved.  However I
also suggest that it's good for I/O performed by a single job spread
over many clients.  

I base this on the idea that a good backend filesystem should and can
optimize disk utilisation.  When a file is written, the file<->disk
offset mapping is fixed for subsequent reads, so I want the NRS to
make I/O request execution order repeatable in the face of network
"noise" and races between clients.  Without this repeatability, we
have to fall back on the disk elevator to re-create the "good" disk
I/O stream on subsequent reads.  Surely it cannot do such a good job
as the NRS since it must have orders of magnitude fewer requests to
play with - bulk buffers must be allocated by the time it sees them.

See http://arch.lustre.org/index.php?title=Network_Request_Scheduler

I'm afraid I don't yet have anything even half backed to say on
write v. read order etc.  I'd still want some empirical evidence first.

> Would the order of the I/O requests based on object ID have some
> relevance to their locality on the disks?

I thinks it might make more sense for the backend F/S to use a job ID
to help it create sequential disk offsets for the whole I/O pattern
rather than anything coming from one individual client.

> I was assuming at least the requests
> can get smoothed out with the objID ordering.
> 
> (2) Have you checked the overhead when there are many concurrent
> threads competing for the locks associated with your heap?  The
> performance impact thereof?

I've only done sequential timings so far.  NRS ops could be "Amdhal
moments" for the whole server so fat SMPs might require some better
care.

> (3) Do you anticipate to merge the requests in any way, or possibly
> batch execute them?

Yes, but I'm such a lazy sod that I hope the disk elevator will smile
on me.  If not and I have to roll up my sleeves - so be it.

    Cheers,
              Eric