[Lustre-devel] 2 primitives for the NRS
Eric Barton
eeb at sun.com
Fri Jan 4 10:19:44 PST 2008
Wonderful questions!
> I am trying to plug in an I/O request scheduler into OSS before
> read/write requests get dispatched to the obdfilter.
If you base your code off b1_6, you can take a free ride on the initial
request processing done by Nathan's adaptive timeout code.
> What I am using
> is hashed bins with a basic rb tree, assuming it would be fairly
> reasonable to handle the number of I/O requests that can reach an
> OSS. My interface calls are very similar to yours, except the lack
> of plug-in comparison. I would not have more to suggest on this.
> But I do have a couple of questions to check for possible thoughts if
> you have.
>
> (1) How are you going to order the requests, say the read/write
> ones? I assume you made it flexible with a plug-in compare().
Yes - because I don't know yet what's going to work best. And maybe
different services might want different orders. And generalisation
when it doesn't hurt performance is a hard habit to break :)
My first thoughts are about fairness to all clients in the face of
unfairness elsewhere - e.g. a gridlocked network - so I'm thinking of
something that picks buffered RPC requests round-robin on client ID.
This is probably good for workloads of large numbers of single-client
jobs to ensure that no individual client can be starved. However I
also suggest that it's good for I/O performed by a single job spread
over many clients.
I base this on the idea that a good backend filesystem should and can
optimize disk utilisation. When a file is written, the file<->disk
offset mapping is fixed for subsequent reads, so I want the NRS to
make I/O request execution order repeatable in the face of network
"noise" and races between clients. Without this repeatability, we
have to fall back on the disk elevator to re-create the "good" disk
I/O stream on subsequent reads. Surely it cannot do such a good job
as the NRS since it must have orders of magnitude fewer requests to
play with - bulk buffers must be allocated by the time it sees them.
See http://arch.lustre.org/index.php?title=Network_Request_Scheduler
I'm afraid I don't yet have anything even half backed to say on
write v. read order etc. I'd still want some empirical evidence first.
> Would the order of the I/O requests based on object ID have some
> relevance to their locality on the disks?
I thinks it might make more sense for the backend F/S to use a job ID
to help it create sequential disk offsets for the whole I/O pattern
rather than anything coming from one individual client.
> I was assuming at least the requests
> can get smoothed out with the objID ordering.
>
> (2) Have you checked the overhead when there are many concurrent
> threads competing for the locks associated with your heap? The
> performance impact thereof?
I've only done sequential timings so far. NRS ops could be "Amdhal
moments" for the whole server so fat SMPs might require some better
care.
> (3) Do you anticipate to merge the requests in any way, or possibly
> batch execute them?
Yes, but I'm such a lazy sod that I hope the disk elevator will smile
on me. If not and I have to roll up my sleeves - so be it.
Cheers,
Eric
More information about the lustre-devel
mailing list