[Lustre-devel] protocol backofs
He.Huang at Sun.COM
Tue Mar 17 08:28:44 PDT 2009
On Mon, Mar 16, 2009 at 01:41:40PM -0700, Andrew C. Uselton wrote:
> Howdy Isaac,
> Nice to meet you. As Eric suggested I am also cc:ing Nick Henke,
> since he might find this an interesting discussion. For all you
> lustre-devel dwellers out there, feel free to chime in.
Hello Andrew, please see my comments inline.
> The "frank_jag" page shows data collected during 4 test with 256 tasks
> (4 tasks per node on 64 nodes). The target is a single file striped
> across all OSTs of the Lustre file system. Two tests are on Franklin
> and two on Jaguar. Each machine runs a test using the POSIX I/O
> interface and another using the MPI-I/O interface. In the third column
> the Franklin, MPI-I/O test has extremely long delays in the reads in the
> middle phase, but not during the other reads or any of the writes. This
I've got zero knowledge on MPI-IO. Could you please elaborate for a
bit on how this "delays in the reads" are measured and what "the
middle phase" is?
> does not happen for POSIX, nor does it happen for Jaguar using MPI-I/O.
> The results shown are entirely reproducible and not due to interference
> from other jobs on the system. The only difference between the Franklin
> and Jaguar configurations is that Jaguar has 144 OSTs on 72 OSSs instead
> of 80 OSTs on 20 OSSs.
Not sure about Franklin, but on Jaguar, depending on the file-system in
use, the OSSs could reside in either the Sea-Star network or an IB
network (accessed via lnet routers). I think it might be worthwhile to
double check what server network had been used.
> Eric put the notion in my head that that we may be looking at a
> contention issue in the Sea-Star network. Since the I/O is being necked
> down to 20 OSSs in the case of Franklin, this seems plausible. If you
> guys have a moment to consider the subject I'd like to think about:
> a) Why would contention introduce the catastrophic delays rather than
> just slow things down generally and more or less evenly? Is there some
> form of back-off in the protocol(s) that could occasionally get kicked
> up to tens of seconds?
It involves many layers:
1. At Lustre/PTLRPC layer, there is a limit on the number of in-flight
RPCs to a server. This is end-to-end, and the limit could change at
2. At lnet/lnd layer, for ptllnd and o2iblnd, there's a credit-based
mechanism to prevent a sending node from overrunning buffers at the
remote end. This is not end-to-end, and the number of pre-granted
credits doesn't change over runtime.
3. Cray Portals and the Sea-Star network runs beneath lnet/ptllnd,
and I'd think that there could also be some similar mechanisms.
More information about the lustre-devel