[Lustre-devel] SeaStar message priority
Oleg.Drokin at Sun.COM
Tue Mar 31 21:43:10 PDT 2009
It came to my attention that seastar network does not implement
message priorities for various reasons.
I really think there is very valid case for the priorities of some
sort to allow MPI and other
latency-critical traffic to go in front of bulk IO traffic on the
Consider this test I was running the other day on Jaguar. The
application writes 250M of data from every
core with plain write() system call, the write() syscall returns
very fast (less than 0.5 sec == 400+Mb/sec
app-perceived bandwidth) because the data just goes to the memory
cache to be flushed later.
Then I do 2 barriers one by one with nothing in between.
If I run it at sufficient scale (say 1200 cores), the first barrier
takes 4.5 seconds to complete and
the second one 1.5 seconds, all due to MPI RPCs being stuck behind
huge bulk data requests on the clients,
presumably (I do not have any other good explanations at least).
This makes for a lot of wasted time in applications that would like
to use the buffering capabilities provided
by the OS.
Do you think something like this could be organized if not for
current revision then at least for the next
More information about the lustre-devel