[Lustre-discuss] Lustre MPI-IO performance on CNL
Marty Barnaby
mlbarna at sandia.gov
Thu Mar 6 08:18:15 PST 2008
I had tried the Direct I/O last year and it didn't seem to be working at
the time, so I gave up and haven't been back there again.
For the file-per-processor vs. shared, I made many different benchmark
trials, but never really head-to-head. My efforts were all with our
redstorm:/scratch_grande:
/home/mlbarna> lfs getstripe -v /scratch_grande | grep ACTIVE | wc -l
320
/home/mlbarna> lfs getstripe -v /scratch_grande | grep -v ACTIVE
OBDS:
/scratch_grande/
default stripe_count: 4 stripe_size: 2097152 stripe_offset: -1
/scratch_grande/test.sh
lmm_magic: 0x0BD10BD0
lmm_object_gr: 0
lmm_object_id: 0x4e92503
lmm_stripe_count: 4
lmm_stripe_size: 2097152
lmm_stripe_pattern: 1
obdidx objid objid group
281 2777792 0x2a62c0 0
282 2780317 0x2a6c9d 0
283 2778125 0x2a640d 0
284 2778316 0x2a64cc 0
My one-file-per-processor mode was executed with a NetCDF benchmark code
someone had put together. I can't remember final numbers, or processor
count, but, at the time, we were interested in actual, scientific
computing usage patterns, so we had only an 80-400 KB range in
blocksizes, per processor, respectively, which will never demonstrate a
maximal byte-rate with a huge Lustre FS. The one point here I do know is
the performance was always highest when the directory the files were
written into was lfs setstripe with the values 0 -1 1. I found no
improvement in adjusting the stripe_size from the default 2 MB, but, for
large processor count runs, a stripe_count of 1 was patently fastest.
My maximal MPI-IO collective writing to a shared file benchmarking,
again with a simple, unique program, wrote into a directory defined with
the lfs setstripe settings 0 -1 160. I found my appex 26 GB/s running on
only 160 processors with a per-processor, respective blocksize of 20 MB.
To clarify my use of blocksize, the NetCDF trials are something like
running IOR with '-b 100m -t 80k'; and for the MPI-IO collective, I'd
have '-b100m -t 20m'. Limiting -b option is not important, one would
want it to be as large as the available memory would allow.
Both the benchmarking codes I employed differed somewhat from the
approach in IOR. They each simply malloced a single buffer of the
specified blocksize, and, after the file or files openings, iterated on
a barried loop, appending the same buffer for 'n' many rotations.
Usually, the timer is stopped as soon as the loop is exited, before the
file closings.
I recently completed some modifications for my own IOR, to execute more
like this. I moved the loop for repetitions inside the file open and
close, and adjusted the offset to be continuous, so every blocksize of
transfers appends to the end of the still open file; then sum up the
product of the blocksize and the repetitions for the total written to
the file. I have this basically working for Posix single-shared-file,
and also PNetCDF.
MLB
Weikuan Yu wrote:
>> What is the stripe_size of this test? 4M? If it is 4M, then
>> transfer_size is a little
>> bigger(64M). And we have seen this situation before, finally it seems
>> because client hold
>> too much lock in each write(because of lustre down-forward extent lock
>> policy) which might
>> block other client writing, so impact the parallel of the whole system.
>> Maybe you could try
>> decrease transfer size to stripe_size. Or increase stripe_size to 64M
>> and see how is it?
>>
>
> Yes, the situation between shared file and separated files has been seen
> before. But I have never seen an explanation regarding CNL. BTW, this
> performance difference between shared/separated stays the same,
> regardless what transfer size is.
>
> Anybody wants to post a reason regarding direct I/O too?
>
> --Weikuan
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080306/f83ead2c/attachment.htm>
More information about the lustre-discuss
mailing list