[Lustre-discuss] Lustre MPI-IO performance on CNL

Thu Mar 6 11:02:51 PST 2008

Thanks for the information. There are choices about the stripe count, 
depends on the targeted pattern.

Is redstorm running under CNL or Catamount?

--Weikuan

Marty Barnaby wrote:
> I had tried the Direct I/O last year and it didn't seem to be working at 
> the time, so I gave up and haven't been back there again.
> 
> For the file-per-processor vs. shared, I made many different benchmark 
> trials, but never really head-to-head. My efforts were all with our 
> redstorm:/scratch_grande:
> 
> /home/mlbarna> lfs getstripe -v /scratch_grande | grep ACTIVE | wc -l
> 320
> /home/mlbarna> lfs getstripe -v /scratch_grande | grep -v ACTIVE
> OBDS:
> /scratch_grande/
> default stripe_count: 4 stripe_size: 2097152 stripe_offset: -1
> /scratch_grande/test.sh
> lmm_magic:          0x0BD10BD0
> lmm_object_gr:      0
> lmm_object_id:      0x4e92503
> lmm_stripe_count:   4
> lmm_stripe_size:    2097152
> lmm_stripe_pattern: 1
>         obdidx           objid          objid            group
>            281         2777792       0x2a62c0                0
>            282         2780317       0x2a6c9d                0
>            283         2778125       0x2a640d                0
>            284         2778316       0x2a64cc                0
> 
> My one-file-per-processor mode was executed with a NetCDF benchmark code 
> someone had put together. I can't remember final numbers, or processor 
> count, but, at the time, we were interested in actual, scientific 
> computing usage patterns, so we had only an 80-400 KB range in 
> blocksizes, per processor, respectively, which will never demonstrate a 
> maximal byte-rate with a huge Lustre FS. The one point here I do know is 
> the performance was always highest when the directory the files were 
> written into was lfs setstripe with the values 0 -1 1. I found no 
> improvement in adjusting the stripe_size from the default 2 MB, but, for 
> large processor count runs, a stripe_count of 1 was patently fastest.
> 
> My maximal MPI-IO collective writing to a shared file benchmarking, 
> again with a simple, unique program, wrote into a directory defined with 
> the lfs setstripe settings 0 -1 160. I found my appex 26 GB/s running on 
> only 160 processors with a per-processor, respective blocksize of 20 MB.
> 
> To clarify my use of blocksize, the NetCDF trials are something like 
> running IOR with '-b 100m -t 80k'; and for the MPI-IO collective, I'd 
> have '-b100m -t 20m'. Limiting -b option is not important, one would 
> want it to be as large as the available memory would allow.
> 
> Both the benchmarking codes I employed differed somewhat from the 
> approach in IOR. They each simply malloced a single buffer of the 
> specified blocksize, and, after the file or files openings, iterated on 
> a barried loop, appending the same buffer for 'n' many rotations. 
> Usually, the timer is stopped as soon as the loop is exited, before the 
> file closings.
> 
> I recently completed some modifications for my own IOR, to execute more 
> like this. I moved the loop for repetitions  inside the file open and 
> close, and adjusted the offset to be continuous, so every blocksize of 
> transfers appends to the end of the still open file; then sum up the 
> product of the blocksize and the repetitions for the total written to 
> the file. I have this basically working for Posix single-shared-file, 
> and also PNetCDF.
> 
> MLB
> 
> 
> 
> Weikuan Yu wrote:
>>> What is the stripe_size of this test? 4M? If it is 4M, then
>>> transfer_size is a little
>>> bigger(64M). And we have seen this situation before, finally it seems
>>> because client hold
>>> too much lock in each write(because of lustre down-forward extent lock
>>> policy) which might
>>> block other client writing, so impact the parallel of the whole system.
>>> Maybe you could try
>>> decrease transfer size to stripe_size. Or increase stripe_size to 64M
>>> and see how is it?
>>>     
>>
>> Yes, the situation between shared file and separated files has been seen
>> before. But I have never seen an explanation regarding CNL. BTW, this
>> performance difference between shared/separated stays the same,
>> regardless what transfer size is.
>>
>> Anybody wants to post a reason regarding direct I/O too?
>>
>> --Weikuan
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>   
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
Weikuan Yu <+> 1-865-574-7990
http://ft.ornl.gov/~wyu/