[Lustre-discuss] Off-topic: largest existing Lustre file system?

Marty Barnaby mlbarna at sandia.gov
Thu Jan 31 11:25:26 PST 2008


I concur that at 160, my processor count was low. At the time, I had 
access to as many as 1000 on our big Cray XT3, Redstorm, but now it is 
not available to me at all. Through my trials, I found that, for 
appending to a single, shared file, matching the lfs maximum stripe 
count of 160, with the equivalent job size was the only combination that 
got me this up to this rate. My interest in this case ends here, because 
real usage involves processor counts in the thousands.

I'm not certain about the distinctions of collective I/O and shared 
files. It seems the latter is to be determined by the application 
authors and their users. With, for instance, a mesh-type application, 
there might be trade-offs; but having, at least, all the data for one 
computed field or dynamic, for one time step that is saved to the output 
(we usually calls these dumps), stored in the same file is regularly 
more advantageous for the entire work cycle. Actually, having all the 
output data in a single file seems like the most desirable approach.

Though MPI-IO independent writing operations can be on the lowest level, 
mustn't there still be some global coordination to determine where each 
processors chunk of a communicators complete vector of values is 
written, respectively, in a shared file? Further, what we now call 
two-phase aggregation, as a means of turning many, tiny block-size 
writes into a small number of large ones to leverage the greater 
efficiency of the FS to respond to this type of activity, has potential 
benefits. However, I've seen the considerable ROMIO provisions for this 
technique used incorrectly, delivering a decrease in performance.

Marty Barnaby


Weikuan Yu wrote:
> I would throw in some of my experience for discussion as Shane mentioned 
> my name here :)
>
> (1) First, I am not under the impression that the collective I/O is 
> designed to reveal the peak performance of a particular system. Well, 
> there are publications claiming that colective I/O might be a preferred 
> case for some particular architecture, e.g., for BG/P (HPCA06)... But I 
> am not fully positive or clear on the context of the claim.
>
> (2) With the intended disjoint access as Marty mentioned, using 
> collective I/O would only add a phase of coordination before each 
> process streams its data to the file system.
>
> (3) The testing as Marty described below seem small, both in terms of 
> process counts and the amount of data. Aside from whether this is good 
> for revealing the system's best performance, I can confirm that there 
> are applications at ORNL that has much larger process counts and bigger 
> data volume from each processes.
>
>  > In tests on redstorm from last year, I appended to a single, open file
>  > at a
>  > rate of 26 GB/s. I had to use exceptional parameters to achieve this
>  > however: the file had an LFS stripe-count of 160, and I sent a 20 MB
>  > buffer,
>  > respectively, from each of a 160, total processor job, for an aggregate
>  > of
>  > 3.2 GB per write_all operation. I consider this configuration out of the
>  > range of any normal usage.
>
> IIRC, something regarding collective I/O has been discussed earlier, 
> also with Marty (?).
>
> In the upcoming IPDPS08 conference, ORNL has two papers on I/O 
> performance of Jaguar. So you may find the numbers interesting. The 
> final version of the papers should be available from me or Mark Fahey 
> (the author of another paper).
>
> --Weikuan
>
>
> Canon, Richard Shane wrote:
>   
>> Marty,
>>
>> Our benchmark measurements were made using IOR doing POSIX IO to a
>> single shared file (I believe).  
>>
>> Since you mentioned MPI-IO...  Weikuan Yu (at ORNL) has done some work
>> to improve the MPI-IO Lustre ADIO driver.  Also, we have been sponsoring
>> work through a Lustre Centre of Excellence to further improve the ADIO
>> driver.  I'm optimistic that this can make collective IO perform at a
>> level that one would expect.  File-per-process runs often do run faster
>> up until the meta data activity associated with creating 10k+ files
>> starts to slow things down.  I'm a firm believer that collective IO
>> through libraries like MPI-IO, HDF5, and pNetCDF are the way things
>> should move.  It should be possible to embed enough intelligence in
>> these middle layers to do good stripe alignment, automatically tune
>> stripe counts, and stripe width, etc. Some of this will hopefully be
>> accomplished with the improvements being made to the ADIO.
>>
>> --Shane
>>
>> -----Original Message-----
>> From: lustre-discuss-bounces at lists.lustre.org
>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of mlbarna
>> Sent: Wednesday, January 16, 2008 12:43 PM
>> To: lustre-discuss at clusterfs.com
>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>> system?
>>
>> Could you elaborate on the benchmarking application(s) run that provided
>> these bandwidth numbers. I have a particular interest in MPI coded
>> programs
>> that perform collective I/O. In discussions, I find this topic sometimes
>> confused; my meaning is streamed, appending with all the data from all
>> the
>> processors for a single, atomic write operation filling disjoint
>> sections of
>> the same file. In MPI-IO, the MPI_File_write_all* family seems to define
>> my
>> focus area, run with or without two-phase aggregation. Imitating the
>> operation with simple, Posix I/O is acceptable, as far as I am
>> concerned.
>>
>> In tests on redstorm from last year, I appended to a single, open file
>> at a
>> rate of 26 GB/s. I had to use exceptional parameters to achieve this
>> however: the file had an LFS stripe-count of 160, and I sent a 20 MB
>> buffer,
>> respectively, from each of a 160, total processor job, for an aggregate
>> of
>> 3.2 GB per write_all operation. I consider this configuration out of the
>> range of any normal usage.
>>
>> I believe that a faster rate could be achieved by a similar program that
>> wrote independently--that is, one-file-per-processor--such as via
>> NetCDF.
>> For this case, I would set the LFS stripe-count down to one.
>>
>>
>> Marty Barnaby
>>
>>
>>
>> On 1/14/08 4:11 PM, "Canon, Richard Shane" <canonrs at ornl.gov> wrote:
>>
>>     
>>> Jeff,
>>>
>>> I'm not aware of any.  For parallel file systems it is usually
>>>       
>> bandwidth
>>     
>>> centric.
>>>
>>> --Shane
>>>
>>> -----Original Message-----
>>> From: Kennedy, Jeffrey [mailto:jkennedy at qualcomm.com]
>>> Sent: Monday, January 14, 2008 4:56 PM
>>> To: Canon, Richard Shane; lustre-discuss at clusterfs.com
>>> Subject: RE: [Lustre-discuss] Off-topic: largest existing Lustre file
>>> system?
>>>
>>> Any spec's on IOPS rather than throughput?
>>>
>>> Thanks.
>>>
>>> Jeff Kennedy
>>> QCT Engineering Compute
>>> 858-651-6592
>>>  
>>>       
>>>> -----Original Message-----
>>>> From: lustre-discuss-bounces at clusterfs.com [mailto:lustre-discuss-
>>>> bounces at clusterfs.com] On Behalf Of Canon, Richard Shane
>>>> Sent: Monday, January 14, 2008 1:49 PM
>>>> To: lustre-discuss at clusterfs.com
>>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>>>> system?
>>>>
>>>>
>>>> Klaus,
>>>>
>>>> Here are some that I know are pretty large.
>>>>
>>>> * RedStorm - I think it has two roughly 50 GB/s file systems.  The
>>>> capacity may not be quite as large though.  I think they used FC
>>>>         
>>> drives.
>>>       
>>>> It was DDN 8500 although that may have changed.
>>>> * CEA - I think they have a file system approaching 100 GB/s.  I
>>>>         
>> think
>>     
>>>> it is DDN 9550.  Not sure about the capacities.
>>>> * TACC has a large Thumper based system.  Not sure of the specs.
>>>> * ORNL - We have a 44 GB/s file system with around 800 TB of total
>>>> capacity.  That is DDN 9550.  We also have two new file system (20
>>>>         
>>> GB/s
>>>       
>>>> and 10 GB/s currently LSI XBB2 and DDN 9550 respectively).  Those
>>>>         
>> have
>>     
>>>> around 800 TB each (after RAID6).
>>>> * We are planning a 200 GB/s, around 10 PB file system now.
>>>>
>>>> --Shane
>>>>
>>>> -----Original Message-----
>>>> From: lustre-discuss-bounces at clusterfs.com
>>>> [mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of D. Marc
>>>> Stearman
>>>> Sent: Monday, January 14, 2008 4:37 PM
>>>> To: lustre-discuss at clusterfs.com
>>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>>>> system?
>>>>
>>>> Klaus,
>>>>
>>>> We currently have a 1.2PB lustre filesystem that we will be expanding
>>>> to 2.4PB in the near future.  I not sure about the highest sustained
>>>> IOPS, but we did have a user peak 19GB/s to one of our 500TB
>>>> filesystems recently. The backend for that was 16 DDN 8500 couplets
>>>> with write-cache turned OFF.
>>>>
>>>> -Marc
>>>>
>>>> ----
>>>> D. Marc Stearman
>>>> LC Lustre Systems Administrator
>>>> marc at llnl.gov
>>>> 925.423.9670
>>>> Pager: 1.888.203.0641
>>>>
>>>>
>>>> On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
>>>>
>>>>         
>>>>> Hi there,
>>>>>
>>>>> I was asked by a friend of a business contact of mine the other day
>>>>> to share
>>>>> some information about Lustre; seems he's planning to build what
>>>>>           
>>> will
>>>       
>>>>> eventually be about a 3 PB file system.
>>>>>
>>>>> The CFS website doesn't appear to have any information on field
>>>>> deployments
>>>>> worth bragging about, so I figured I'd ask, just for fun; does
>>>>> anyone know:
>>>>>
>>>>> - the size of the largest working Lustre file system currently in
>>>>> the field
>>>>> - the highest sustained number of IOPS seen with Lustre, and what
>>>>>           
>>> the
>>>       
>>>>> backend was?
>>>>>
>>>>> cheers,
>>>>> Klaus
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at clusterfs.com
>>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>>           
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>         
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>
>>>       
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>     
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080131/5c71e4ee/attachment.htm>


More information about the lustre-discuss mailing list