[Lustre-discuss] Off-topic: largest existing Lustre file system?

Thu Jan 31 13:11:43 PST 2008

Marty Barnaby wrote:
> I concur that at 160, my processor count was low. At the time, I had 
> access to as many as 1000 on our big Cray XT3, Redstorm, but now it is 
> not available to me at all. Through my trials, I found that, for 
> appending to a single, shared file, matching the lfs maximum stripe 
> count of 160, with the equivalent job size was the only combination 
> that got me this up to this rate. My interest in this case ends here, 
> because real usage involves processor counts in the thousands.
>
> I'm not certain about the distinctions of collective I/O and shared 
> files. It seems the latter is to be determined by the application 
> authors and their users. With, for instance, a mesh-type application, 
> there might be trade-offs; but having, at least, all the data for one 
> computed field or dynamic, for one time step that is saved to the 
> output (we usually calls these dumps), stored in the same file is 
> regularly more advantageous for the entire work cycle. Actually, 
> having all the output data in a single file seems like the most 
> desirable approach.
>
> Though MPI-IO independent writing operations can be on the lowest 
> level, mustn't there still be some global coordination to determine 
> where each processors chunk of a communicators complete vector of 
> values is written, respectively, in a shared file? Further, what we 
> now call two-phase aggregation, as a means of turning many, tiny 
> block-size writes into a small number of large ones to leverage the 
> greater efficiency of the FS to respond to this type of activity, has 
> potential benefits. However, I've seen the considerable ROMIO 
> provisions for this technique used incorrectly, delivering a decrease 
> in performance.
The collective I/O should reorganize the data among the I/O nodes to 
some kind of pattern the bottom file system prefer, IMHO.
As for lustre, sometimes it is sensitive to the I/O size(or alignment). 
especially for redstorm, since no client cache, and client
just throw the I/O req to server as soon as it gets from the 
application. So collective I/O is the only layer I/O req might be
optimized in the client, except the application. But current two-phase 
aggregagation does not consider too much about the
specification information from the file system.  for example it splits 
data evenly to all the I/O clients, instead of considering
stripe-size(alignment) and which OST the data goes to (sometimes, even 
client I/O load != even OST I/O load); improper use of
read-modify-write in data_sieving somtimes; no collection for 
non-interleave data.

Our current experience is that be careful of using collective_io and 
data_sieving, except we are very clear about
how adio driver will reorganize your data. For example, for HDF5 lib, 
mpiposix(no collective and data_sieve) layer is favorable.
But I do agree with Shane,  the collective I/O with HDF5, pNetCDF should 
be the way moving forward.

WangDi
>
> Marty Barnaby
>
>
> Weikuan Yu wrote:
>> I would throw in some of my experience for discussion as Shane mentioned 
>> my name here :)
>>
>> (1) First, I am not under the impression that the collective I/O is 
>> designed to reveal the peak performance of a particular system. Well, 
>> there are publications claiming that colective I/O might be a preferred 
>> case for some particular architecture, e.g., for BG/P (HPCA06)... But I 
>> am not fully positive or clear on the context of the claim.
>>
>> (2) With the intended disjoint access as Marty mentioned, using 
>> collective I/O would only add a phase of coordination before each 
>> process streams its data to the file system.
>>
>> (3) The testing as Marty described below seem small, both in terms of 
>> process counts and the amount of data. Aside from whether this is good 
>> for revealing the system's best performance, I can confirm that there 
>> are applications at ORNL that has much larger process counts and bigger 
>> data volume from each processes.
>>
>>  > In tests on redstorm from last year, I appended to a single, open file
>>  > at a
>>  > rate of 26 GB/s. I had to use exceptional parameters to achieve this
>>  > however: the file had an LFS stripe-count of 160, and I sent a 20 MB
>>  > buffer,
>>  > respectively, from each of a 160, total processor job, for an aggregate
>>  > of
>>  > 3.2 GB per write_all operation. I consider this configuration out of the
>>  > range of any normal usage.
>>
>> IIRC, something regarding collective I/O has been discussed earlier, 
>> also with Marty (?).
>>
>> In the upcoming IPDPS08 conference, ORNL has two papers on I/O 
>> performance of Jaguar. So you may find the numbers interesting. The 
>> final version of the papers should be available from me or Mark Fahey 
>> (the author of another paper).
>>
>> --Weikuan
>>
>>
>> Canon, Richard Shane wrote:
>>   
>>> Marty,
>>>
>>> Our benchmark measurements were made using IOR doing POSIX IO to a
>>> single shared file (I believe).  
>>>
>>> Since you mentioned MPI-IO...  Weikuan Yu (at ORNL) has done some work
>>> to improve the MPI-IO Lustre ADIO driver.  Also, we have been sponsoring
>>> work through a Lustre Centre of Excellence to further improve the ADIO
>>> driver.  I'm optimistic that this can make collective IO perform at a
>>> level that one would expect.  File-per-process runs often do run faster
>>> up until the meta data activity associated with creating 10k+ files
>>> starts to slow things down.  I'm a firm believer that collective IO
>>> through libraries like MPI-IO, HDF5, and pNetCDF are the way things
>>> should move.  It should be possible to embed enough intelligence in
>>> these middle layers to do good stripe alignment, automatically tune
>>> stripe counts, and stripe width, etc. Some of this will hopefully be
>>> accomplished with the improvements being made to the ADIO.
>>>
>>> --Shane
>>>
>>> -----Original Message-----
>>> From: lustre-discuss-bounces at lists.lustre.org
>>> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of mlbarna
>>> Sent: Wednesday, January 16, 2008 12:43 PM
>>> To: lustre-discuss at clusterfs.com
>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>>> system?
>>>
>>> Could you elaborate on the benchmarking application(s) run that provided
>>> these bandwidth numbers. I have a particular interest in MPI coded
>>> programs
>>> that perform collective I/O. In discussions, I find this topic sometimes
>>> confused; my meaning is streamed, appending with all the data from all
>>> the
>>> processors for a single, atomic write operation filling disjoint
>>> sections of
>>> the same file. In MPI-IO, the MPI_File_write_all* family seems to define
>>> my
>>> focus area, run with or without two-phase aggregation. Imitating the
>>> operation with simple, Posix I/O is acceptable, as far as I am
>>> concerned.
>>>
>>> In tests on redstorm from last year, I appended to a single, open file
>>> at a
>>> rate of 26 GB/s. I had to use exceptional parameters to achieve this
>>> however: the file had an LFS stripe-count of 160, and I sent a 20 MB
>>> buffer,
>>> respectively, from each of a 160, total processor job, for an aggregate
>>> of
>>> 3.2 GB per write_all operation. I consider this configuration out of the
>>> range of any normal usage.
>>>
>>> I believe that a faster rate could be achieved by a similar program that
>>> wrote independently--that is, one-file-per-processor--such as via
>>> NetCDF.
>>> For this case, I would set the LFS stripe-count down to one.
>>>
>>>
>>> Marty Barnaby
>>>
>>>
>>>
>>> On 1/14/08 4:11 PM, "Canon, Richard Shane" <canonrs at ornl.gov> wrote:
>>>
>>>     
>>>> Jeff,
>>>>
>>>> I'm not aware of any.  For parallel file systems it is usually
>>>>       
>>> bandwidth
>>>     
>>>> centric.
>>>>
>>>> --Shane
>>>>
>>>> -----Original Message-----
>>>> From: Kennedy, Jeffrey [mailto:jkennedy at qualcomm.com]
>>>> Sent: Monday, January 14, 2008 4:56 PM
>>>> To: Canon, Richard Shane; lustre-discuss at clusterfs.com
>>>> Subject: RE: [Lustre-discuss] Off-topic: largest existing Lustre file
>>>> system?
>>>>
>>>> Any spec's on IOPS rather than throughput?
>>>>
>>>> Thanks.
>>>>
>>>> Jeff Kennedy
>>>> QCT Engineering Compute
>>>> 858-651-6592
>>>>  
>>>>       
>>>>> -----Original Message-----
>>>>> From: lustre-discuss-bounces at clusterfs.com [mailto:lustre-discuss-
>>>>> bounces at clusterfs.com] On Behalf Of Canon, Richard Shane
>>>>> Sent: Monday, January 14, 2008 1:49 PM
>>>>> To: lustre-discuss at clusterfs.com
>>>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>>>>> system?
>>>>>
>>>>>
>>>>> Klaus,
>>>>>
>>>>> Here are some that I know are pretty large.
>>>>>
>>>>> * RedStorm - I think it has two roughly 50 GB/s file systems.  The
>>>>> capacity may not be quite as large though.  I think they used FC
>>>>>         
>>>> drives.
>>>>       
>>>>> It was DDN 8500 although that may have changed.
>>>>> * CEA - I think they have a file system approaching 100 GB/s.  I
>>>>>         
>>> think
>>>     
>>>>> it is DDN 9550.  Not sure about the capacities.
>>>>> * TACC has a large Thumper based system.  Not sure of the specs.
>>>>> * ORNL - We have a 44 GB/s file system with around 800 TB of total
>>>>> capacity.  That is DDN 9550.  We also have two new file system (20
>>>>>         
>>>> GB/s
>>>>       
>>>>> and 10 GB/s currently LSI XBB2 and DDN 9550 respectively).  Those
>>>>>         
>>> have
>>>     
>>>>> around 800 TB each (after RAID6).
>>>>> * We are planning a 200 GB/s, around 10 PB file system now.
>>>>>
>>>>> --Shane
>>>>>
>>>>> -----Original Message-----
>>>>> From: lustre-discuss-bounces at clusterfs.com
>>>>> [mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of D. Marc
>>>>> Stearman
>>>>> Sent: Monday, January 14, 2008 4:37 PM
>>>>> To: lustre-discuss at clusterfs.com
>>>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>>>>> system?
>>>>>
>>>>> Klaus,
>>>>>
>>>>> We currently have a 1.2PB lustre filesystem that we will be expanding
>>>>> to 2.4PB in the near future.  I not sure about the highest sustained
>>>>> IOPS, but we did have a user peak 19GB/s to one of our 500TB
>>>>> filesystems recently. The backend for that was 16 DDN 8500 couplets
>>>>> with write-cache turned OFF.
>>>>>
>>>>> -Marc
>>>>>
>>>>> ----
>>>>> D. Marc Stearman
>>>>> LC Lustre Systems Administrator
>>>>> marc at llnl.gov
>>>>> 925.423.9670
>>>>> Pager: 1.888.203.0641
>>>>>
>>>>>
>>>>> On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
>>>>>
>>>>>         
>>>>>> Hi there,
>>>>>>
>>>>>> I was asked by a friend of a business contact of mine the other day
>>>>>> to share
>>>>>> some information about Lustre; seems he's planning to build what
>>>>>>           
>>>> will
>>>>       
>>>>>> eventually be about a 3 PB file system.
>>>>>>
>>>>>> The CFS website doesn't appear to have any information on field
>>>>>> deployments
>>>>>> worth bragging about, so I figured I'd ask, just for fun; does
>>>>>> anyone know:
>>>>>>
>>>>>> - the size of the largest working Lustre file system currently in
>>>>>> the field
>>>>>> - the highest sustained number of IOPS seen with Lustre, and what
>>>>>>           
>>>> the
>>>>       
>>>>>> backend was?
>>>>>>
>>>>>> cheers,
>>>>>> Klaus
>>>>>>
>>>>>> _______________________________________________
>>>>>> Lustre-discuss mailing list
>>>>>> Lustre-discuss at clusterfs.com
>>>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>>>           
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at clusterfs.com
>>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at clusterfs.com
>>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>>         
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>
>>>>       
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>     
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>   
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>