[Lustre-discuss] Off-topic: largest existing Lustre file system?

Thu Jan 31 09:13:18 PST 2008

I would throw in some of my experience for discussion as Shane mentioned 
my name here :)

(1) First, I am not under the impression that the collective I/O is 
designed to reveal the peak performance of a particular system. Well, 
there are publications claiming that colective I/O might be a preferred 
case for some particular architecture, e.g., for BG/P (HPCA06)... But I 
am not fully positive or clear on the context of the claim.

(2) With the intended disjoint access as Marty mentioned, using 
collective I/O would only add a phase of coordination before each 
process streams its data to the file system.

(3) The testing as Marty described below seem small, both in terms of 
process counts and the amount of data. Aside from whether this is good 
for revealing the system's best performance, I can confirm that there 
are applications at ORNL that has much larger process counts and bigger 
data volume from each processes.

 > In tests on redstorm from last year, I appended to a single, open file
 > at a
 > rate of 26 GB/s. I had to use exceptional parameters to achieve this
 > however: the file had an LFS stripe-count of 160, and I sent a 20 MB
 > buffer,
 > respectively, from each of a 160, total processor job, for an aggregate
 > of
 > 3.2 GB per write_all operation. I consider this configuration out of the
 > range of any normal usage.

IIRC, something regarding collective I/O has been discussed earlier, 
also with Marty (?).

In the upcoming IPDPS08 conference, ORNL has two papers on I/O 
performance of Jaguar. So you may find the numbers interesting. The 
final version of the papers should be available from me or Mark Fahey 
(the author of another paper).

--Weikuan

Canon, Richard Shane wrote:
> Marty,
> 
> Our benchmark measurements were made using IOR doing POSIX IO to a
> single shared file (I believe).  
> 
> Since you mentioned MPI-IO...  Weikuan Yu (at ORNL) has done some work
> to improve the MPI-IO Lustre ADIO driver.  Also, we have been sponsoring
> work through a Lustre Centre of Excellence to further improve the ADIO
> driver.  I'm optimistic that this can make collective IO perform at a
> level that one would expect.  File-per-process runs often do run faster
> up until the meta data activity associated with creating 10k+ files
> starts to slow things down.  I'm a firm believer that collective IO
> through libraries like MPI-IO, HDF5, and pNetCDF are the way things
> should move.  It should be possible to embed enough intelligence in
> these middle layers to do good stripe alignment, automatically tune
> stripe counts, and stripe width, etc. Some of this will hopefully be
> accomplished with the improvements being made to the ADIO.
> 
> --Shane
> 
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of mlbarna
> Sent: Wednesday, January 16, 2008 12:43 PM
> To: lustre-discuss at clusterfs.com
> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
> system?
> 
> Could you elaborate on the benchmarking application(s) run that provided
> these bandwidth numbers. I have a particular interest in MPI coded
> programs
> that perform collective I/O. In discussions, I find this topic sometimes
> confused; my meaning is streamed, appending with all the data from all
> the
> processors for a single, atomic write operation filling disjoint
> sections of
> the same file. In MPI-IO, the MPI_File_write_all* family seems to define
> my
> focus area, run with or without two-phase aggregation. Imitating the
> operation with simple, Posix I/O is acceptable, as far as I am
> concerned.
> 
> In tests on redstorm from last year, I appended to a single, open file
> at a
> rate of 26 GB/s. I had to use exceptional parameters to achieve this
> however: the file had an LFS stripe-count of 160, and I sent a 20 MB
> buffer,
> respectively, from each of a 160, total processor job, for an aggregate
> of
> 3.2 GB per write_all operation. I consider this configuration out of the
> range of any normal usage.
> 
> I believe that a faster rate could be achieved by a similar program that
> wrote independently--that is, one-file-per-processor--such as via
> NetCDF.
> For this case, I would set the LFS stripe-count down to one.
> 
> 
> Marty Barnaby
> 
> 
> 
> On 1/14/08 4:11 PM, "Canon, Richard Shane" <canonrs at ornl.gov> wrote:
> 
>> Jeff,
>>
>> I'm not aware of any.  For parallel file systems it is usually
> bandwidth
>> centric.
>>
>> --Shane
>>
>> -----Original Message-----
>> From: Kennedy, Jeffrey [mailto:jkennedy at qualcomm.com]
>> Sent: Monday, January 14, 2008 4:56 PM
>> To: Canon, Richard Shane; lustre-discuss at clusterfs.com
>> Subject: RE: [Lustre-discuss] Off-topic: largest existing Lustre file
>> system?
>>
>> Any spec's on IOPS rather than throughput?
>>
>> Thanks.
>>
>> Jeff Kennedy
>> QCT Engineering Compute
>> 858-651-6592
>>  
>>> -----Original Message-----
>>> From: lustre-discuss-bounces at clusterfs.com [mailto:lustre-discuss-
>>> bounces at clusterfs.com] On Behalf Of Canon, Richard Shane
>>> Sent: Monday, January 14, 2008 1:49 PM
>>> To: lustre-discuss at clusterfs.com
>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>>> system?
>>>
>>>
>>> Klaus,
>>>
>>> Here are some that I know are pretty large.
>>>
>>> * RedStorm - I think it has two roughly 50 GB/s file systems.  The
>>> capacity may not be quite as large though.  I think they used FC
>> drives.
>>> It was DDN 8500 although that may have changed.
>>> * CEA - I think they have a file system approaching 100 GB/s.  I
> think
>>> it is DDN 9550.  Not sure about the capacities.
>>> * TACC has a large Thumper based system.  Not sure of the specs.
>>> * ORNL - We have a 44 GB/s file system with around 800 TB of total
>>> capacity.  That is DDN 9550.  We also have two new file system (20
>> GB/s
>>> and 10 GB/s currently LSI XBB2 and DDN 9550 respectively).  Those
> have
>>> around 800 TB each (after RAID6).
>>> * We are planning a 200 GB/s, around 10 PB file system now.
>>>
>>> --Shane
>>>
>>> -----Original Message-----
>>> From: lustre-discuss-bounces at clusterfs.com
>>> [mailto:lustre-discuss-bounces at clusterfs.com] On Behalf Of D. Marc
>>> Stearman
>>> Sent: Monday, January 14, 2008 4:37 PM
>>> To: lustre-discuss at clusterfs.com
>>> Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file
>>> system?
>>>
>>> Klaus,
>>>
>>> We currently have a 1.2PB lustre filesystem that we will be expanding
>>> to 2.4PB in the near future.  I not sure about the highest sustained
>>> IOPS, but we did have a user peak 19GB/s to one of our 500TB
>>> filesystems recently. The backend for that was 16 DDN 8500 couplets
>>> with write-cache turned OFF.
>>>
>>> -Marc
>>>
>>> ----
>>> D. Marc Stearman
>>> LC Lustre Systems Administrator
>>> marc at llnl.gov
>>> 925.423.9670
>>> Pager: 1.888.203.0641
>>>
>>>
>>> On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:
>>>
>>>> Hi there,
>>>>
>>>> I was asked by a friend of a business contact of mine the other day
>>>> to share
>>>> some information about Lustre; seems he's planning to build what
>> will
>>>> eventually be about a 3 PB file system.
>>>>
>>>> The CFS website doesn't appear to have any information on field
>>>> deployments
>>>> worth bragging about, so I figured I'd ask, just for fun; does
>>>> anyone know:
>>>>
>>>> - the size of the largest working Lustre file system currently in
>>>> the field
>>>> - the highest sustained number of IOPS seen with Lustre, and what
>> the
>>>> backend was?
>>>>
>>>> cheers,
>>>> Klaus
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>