<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

I concur that at 160, my processor count was low. At the time, I had

access to as many as 1000 on our big Cray XT3, Redstorm, but now it is

not available to me at all. Through my trials, I found that, for

appending to a single, shared file, matching the lfs maximum stripe

count of 160, with the equivalent job size was the only combination

that got me this up to this rate. My interest in this case ends here,

because real usage involves processor counts in the thousands.<br>

<br>

I'm not certain about the distinctions of collective I/O and shared

files. It seems the latter is to be determined by the application

authors and their users. With, for instance, a mesh-type application,

there might be trade-offs; but having, at least, all the data for one

computed field or dynamic, for one time step that is saved to the

output (we usually calls these dumps), stored in the same file is

regularly more advantageous for the entire work cycle. Actually, having

all the output data in a single file seems like the most desirable

approach.<br>

<br>

Though MPI-IO independent writing operations can be on the lowest

level, mustn't there still be some global coordination to determine

where each processors chunk of a communicators complete vector of

values is written, respectively, in a shared file? Further, what we now

call two-phase aggregation, as a means of turning many, tiny block-size

writes into a small number of large ones to leverage the greater

efficiency of the FS to respond to this type of activity, has potential

benefits. However, I've seen the considerable ROMIO provisions for this

technique used incorrectly, delivering a decrease in performance.<br>

<br>

Marty Barnaby<br>

<br>

<br>

Weikuan Yu wrote:

<blockquote cite="mid:47A201AE.6040801@gmail.com" type="cite">

  <pre wrap="">I would throw in some of my experience for discussion as Shane mentioned 

my name here :)

(1) First, I am not under the impression that the collective I/O is 

designed to reveal the peak performance of a particular system. Well, 

there are publications claiming that colective I/O might be a preferred 

case for some particular architecture, e.g., for BG/P (HPCA06)... But I 

am not fully positive or clear on the context of the claim.

(2) With the intended disjoint access as Marty mentioned, using 

collective I/O would only add a phase of coordination before each 

process streams its data to the file system.

(3) The testing as Marty described below seem small, both in terms of 

process counts and the amount of data. Aside from whether this is good 

for revealing the system's best performance, I can confirm that there 

are applications at ORNL that has much larger process counts and bigger 

data volume from each processes.

 > In tests on redstorm from last year, I appended to a single, open file

 > at a

 > rate of 26 GB/s. I had to use exceptional parameters to achieve this

 > however: the file had an LFS stripe-count of 160, and I sent a 20 MB

 > buffer,

 > respectively, from each of a 160, total processor job, for an aggregate

 > of

 > 3.2 GB per write_all operation. I consider this configuration out of the

 > range of any normal usage.

IIRC, something regarding collective I/O has been discussed earlier, 

also with Marty (?).

In the upcoming IPDPS08 conference, ORNL has two papers on I/O 

performance of Jaguar. So you may find the numbers interesting. The 

final version of the papers should be available from me or Mark Fahey 

(the author of another paper).

--Weikuan

Canon, Richard Shane wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">Marty,

Our benchmark measurements were made using IOR doing POSIX IO to a

single shared file (I believe).  

Since you mentioned MPI-IO...  Weikuan Yu (at ORNL) has done some work

to improve the MPI-IO Lustre ADIO driver.  Also, we have been sponsoring

work through a Lustre Centre of Excellence to further improve the ADIO

driver.  I'm optimistic that this can make collective IO perform at a

level that one would expect.  File-per-process runs often do run faster

up until the meta data activity associated with creating 10k+ files

starts to slow things down.  I'm a firm believer that collective IO

through libraries like MPI-IO, HDF5, and pNetCDF are the way things

should move.  It should be possible to embed enough intelligence in

these middle layers to do good stripe alignment, automatically tune

stripe counts, and stripe width, etc. Some of this will hopefully be

accomplished with the improvements being made to the ADIO.

--Shane

-----Original Message-----

From: <a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss-bounces@lists.lustre.org">lustre-discuss-bounces@lists.lustre.org</a>

[<a class="moz-txt-link-freetext" href="mailto:lustre-discuss-bounces@lists.lustre.org">mailto:lustre-discuss-bounces@lists.lustre.org</a>] On Behalf Of mlbarna

Sent: Wednesday, January 16, 2008 12:43 PM

To: <a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@clusterfs.com">lustre-discuss@clusterfs.com</a>

Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file

system?

Could you elaborate on the benchmarking application(s) run that provided

these bandwidth numbers. I have a particular interest in MPI coded

programs

that perform collective I/O. In discussions, I find this topic sometimes

confused; my meaning is streamed, appending with all the data from all

the

processors for a single, atomic write operation filling disjoint

sections of

the same file. In MPI-IO, the MPI_File_write_all* family seems to define

my

focus area, run with or without two-phase aggregation. Imitating the

operation with simple, Posix I/O is acceptable, as far as I am

concerned.

In tests on redstorm from last year, I appended to a single, open file

at a

rate of 26 GB/s. I had to use exceptional parameters to achieve this

however: the file had an LFS stripe-count of 160, and I sent a 20 MB

buffer,

respectively, from each of a 160, total processor job, for an aggregate

of

3.2 GB per write_all operation. I consider this configuration out of the

range of any normal usage.

I believe that a faster rate could be achieved by a similar program that

wrote independently--that is, one-file-per-processor--such as via

NetCDF.

For this case, I would set the LFS stripe-count down to one.

Marty Barnaby

On 1/14/08 4:11 PM, "Canon, Richard Shane" <a class="moz-txt-link-rfc2396E" href="mailto:canonrs@ornl.gov"><canonrs@ornl.gov></a> wrote:

    </pre>

    <blockquote type="cite">

      <pre wrap="">Jeff,

I'm not aware of any.  For parallel file systems it is usually

      </pre>

    </blockquote>

    <pre wrap="">bandwidth

    </pre>

    <blockquote type="cite">

      <pre wrap="">centric.

--Shane

-----Original Message-----

From: Kennedy, Jeffrey [<a class="moz-txt-link-freetext" href="mailto:jkennedy@qualcomm.com">mailto:jkennedy@qualcomm.com</a>]

Sent: Monday, January 14, 2008 4:56 PM

To: Canon, Richard Shane; <a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@clusterfs.com">lustre-discuss@clusterfs.com</a>

Subject: RE: [Lustre-discuss] Off-topic: largest existing Lustre file

system?

Any spec's on IOPS rather than throughput?

Thanks.

Jeff Kennedy

QCT Engineering Compute

858-651-6592

      </pre>

      <blockquote type="cite">

        <pre wrap="">-----Original Message-----

From: <a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss-bounces@clusterfs.com">lustre-discuss-bounces@clusterfs.com</a> [<a class="moz-txt-link-freetext" href="mailto:lustre-discuss">mailto:lustre-discuss</a>-

<a class="moz-txt-link-abbreviated" href="mailto:bounces@clusterfs.com">bounces@clusterfs.com</a>] On Behalf Of Canon, Richard Shane

Sent: Monday, January 14, 2008 1:49 PM

To: <a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@clusterfs.com">lustre-discuss@clusterfs.com</a>

Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file

system?

Klaus,

Here are some that I know are pretty large.

* RedStorm - I think it has two roughly 50 GB/s file systems.  The

capacity may not be quite as large though.  I think they used FC

        </pre>

      </blockquote>

      <pre wrap="">drives.

      </pre>

      <blockquote type="cite">

        <pre wrap="">It was DDN 8500 although that may have changed.

* CEA - I think they have a file system approaching 100 GB/s.  I

        </pre>

      </blockquote>

    </blockquote>

    <pre wrap="">think

    </pre>

    <blockquote type="cite">

      <blockquote type="cite">

        <pre wrap="">it is DDN 9550.  Not sure about the capacities.

* TACC has a large Thumper based system.  Not sure of the specs.

* ORNL - We have a 44 GB/s file system with around 800 TB of total

capacity.  That is DDN 9550.  We also have two new file system (20

        </pre>

      </blockquote>

      <pre wrap="">GB/s

      </pre>

      <blockquote type="cite">

        <pre wrap="">and 10 GB/s currently LSI XBB2 and DDN 9550 respectively).  Those

        </pre>

      </blockquote>

    </blockquote>

    <pre wrap="">have

    </pre>

    <blockquote type="cite">

      <blockquote type="cite">

        <pre wrap="">around 800 TB each (after RAID6).

* We are planning a 200 GB/s, around 10 PB file system now.

--Shane

-----Original Message-----

From: <a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss-bounces@clusterfs.com">lustre-discuss-bounces@clusterfs.com</a>

[<a class="moz-txt-link-freetext" href="mailto:lustre-discuss-bounces@clusterfs.com">mailto:lustre-discuss-bounces@clusterfs.com</a>] On Behalf Of D. Marc

Stearman

Sent: Monday, January 14, 2008 4:37 PM

To: <a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@clusterfs.com">lustre-discuss@clusterfs.com</a>

Subject: Re: [Lustre-discuss] Off-topic: largest existing Lustre file

system?

Klaus,

We currently have a 1.2PB lustre filesystem that we will be expanding

to 2.4PB in the near future.  I not sure about the highest sustained

IOPS, but we did have a user peak 19GB/s to one of our 500TB

filesystems recently. The backend for that was 16 DDN 8500 couplets

with write-cache turned OFF.

-Marc

----

D. Marc Stearman

LC Lustre Systems Administrator

<a class="moz-txt-link-abbreviated" href="mailto:marc@llnl.gov">marc@llnl.gov</a>

925.423.9670

Pager: 1.888.203.0641

On Jan 14, 2008, at 12:41 PM, Klaus Steden wrote:

        </pre>

        <blockquote type="cite">

          <pre wrap="">Hi there,

I was asked by a friend of a business contact of mine the other day

to share

some information about Lustre; seems he's planning to build what

          </pre>

        </blockquote>

      </blockquote>

      <pre wrap="">will

      </pre>

      <blockquote type="cite">

        <blockquote type="cite">

          <pre wrap="">eventually be about a 3 PB file system.

The CFS website doesn't appear to have any information on field

deployments

worth bragging about, so I figured I'd ask, just for fun; does

anyone know:

- the size of the largest working Lustre file system currently in

the field

- the highest sustained number of IOPS seen with Lustre, and what

          </pre>

        </blockquote>

      </blockquote>

      <pre wrap="">the

      </pre>

      <blockquote type="cite">

        <blockquote type="cite">

          <pre wrap="">backend was?

cheers,

Klaus

_______________________________________________

Lustre-discuss mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Lustre-discuss@clusterfs.com">Lustre-discuss@clusterfs.com</a>

<a class="moz-txt-link-freetext" href="https://mail.clusterfs.com/mailman/listinfo/lustre-discuss">https://mail.clusterfs.com/mailman/listinfo/lustre-discuss</a>

          </pre>

        </blockquote>

        <pre wrap="">_______________________________________________

Lustre-discuss mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Lustre-discuss@clusterfs.com">Lustre-discuss@clusterfs.com</a>

<a class="moz-txt-link-freetext" href="https://mail.clusterfs.com/mailman/listinfo/lustre-discuss">https://mail.clusterfs.com/mailman/listinfo/lustre-discuss</a>

_______________________________________________

Lustre-discuss mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Lustre-discuss@clusterfs.com">Lustre-discuss@clusterfs.com</a>

<a class="moz-txt-link-freetext" href="https://mail.clusterfs.com/mailman/listinfo/lustre-discuss">https://mail.clusterfs.com/mailman/listinfo/lustre-discuss</a>

        </pre>

      </blockquote>

      <pre wrap="">_______________________________________________

Lustre-discuss mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Lustre-discuss@clusterfs.com">Lustre-discuss@clusterfs.com</a>

<a class="moz-txt-link-freetext" href="https://mail.clusterfs.com/mailman/listinfo/lustre-discuss">https://mail.clusterfs.com/mailman/listinfo/lustre-discuss</a>

      </pre>

    </blockquote>

    <pre wrap="">

_______________________________________________

Lustre-discuss mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Lustre-discuss@lists.lustre.org">Lustre-discuss@lists.lustre.org</a>

<a class="moz-txt-link-freetext" href="http://lists.lustre.org/mailman/listinfo/lustre-discuss">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a>

_______________________________________________

Lustre-discuss mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Lustre-discuss@lists.lustre.org">Lustre-discuss@lists.lustre.org</a>

<a class="moz-txt-link-freetext" href="http://lists.lustre.org/mailman/listinfo/lustre-discuss">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a>

    </pre>

  </blockquote>

  <pre wrap=""><!---->_______________________________________________

Lustre-discuss mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Lustre-discuss@lists.lustre.org">Lustre-discuss@lists.lustre.org</a>

<a class="moz-txt-link-freetext" href="http://lists.lustre.org/mailman/listinfo/lustre-discuss">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a>

  </pre>

</blockquote>

<br>

</body>

</html>