[lustre-discuss] problem getting high performance output to single file

Tue May 19 10:44:21 PDT 2015

Thanks for the suggestion! When I had each rank run on a separate compute node/host, I saw parallel performance (4 seconds for the 6GB of writing). When I ran the MPI job on one host (the hosts have 12 cores, by default we pack ranks onto as few hosts as possible), things happened serially, each rank finished about 2 seconds after a different rank. I'm told that the hosts can handle a lot of I/O, but it seems there a some issues with getting that to work well. I believe we get good performance with different ranks on one host reading from different files. I'll look into tuning the MPI/Hdf5 parameter now, with an eye for designing my application to write from different hosts. My initial tests with MPI showed degraded performance when I used different hosts for the writing, but maybe there are some parameters that will help. I can try the openmpi forum at that point. 

best,

David Schneider
________________________________________
From: Mohr Jr, Richard Frank (Rick Mohr) [rmohr at utk.edu]
Sent: Tuesday, May 19, 2015 9:15 AM
To: Schneider, David A.
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [lustre-discuss] problem getting high performance output to single file

> On May 19, 2015, at 11:40 AM, Schneider, David A. <davidsch at slac.stanford.edu> wrote:
>
> When working from hdf5 and mpi, I have seen a number of references about tuning parameters, I haven't dug into this yet. I first want to make sure lustre has the high output performance at a basic level. I tried to write a C program uses simple POSIX calls (open and looping over writes) but I don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file).
>
> Does anyone know if this should work? What is the simplest C program I could write to see an increase in output performance after I stripe? Do I need separate processes/threads with separate file handles?

If you are looking for a simple shared-file test, you could try something like this:

1) Create a file with a stripe size of 1 GB and a stripe count of 6.

2) Write an MPI program where each process writes 1 GB of sequential data.  Each process should first seek to (mpi_rank)*(1GB) and then write 1 GB.  This will ensure that all processes are writing to non-overlapping parts of the file.

3) Start the program running on 6 nodes (1 process per node).

In a scenario like that, you should effectively be getting file-per-process speeds even though you are writing to a shared file because each process is writing to a different OST.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu