[Lustre-discuss] abnormally long ftruncates on Cray XT4

Fri Dec 11 10:58:19 PST 2009

Hi Andreas,

Sorry I should have provided a better description of the trace plot.
Yes, the vertical axis shows MPI task number: the first 48 task (mod 4
because of quad cores) tasks are the writers, and indeed there a few
OSTs that are an order of magnitude slower than others. Usually, we
see that undesirable behavior when we have many more writers than OSTs
(for instance 1K tasks hitting 48 OSTs)... now it has started
happening even when we use collective buffering to carefully match the
number of writers to OSTs. It is possible that this just contention
from other users on the system. Andrew Uselton is going to help me
obtain LMT data to verify that. Contention has been less of an issue
since there was a large IO hardware upgrade to Franklin last spring,
though.

By any chance, have there been changes in recent releases of lustre
that would affect how stripes are assigned to OSTs? For instance, is
it no longer the case that they are assigned round robin? Maybe some
type of load balancing has been introduced? Because that would break
our 1-1 writer to OST pattern.

The purple indicates ftruncate, which is only called from task 0 (that
is how MPI_File_set_size is implemented). The salmon at the far right
actually indicates fsyncs, not reads. And the brown (hard to see) is
fclose.

I will ask our systems people if they can locate logs and verify
whether there were any errors at the time of the ftruncate, but I
didn't receive any errors on the client side.

Thanks for helping us look into this,

Mark

On Fri, Dec 11, 2009 at 2:03 AM, Andreas Dilger <adilger at sun.com> wrote:
> On 2009-12-10, at 13:55, Mark Howison wrote:
>>
>> On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we
>> have noticed excessively long return times on ftruncate calls that are
>> issued through HDF5 or the MPI-IO layer (through MPI_File_set_size()
>> for instance). Here is a IO trace plot that shows 235GB written to a
>> shared HDF5 file in 65s followed by an ftruncate that lasts about 50s:
>>
>> http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png
>
> To clarify - the vertical axis is for different servers or is that for
> clients?  It definitely looks like 1 or 2 of the servers are much slower
> than the others, as shown by the "solid" line of writes, compared to the
> others which are very sparse.
>
> Then, the single purple line at the top is presumably the truncate in
> progress?
>
> Finally, at the far right, is that for reads?
>
> My original guess would have been that all of your clients are doing a
> truncate at the same time, and this is causing lock contention, but even
> that shouldn't cause such a long delay.
>
> Another possibility is that the large file is fragmented on disk, and the
> truncate is taking a long time, but I also find it hard to believe it would
> take this long.
>
> Presumably there are no error messages during this time?
>
>> However, we've also seen this long ftruncate problem with several
>> other IO patterns in addition to collective buffering in MPI-IO: for
>> instance, when bypassing MPI-IO in HDF5 and instead using the
>> MPI-POSIX driver and with unstructured 1D grids.
>>
>> Any ideas on what might cause these long ftruncates? We plan on
>> analyzing LMT data from the metadata server to determine if it is
>> simply contention with other users, but we are suspicious of the
>> consistency and magnitude of these hangs.
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>