[Lustre-discuss] abnormally long ftruncates on Cray XT4
Andreas Dilger
adilger at sun.com
Fri Dec 11 02:03:50 PST 2009
On 2009-12-10, at 13:55, Mark Howison wrote:
> On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we
> have noticed excessively long return times on ftruncate calls that are
> issued through HDF5 or the MPI-IO layer (through MPI_File_set_size()
> for instance). Here is a IO trace plot that shows 235GB written to a
> shared HDF5 file in 65s followed by an ftruncate that lasts about 50s:
>
> http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png
To clarify - the vertical axis is for different servers or is that for
clients? It definitely looks like 1 or 2 of the servers are much
slower than the others, as shown by the "solid" line of writes,
compared to the others which are very sparse.
Then, the single purple line at the top is presumably the truncate in
progress?
Finally, at the far right, is that for reads?
My original guess would have been that all of your clients are doing a
truncate at the same time, and this is causing lock contention, but
even that shouldn't cause such a long delay.
Another possibility is that the large file is fragmented on disk, and
the truncate is taking a long time, but I also find it hard to believe
it would take this long.
Presumably there are no error messages during this time?
> However, we've also seen this long ftruncate problem with several
> other IO patterns in addition to collective buffering in MPI-IO: for
> instance, when bypassing MPI-IO in HDF5 and instead using the
> MPI-POSIX driver and with unstructured 1D grids.
>
> Any ideas on what might cause these long ftruncates? We plan on
> analyzing LMT data from the metadata server to determine if it is
> simply contention with other users, but we are suspicious of the
> consistency and magnitude of these hangs.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list