[Lustre-discuss] abnormally long ftruncates on Cray XT4

Thu Dec 10 12:55:29 PST 2009

Hi,

On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we
have noticed excessively long return times on ftruncate calls that are
issued through HDF5 or the MPI-IO layer (through MPI_File_set_size()
for instance). Here is a IO trace plot that shows 235GB written to a
shared HDF5 file in 65s followed by an ftruncate that lasts about 50s:

http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png

(Full details: With collective buffering enabled in the MPI-IO layer,
the I/O pattern is essentially a series of 4MB writes issued from 48
nodes that have been designated as aggregator/writer nodes. The number
of writer nodes matches the 48 OSTs that store the file, and the write
size matches the 4MB stripe width. This sets up a pattern that *looks*
to the OSTs as essentially the same pattern as if we had 48
single-stripe files and 48 nodes each writing to its own file. This
has been the most effective way we have found to stage shared-file
writes on lustre.)

However, we've also seen this long ftruncate problem with several
other IO patterns in addition to collective buffering in MPI-IO: for
instance, when bypassing MPI-IO in HDF5 and instead using the
MPI-POSIX driver and with unstructured 1D grids.

Any ideas on what might cause these long ftruncates? We plan on
analyzing LMT data from the metadata server to determine if it is
simply contention with other users, but we are suspicious of the
consistency and magnitude of these hangs.

Thanks,

Mark Howison
NERSC Analytics
mhowison at lbl.gov