[Lustre-discuss] abnormally long ftruncates on Cray XT4

Fri Dec 11 02:03:50 PST 2009

On 2009-12-10, at 13:55, Mark Howison wrote:
> On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we
> have noticed excessively long return times on ftruncate calls that are
> issued through HDF5 or the MPI-IO layer (through MPI_File_set_size()
> for instance). Here is a IO trace plot that shows 235GB written to a
> shared HDF5 file in 65s followed by an ftruncate that lasts about 50s:
>
> http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png

To clarify - the vertical axis is for different servers or is that for  
clients?  It definitely looks like 1 or 2 of the servers are much  
slower than the others, as shown by the "solid" line of writes,  
compared to the others which are very sparse.

Then, the single purple line at the top is presumably the truncate in  
progress?

Finally, at the far right, is that for reads?

My original guess would have been that all of your clients are doing a  
truncate at the same time, and this is causing lock contention, but  
even that shouldn't cause such a long delay.

Another possibility is that the large file is fragmented on disk, and  
the truncate is taking a long time, but I also find it hard to believe  
it would take this long.

Presumably there are no error messages during this time?

> However, we've also seen this long ftruncate problem with several
> other IO patterns in addition to collective buffering in MPI-IO: for
> instance, when bypassing MPI-IO in HDF5 and instead using the
> MPI-POSIX driver and with unstructured 1D grids.
>
> Any ideas on what might cause these long ftruncates? We plan on
> analyzing LMT data from the metadata server to determine if it is
> simply contention with other users, but we are suspicious of the
> consistency and magnitude of these hangs.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.