[Lustre-discuss] abnormally long ftruncates on Cray XT4

Wed Dec 23 13:22:16 PST 2009

Hi Andreas,

Our systems people looked through the error logs and didn't report any
errors directly related to my job. There were some errors around the
same time and associated with other jobs, however (pasted below).

Thanks
Mark

based on your timestamps I am assuming this is the run:

Dec 7 17:00:13 nid04099 aprun[54251]: apid=1820824, Starting,
user=43251, cmd_line="aprun -n 2048 ./vorpalio -a 4 -t 10 -b 80 80 80
-n 32 8 8 ", num_nodes=512,
node_list=1440-1462,1470-1471,1489-1491,1493-1499,1534-1535,1537,1563-1568,1582-1600,1643-1644,1648-1650,1739-1745,1762-1776,2033-2038,2065-2067,2105-2114,2141-2168,2239-2244,2248-2255,2341-2367,2372-2374,2762-2795,2837-2864,2952-2995,3234-3245,3287-3296,3302-3305,3330-3339,3866-3909,3911-3933,3946-4023,12796-12839

No Lustre errors during the timeframe requested:

>From Messages:
Dec 7 15:30:02 nid00263 kernel: LustreError:
42196:0:(llite_nfs.c:252:ll_get_parent()) failure -13 inode 356483075
get parent
Dec 7 19:22:13 nid00256 kernel: LustreError:
92924:0:(file.c:2653:ll_inode_revalidate_fini()) failure -43 inode
422182913

None in console log either. However just after the requested time period:
[2009-12-07 17:03:07][c4-1c0s0n3]Lustre:
5555:0:(rw.c:1830:ll_readpage()) ino 169248734 page 1024 (4194304) not
covered by a lock (mmap?). check debug logs.
[2009-12-07 17:03:07][c4-1c0s0n2]Lustre:
5539:0:(rw.c:1830:ll_readpage()) ino 169248734 page 1024 (4194304) not
covered by a lock (mmap?). check debug logs.
[2009-12-07 17:03:07][c4-1c0s0n3]Lustre:
5557:0:(rw.c:1830:ll_readpage()) ino 169248734 page 1025 (4198400) not
covered by a lock (mmap?). check debug logs.
[2009-12-07 17:03:07][c4-1c0s0n3]Lustre:
5555:0:(rw.c:1830:ll_readpage()) Skipped 316 previous similar messages
[2009-12-07 17:03:12][c4-1c0s0n3]Lustre:
5557:0:(rw.c:1830:ll_readpage()) ino 169248729 page 1024 (4194304) not
covered by a lock (mmap?). check debug logs.
[2009-12-07 17:03:12][c4-1c0s0n3]Lustre:
5557:0:(rw.c:1830:ll_readpage()) Skipped 1 previous similar message

but both of these nodes (1250 and 1251) were not part of this job.

On Fri, Dec 11, 2009 at 10:58 AM, Mark Howison <MHowison at lbl.gov> wrote:
> Hi Andreas,
>
> Sorry I should have provided a better description of the trace plot.
> Yes, the vertical axis shows MPI task number: the first 48 task (mod 4
> because of quad cores) tasks are the writers, and indeed there a few
> OSTs that are an order of magnitude slower than others. Usually, we
> see that undesirable behavior when we have many more writers than OSTs
> (for instance 1K tasks hitting 48 OSTs)... now it has started
> happening even when we use collective buffering to carefully match the
> number of writers to OSTs. It is possible that this just contention
> from other users on the system. Andrew Uselton is going to help me
> obtain LMT data to verify that. Contention has been less of an issue
> since there was a large IO hardware upgrade to Franklin last spring,
> though.
>
> By any chance, have there been changes in recent releases of lustre
> that would affect how stripes are assigned to OSTs? For instance, is
> it no longer the case that they are assigned round robin? Maybe some
> type of load balancing has been introduced? Because that would break
> our 1-1 writer to OST pattern.
>
> The purple indicates ftruncate, which is only called from task 0 (that
> is how MPI_File_set_size is implemented). The salmon at the far right
> actually indicates fsyncs, not reads. And the brown (hard to see) is
> fclose.
>
> I will ask our systems people if they can locate logs and verify
> whether there were any errors at the time of the ftruncate, but I
> didn't receive any errors on the client side.
>
> Thanks for helping us look into this,
>
> Mark
>
> On Fri, Dec 11, 2009 at 2:03 AM, Andreas Dilger <adilger at sun.com> wrote:
>> On 2009-12-10, at 13:55, Mark Howison wrote:
>>>
>>> On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we
>>> have noticed excessively long return times on ftruncate calls that are
>>> issued through HDF5 or the MPI-IO layer (through MPI_File_set_size()
>>> for instance). Here is a IO trace plot that shows 235GB written to a
>>> shared HDF5 file in 65s followed by an ftruncate that lasts about 50s:
>>>
>>> http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png
>>
>> To clarify - the vertical axis is for different servers or is that for
>> clients?  It definitely looks like 1 or 2 of the servers are much slower
>> than the others, as shown by the "solid" line of writes, compared to the
>> others which are very sparse.
>>
>> Then, the single purple line at the top is presumably the truncate in
>> progress?
>>
>> Finally, at the far right, is that for reads?
>>
>> My original guess would have been that all of your clients are doing a
>> truncate at the same time, and this is causing lock contention, but even
>> that shouldn't cause such a long delay.
>>
>> Another possibility is that the large file is fragmented on disk, and the
>> truncate is taking a long time, but I also find it hard to believe it would
>> take this long.
>>
>> Presumably there are no error messages during this time?
>>
>>> However, we've also seen this long ftruncate problem with several
>>> other IO patterns in addition to collective buffering in MPI-IO: for
>>> instance, when bypassing MPI-IO in HDF5 and instead using the
>>> MPI-POSIX driver and with unstructured 1D grids.
>>>
>>> Any ideas on what might cause these long ftruncates? We plan on
>>> analyzing LMT data from the metadata server to determine if it is
>>> simply contention with other users, but we are suspicious of the
>>> consistency and magnitude of these hangs.
>>
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>>
>