[lustre-discuss] liblustreapi.so llapi_layout_get_by_fd() taking a long time to complete

John Bauer bauerj at iodoctors.com
Fri Nov 25 06:56:57 PST 2022


Andreas,

Thanks for the reply.  After days of debugging, thinking it was my 
instrumentation code at the root of the problem, this morning I 
discovered the jobs will sometimes have the 260 second pause even when 
not using my instrumentation package.  The intermittent nature sent me 
on that wild goose chase.  What's odd is now the 260 second delay shows 
up during the MPI_File_write_at() phase of the job.  I no longer invoke 
my package's option to set and get the striping info for the file, so I 
am no longer calling llapi_layout_get_by_fd().  This is an h5perf run:

mpirun -n 16 h5perf -A mpiio -B 2K -e 256M -i 1 -p 16 -P 16 -w -x 128M 
-X 128M -I -o ~/h5perf/${output}.log

Here's a few of my instrumentation plots of the pwrite64() calls for all 
the ranks, with file Position on the Y axis and time on the X.  Note 
that each rank does 131072 pwrite64() calls of 2k bytes, strided by 
32k.  Lots of potential for lock contention. There is always one rank 
that gets off to a good start.  I suppose because once it is in the lead 
it does not have to deal with lock contention.  What's strange about 
this run is that it is the rank that is in the lead that hits the "road 
block" first and has the 260s delay before resuming its writes. Some of 
the trailing ranks blow through the "road block" and continue writing.  
The other ranks also block on the same area of the file and pause for 
the 260 seconds.  Again, all ranks are on different nodes(hosts).  Why 
would some ranks pause and some not?  Why would the lead rank even pause 
at all?  Today I will try to associate which OSTs/OSSs are behind the 
road block.  Is there something in /proc/fs/lustre/osc/ that sheds light 
on RTC timeouts?

John

https://www.dropbox.com/s/6uac96xhr79pj73/h5perf_1.png?dl=0

https://www.dropbox.com/s/tebm1iy0jipnqgx/h5perf_2.png?dl=0

https://www.dropbox.com/s/ydgzstkg9qrk6z4/h5perf_3.png?dl=0
On 11/24/22 20:47, Andreas Dilger wrote:
> On Nov 22, 2022, at 13:57, John Bauer <bauerj at iodoctors.com> wrote:
>>
>> Hi all,
>>
>> I am making a call to *llapi_layout_get_by_fd()*  from each rank of a 
>> 16 rank MPI job.  One rank per node.
>>
>> About 75% of the time, one of the ranks, typically rank 0, takes a 
>> very long time to complete this call.  I have placed fprintf() calls 
>> with wall clock timers around the call.  If it does take a long time 
>> it is generally about 260 seconds.  Otherwise it takes only 
>> micro-seconds.
>>
>> How I access llapi_layout_get_by_fd() :
>>
>> liblustreapi = dlopen("liblustreapi.so", RTLD_LAZY ) ;
>> LLAPI.layout_get_by_fd = dlsym( liblustreapi, 
>> "llapi_layout_get_by_fd" ) ;
>>
>> How I call llapi_layout_get_by_fd() :
>> if(dbg)fprintf(stderr,"%s %12.8f %s() before 
>> LLAPI.layout_get_by_fd()\n",host,rtc(),__func__);
>>    struct llapi_layout *layout = (*LLAPI.layout_get_by_fd)( fd, 0);
>> if(dbg)fprintf(stderr,"%s %12.8f %s() after 
>> LLAPI.layout_get_by_fd()\n",host,rtc(),__func__);
>>
>> The resulting prints from rank 0 :
>>
>> r401i2n10   7.22477698 LustreLayout_get_by_fd() before 
>> LLAPI.layout_get_by_fd()
>> r401i2n10 269.52539992 LustreLayout_get_by_fd() after 
>> LLAPI.layout_get_by_fd()
>>
>> Any ideas on what might be triggering this. The layout returned seems 
>> to be correct every time, whether it takes a long time or not.  The 
>> layout returned has the correct striping information, but the 
>> component has no OSTs as the component has yet to be instantiated for 
>> the new file.
>>
>
> Running under strace/ltrace would show where the slowdown is, and 
> Lustre kernel debug logs would be needed to isolate this to a specific 
> piece of code.  Given the length of time it is likely that an RPC is 
> timing out (presumably nothing is printed on the console logs), but 
> you'd need to look at exactly what is happening.
>
> It's a *bit* strange, because this call is essentially equivalent to 
> "getxattr", but over the years a bunch of cruft has been added and it 
> is probably doing a lot more than it should...  You could potentially 
> use (approximately):
>
>        fgetxattr(fd, "lustre.lov", buf, buflen);
>        llapi_layout_get_by_xattr(buf, buflen, 0);
>
> but then we wouldn't know what is making this slow and you couldn't 
> submit a patch to fix it...
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20221125/52b1e493/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: h5perf_1.png
Type: image/png
Size: 9634 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20221125/52b1e493/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: h5perf_2.png
Type: image/png
Size: 12581 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20221125/52b1e493/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: h5perf_3.png
Type: image/png
Size: 7945 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20221125/52b1e493/attachment-0005.png>


More information about the lustre-discuss mailing list