[Lustre-discuss] IOR Single File -- lock callback timer expired
Jeffrey Alan Bennett
jab at sdsc.edu
Mon Dec 15 10:37:35 PST 2008
I am also having this same issue when using IOR with POSIX. I also have other issues with IOR. For example, when I run IOR with MPI-IO, sometimes IOR is hung forever in the middle of the test. I am only using 4 Lustre clients and files are striped over 28 OSTs.
Jeff
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
> Roger Spellman
> Sent: Monday, December 15, 2008 9:48 AM
> To: Andreas Dilger
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] IOR Single File -- lock
> callback timer expired
>
> Andreas,
>
> Thanks.
>
> > If the file is not striped over multiple OSTs it may be that the 1
> > (default)
> > OST that this file is striped over is being overloaded.
>
> The file is striped over many OSTs. The customer has tested
> between 8 and 18 stripes, to my knowledge.
>
> As far as I can tell, I can control how many RPCs are
> outstanding from each client to each OST. However, I cannot
> control the total number of outstanding RPCs from a single
> client. So, it is possible that many (or even all) of the
> 128 clients have outstanding I/Os to the same OST, even if
> the file is striped. Do you agree?
>
> Is there a proc file like max_rpcs_in_flight that is
> per-client, not per-client/per-OST pair?
>
> > llstat -i 1 /proc/fs/lustre/ost/OSS/ost_io/stats
>
> Do you want this command to be run WHILE the test is going on?
>
> Thanks again.
>
> -Roger
>
> > -----Original Message-----
> > From: Andreas.Dilger at sun.com
> [mailto:Andreas.Dilger at sun.com] On Behalf
> Of
> > Andreas Dilger
> > Sent: Friday, December 12, 2008 8:11 PM
> > To: Roger Spellman
> > Cc: lustre-discuss at lists.lustre.org
> > Subject: Re: [Lustre-discuss] IOR Single File -- lock
> callback timer
> > expired
> >
> > On Dec 10, 2008 13:21 -0500, Roger Spellman wrote:
> > > I have a customer running IOR on 128 clients, using IOR's
> POSIX mode
> to
> > > create a single file.
> > >
> > > The clients are running Lustre 1.6.6. The servers are running
> Lustre
> > > 1.6.5.
> >
> > If the file is not striped over multiple OSTs it may be that the 1
> > (default)
> > OST that this file is striped over is being overloaded.
> >
> > > mpiexec noticed that job rank 0 with PID 7520 on node
> whitney160
> > > exited on signal 42 (Real-time signal 8).
> > >
> > > Looking at the logs on the servers, I see a bunch of messages like
> the
> > > following:
> > >
> > > Dec 9 18:23:38 ts-sandia-02 kernel: LustreError:
> > > 0:0:(ldlm_lockd.c:234:waiting_locks_callback()) ### lock callback
> timer
> > > expired after 116s: evicting client at 192.168.121.32 at o2ib ns:
> > > filter-scratch-OST0000_UUID lock:
> ffff810014239600/0x6316855aa9d9f014
> > > lrc: 1/0,0 mode: PW/PW res: 5987/0 rrc: 373 type: EXT
> > > [1409286144->1442840575] (req 1409286144->1410334719) flags: 20
> > > remote: 0x77037709d529258a expref: 28 pi
> > >
> > >
> > > What might be causing this?
> >
> > This indicates that the (from the OST's POV) the client hasn't
> cancelled
> > the lock, nor done any writes under this lock in the past 2 minutes.
> >
> > It would be worthwhile for you to check the RPC IO stats to see how
> long
> > writes are taking on this OST:
> >
> > llstat -i 1 /proc/fs/lustre/ost/OSS/ost_io/stats
> >
> > > Can I fix this problem by extending timers, such as
> > > /proc/sys/lustre/timeout and /proc/sys/lustre/ldlm_timeout ?
> >
> > Increasing /proc/sys/lustre/timeout would likely help.
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
More information about the lustre-discuss
mailing list