[Lustre-discuss] IOR Single File -- lock callback timer expired

Jeffrey Alan Bennett jab at sdsc.edu
Mon Dec 15 10:37:35 PST 2008


I am also having this same issue when using IOR with POSIX. I also have other issues with IOR. For example, when I run IOR with MPI-IO, sometimes IOR is hung forever in the middle of the test. I am only using 4 Lustre clients and files are striped over 28 OSTs.

Jeff
 

> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org 
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of 
> Roger Spellman
> Sent: Monday, December 15, 2008 9:48 AM
> To: Andreas Dilger
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] IOR Single File -- lock 
> callback timer expired
> 
> Andreas,
> 
> Thanks.  
> 
> > If the file is not striped over multiple OSTs it may be that the 1
> > (default)
> > OST that this file is striped over is being overloaded.
> 
> The file is striped over many OSTs.  The customer has tested 
> between 8 and 18 stripes, to my knowledge.  
> 
> As far as I can tell, I can control how many RPCs are 
> outstanding from each client to each OST.  However, I cannot 
> control the total number of outstanding RPCs from a single 
> client.  So, it is possible that many (or even all) of the 
> 128 clients have outstanding I/Os to the same OST, even if 
> the file is striped.  Do you agree?  
> 
> Is there a proc file like max_rpcs_in_flight that is 
> per-client, not per-client/per-OST pair?
> 
> > 	llstat -i 1 /proc/fs/lustre/ost/OSS/ost_io/stats
> 
> Do you want this command to be run WHILE the test is going on?
> 
> Thanks again.
> 
> -Roger
> 
> > -----Original Message-----
> > From: Andreas.Dilger at sun.com 
> [mailto:Andreas.Dilger at sun.com] On Behalf
> Of
> > Andreas Dilger
> > Sent: Friday, December 12, 2008 8:11 PM
> > To: Roger Spellman
> > Cc: lustre-discuss at lists.lustre.org
> > Subject: Re: [Lustre-discuss] IOR Single File -- lock 
> callback timer 
> > expired
> > 
> > On Dec 10, 2008  13:21 -0500, Roger Spellman wrote:
> > > I have a customer running IOR on 128 clients, using IOR's 
> POSIX mode
> to
> > > create a single file.
> > >
> > > The clients are running Lustre 1.6.6.  The servers are running
> Lustre
> > > 1.6.5.
> > 
> > If the file is not striped over multiple OSTs it may be that the 1
> > (default)
> > OST that this file is striped over is being overloaded.
> > 
> > >     mpiexec noticed that job rank 0 with PID 7520 on node 
> whitney160 
> > > exited on signal 42 (Real-time signal 8).
> > >
> > > Looking at the logs on the servers, I see a bunch of messages like
> the
> > > following:
> > >
> > > Dec  9 18:23:38 ts-sandia-02 kernel: LustreError:
> > > 0:0:(ldlm_lockd.c:234:waiting_locks_callback()) ### lock callback
> timer
> > > expired after 116s: evicting client at 192.168.121.32 at o2ib  ns:
> > > filter-scratch-OST0000_UUID lock:
> ffff810014239600/0x6316855aa9d9f014
> > > lrc: 1/0,0 mode: PW/PW res: 5987/0 rrc: 373 type: EXT 
> > > [1409286144->1442840575] (req 1409286144->1410334719) flags: 20
> > > remote: 0x77037709d529258a expref: 28 pi
> > >
> > >
> > > What might be causing this?
> > 
> > This indicates that the (from the OST's POV) the client hasn't
> cancelled
> > the lock, nor done any writes under this lock in the past 2 minutes.
> > 
> > It would be worthwhile for you to check the RPC IO stats to see how
> long
> > writes are taking on this OST:
> > 
> > 	llstat -i 1 /proc/fs/lustre/ost/OSS/ost_io/stats
> > 
> > > Can I fix this problem by extending timers, such as 
> > > /proc/sys/lustre/timeout and /proc/sys/lustre/ldlm_timeout ?
> > 
> > Increasing /proc/sys/lustre/timeout would likely help.
> > 
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 


More information about the lustre-discuss mailing list