[Lustre-discuss] IOR Single File -- lock callback timer expired

Roger Spellman roger at terascala.com
Mon Dec 15 09:47:30 PST 2008


Andreas,

Thanks.  

> If the file is not striped over multiple OSTs it may be that the 1
> (default)
> OST that this file is striped over is being overloaded.

The file is striped over many OSTs.  The customer has tested between 8
and 18 stripes, to my knowledge.  

As far as I can tell, I can control how many RPCs are outstanding from
each client to each OST.  However, I cannot control the total number of
outstanding RPCs from a single client.  So, it is possible that many (or
even all) of the 128 clients have outstanding I/Os to the same OST, even
if the file is striped.  Do you agree?  

Is there a proc file like max_rpcs_in_flight that is per-client, not
per-client/per-OST pair?

> 	llstat -i 1 /proc/fs/lustre/ost/OSS/ost_io/stats

Do you want this command to be run WHILE the test is going on?

Thanks again.

-Roger

> -----Original Message-----
> From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On Behalf
Of
> Andreas Dilger
> Sent: Friday, December 12, 2008 8:11 PM
> To: Roger Spellman
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] IOR Single File -- lock callback timer
> expired
> 
> On Dec 10, 2008  13:21 -0500, Roger Spellman wrote:
> > I have a customer running IOR on 128 clients, using IOR's POSIX mode
to
> > create a single file.
> >
> > The clients are running Lustre 1.6.6.  The servers are running
Lustre
> > 1.6.5.
> 
> If the file is not striped over multiple OSTs it may be that the 1
> (default)
> OST that this file is striped over is being overloaded.
> 
> >     mpiexec noticed that job rank 0 with PID 7520 on node whitney160
> > exited on signal 42 (Real-time signal 8).
> >
> > Looking at the logs on the servers, I see a bunch of messages like
the
> > following:
> >
> > Dec  9 18:23:38 ts-sandia-02 kernel: LustreError:
> > 0:0:(ldlm_lockd.c:234:waiting_locks_callback()) ### lock callback
timer
> > expired after 116s: evicting client at 192.168.121.32 at o2ib  ns:
> > filter-scratch-OST0000_UUID lock:
ffff810014239600/0x6316855aa9d9f014
> > lrc: 1/0,0 mode: PW/PW res: 5987/0 rrc: 373 type: EXT
> > [1409286144->1442840575] (req 1409286144->1410334719) flags: 20
> > remote: 0x77037709d529258a expref: 28 pi
> >
> >
> > What might be causing this?
> 
> This indicates that the (from the OST's POV) the client hasn't
cancelled
> the lock, nor done any writes under this lock in the past 2 minutes.
> 
> It would be worthwhile for you to check the RPC IO stats to see how
long
> writes are taking on this OST:
> 
> 	llstat -i 1 /proc/fs/lustre/ost/OSS/ost_io/stats
> 
> > Can I fix this problem by extending timers, such as
> > /proc/sys/lustre/timeout and /proc/sys/lustre/ldlm_timeout ?
> 
> Increasing /proc/sys/lustre/timeout would likely help.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list