[Lustre-discuss] IOR Single File -- lock callback timer expired

Andreas Dilger adilger at sun.com
Fri Dec 12 17:10:35 PST 2008


On Dec 10, 2008  13:21 -0500, Roger Spellman wrote:
> I have a customer running IOR on 128 clients, using IOR's POSIX mode to
> create a single file.  
> 
> The clients are running Lustre 1.6.6.  The servers are running Lustre
> 1.6.5.

If the file is not striped over multiple OSTs it may be that the 1 (default)
OST that this file is striped over is being overloaded.

>     mpiexec noticed that job rank 0 with PID 7520 on node whitney160
> exited on signal 42 (Real-time signal 8).
> 
> Looking at the logs on the servers, I see a bunch of messages like the
> following:
> 
> Dec  9 18:23:38 ts-sandia-02 kernel: LustreError:
> 0:0:(ldlm_lockd.c:234:waiting_locks_callback()) ### lock callback timer
> expired after 116s: evicting client at 192.168.121.32 at o2ib  ns:
> filter-scratch-OST0000_UUID lock: ffff810014239600/0x6316855aa9d9f014
> lrc: 1/0,0 mode: PW/PW res: 5987/0 rrc: 373 type: EXT
> [1409286144->1442840575] (req 1409286144->1410334719) flags: 20
> remote: 0x77037709d529258a expref: 28 pi
>  
> 
> What might be causing this?

This indicates that the (from the OST's POV) the client hasn't cancelled
the lock, nor done any writes under this lock in the past 2 minutes.

It would be worthwhile for you to check the RPC IO stats to see how long
writes are taking on this OST:

	llstat -i 1 /proc/fs/lustre/ost/OSS/ost_io/stats

> Can I fix this problem by extending timers, such as
> /proc/sys/lustre/timeout and /proc/sys/lustre/ldlm_timeout ?

Increasing /proc/sys/lustre/timeout would likely help.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list