[Lustre-discuss] IOR Single File -- lock callback timer expired

Roger Spellman roger at terascala.com
Wed Dec 10 10:21:22 PST 2008


I have a customer running IOR on 128 clients, using IOR's POSIX mode to
create a single file.  

The clients are running Lustre 1.6.6.  The servers are running Lustre
1.6.5.

 

The following is the IOR output:

 

    /usr/bin/lfs setstripe /scratch1/test/bm_runs 2097152 -1 18 

    IOR-2.10.1: MPI Coordinated Test of Parallel I/O

 

    Run began: Tue Dec  9 09:19:37 2008

    Command line used: /home/test/IOR/bin/IOR-2.10.1 -b 32g -t 1m -i 1
-a POSIX -E -g -C -w -r -v -d 2 -o
/scratch1/test/bm_runs/IOR.stripe.18.1

    Machine: Linux whitney160

    Start time skew across all tasks: 0.02 sec

    Path: /scratch1/test/bm_runs

    FS: 118.7 TiB   Used FS: 2.6%   Inodes: 300.5 Mi   Used Inodes: 0.0%

    Participating tasks: 128

    Using reorderTasks '-C' (expecting block, not cyclic, task
assignment)

 

    Summary:

        api                = POSIX

        test filename      = /scratch1/test/bm_runs/IOR.stripe.18.1

        access             = single-shared-file

        pattern            = segmented (1 segment)

        ordering           = sequential offsets

        clients            = 128 (1 per node)

        repetitions        = 1

        xfersize           = 1 MiB

        blocksize          = 32 GiB

        aggregate filesize = 4096 GiB

 

    delaying 2 seconds . . .

    Commencing write performance test.

    Tue Dec  9 09:19:39 2008

 

    ** error **

    ERROR in aiori-POSIX.c (line 247): transfer failed.

    ERROR: No locks available

    ** exiting **

    [whitney287:07469] MPI_ABORT invoked on rank 127 in communicator
MPI_COMM_WORLD with errorcode -1

    mpiexec noticed that job rank 0 with PID 7520 on node whitney160
exited on signal 42 (Real-time signal 8).

    110 additional processes aborted (not shown)

    16 processes killed (possibly by Open MPI)

    

Looking at the logs on the servers, I see a bunch of messages like the
following:

 

Dec  9 18:23:38 ts-sandia-02 kernel: LustreError:
0:0:(ldlm_lockd.c:234:waiting_locks_callback()) ### lock callback timer
expired after 116s: evicting client at 192.168.121.32 at o2ib  ns:
filter-scratch-OST0000_UUID lock: ffff810014239600/0x6316855aa9d9f014
lrc: 1/0,0 mode: PW/PW res: 5987/0 rrc: 373 type: EXT
[1409286144->1442840575] (req 1409286144->1410334719) flags: 2

0 remote: 0x77037709d529258a expref: 28 pi

 

What might be causing this?

 

Can I fix this problem by extending timers, such as
/proc/sys/lustre/timeout and /proc/sys/lustre/ldlm_timeout ?

 

Are there other timers I can try?

 

 

Thanks for your help.

 

Roger Spellman

Staff Engineer

Terascala, Inc.

508-588-1501

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20081210/fae6bb96/attachment.htm>


More information about the lustre-discuss mailing list