[Lustre-discuss] IOR Single File -- lock callback timer expired
Roger Spellman
roger at terascala.com
Wed Dec 10 10:21:22 PST 2008
I have a customer running IOR on 128 clients, using IOR's POSIX mode to
create a single file.
The clients are running Lustre 1.6.6. The servers are running Lustre
1.6.5.
The following is the IOR output:
/usr/bin/lfs setstripe /scratch1/test/bm_runs 2097152 -1 18
IOR-2.10.1: MPI Coordinated Test of Parallel I/O
Run began: Tue Dec 9 09:19:37 2008
Command line used: /home/test/IOR/bin/IOR-2.10.1 -b 32g -t 1m -i 1
-a POSIX -E -g -C -w -r -v -d 2 -o
/scratch1/test/bm_runs/IOR.stripe.18.1
Machine: Linux whitney160
Start time skew across all tasks: 0.02 sec
Path: /scratch1/test/bm_runs
FS: 118.7 TiB Used FS: 2.6% Inodes: 300.5 Mi Used Inodes: 0.0%
Participating tasks: 128
Using reorderTasks '-C' (expecting block, not cyclic, task
assignment)
Summary:
api = POSIX
test filename = /scratch1/test/bm_runs/IOR.stripe.18.1
access = single-shared-file
pattern = segmented (1 segment)
ordering = sequential offsets
clients = 128 (1 per node)
repetitions = 1
xfersize = 1 MiB
blocksize = 32 GiB
aggregate filesize = 4096 GiB
delaying 2 seconds . . .
Commencing write performance test.
Tue Dec 9 09:19:39 2008
** error **
ERROR in aiori-POSIX.c (line 247): transfer failed.
ERROR: No locks available
** exiting **
[whitney287:07469] MPI_ABORT invoked on rank 127 in communicator
MPI_COMM_WORLD with errorcode -1
mpiexec noticed that job rank 0 with PID 7520 on node whitney160
exited on signal 42 (Real-time signal 8).
110 additional processes aborted (not shown)
16 processes killed (possibly by Open MPI)
Looking at the logs on the servers, I see a bunch of messages like the
following:
Dec 9 18:23:38 ts-sandia-02 kernel: LustreError:
0:0:(ldlm_lockd.c:234:waiting_locks_callback()) ### lock callback timer
expired after 116s: evicting client at 192.168.121.32 at o2ib ns:
filter-scratch-OST0000_UUID lock: ffff810014239600/0x6316855aa9d9f014
lrc: 1/0,0 mode: PW/PW res: 5987/0 rrc: 373 type: EXT
[1409286144->1442840575] (req 1409286144->1410334719) flags: 2
0 remote: 0x77037709d529258a expref: 28 pi
What might be causing this?
Can I fix this problem by extending timers, such as
/proc/sys/lustre/timeout and /proc/sys/lustre/ldlm_timeout ?
Are there other timers I can try?
Thanks for your help.
Roger Spellman
Staff Engineer
Terascala, Inc.
508-588-1501
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20081210/fae6bb96/attachment.htm>
More information about the lustre-discuss
mailing list