[Lustre-discuss] LustreError: lock callback timer expired after

Simon Latapie simon.latapie at oxalya.com
Mon Mar 30 04:06:13 PDT 2009


Greetings,

I currently have a lustre system with 1 MDS, 2 OSS with 2 OSTs each, and 
37 lustre clients (1 login and 36 compute nodes), all using infiniband 
as lustre network (o2ib). All nodes are on 1.6.5.1 patched kernel.

For the past two months, several times a month, the login node seems to 
be permanently evicted from the OSTs. The OSTs show a "lock callback 
timer expired after ..." error, then the login tries to reconnect, and 
fail. As lustre mount is the home directory of the cluster, users can't 
have access to it, and can't log in anymore. The only way I found to 
stop this is to reboot the login node (umount -f stucks). After the 
reboot, the login simply reconnects to the OST, and everything is okay 
until the next "lock callback timer" issue.
Compute nodes doesn't seem to be affected by this problem. Only login 
node does.
There is no memory problem (no swap, no memory leaks), neither on OSTs 
or login node.
There is network error (no packet loss detected), in IB or IPoIB.
Expiration time can be very random: from about 300s to 9000s.

I tried to reproduce the bug, for example by stressing the login with a 
lot of simultaneous creation/read/write of files, or with big IOs (using 
IOR from lustre-iokit), or by stressing the compute nodes and the login 
at the same time, but none of these worked.

I also tried to increase the global timeout on the OSTs and login to 
300, but the problem still reappeared.

Any tip or suggestion would be very appreciated.

By the way, is there any way to extract the list of connections between 
lock ids and files for a single node ?



Here is an example of logs for the problem:

There is no error on the /var/log/messages or lustre debug log at the 
supposed time of lock callback, on any machine.

/var/log/messages on an OSS:

Mar 27 17:28:51 lustre-oss-0-1 kernel: LustreError: 
0:0:(ldlm_lockd.c:234:waiting_locks_callback()) ### lock callback timer 
expired after 339s: evicting client at 192.168.198.203 at o2ib  ns: 
filter-lustre-OST0001_UUID lock: 00000101a5de8280/0xc8fe937f513470ae 
lrc: 1/0,0 mode: PW/PW res: 10811159/0 rrc: 8 type: EXT 
[0->18446744073709551615] (req 1118208->1122303) flags: 20 remote: 
0xc9f5de2876c8bde0 expref: 3006 pid: 6678
Mar 27 17:29:01 lustre-oss-0-1 kernel: LustreError: 
8203:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error 
(-107)  req at 00000101fd830c00 x193375/t0 o13-><?>@<?>:0/0 lens 128/0 e 0 
to 0 dl 1238171441 ref 1 fl Interpret:/0/0 rc -107/0
Mar 27 17:29:01 lustre-oss-0-1 kernel: LustreError: 
8203:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 630 previous 
similar messages

debug log on the login:

00000100:02020000:7:1238171222.003080:0:11481:0:(client.c:788:ptlrpc_check_status()) 
11-0: an error occurred while communicating with 192.168.198.206 at o2ib. 
The ost_statfs operation failed with -107
00000100:00080000:7:1238171222.003091:0:11481:0:(recover.c:188:ptlrpc_request_handle_notconn()) 
import lustre-OST0000-osc-0000010821d74800 of 
lustre-OST0000_UUID at 192.168.198.206@o2ib abruptly disconnected: reconnecting
00000100:02000400:7:1238171222.003097:0:11481:0:(import.c:133:ptlrpc_set_import_discon()) 
lustre-OST0000-osc-0000010821d74800: Connection to service 
lustre-OST0000 via nid 192.168.198.206 at o2ib was lost; in progress 
operations using this service will wait for recovery to complete.
00000100:00080000:7:1238171222.003103:0:11481:0:(import.c:141:ptlrpc_set_import_discon()) 
00000104362e3c00 lustre-OST0000_UUID: changing import state from FULL to 
DISCONN
00000100:00080000:7:1238171222.003108:0:11481:0:(import.c:465:ptlrpc_connect_import()) 
00000104362e3c00 lustre-OST0000_UUID: changing import state from DISCONN 
to CONNECTING
00000100:00080000:7:1238171222.003111:0:11481:0:(import.c:342:import_select_connection()) 
lustre-OST0000-osc-0000010821d74800: connect to NID 192.168.198.206 at o2ib 
last attempt 4294943333
00000100:00080000:7:1238171222.003118:0:11481:0:(import.c:423:import_select_connection()) 
lustre-OST0000-osc-0000010821d74800: import 00000104362e3c00 using 
connection 192.168.198.206 at o2ib/192.168.198.206 at o2ib
00000100:00080000:7:1238171222.003238:0:2274:0:(import.c:737:ptlrpc_connect_interpret()) 
@@@ evicting (not initial connect and flags reconnect/recovering not 
set: 4)  req at 000001041c3a0600 x193244/t0 
o8->lustre-OST0000_UUID at 192.168.198.206@o2ib:28/4 lens 304/456 e 0 to 
100 dl 1238171321 ref 1 fl Interpret:R/0/0 rc 0/0
00000100:00080000:7:1238171222.003244:0:2274:0:(import.c:740:ptlrpc_connect_interpret()) 
00000104362e3c00 lustre-OST0000_UUID: changing import state from 
CONNECTING to EVICTED
00000100:02020000:7:1238171222.003246:0:2274:0:(import.c:1049:ptlrpc_import_recovery_state_machine()) 
167-0: This client was evicted by lustre-OST0000; in progress operations 
using this service will fail.
00000100:00080000:7:1238171222.003251:0:2274:0:(import.c:1053:ptlrpc_import_recovery_state_machine()) 
evicted from lustre-OST0000_UUID at 192.168.198.206@o2ib; invalidating
00000100:00080000:7:1238171222.003292:0:11486:0:(import.c:1016:ptlrpc_invalidate_import_thread()) 
thread invalidate import lustre-OST0000-osc-0000010821d74800 to 
lustre-OST0000_UUID at 192.168.198.206@o2ib
00000100:00080000:7:1238171222.003295:0:11486:0:(import.c:168:ptlrpc_deactivate_and_unlock_import()) 
setting import lustre-OST0000_UUID INVALID
00020000:01000000:7:1238171222.003300:0:11486:0:(lov_obd.c:505:lov_set_osc_active()) 
Marking OSC lustre-OST0000_UUID inactive
00000100:00080000:7:1238171222.181183:0:11486:0:(import.c:1025:ptlrpc_invalidate_import_thread()) 
00000104362e3c00 lustre-OST0000_UUID: changing import state from EVICTED 
to RECOVER
00000100:00080000:7:1238171222.181189:0:11486:0:(import.c:1103:ptlrpc_import_recovery_state_machine()) 
reconnected to lustre-OST0000_UUID at 192.168.198.206@o2ib
00000100:00080000:7:1238171222.181192:0:11486:0:(import.c:1108:ptlrpc_import_recovery_state_machine()) 
00000104362e3c00 lustre-OST0000_UUID: changing import state from RECOVER 
to FULL
00020000:01000000:7:1238171222.181196:0:11486:0:(lov_obd.c:505:lov_set_osc_active()) 
Marking OSC lustre-OST0000_UUID active
00000100:02000000:7:1238171222.181202:0:11486:0:(import.c:1116:ptlrpc_import_recovery_state_machine()) 
lustre-OST0000-osc-0000010821d74800: Connection restored to service 
lustre-OST0000 using nid 192.168.198.206 at o2ib.
00000100:02020000:5:1238171341.066101:0:11794:0:(client.c:788:ptlrpc_check_status()) 
11-0: an error occurred while communicating with 192.168.198.204 at o2ib. 
The ost_statfs operation failed with -107
00000100:00080000:5:1238171341.066113:0:11794:0:(recover.c:188:ptlrpc_request_handle_notconn()) 
import lustre-OST0001-osc-0000010821d74800 of 
lustre-OST0001_UUID at 192.168.198.204@o2ib abruptly disconnected: reconnecting
00000100:02000400:5:1238171341.066117:0:11794:0:(import.c:133:ptlrpc_set_import_discon()) 
lustre-OST0001-osc-0000010821d74800: Connection to service 
lustre-OST0001 via nid 192.168.198.204 at o2ib was lost; in progress 
operations using this service will wait for recovery to complete.
00000100:00080000:5:1238171341.066126:0:11794:0:(import.c:141:ptlrpc_set_import_discon()) 
0000010821d14c00 lustre-OST0001_UUID: changing import state from FULL to 
DISCONN
00000100:00080000:5:1238171341.066130:0:11794:0:(import.c:465:ptlrpc_connect_import()) 
0000010821d14c00 lustre-OST0001_UUID: changing import state from DISCONN 
to CONNECTING
00000100:00080000:5:1238171341.066133:0:11794:0:(import.c:342:import_select_connection()) 
lustre-OST0001-osc-0000010821d74800: connect to NID 192.168.198.204 at o2ib 
last attempt 4294943333
00000100:00080000:5:1238171341.066140:0:11794:0:(import.c:423:import_select_connection()) 
lustre-OST0001-osc-0000010821d74800: import 0000010821d14c00 using 
connection 192.168.198.204 at o2ib/192.168.198.204 at o2ib
00000100:00080000:7:1238171341.093184:0:2274:0:(import.c:737:ptlrpc_connect_interpret()) 
@@@ evicting (not initial connect and flags reconnect/recovering not 
set: 4)  req at 000001082130d400 x193378/t0 
o8->lustre-OST0001_UUID at 192.168.198.204@o2ib:28/4 lens 304/456 e 0 to 
100 dl 1238171441 ref 1 fl Interpret:R/0/0 rc 0/0
00000100:00080000:7:1238171341.093193:0:2274:0:(import.c:740:ptlrpc_connect_interpret()) 
0000010821d14c00 lustre-OST0001_UUID: changing import state from 
CONNECTING to EVICTED
00000100:02020000:7:1238171341.093196:0:2274:0:(import.c:1049:ptlrpc_import_recovery_state_machine()) 
167-0: This client was evicted by lustre-OST0001; in progress operations 
using this service will fail.
00000100:00080000:7:1238171341.093203:0:2274:0:(import.c:1053:ptlrpc_import_recovery_state_machine()) 
evicted from lustre-OST0001_UUID at 192.168.198.204@o2ib; invalidating
00000100:00080000:7:1238171341.093279:0:11859:0:(import.c:1016:ptlrpc_invalidate_import_thread()) 
thread invalidate import lustre-OST0001-osc-0000010821d74800 to 
lustre-OST0001_UUID at 192.168.198.204@o2ib
00000100:00080000:7:1238171341.093282:0:11859:0:(import.c:168:ptlrpc_deactivate_and_unlock_import()) 
setting import lustre-OST0001_UUID INVALID
00020000:01000000:7:1238171341.093291:0:11859:0:(lov_obd.c:505:lov_set_osc_active()) 
Marking OSC lustre-OST0001_UUID inactive
00000080:00020000:5:1238171341.093306:0:11794:0:(llite_lib.c:1561:ll_statfs_internal()) 
obd_statfs fails: rc = -4


Thanks in advance,

-- 
Simon Latapie





More information about the lustre-discuss mailing list