[lustre-discuss] Client eviction during OST failover

Raj rajgautam at gmail.com
Thu Feb 2 13:24:03 PST 2017


Hello, I am testing 2.8.0 in server and clients are getting evicted when an
OST failover occurs resulting in IO error. Failover happens within 2 mins
and lustre timeouts are very comfortable (lustre timeout is set to 300,
at_min is 40, at_max is 400, ldlm_enqueue_min=260).
Do you see any thing abnormal in the below debug? Any help is appreciated.

Client side debug when OST000e is unmounted and mounted on an OSS node(it
was mounted again at 7:21):
Feb  2 07:20:25 hcrddm005 kernel: LustreError: 11-0:
hsi003-OST000e-osc-ffff8815c65f4400: Communicating with 10.10.150.21 at o2ib,
operation ost_read failed with -107.
Feb  2 07:20:25 hcrddm005 kernel: Lustre:
hsi003-OST000e-osc-ffff8815c65f4400: Connection to hsi003-OST000e (at
10.10.150.21 at o2ib) was lost; in progress operations using this service will
wait for recovery to complete
Feb  2 07:20:25 hcrddm005 kernel: LustreError: dumping log to
*/tmp/lustre-log.1486041625.111200*
Feb  2 07:20:25 hcrddm005 kernel: LustreError: Skipped 1 previous similar
message
Feb  2 07:21:27 hcrddm005 kernel: LustreError: 167-0:
hsi003-OST000e-osc-ffff8815c65f4400: This client was evicted by
hsi003-OST000e; in progress operations using this service will fail.
Feb  2 07:21:28 hcrddm005 kernel: LustreError:
78611:0:(osc_lock.c:832:osc_ldlm_completion_ast()) lock at ffff8819fd5eca98[3
3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x1000e0000:0x26442:0x0] {
Feb  2 07:21:28 hcrddm005 kernel: LustreError:
78611:0:(osc_lock.c:832:osc_ldlm_completion_ast())
lovsub at ffff881be723c720: [0 ffff881fb55c26e0 R(1):[0,
18446744073709551615]@[0x200000bd1:0x1e:0x0]]
Feb  2 07:21:28 hcrddm005 kernel: LustreError:
78611:0:(osc_lock.c:832:osc_ldlm_completion_ast())     osc at ffff881ab01f5978:
ffff881d5cb4ebc0    0x20000001001 0x3e482ccc3f43b14c 3 ffff880b6711a5d0
size: 61045997568 mtime: 1485886851 atime: 0 ctime: 1485886851 blocks:
113896795
Feb  2 07:21:28 hcrddm005 kernel: LustreError:
78611:0:(osc_lock.c:832:osc_ldlm_completion_ast()) } lock at ffff8819fd5eca98
Feb  2 07:21:28 hcrddm005 kernel: LustreError:
78611:0:(osc_lock.c:832:osc_ldlm_completion_ast()) dlmlock returned -5
Feb  2 07:21:28 hcrddm005 kernel: LustreError:
78611:0:(ldlm_resource.c:809:ldlm_resource_complain())
hsi003-OST000e-osc-ffff8815c65f4400: namespace resource [0x26442:0x0:0x0].0
(ffff8812426c35c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
Feb  2 07:21:28 hcrddm005 kernel: LustreError:
78611:0:(ldlm_resource.c:1448:ldlm_resource_dump()) --- Resource:
[0x26442:0x0:0x0].0 (ffff8812426c35c0) refcount = 2
Feb  2 07:21:28 hcrddm005 kernel: LustreError:
78611:0:(ldlm_resource.c:1451:ldlm_resource_dump()) Granted locks (in
reverse order):
Feb  2 07:21:28 hcrddm005 kernel: LustreError:
78611:0:(ldlm_resource.c:1454:ldlm_resource_dump()) ### ### ns:
hsi003-OST000e-osc-ffff8815c65f4400 lock:
ffff881d5cb4ebc0/0x3e482ccc3f43b14c lrc: 3/0,0 mode: PR/PR res:
[0x26442:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 0x126400000000 nid: local remote:
0xa7242e5b8ddab5d0 expref: -99 pid: 232385 timeout: 0 lvb_type: 1
Feb  2 07:21:28 hcrddm005 kernel: LustreError:
78611:0:(import.c:1296:ptlrpc_invalidate_import_thread()) dump the log upon
eviction
Feb  2 07:21:28 hcrddm005 kernel: LustreError: dumping log to
*/tmp/lustre-log.1486041688.78611*
Feb  2 07:21:28 hcrddm005 kernel: Lustre:
hsi003-OST000e-osc-ffff8815c65f4400: Connection restored to hsi003-OST000e
(at 10.10.150.21 at o2ib)
Feb  2 07:21:34 hcrddm005 kernel: LustreError:
78524:0:(cl_lock.c:1422:cl_unuse_try()) result = -5, this is unlikely!
Feb  2 07:21:34 hcrddm005 kernel: LustreError:
78524:0:(cl_lock.c:1437:cl_unuse_locked()) lock at ffff8817dddda078[2 0 0 1 0
00000000] R(1):[0, 18446744073709551615]@[0x200000bd1:0x1e:0x0] {
Feb  2 07:21:34 hcrddm005 kernel: LustreError:
78524:0:(cl_lock.c:1437:cl_unuse_locked())     vvp at ffff881c7a3823f8:
Feb  2 07:21:34 hcrddm005 kernel: LustreError:
78524:0:(cl_lock.c:1437:cl_unuse_locked())     lov at ffff881fb55c26e0: 1
Feb  2 07:21:34 hcrddm005 kernel: LustreError:
78524:0:(cl_lock.c:1437:cl_unuse_locked())     0 0: ---
Feb  2 07:21:34 hcrddm005 kernel: LustreError:
78524:0:(cl_lock.c:1437:cl_unuse_locked())
Feb  2 07:21:34 hcrddm005 kernel: LustreError:
78524:0:(cl_lock.c:1437:cl_unuse_locked()) } lock at ffff8817dddda078
Feb  2 07:21:34 hcrddm005 kernel: LustreError:
78524:0:(cl_lock.c:1437:cl_unuse_locked()) unuse return –5

[client]</lus/hsi003/raj>$ dd if=raj_14 of=/dev/null bs=1M
*dd: reading `raj_14': Input/output error*
8307+0 records in
8307+0 records out
8710520832 bytes (8.7 GB) copied, 72.793 s, 120 MB/s
[client]</lus/hsi003/raj>$

/tmp/lustre-log.1486041625.111200 says:
00000100:00080000:12.0:1486041625.090841:0:111200:0:(recover.c:249:ptlrpc_request_handle_notconn())
import hsi003-OST000e-osc-ffff8815c65f4400 of
hsi003-OST000e_UUID at 10.10.150.21@o2ib abruptly disconnected: reconnecting
00000100:02000400:12.0:1486041625.090846:0:111200:0:(import.c:167:ptlrpc_set_import_discon())
hsi003-OST000e-osc-ffff8815c65f4400: Connection to hsi003-OST000e (at
10.10.150.21 at o2ib) was lost; in progress operations using this service will
wait for recovery to complete
00000100:00080000:12.0:1486041625.090855:0:111200:0:(import.c:176:ptlrpc_set_import_discon())
ffff880a9ea66000 hsi003-OST000e_UUID: changing import state from FULL to
DISCONN
00000100:00080000:13.0:1486041625.090916:0:111200:0:(import.c:640:ptlrpc_connect_import())
ffff880a9ea66000 hsi003-OST000e_UUID: changing import state from DISCONN to
CONNECTING
00000100:00080000:13.0:1486041625.090923:0:111200:0:(import.c:485:import_select_connection())
hsi003-OST000e-osc-ffff8815c65f4400: connect to NID 10.10.150.21 at o2ib last
attempt 34103933580
00000100:00080000:13.0:1486041625.090930:0:111200:0:(import.c:563:import_select_connection())
hsi003-OST000e-osc-ffff8815c65f4400: import ffff880a9ea66000 using
connection 10.10.150.21 at o2ib/10.10.150.21 at o2ib
00000100:00080000:14.0:1486041625.091061:0:111188:0:(import.c:1158:ptlrpc_connect_interpret())
ffff880a9ea66000 hsi003-OST000e_UUID: changing import state from CONNECTING
to DISCONN
00000100:00080000:14.0:1486041625.091066:0:111188:0:(import.c:1204:ptlrpc_connect_interpret())
recovery of hsi003-OST000e_UUID on 10.10.150.21 at o2ib failed (-19)
00000100:00080000:11.0:1486041625.092288:0:111201:0:(client.c:1438:ptlrpc_send_new_req())
@@@ req from PID 0 waiting for recovery: (FULL != DISCONN)
 req at ffff880a2208d400 x1541968458196356/t0(0) o3->
hsi003-OST000e-osc-ffff8815c65f4400 at 10.10.150.21@o2ib:6/4 lens 488/432 e 0
to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00080000:13.0:1486041625.093806:0:111202:0:(client.c:1438:ptlrpc_send_new_req())
@@@ req from PID 0 waiting for recovery: (FULL != DISCONN)
 req at ffff880c2caaac00 x1541968458196360/t0(0) o3->
hsi003-OST000e-osc-ffff8815c65f4400 at 10.10.150.21@o2ib:6/4 lens 488/432 e 0
to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00080000:13.0:1486041625.098325:0:111205:0:(client.c:1438:ptlrpc_send_new_req())
@@@ req from PID 0 waiting for recovery: (FULL != DISCONN)
 req at ffff88070d29a800 x1541968458196372/t0(0) o3->
hsi003-OST000e-osc-ffff8815c65f4400 at 10.10.150.21@o2ib:6/4 lens 488/432 e 0
to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00080000:13.0:1486041625.099788:0:111205:0:(client.c:1438:ptlrpc_send_new_req())
@@@ req from PID 0 waiting for recovery: (FULL != DISCONN)
 req at ffff8804d4fc0400 x1541968458196376/t0(0) o3->
hsi003-OST000e-osc-ffff8815c65f4400 at 10.10.150.21@o2ib:6/4 lens 488/432 e 0
to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
00000100:00080000:11.0:1486041625.101315:0:111207:0:(client.c:1438:ptlrpc_send_new_req())
@@@ req from PID 0 waiting for recovery: (FULL != DISCONN)
 req at ffff880ae155bc00 x1541968458196380/t0(0) o3->
hsi003-OST000e-osc-ffff8815c65f4400 at 10.10.150.21@o2ib:6/4 lens 488/432 e 0
to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
Debug log: 269934 lines, 269934 kept, 0 dropped, 0 bad.

*/tmp/lustre-log.1486041688.78611*
00000100:00080000:0.0:1486041681.536712:0:33379:0:(service.c:1079:ptlrpc_update_export_timer())
updating export 2189d151-cd58-3ada-30bd-fcff46ee16dc at 1486041681 exp
ffff881e810c7c00
10000000:01000000:1.0F:1486041681.536805:0:32111:0:(mgc_request.c:962:mgc_blocking_ast())
Lock res [0x333030697368:0x2:0x0].0 (hsi003)
10000000:01000000:10.0:1486041687.976410:0:111932:0:(mgc_request.c:512:do_requeue())
updating log hsi003-cliir
10000000:01000000:10.0:1486041687.976413:0:111932:0:(mgc_request.c:1848:mgc_process_log())
Process log hsi003-cliir:ffff8815c65f4400 from 163
10000000:01000000:10.0:1486041687.976418:0:111932:0:(mgc_request.c:1061:mgc_enqueue())
Enqueue for hsi003-cliir (res 0x333030697368)
00000020:00000080:10.0:1486041687.976986:0:111932:0:(obd_config.c:1107:class_process_config())
processing cmd: cf00f
00000100:00080000:10.0:1486041687.977019:0:111932:0:(import.c:191:ptlrpc_set_import_discon())
osc: import ffff880a9ea66000 already not connected (conn 9, was 0): DISCONN
00010000:00080000:10.0:1486041687.977031:0:111932:0:(ldlm_lib.c:98:import_set_conn())
imp ffff880a9ea66000 at hsi003-OST000e-osc-ffff8815c65f4400: found existing
conn 10.10.150.21 at o2ib, moved to head
00000100:00080000:10.0:1486041687.977034:0:111932:0:(import.c:640:ptlrpc_connect_import())
ffff880a9ea66000 hsi003-OST000e_UUID: changing import state from DISCONN to
CONNECTING
00000100:00080000:10.0:1486041687.977040:0:111932:0:(import.c:485:import_select_connection())
hsi003-OST000e-osc-ffff8815c65f4400: connect to NID 10.10.150.21 at o2ib last
attempt 0
00000100:00080000:10.0:1486041687.977045:0:111932:0:(import.c:563:import_select_connection())
hsi003-OST000e-osc-ffff8815c65f4400: import ffff880a9ea66000 using
connection 10.10.150.21 at o2ib/10.10.150.21 at o2ib
00000020:01000000:10.0:1486041687.977063:0:111932:0:(obd_config.c:1352:class_process_proc_param())
hsi003-OST000e-osc-ffff8815c65f4400.osc: Set parameter
import=connection=10.10.150.21 at o2ib::21
10000000:01000000:10.0:1486041687.977080:0:111932:0:(mgc_request.c:1879:mgc_process_log())
MGC10.10.150.15 at o2ib: configuration from log 'hsi003-cliir' succeeded (0).
00010000:00080000:14.0F:1486041687.977538:0:111188:0:(ldlm_request.c:1317:ldlm_cli_update_pool())
@@@ Zero SLV or Limit found (SLV: 0, Limit: 1611812)  req at ffff8818f261e800
x1541968458196396/t0(0)
o8->hsi003-OST000e-osc-ffff8815c65f4400 at 10.10.150.21@o2ib:28/4
lens 400/384 e 0 to 0 dl 1486041707 ref 1 fl Rpc:RN/0/0 rc 0/0
00000100:00080000:14.0:1486041687.977554:0:111188:0:(import.c:831:ptlrpc_connect_interpret())
hsi003-OST000e-osc-ffff8815c65f4400: connect to target with instance 21
00000001:00080000:14.0:1486041687.977569:0:111188:0:(lcommon_misc.c:77:cl_init_ea_size())
updating max_mdsize/max_cookiesize: 432/512
00000100:00080000:14.0:1486041687.977574:0:111188:0:(import.c:980:ptlrpc_connect_interpret())
@@@ hsi003-OST000e-osc-ffff8815c65f4400: evicting (reconnect/recover flags
not set: 4)  req at ffff8818f261e800 x1541968458196396/t0(0) o8->
hsi003-OST000e-osc-ffff8815c65f4400 at 10.10.150.21@o2ib:28/4 lens 400/384 e 0
to 0 dl 1486041707 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:00080000:14.0:1486041687.977582:0:111188:0:(import.c:983:ptlrpc_connect_interpret())
ffff880a9ea66000 hsi003-OST000e_UUID: changing import state from CONNECTING
to EVICTED
00000100:02020000:14.0:1486041687.977585:0:111188:0:(import.c:1346:ptlrpc_import_recovery_state_machine())
167-0: hsi003-OST000e-osc-ffff8815c65f4400: This client was evicted by
hsi003-OST000e; in progress operations using this service will fail.
00000100:00080000:14.0:1486041687.993395:0:111188:0:(import.c:1350:ptlrpc_import_recovery_state_machine())
evicted from hsi003-OST000e_UUID at 10.10.150.21@o2ib; invalidating
00000100:00080000:12.0:1486041687.993449:0:78611:0:(import.c:1291:ptlrpc_invalidate_import_thread())
thread invalidate import hsi003-OST000e-osc-ffff8815c65f4400 to
hsi003-OST000e_UUID at 10.10.150.21@o2ib
00000100:00080000:12.0:1486041687.993454:0:78611:0:(import.c:203:ptlrpc_deactivate_and_unlock_import())
setting import hsi003-OST000e_UUID INVALID
Thank you
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170202/0baece30/attachment.htm>


More information about the lustre-discuss mailing list