[Lustre-discuss] A Failed client soft lockup one OSS

Michael Sternberg sternberg at anl.gov
Fri Mar 26 12:27:21 PDT 2010


+1 on this one, in my case using lustre-1.8.2 on RHEL-5.4 over o2ib, with patchless clients.

My OSS complains about hung service threads:

	Service thread pid 16590 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
	. . .
	Service thread pid 16590 completed after 2403.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

In this case, n337 (172.16.1.87) went dead, and another client (login2) suffered.  Solution was to reboot n337, the initial hung client.


NB:  I take the overload warning at its word.  The OST is one giant RAID5, which I have scheduled to split into several RAID 1+0 sets next week.


regards,
Michael


On Mar 26, 2010, at 2:29 , Lu Wang wrote:
> 	We find bug on Lustre 1.8.1.1. Sometimes one client's dead may cause soft lockup on OSS. The certain OSS may reach a high CPU System% usage, and then became unreachable through "lctl ping" from now and then.


Mar 13 18:45:13 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline).
Mar 13 18:45:13 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) Skipped 59 previous similar messages
Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527520 ref 2 fl Rpc:/0/0 rc 0/0
Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-172.16.1.87 at tcp: -113
Mar 13 18:45:13 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Skipped 57 previous similar messages
Mar 13 18:45:13 oss01 kernel:   req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527520 ref 1 fl Rpc:/0/0 rc 0/0
Mar 13 18:45:22 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline).
Mar 13 18:45:22 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 454325 previous similar messages
Mar 13 18:45:22 oss01 kernel:   req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527529 ref 1 fl Rpc:/2/0 rc 0/0
Mar 13 18:45:29 login2 -- MARK --
Mar 13 18:45:41 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline).
Mar 13 18:45:41 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 903192 previous similar messages
Mar 13 18:45:41 oss01 kernel:   req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527548 ref 1 fl Rpc:/2/0 rc 0/0
Mar 13 18:46:18 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline).
Mar 13 18:46:18 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 1804452 previous similar messages
Mar 13 18:46:18 oss01 kernel:   req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527585 ref 1 fl Rpc:/2/0 rc 0/0
Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) Skipped 3611781 previous similar messages
Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527595 ref 2 fl Rpc:/2/0 rc 0/0
Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-172.16.1.87 at tcp: -113
Mar 13 18:46:28 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Skipped 3611782 previous similar messages
Mar 13 18:47:27 oss01 kernel: LustreError: 138-a: carbonfs-OST0000: A client on nid 172.16.1.87 at tcp was evicted due to a lock blocking callback to 172.16.1.87 at tcp timed out: rc -107
Mar 13 18:47:33 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline).
Mar 13 18:47:33 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 3623858 previous similar messages
Mar 13 18:47:33 oss01 kernel:   req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527660 ref 1 fl Rpc:/2/0 rc 0/0
Mar 13 18:48:33 oss01 kernel: 
Mar 13 18:48:33 oss01 kernel:  0002000000000000 ffffc20010c6d870 ffffffff8884d460 ffffffff00000000
Mar 13 18:48:33 oss01 kernel:  0004b95a9d0332d0 ffffffff889c41e5 ffffffff889c3beb ffff8101172516e0
Mar 13 18:48:33 oss01 kernel: Call Trace:
Mar 13 18:48:33 oss01 kernel:  [<ffffffff80047152>] try_to_wake_up+0x472/0x484
Mar 13 18:48:33 oss01 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Mar 13 18:48:33 oss01 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Mar 13 18:48:33 oss01 kernel:  [<ffffffff80062ff8>] thread_return+0x62/0xfe
Mar 13 18:48:33 oss01 kernel:  [<ffffffff8008ac95>] __wake_up_common+0x3e/0x68
Mar 13 18:48:33 oss01 kernel:  [<ffffffff8008c270>] __activate_task+0x56/0x6d
Mar 13 18:48:33 oss01 kernel:  [<ffffffff8008c86b>] default_wake_function+0x0/0xe
Mar 13 18:48:33 oss01 kernel:  [<ffffffff800b7076>] audit_syscall_exit+0x336/0x362
Mar 13 18:48:33 oss01 kernel:  [<ffffffff801522a0>] list_del+0xb/0x71
Mar 13 18:48:33 oss01 kernel:  [<ffffffff8887314b>] :lnet:LNetMDAttach+0x37b/0x4c0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88879b90>] :lnet:LNetPut+0x700/0x800
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88879c06>] :lnet:LNetPut+0x776/0x800
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88944380>] :ptlrpc:ldlm_lock_create+0x540/0x9f0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88947eb6>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb20
Mar 13 18:48:33 oss01 kernel:  [<ffffffff8895b5f0>] :ptlrpc:ldlm_process_extent_lock+0x0/0xad0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88966a06>] :ptlrpc:ldlm_server_glimpse_ast+0x266/0x3b0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff889693f9>] :ptlrpc:ldlm_handle_enqueue+0xbf9/0x11f0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff889736f3>] :ptlrpc:interval_iterate_reverse+0x73/0x240
Mar 13 18:48:33 oss01 kernel:  [<ffffffff889763eb>] :ptlrpc:ptlrpc_expire_one_request+0x12b/0x630
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88976975>] :ptlrpc:ptlrpc_at_set_req_timeout+0x85/0xd0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88977199>] :ptlrpc:ptlrpc_prep_req_pool+0x619/0x6b0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88977d26>] :ptlrpc:ptlrpc_check_reply+0x1c6/0x610
Mar 13 18:48:33 oss01 kernel:  [<ffffffff8897efc8>] :ptlrpc:ptlrpc_queue_wait+0x988/0x16f0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88982613>] :ptlrpc:ptl_send_buf+0x3f3/0x5b0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88985265>] :ptlrpc:ptl_send_rpc+0xb45/0xde0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff889872e8>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Mar 13 18:48:33 oss01 kernel:  [<ffffffff8898d565>] :ptlrpc:lustre_msg_set_opc+0x45/0x120
Mar 13 18:48:33 oss01 kernel:  [<ffffffff8898f101>] :ptlrpc:request_out_callback+0xe1/0x1e0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff8899185d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
Mar 13 18:48:33 oss01 kernel:  [<ffffffff889944fe>] :ptlrpc:ptlrpc_server_handle_request+0xa8e/0x1130
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88996d00>] :ptlrpc:ptlrpc_main+0x0/0x1420
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88997f58>] :ptlrpc:ptlrpc_main+0x1258/0x1420
Mar 13 18:48:33 oss01 kernel:  ffffffff889bbca0 ffff810117356708 ffff810178f94940 ffffffff888711a4
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88c3e5f0>] :ost:ost_blocking_ast+0x0/0xaa0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88c462f7>] :ost:ost_handle+0x4e17/0x53e0
Mar 13 18:48:33 oss01 kernel:  [<ffffffff88c90f4a>] :obdfilter:filter_intent_policy+0x65a/0x760
Mar 13 18:48:33 oss01 kernel: ll_ost_100    R  running task       0 16590      1         16591 16589 (L-TLB)
Mar 13 18:48:33 oss01 kernel: Lustre: 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 16590
Mar 13 18:48:33 oss01 kernel: Lustre: Service thread pid 16590 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Mar 13 18:48:58 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) Skipped 7264682 previous similar messages
Mar 13 18:48:58 oss01 kernel: LustreError: 16590:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268527745 ref 2 fl Rpc:/2/0 rc 0/0
Mar 13 18:48:58 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-172.16.1.87 at tcp: -113
Mar 13 18:48:58 oss01 kernel: LustreError: 16590:0:(lib-move.c:2436:LNetPut()) Skipped 7264681 previous similar messages
Mar 13 18:50:03 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 0s ago has failed due to network error (7s prior to deadline).
. . .
Mar 13 19:17:39 oss01 kernel: LustreError: 16549:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 4985591 previous similar messages
Mar 13 19:22:40 login2 kernel: LustreError: 4413:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
Mar 13 19:22:40 login2 kernel: LustreError: 4413:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Mar 13 19:25:08 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1329698739270352 sent from carbonfs-OST0000 to NID 172.16.1.87 at tcp 7s ago has timed out (7s prior to deadline).
Mar 13 19:25:08 oss01 kernel: Lustre: 16590:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 20801286 previous similar messages
Mar 13 19:25:08 oss01 kernel:   req at ffff810126fc0000 x1329698739270352/t0 o106->@NET_0x20000ac100157_UUID:15/16 lens 296/424 e 0 to 1 dl 1268529908 ref 1 fl Rpc:/2/0 rc 0/0
Mar 13 19:25:15 login2 kernel: Lustre: carbonfs-OST0000-osc-ffff81042c2ffc00: Connection restored to service carbonfs-OST0000 using nid 172.17.130.1 at o2ib.
Mar 13 19:25:15 login2 kernel: Lustre: Skipped 1 previous similar message
Mar 13 19:25:15 oss01 kernel: Lustre: 16590:0:(service.c:1380:ptlrpc_server_handle_request()) @@@ Request x1329711551107702 took longer than estimated (812+1590s); client may timeout.  req at ffff81012fc25800 x1329711551107702/t0 o101->cc99a57b-05eb-215d-d552-1b3b78978588 at NET_0x50000ac116e02_UUID:0/0 lens 296/352 e 5 to 0 dl 1268528325 ref 1 fl Complete:/0/0 rc 301/301
Mar 13 19:25:15 oss01 kernel: LustreError: 137-5: UUID 'sandbox-OST0000_UUID' is not available  for connect (no target)
Mar 13 19:25:15 oss01 kernel: Lustre: Service thread pid 16590 completed after 2403.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Mar 13 19:50:51 oss01 -- MARK --

[root at n337 ~]# who -r
         run-level 3  Mar 13 19:22                   last=S





More information about the lustre-discuss mailing list