[Lustre-discuss] OST disconnecting

Michael Losapio mike.losapio at nyu.edu
Fri Aug 10 09:27:44 PDT 2012


We're having an OST issue stemming from a storage array
problem we had the other day and was hoping maybe someone could help us out...

The scenario we're seeing is as follows;

41 OST's mount perfectly fine.
OST_41 mounts and recovery appears to work however it appears that no
new objects can be created on the OST and after a while the MDS marks
it inactive.

(*Note - the device in the OS is named OST_41, multipathd has it as
dm-46 and in Lustre it is known as 0029 - It's a long story....)

Let me know if anyone needs the full dmesg logs - I've also run e2fsck on the
OST prior to mounting it and we're seeing the same problem. I should
note that 90% of the clients appear to be able to read from the OST
just fine - but writes are not working. The OSS is serving up numerous
other OSTs without issue.

Ex:
lfs setstripe --count 1 --index 41 filename

And the file gets created on OST-0... but something like lfs check
servers reports all OSTs are active.

Help would be much appreciated.

--Mike


Relevent MDS Log;

Lustre: MDS scratch-MDT0000: scratch-OST001f_UUID now active, resetting orphans
Lustre: 8405:0:(quota_master.c:1718:mds_quota_recovery()) Only 41/42
OSTs are active, abort quota recovery
Lustre: 8405:0:(quota_master.c:1718:mds_quota_recovery()) Skipped 1
previous similar message
Lustre: scratch-OST0021-osc: Connection restored to service
scratch-OST0021 using nid 10.0.1.235 at o2ib.
Lustre: Skipped 1 previous similar message
Lustre: MDS scratch-MDT0000: scratch-OST0021_UUID now active, resetting orphans
Lustre: Skipped 1 previous similar message
LustreError: 15966:0:(llog_server.c:466:llog_origin_handle_cancel())
Cancel 1 of 2 llog-records failed: -2
LustreError: 15966:0:(llog_server.c:466:llog_origin_handle_cancel())
Skipped 1 previous similar message
Lustre: 8406:0:(import.c:517:import_select_connection())
scratch-OST0029-osc: tried all connections, increasing latency to 16s
Lustre: 8406:0:(import.c:517:import_select_connection()) Skipped 218
previous similar messages
LustreError: 15971:0:(llog_server.c:466:llog_origin_handle_cancel())
Cancel 1 of 2 llog-records failed: -2
LustreError: 15959:0:(llog_server.c:466:llog_origin_handle_cancel())
Cancel 1 of 1 llog-records failed: -2
LustreError: 15966:0:(llog_server.c:466:llog_origin_handle_cancel())
Cancel 1 of 2 llog-records failed: -2
Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request
x1409470549792684 sent from scratch-OST0029-osc to NID 10.0.1.236 at o2ib
0s ago has failed due to network error (23s prior to deadline).
  req at ffff810335e53000 x1409470549792684/t0
o8->scratch-OST0029_UUID at 10.0.1.236@o2ib:28/4 lens 368/584 e 0 to 1 dl
1344610732 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 196
previous similar messages
LustreError: 15970:0:(llog_server.c:466:llog_origin_handle_cancel())
Cancel 1 of 2 llog-records failed: -2
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error
creating fid 0x1a78fe63 sub-object on OST idx 12/4: rc = -107
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set())
Skipped 18 previous similar messages
Lustre: 8406:0:(import.c:517:import_select_connection())
scratch-OST0029-osc: tried all connections, increasing latency to 25s
Lustre: 8406:0:(import.c:517:import_select_connection()) Skipped 11
previous similar messages
Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request
x1409470549796118 sent from scratch-OST0029-osc to NID 10.0.1.236 at o2ib
0s ago has failed due to network error (30s prior to deadline).
  req at ffff810349dfbc00 x1409470549796118/t0
o8->scratch-OST0029_UUID at 10.0.1.236@o2ib:28/4 lens 368/584 e 0 to 1 dl
1344611365 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 10
previous similar messages
Lustre: 8406:0:(import.c:517:import_select_connection())
scratch-OST0029-osc: tried all connections, increasing latency to 10s
Lustre: 8406:0:(import.c:517:import_select_connection()) Skipped 14
previous similar messages
Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request
x1409470549799083 sent from scratch-OST0029-osc to NID 10.0.1.236 at o2ib
0s ago has failed due to network error (17s prior to deadline).
  req at ffff810333fb0c00 x1409470549799083/t0
o8->scratch-OST0029_UUID at 10.0.1.236@o2ib:28/4 lens 368/584 e 0 to 1 dl
1344611974 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 15
previous similar messages
Lustre: 8405:0:(quota_master.c:1718:mds_quota_recovery()) Only 41/42
OSTs are active, abort quota recovery
Lustre: 8405:0:(quota_master.c:1718:mds_quota_recovery()) Skipped 12
previous similar messages
Lustre: scratch-OST0029-osc: Connection restored to service
scratch-OST0029 using nid 10.0.1.235 at o2ib.
Lustre: Skipped 12 previous similar messages
Lustre: MDS scratch-MDT0000: scratch-OST0029_UUID now active, resetting orphans
Lustre: Skipped 12 previous similar messages
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error
creating fid 0xe09537f sub-object on OST idx 41/4: rc = -107
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error
creating fid 0xe09538d sub-object on OST idx 41/4: rc = -107
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set())
Skipped 1 previous similar message
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error
creating fid 0x11a04ba3 sub-object on OST idx 41/4: rc = -107
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error
creating fid 0xe09538e sub-object on OST idx 41/4: rc = -107
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error
creating fid 0x11a81070 sub-object on OST idx 41/4: rc = -107
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set())
Skipped 2 previous similar messages
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error
creating fid 0x15790270 sub-object on OST idx 41/1: rc = -107
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set())
Skipped 9 previous similar messages
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error
creating fid 0x11b82a3b sub-object on OST idx 41/4: rc = -107
LustreError: 8404:0:(lov_request.c:694:lov_update_create_set())
Skipped 5 previous similar messages
Lustre: 7455:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request
x1409470549800467 sent from scratch-OST0029-osc to NID 10.0.1.235 at o2ib
1103s ago has timed out (1103s prior to deadline).
  req at ffff81039f930c00 x1409470549800467/t0
o5->scratch-OST0029_UUID at 10.0.1.235@o2ib:28/4 lens 400/592 e 4 to 1 dl
1344613415 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 7455:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 1
previous similar message
Lustre: scratch-OST0029-osc: Connection to service scratch-OST0029 via
nid 10.0.1.235 at o2ib was lost; in progress operations using this
service will wait for recovery to complete.
LustreError: 7455:0:(osc_create.c:593:osc_create())
scratch-OST0029-osc: oscc recovery failed: -110
LustreError: 7455:0:(lov_obd.c:1150:lov_clear_orphans()) error in
orphan recovery on OST idx 41/42: rc = -110
LustreError: 7455:0:(mds_lov.c:1057:__mds_lov_synchronize())
scratch-OST0029_UUID failed at mds_lov_clear_orphans: -110
LustreError: 7455:0:(mds_lov.c:1066:__mds_lov_synchronize())
scratch-OST0029_UUID sync failed -110, deactivating
LustreError: 11-0: an error occurred while communicating with
10.0.1.235 at o2ib. The ost_connect operation failed with -16
LustreError: Skipped 3 previous similar messages
LustreError: 11-0: an error occurred while communicating with
10.0.1.235 at o2ib. The ost_connect operation failed with -16
LustreError: Skipped 2 previous similar messages
LustreError: 11-0: an error occurred while communicating with
10.0.1.235 at o2ib. The ost_connect operation failed with -16
LustreError: Skipped 5 previous similar messages
LustreError: 11-0: an error occurred while communicating with
10.0.1.235 at o2ib. The ost_connect operation failed with -16
LustreError: Skipped 10 previous similar messages
LustreError: 11-0: an error occurred while communicating with
10.0.1.235 at o2ib. The ost_connect operation failed with -16
LustreError: Skipped 22 previous similar messages
LustreError: 11-0: an error occurred while communicating with
10.0.1.235 at o2ib. The ost_connect operation failed with -16
LustreError: Skipped 42 previous similar messages


OST Log

LDISKFS-fs (dm-46): mounted filesystem with ordered data mode
JBD: barrier-based sync failed on dm-46-8 - disabling barriers
LDISKFS-fs (dm-46): barriers disabled
LDISKFS-fs (dm-46): mounted filesystem with ordered data mode
Lustre: 9381:0:(filter.c:1001:filter_init_server_data()) RECOVERY:
service scratch-OST0029, 388 recoverable clients, 0 delayed clients,
last_rcvd 85899346284
Lustre: 9381:0:(filter.c:1001:filter_init_server_data()) Skipped 1
previous similar message
Lustre: scratch-OST0029: Now serving scratch-OST0029 on
/dev/mapper/ost_scratch_41 with recovery enabled
Lustre: Skipped 1 previous similar message
Lustre: scratch-OST0029: Will be in recovery for at least 5:00, or
until 388 clients reconnect
Lustre: Skipped 1 previous similar message
Lustre: 31046:0:(ldlm_lib.c:1815:target_queue_last_replay_reply())
scratch-OST0029: 387 recoverable clients remain
Lustre: 31046:0:(ldlm_lib.c:1815:target_queue_last_replay_reply())
Skipped 26 previous similar messages
LustreError: 30900:0:(ldlm_lib.c:944:target_handle_connect())
scratch-OST0029: denying connection for new client 10.2.1.112 at o2ib2
(aa6ab4db-167b-53ac-8147-5823b6ce20de): 234 clients in recovery for
280s
Lustre: 30897:0:(ldlm_lib.c:1815:target_queue_last_replay_reply())
scratch-OST0029: 114 recoverable clients remain
Lustre: 30897:0:(ldlm_lib.c:1815:target_queue_last_replay_reply())
Skipped 272 previous similar messages
<A BUNCH OF DENIED CONNECTIONS BECAUSE OF RECOVERY>
Lustre: scratch-OST0029: Recovery period over after 5:00, of 388
clients 385 recovered and 3 were evicted.
Lustre: Skipped 2 previous similar messages
Lustre: scratch-OST0029: sending delayed replies to recovered clients
Lustre: Skipped 2 previous similar messages
Lustre: scratch-OST0029: received MDS connection from 10.0.1.240 at o2ib

Lustre: scratch-OST0029: sending delayed replies to recovered clients
Lustre: Skipped 2 previous similar messages
Lustre: scratch-OST0029: received MDS connection from 10.0.1.240 at o2ib
Lustre: Skipped 2 previous similar messages
LustreError: 30865:0:(filter.c:3427:filter_precreate())
scratch-OST0029: Serious error: objid 28055313 already exists; is this
filesystem corrupt?
LustreError: 30865:0:(filter.c:3428:filter_precreate()) LBUG
Pid: 30865, comm: ll_ost_40

Call Trace:
 [<ffffffff8854f6a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
 [<ffffffff8854fbda>] lbug_with_loc+0x7a/0xd0 [libcfs]
 [<ffffffff889eb606>] filter_precreate+0xef6/0x19b0 [obdfilter]
 [<ffffffff885a33a6>] lnet_ni_send+0x96/0xe0 [lnet]
 [<ffffffff8000d5a8>] dput+0x2c/0x113
 [<ffffffff889f52fb>] filter_create+0x11ab/0x17c0 [obdfilter]
 [<ffffffff885a9230>] LNetPut+0x730/0x840 [lnet]
 [<ffffffff886b6323>] ptl_send_buf+0x3f3/0x5b0 [ptlrpc]
 [<ffffffff886bbf54>] lustre_msg_add_version+0x34/0x110 [ptlrpc]
 [<ffffffff886b6ac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc]
 [<ffffffff886bed69>] lustre_pack_reply+0x29/0xb0 [ptlrpc]
 [<ffffffff889ad761>] ost_handle+0x1271/0x55b0 [ost]
 [<ffffffff8008dff2>] dequeue_task+0x18/0x37
 [<ffffffff886ca6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
 [<ffffffff886cae35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
 [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
 [<ffffffff886cbdc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff886cae60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

LustreError: dumping log to /tmp/lustre-log.1344612315.30865
Lustre: Service thread pid 30865 was inactive for 200.00s. The thread
might be hung, or it might only be slow and will resume later. Dumping
the stack trace for debugging purposes:
Pid: 30865, comm: ll_ost_40

Lustre: 30941:0:(ldlm_lib.c:874:target_handle_connect())
scratch-OST0029: refuse reconnection from
41b8aeb5-2c37-88e7-d228-7a59ee9cac74 at 10.2.1.130@o2ib2 to
0xffff81023028c600; still busy with 3 active RPCs



More information about the lustre-discuss mailing list