[lustre-discuss] Lustre errors asking for help

Baranowski, Roman roman.baranowski at ubc.ca
Fri Jan 19 16:31:19 PST 2024


Dear Andreas, All,

Thanks for your response.  Just to give you some more information.  Yes, we have
run e2fsck on all OSTs, MDTs, and MGS and all come back completely clean.  We
have also removed and recreated the quota files on them, again without any
issues.  The underlying storage upon which the OSTs and MDTs are built is fine.
We have run verifies on all LUNs, and then come back clean.  All the spinning
disks are fine and the controllers report no errors are failures.  Similarly,
the Infiniband fabric connecting all storage servers has also been checked and
no errors or issues are present.

You are correct that no files are being created on those OSTs.  However, Lustre
is behaving poorly, and sometimes goes into a hung state.  Today, it was
reporting the following:

pdsh -g storage uptime
mds2:  14:39:27 up 4 days, 21:17,  0 users,  load average: 0.00, 0.00, 0.00
mds1:  14:39:27 up 4 days, 21:17,  0 users,  load average: 0.00, 0.00, 0.00
oss2:  14:39:27 up 4 days, 21:07,  0 users,  load average: 0.00, 0.00, 0.00
oss1:  14:39:27 up 4 days, 21:07,  0 users,  load average: 0.00, 0.00, 0.00
oss4:  14:39:27 up 4 days, 21:06,  0 users,  load average: 0.00, 0.00, 0.00
oss6:  14:39:27 up 4 days, 21:06,  0 users,  load average: 0.00, 0.00, 0.00
oss5:  14:39:27 up 4 days, 21:06,  0 users,  load average: 0.00, 0.00, 0.00
oss3:  14:39:27 up 4 days, 21:07,  0 users,  load average: 0.06, 0.03, 0.00

The load on the storage servers is effectively zero yet the following messages
are being produced on mds2 (the mds serving the problematic OSTs) and other
OSTs from our lustre.log

Jan 19 14:36:57 oss5 kernel: : LustreError: 13751:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0004: unable to precreate: rc = -116
Jan 19 14:36:57 oss5 kernel: : LustreError: 13751:0:(ofd_obd.c:1348:ofd_create()) Skipped 76 previous similar messages
Jan 19 14:39:35 mds2 kernel: : Lustre: 24647:0:(service.c:1339:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Jan 19 14:39:35 mds2 kernel: : Lustre: 24647:0:(service.c:1339:ptlrpc_at_send_early_reply()) Skipped 133 previous similar messages
Jan 19 14:40:57 oss3 kernel: : LustreError: 13903:0:(ofd_obd.c:1348:ofd_create()) scratch-OST0014: unable to precreate: rc = -116
Jan 19 14:40:57 oss3 kernel: : LustreError: 13903:0:(ofd_obd.c:1348:ofd_create()) Skipped 80 previous similar messages
Jan 19 14:43:31 mds2 kernel: : Lustre: scratch-MDT0000: Client e258850e-e603-8c7b-843f-66886cc67347 (at 192.168.113.1 at o2ib) reconnecting
Jan 19 14:43:31 mds2 kernel: : Lustre: Skipped 4160 previous similar messagesJan 19 14:43:31 mds2 kernel: : Lustre: scratch-MDT0000: Client e258850e-e603-8c7b-843f-66886cc67347 (at 192.168.113.1 at o2ib) refused reconnection, still busy with 1 active RPCs
Jan 19 14:43:31 mds2 kernel: : Lustre: Skipped 4094 previous similar messagesJan 19 14:44:30 mds2 kernel: : Lustre: lock timed out (enqueued at 1705703070, 1200s ago)Jan 19 14:44:30 mds2 kernel: : LustreError: dumping log to /tmp/lustre-log.1705704270.4999
Jan 19 14:44:30 mds2 kernel: : Lustre: Skipped 10 previous similar messages
Jan 19 14:44:32 mds2 kernel: : LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 4307s: evicting client at 192.168.122.11 at o2ib  ns: mdt-scratch-MDT0000_UUID lock: ffff8804455dc480/0x280a58b66b7affa7 lrc: 3/0,0 mode: PR/PR res: [0x2000733da:0x14d:0x0].0 bits 0x13 rrc: 267 type: IBT flags: 0x200400000020 nid: 192.168.122.11 at o2ib remote: 0xa9053ebb6b8f23c1 expref: 7 pid: 28304 timeout: 4717246067 lvb_type: 0
Jan 19 14:44:32 mds2 kernel: : LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 1 previous similar message
Jan 19 14:44:32 mds2 kernel: : Lustre: 26575:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:4253s); client may timeout.  req at ffff880127936800 x1788463595089220/t609893172938(0) o101->03cbede6-68a6-36ae-866e-74e858cf47f1 at 192.168.114.2@o2ib:0/0 lens 584/600 e 0 to 0 dl 1705700019 ref 1 fl Complete:/0/0 rc 0/0
Jan 19 14:44:32 mds2 kernel: : LustreError: 0:0:(ldlm_lockd.c:391:waiting_locks_callback()) Skipped 1 previous similar message
Jan 19 14:44:32 mds2 kernel: : Lustre: 26575:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:4253s); client may timeout.  req at ffff880127936800 x1788463595089220/t609893172938(0) o101->03cbede6-68a6-36ae-866e-74e858cf47f1 at 192.168.114.2@o2ib:0/0 lens 584/600 e 0 to 0 dl 1705700019 ref 1 fl Complete:/0/0 rc 0/0
Jan 19 14:44:32 mds2 kernel: : LustreError: 29019:0:(service.c:1999:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-192.168.128.8 at o2ib: deadline 100:599s ago
Jan 19 14:44:32 mds2 kernel: : LustreError: 29019:0:(service.c:1999:ptlrpc_server_handle_request()) Skipped 73 previous similar messages
Jan 19 14:44:32 mds2 kernel: : LustreError: 24604:0:(ldlm_lockd.c:1309:ldlm_handle_enqueue0()) ### lock on disconnected export ffff88083e4f0c00 ns: mdt-scratch-MDT0000_UUID lock: ffff880bfaa7e240/0x280a58b66b7d0c30 lrc: 2/0,0 mode: --/CR res: [0x20002a5b6:0x15a:0x0].0 bits 0x0 rrc: 348 type: IBT flags: 0x0 nid: local remote: 0x38e6942c1a7c9132 expref: -99 pid: 24604 timeout: 0 lvb_type: 0
Jan 19 14:44:32 mds2 kernel: : LustreError: 24604:0:(ldlm_lockd.c:1309:ldlm_handle_enqueue0()) Skipped 2 previous similar messages
Jan 19 14:44:32 mds2 kernel: : Lustre: 26575:0:(service.c:2031:ptlrpc_server_handle_request()) Skipped 141 previous similar messages
Jan 19 14:45:02 oss4 kernel: : LustreError: 18223:0:(ofd_obd.c:1348:ofd_create()) scratch-OST000f: unable to precreate: rc = -116

This makes it look like our MDT is not responsive yet its load is effectively
zero.  The kern.log from mds2 is producing messages like the following:

Jan 19 14:44:30 mds2 kernel: : LNet: Service thread pid 4999 was inactive for 1200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jan 19 14:44:30 mds2 kernel: : LNet: Skipped 4 previous similar messages
Jan 19 14:44:30 mds2 kernel: : Pid: 4999, comm: mdt01_012
Jan 19 14:44:30 mds2 kernel: :
Jan 19 14:44:30 mds2 kernel: : Call Trace:
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0616301>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa06066fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08c50ea>] ldlm_completion_ast+0x57a/0x960 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08c07f0>] ? ldlm_expired_completion_wait+0x0/0x390 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffff81063990>] ? default_wake_function+0x0/0x20
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08c4798>] ldlm_cli_enqueue_local+0x1f8/0x5d0 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08c4b70>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0f57a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08c4b70>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0f57a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0f5dc7b>] mdt_object_lock0+0x33b/0xaf0 [mdt]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0f57a90>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08c4b70>] ? ldlm_completion_ast+0x0/0x960 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0f5e4f4>] mdt_object_lock+0x14/0x20 [mdt]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0f6d5a9>] mdt_getattr_name_lock+0xe19/0x1980 [mdt]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa061cdc2>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08ee75c>] ? lustre_msg_clear_flags+0x6c/0xc0 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0f6e3ad>] mdt_intent_getattr+0x29d/0x490 [mdt]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0f5af1e>] mdt_intent_policy+0x39e/0x720 [mdt]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08a5831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08cc22f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0f5b3a6>] mdt_enqueue+0x46/0xe0 [mdt]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0f61a97>] mdt_handle_common+0x647/0x16d0 [mdt]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0f9b6e5>] mds_regular_handle+0x15/0x20 [mdt]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08fe508>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa06065de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0617dcf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08f5869>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08ff89e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08fedd0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08fedd0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa08fedd0>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
Jan 19 14:44:30 mds2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Jan 19 14:44:30 mds2 kernel: :
Jan 19 14:44:30 mds2 kernel: : Pid: 8475, comm: mdt00_009
Jan 19 14:44:30 mds2 kernel: :
Jan 19 14:44:30 mds2 kernel: : Call Trace:
Jan 19 14:44:30 mds2 kernel: : [<ffffffffa0616301>] ? libcfs_debug_msg+0x41/0x50 [libcfs]

etc.....

Again, we cannot figure out why Lustre does not like those 3 OSTs.  Is there
some command we can run that will tell us what the source of the problem is?
Unfortunately we have no funds to upgrade our storage so we have to find a
solution within the confines of our legacy storage solution.

Many Thanks and All the best

Roman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240120/6517d63e/attachment.htm>


More information about the lustre-discuss mailing list