[lustre-discuss] Eternally Invalid Lock?

Wed Apr 24 14:52:19 PDT 2024

Hi all,

(This is on 2.15.4 with very limited modifications, none to speak of in ldlm or similar)

Very rarely, when attempting to perform an lctl barrier_freeze, we run into a situation where it fails with EINVAL.  At that point all future lctl barrier operations (including rescan) return EINVAL, except barrier_status, which simply shows failed.  This remains stuck that way until we reboot the MDS or we reach heat death of the universe.  We have only tested the former solution, and that does work, though it's extremely heavyweight.

Since we've been able to repro this on a test system (so we're leaving it in that state), we've been able to collect traces, and we see these two that are quite interesting:

00010000:00010000:7.0:1713898822.714580:0:2921:0:(ldlm_lockd.c:2382:ldlm_callback_handler()) callback on lock 0x831100496f36c2f4 - lock disappeared
00010000:00010000:4.0:1713898822.710567:0:2676:0:(ldlm_lockd.c:2286:ldlm_callback_errmsg()) @@@ Operate with invalid parameter, NID=12345-0 at lo lock=0x831100496f36c2f4: rc = 0  req at 000000002ae045d9 x1783750461943232/t0(0) o106->8afc7dcc-fc22-4bb8-8ca1-b5df01779cf4 at 0@lo:64/0 lens 400/224 e 0 to 0 dl 1713899139 ref 1 fl Interpret:/0/0 rc -22/-22 job:''

This seems to correlate with this code:
2390         /*
2391          * Force a known safe race, send a cancel to the server for a lock
2392          * which the server has already started a blocking callback on.
2393          */
2394         if (OBD_FAIL_CHECK(OBD_FAIL_LDLM_CANCEL_BL_CB_RACE) &&
2395             lustre_msg_get_opc(req->rq_reqmsg) == LDLM_BL_CALLBACK) {
2396                 rc = ldlm_cli_cancel(&dlm_req->lock_handle[0], 0);
2397                 if (rc < 0)
2398                         CERROR("ldlm_cli_cancel: %d\n", rc);
2399         }
2400
2401         lock = ldlm_handle2lock_long(&dlm_req->lock_handle[0], 0);
2402         if (!lock) {
2403                 CDEBUG(D_DLMTRACE,
2404                        "callback on lock %#llx - lock disappeared\n",
2405                        dlm_req->lock_handle[0].cookie);
2406                 rc = ldlm_callback_reply(req, -EINVAL);
2407                 ldlm_callback_errmsg(req, "Operate with invalid parameter", rc,
2408                                      &dlm_req->lock_handle[0]);
2409                 RETURN(0);
2410         }

The weird thing is that this lock never expires.  0x831100496f36c2f4 is here to stay.  Attempting to clear it by setting clear on the lru_size setting does nothing.

Practical questions:
1. I'm assuming NID=12345-0 at lo is basically equivalent to "localhost" for Lustre, but if it's not, let me know.  I see the first part of this is defined as LNET_PID_LUSTRE, though I'm not 100% sure what PID here stands for (I doubt process ID?)
2. Is there any tool I can't find to instruct Lustre to drop a lock by ID?  That'd be handy, though I realize I'm asking for a "turn off airbags" button.
3. One of my engineers made the following comment in case it is helpful:
"It appears that the "lock disappeared" lock exists in the dump_namespaces output as the remote end of another lock but nowhere as a lock itself. It's also interesting that it seems like the same resource appears twice with different resource IDs but the same 3 part ID that looks like a FID:
00010000:00010000:10.0:1713987044.511342:0:1897178:0:(ldlm_resource.c:1783:ldlm_resource_dump()) --- Resource: [0x736665727473756c:0x5:0x0].0x0 (00000000c3f4ce61) refcount = 3
00010000:00010000:10.0:1713987044.511343:0:1897178:0:(ldlm_resource.c:1787:ldlm_resource_dump()) Granted locks (in reverse order):
00010000:00010000:10.0:1713987044.511343:0:1897178:0:(ldlm_resource.c:1790:ldlm_resource_dump()) ### ### ns: MGS lock: 00000000e2096801/0x831100496f36eea6 lrc: 2/0,0 mode: CR/CR res: [0x736665727473756c:0x5:0x0].0x0 rrc: 4 type: PLN flags: 0x40000000000000 nid: 0 at lo remote: 0x831100496f36ee9f expref: 14 pid: 2699 timeout: 0 lvb_type: 0
00010000:00000040:10.0:1713987044.511344:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref()) putref res: 00000000c3f4ce61 count: 3
00010000:00010000:10.0:1713987044.511344:0:1897178:0:(ldlm_resource.c:1790:ldlm_resource_dump()) ### ### ns: MGS lock: 000000001007e670/0x831100496f36c2fb lrc: 2/0,0 mode: CR/CR res: [0x736665727473756c:0x5:0x0].0x0 rrc: 4 type: PLN flags: 0x40000000000000 nid: 0 at lo remote: 0x831100496f36c2f4 expref: 14 pid: 2700 timeout: 0 lvb_type: 0
00010000:00000040:10.0:1713987044.511345:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref()) putref res: 00000000c3f4ce61 count: 3
00010000:00000040:10.0:1713987044.511346:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref()) putref res: 00000000c3f4ce61 count: 2
00010000:00000040:10.0:1713987044.511346:0:1897178:0:(ldlm_resource.c:1566:ldlm_resource_getref()) getref res: 000000001ab54d34 count: 35

00010000:00010000:10.0:1713987044.512238:0:1897178:0:(ldlm_resource.c:1783:ldlm_resource_dump()) --- Resource: [0x736665727473756c:0x5:0x0].0x0 (0000000019b90cb9) refcount = 2
00010000:00010000:10.0:1713987044.512239:0:1897178:0:(ldlm_resource.c:1787:ldlm_resource_dump()) Granted locks (in reverse order):
00010000:00010000:10.0:1713987044.512239:0:1897178:0:(ldlm_resource.c:1790:ldlm_resource_dump()) ### ### ns: MGC10.0.1.6 at tcp lock: 00000000b9bf701e/0x831100496f36ee9f lrc: 1/0,0 mode: CR/CR res: [0x736665727473756c:0x5:0x0].0x0 rrc: 3 type: PLN flags: 0x1000000000000 nid: local remote: 0x831100496f36eea6 expref: -99 pid: 2702 timeout: 0 lvb_type: 0
00010000:00000040:10.0:1713987044.512240:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref()) putref res: 0000000019b90cb9 count: 2
00010000:00000040:10.0:1713987044.512240:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref()) putref res: 0000000019b90cb9 count: 1
00010000:00000040:10.0:1713987044.512241:0:1897178:0:(ldlm_resource.c:1566:ldlm_resource_getref()) getref res: 00000000d6eb813a count: 2"

Finally, note that I found this bug upstream:
https://jira.whamcloud.com/browse/LU-17412
But this appears to be triggered by:
https://review.whamcloud.com/c/fs/lustre-release/+/52498, which only lands in 2.16.0, not in our versions (both of these hits were on 2.15.1 -- we're onto 2.15.4 but I see no changes to ldlm that suggest this would be fixed between those two).  This commit feels like it would be helpful to our cause here (since it handles EINVAL), but it ironically appears to be inversely correlated with helping this pathology in the aforementioned bug.

Any recommendations on further data I can gather?  I'll open a bug this week, but I thought I'd first ask if somebody has good ideas or can mentally pattern match this to some other bug I've overlooked that's been fixed in a later release.

Thanks as always!

ellis