[lustre-discuss] MDS hangs for no apparent reason
Jose Manuel Martínez García
jose.martinez at scayle.es
Sat Jan 25 05:29:57 PST 2025
Hello again,
Yesterday the MDS server crashed twice (the whole machine).
The first one was berore 22:57. The second one was at 00:15 of today.
Here you can see the Lustre related logs. The server was manually
rebooted from the first hang at 22:57 and Lustre started the MDT
recovery. After recovery, the whole system was working 'propertly' until
23:00 where the data started to be unaccesible for the clients. Finally,
the server hangs at 00:15, but the last lustre log is at 23:26.
Here I can see a different line I have not seen before: "/$$$ failed to
release quota space on glimpse 0!=60826269226353608"/
/
/
/Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Imperative
Recovery not enabled, recovery window 300-900
Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: in recovery
but waiting for the first client to connect
Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Will be in
recovery for at least 5:00, or until 125 clients reconnect
Jan 24 22:57:13 srv-lustre11 kernel: LustreError:
3949134:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req at 000000003892d67b x1812509961058304/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0c20_UUID at 10.5.33.243@o2ib1:274/0 lens 336/0
e 0 to 0 dl 1737755839 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:13 srv-lustre11 kernel: LustreError:
3949134:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 1
previous similar message
Jan 24 22:57:20 srv-lustre11 kernel: LustreError:
3949407:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req at 000000009a279624 x1812509773308160/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0fa7_UUID at 10.5.33.244@o2ib1:281/0 lens 336/0
e 0 to 0 dl 1737755846 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:20 srv-lustre11 kernel: LustreError:
3949407:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 9
previous similar messages
Jan 24 22:57:21 srv-lustre11 kernel: LustreError:
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req at 000000000db38b1b x1812509961083456/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0c1e_UUID at 10.5.33.243@o2ib1:282/0 lens 336/0
e 0 to 0 dl 1737755847 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:21 srv-lustre11 kernel: LustreError:
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 12
previous similar messages
Jan 24 22:57:24 srv-lustre11 kernel: LustreError:
3949411:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req at 0000000034e830d1 x1812509773318336/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0fa1_UUID at 10.5.33.244@o2ib1:285/0 lens 336/0
e 0 to 0 dl 1737755850 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:24 srv-lustre11 kernel: LustreError:
3949411:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 8
previous similar messages
Jan 24 22:57:30 srv-lustre11 kernel: LustreError:
3949406:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req at 00000000e40a36e5 x1812509961108224/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0bbc_UUID at 10.5.33.243@o2ib1:291/0 lens 336/0
e 0 to 0 dl 1737755856 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:30 srv-lustre11 kernel: LustreError:
3949406:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 24
previous similar messages
Jan 24 22:57:38 srv-lustre11 kernel: LustreError:
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req at 000000004a78941b x1812509961124480/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0c1d_UUID at 10.5.33.243@o2ib1:299/0 lens 336/0
e 0 to 0 dl 1737755864 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:38 srv-lustre11 kernel: LustreError:
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 57
previous similar messages
Jan 24 22:57:57 srv-lustre11 kernel: LustreError:
3949482:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req at 000000002220d707 x1812509773390720/t0(0)
o601->LUSTRE-MDT0000-lwp-OST139c_UUID at 10.5.33.244@o2ib1:318/0 lens 336/0
e 0 to 0 dl 1737755883 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:57 srv-lustre11 kernel: LustreError:
3949482:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 99
previous similar messages
Jan 24 22:58:15 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Recovery
over after 1:07, of 125 clients 125 recovered and 0 were evicted.
Jan 24 22:58:50 srv-lustre11 kernel: LustreError:
3949159:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
uuid:LUSTRE-MDT0000-lwp-OST0bc4_UUID release: 60826269226353608
granted:66040, total:13781524 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2949
enforced:1 hard:62914560 soft:52428800 granted:13781524 time:0 qunit:
262144 edquot:0 may_rel:0 revoke:0 default:yes
Jan 24 22:58:50 srv-lustre11 kernel: LustreError:
3949159:0:(qmt_lock.c:425:qmt_lvbo_update()) *$$$ failed to release
quota space on glimpse 0!=60826269226353608* : rc = -22#012
qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2949 enforced:1 hard:62914560
soft:52428800 granted:13781524 time:0 qunit: 262144 edquot:0 may_rel:0
revoke:0 default:yes
Jan 24 23:08:52 srv-lustre11 kernel: Lustre: LUSTRE-OST1389-osc-MDT0000:
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:09:39 srv-lustre11 kernel: Lustre: LUSTRE-OST138b-osc-MDT0000:
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:10:24 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST138d-osc-MDT0000: operation ost_connect to node
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:10:32 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:10:32 srv-lustre11 kernel: LustreError: Skipped 5 previous
similar messages
Jan 24 23:11:18 srv-lustre11 kernel: Lustre: LUSTRE-OST138d-osc-MDT0000:
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:11:25 srv-lustre11 kernel: Lustre: LUSTRE-OST138a-osc-MDT0000:
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:12:09 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST1390-osc-MDT0000: operation ost_connect to node
10.5.33.244 at o2ib1 failed: rc = -19
Jan 24 23:12:09 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:12:09 srv-lustre11 kernel: Lustre: LUSTRE-OST138e-osc-MDT0000:
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:12:23 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:12:23 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:12:58 srv-lustre11 kernel: Lustre: LUSTRE-OST138f-osc-MDT0000:
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:13:46 srv-lustre11 kernel: Lustre: LUSTRE-OST1390-osc-MDT0000:
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:13:46 srv-lustre11 kernel: Lustre: Skipped 1 previous similar
message
Jan 24 23:14:35 srv-lustre11 kernel: Lustre: LUSTRE-OST1391-osc-MDT0000:
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:14:36 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST1392-osc-MDT0000: operation ost_connect to node
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:14:36 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:16:48 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node
10.5.33.244 at o2ib1 failed: rc = -19
Jan 24 23:16:48 srv-lustre11 kernel: LustreError: Skipped 4 previous
similar messages
Jan 24 23:17:02 srv-lustre11 kernel: Lustre: LUSTRE-OST03f3-osc-MDT0000:
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:17:02 srv-lustre11 kernel: Lustre: Skipped 1 previous similar
message
Jan 24 23:19:33 srv-lustre11 kernel: Lustre: LUSTRE-OST03f6-osc-MDT0000:
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:19:33 srv-lustre11 kernel: Lustre: Skipped 2 previous similar
messages
Jan 24 23:19:41 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node
10.5.33.244 at o2ib1 failed: rc = -19
Jan 24 23:19:41 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:22:11 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:22:11 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:23:59 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:23:59 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:24:29 srv-lustre11 kernel: Lustre: LUSTRE-OST03fc-osc-MDT0000:
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:24:29 srv-lustre11 kernel: Lustre: Skipped 5 previous similar
messages
Jan 24 23:26:33 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13f0-osc-MDT0000: operation ost_connect to node
10.5.33.244 at o2ib1 failed: rc = -19
Jan 24 23:26:33 srv-lustre11 kernel: LustreError: Skipped 5 previous
similar messages/
Thanks.
Jose.
El 21/01/2025 a las 10:34, Jose Manuel Martínez García escribió:
>
> Hello everybody.
>
>
> I am dealing with an issue with a relatively new Lustre installation.
> The Metadata Server (MDS) hangs randomly without any common pattern.
> It can take anywhere from 30 minutes to 30 days, but it always ends up
> hanging without a consistent pattern (at least, I haven't found one).
> The logs don't show anything unusual at the time of the failure. The
> only thing I continuously see are these messages:
>
> /[lun ene 20 14:17:10 2025] LustreError:
> 7068:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
> -22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
> qunit:262144 qtune:65536 edquot:0 default:yes
> [lun ene 20 14:17:10 2025] LustreError:
> 7068:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
> similar messages
> [lun ene 20 14:21:52 2025] LustreError:
> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
> uuid:LUSTRE-MDT0000-lwp-OST0c1f_UUID release: 15476132855418716160
> granted:262144, total:14257500 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2582
> enforced:1 hard:62914560 soft:52428800 granted:14257500 time:0 qunit:
> 262144 edquot:0 may_rel:0 revoke:0 default:yes
> [lun ene 20 14:21:52 2025] LustreError:
> 1947381:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
> uuid:LUSTRE-MDT0000-lwp-OST0fb2_UUID release: 13809297465413342331
> granted:66568, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325
> enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit:
> 262144 edquot:0 may_rel:0 revoke:0 default:yes
> [lun ene 20 14:21:52 2025] LustreError:
> 1947381:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous
> similar messages
> [lun ene 20 14:21:52 2025] LustreError:
> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous
> similar messages
> [lun ene 20 14:27:24 2025] LustreError:
> 7047:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
> -22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
> qunit:262144 qtune:65536 edquot:0 default:yes
> [lun ene 20 14:27:24 2025] LustreError:
> 7047:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
> similar messages
> [lun ene 20 14:31:52 2025] LustreError:
> 1844354:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
> uuid:LUSTRE-MDT0000-lwp-OST1399_UUID release: 12882711387029922688
> granted:66116, total:14078012 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2586
> enforced:1 hard:62914560 soft:52428800 granted:14078012 time:0 qunit:
> 262144 edquot:0 may_rel:0 revoke:0 default:yes
> [lun ene 20 14:31:52 2025] LustreError:
> 1844354:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 785 previous
> similar messages
> [lun ene 20 14:37:39 2025] LustreError:
> 7054:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
> -22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
> qunit:262144 qtune:65536 edquot:0 default:yes
> [lun ene 20 14:37:39 2025] LustreError:
> 7054:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
> similar messages
> [lun ene 20 14:41:54 2025] LustreError:
> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
> uuid:LUSTRE-MDT0000-lwp-OST0faa_UUID release: 13811459193234480169
> granted:65632, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325
> enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit:
> 262144 edquot:0 may_rel:0 revoke:0 default:yes
> [lun ene 20 14:41:54 2025] LustreError:
> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 798 previous
> similar messages
> [lun ene 20 14:47:53 2025] LustreError:
> 7052:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
> -22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
> qunit:262144 qtune:65536 edquot:0 default:yes
> [lun ene 20 14:47:53 2025] LustreError:
> 7052:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
> similar messages
> /
> I have ruled out hardware failure since the MDS service has been moved
> between different servers, and it happens with all of them.
>
> Linux distribution: AlmaLinux release 8.10 (Cerulean Leopard)
> Kernel: Linux srv-lustre15 4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri
> Jun 28 18:44:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
> Lustre release: lustre-2.15.5-1.el8.x86_64
> Not using ZFS.
>
> Any ideas on where to continue investigating?
> Is the error appearing in dmesg a bug, or is it a corruption in the
> quota database?
>
> The possible bugs affecting quotas that might be related seem to be
> fixed in version 2.15.
>
>
> Thanks in advance.
>
> --
> - no title specified
>
> Jose Manuel Martínez García
>
> Coordinador de Sistemas
>
> Supercomputación de Castilla y León
>
> Tel: 987 293 174
>
>
>
>
>
> Edificio CRAI-TIC, Campus de Vegazana, s/n Universidad de León - 24071
> León, España
>
> <https://www.scayle.es/>
>
> Le informamos, como destinatario de este mensaje, que el correo
> electrónico y las comunicaciones por medio de Internet no permiten
> asegurar ni garantizar la confidencialidad de los mensajes
> transmitidos, así como tampoco su integridad o su correcta recepción,
> por lo que SCAYLE no asume responsabilidad alguna por tales
> circunstancias. Si no consintiese en la utilización del correo
> electrónico o de las comunicaciones vía Internet le rogamos nos lo
> comunique y ponga en nuestro conocimiento de manera inmediata. Para
> más información visite nuestro Aviso Legal
> <https://www.scayle.es/aviso-legal/>.
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250125/fdfa624c/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Y3n8eAX8H72LNRDN.png
Type: image/png
Size: 17332 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250125/fdfa624c/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zBTN1jYIB0K6ERKX.png
Type: image/png
Size: 4610 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250125/fdfa624c/attachment-0003.png>
More information about the lustre-discuss
mailing list