[lustre-discuss] MDS hangs for no apparent reason

Jose Manuel Martínez García jose.martinez at scayle.es
Sat Jan 25 05:29:57 PST 2025


Hello again,


Yesterday the MDS server crashed twice (the whole machine).

The first one was berore 22:57. The second one was at 00:15 of today.

Here you can see the Lustre related logs. The server was manually 
rebooted from the first hang at 22:57 and Lustre started the MDT 
recovery. After recovery, the whole system was working 'propertly' until 
23:00 where the data started to be unaccesible for the clients. Finally, 
the server hangs at 00:15, but the last lustre log is at 23:26.


Here I can see a different line I have not seen before: "/$$$ failed to 
release quota space on glimpse 0!=60826269226353608"/

/
/



/Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Imperative 
Recovery not enabled, recovery window 300-900
Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: in recovery 
but waiting for the first client to connect
Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Will be in 
recovery for at least 5:00, or until 125 clients reconnect
Jan 24 22:57:13 srv-lustre11 kernel: LustreError: 
3949134:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not 
permitted during recovery  req at 000000003892d67b x1812509961058304/t0(0) 
o601->LUSTRE-MDT0000-lwp-OST0c20_UUID at 10.5.33.243@o2ib1:274/0 lens 336/0 
e 0 to 0 dl 1737755839 ref 1 fl Interpret:/0/ffffffff rc 0/-1 
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:13 srv-lustre11 kernel: LustreError: 
3949134:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 1 
previous similar message
Jan 24 22:57:20 srv-lustre11 kernel: LustreError: 
3949407:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not 
permitted during recovery  req at 000000009a279624 x1812509773308160/t0(0) 
o601->LUSTRE-MDT0000-lwp-OST0fa7_UUID at 10.5.33.244@o2ib1:281/0 lens 336/0 
e 0 to 0 dl 1737755846 ref 1 fl Interpret:/0/ffffffff rc 0/-1 
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:20 srv-lustre11 kernel: LustreError: 
3949407:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 9 
previous similar messages
Jan 24 22:57:21 srv-lustre11 kernel: LustreError: 
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not 
permitted during recovery  req at 000000000db38b1b x1812509961083456/t0(0) 
o601->LUSTRE-MDT0000-lwp-OST0c1e_UUID at 10.5.33.243@o2ib1:282/0 lens 336/0 
e 0 to 0 dl 1737755847 ref 1 fl Interpret:/0/ffffffff rc 0/-1 
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:21 srv-lustre11 kernel: LustreError: 
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 12 
previous similar messages
Jan 24 22:57:24 srv-lustre11 kernel: LustreError: 
3949411:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not 
permitted during recovery  req at 0000000034e830d1 x1812509773318336/t0(0) 
o601->LUSTRE-MDT0000-lwp-OST0fa1_UUID at 10.5.33.244@o2ib1:285/0 lens 336/0 
e 0 to 0 dl 1737755850 ref 1 fl Interpret:/0/ffffffff rc 0/-1 
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:24 srv-lustre11 kernel: LustreError: 
3949411:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 8 
previous similar messages
Jan 24 22:57:30 srv-lustre11 kernel: LustreError: 
3949406:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not 
permitted during recovery  req at 00000000e40a36e5 x1812509961108224/t0(0) 
o601->LUSTRE-MDT0000-lwp-OST0bbc_UUID at 10.5.33.243@o2ib1:291/0 lens 336/0 
e 0 to 0 dl 1737755856 ref 1 fl Interpret:/0/ffffffff rc 0/-1 
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:30 srv-lustre11 kernel: LustreError: 
3949406:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 24 
previous similar messages
Jan 24 22:57:38 srv-lustre11 kernel: LustreError: 
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not 
permitted during recovery  req at 000000004a78941b x1812509961124480/t0(0) 
o601->LUSTRE-MDT0000-lwp-OST0c1d_UUID at 10.5.33.243@o2ib1:299/0 lens 336/0 
e 0 to 0 dl 1737755864 ref 1 fl Interpret:/0/ffffffff rc 0/-1 
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:38 srv-lustre11 kernel: LustreError: 
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 57 
previous similar messages
Jan 24 22:57:57 srv-lustre11 kernel: LustreError: 
3949482:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not 
permitted during recovery  req at 000000002220d707 x1812509773390720/t0(0) 
o601->LUSTRE-MDT0000-lwp-OST139c_UUID at 10.5.33.244@o2ib1:318/0 lens 336/0 
e 0 to 0 dl 1737755883 ref 1 fl Interpret:/0/ffffffff rc 0/-1 
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:57 srv-lustre11 kernel: LustreError: 
3949482:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 99 
previous similar messages
Jan 24 22:58:15 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Recovery 
over after 1:07, of 125 clients 125 recovered and 0 were evicted.
Jan 24 22:58:50 srv-lustre11 kernel: LustreError: 
3949159:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! 
uuid:LUSTRE-MDT0000-lwp-OST0bc4_UUID release: 60826269226353608 
granted:66040, total:13781524  qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2949 
enforced:1 hard:62914560 soft:52428800 granted:13781524 time:0 qunit: 
262144 edquot:0 may_rel:0 revoke:0 default:yes
Jan 24 22:58:50 srv-lustre11 kernel: LustreError: 
3949159:0:(qmt_lock.c:425:qmt_lvbo_update()) *$$$ failed to release 
quota space on glimpse 0!=60826269226353608* : rc = -22#012  
qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2949 enforced:1 hard:62914560 
soft:52428800 granted:13781524 time:0 qunit: 262144 edquot:0 may_rel:0 
revoke:0 default:yes
Jan 24 23:08:52 srv-lustre11 kernel: Lustre: LUSTRE-OST1389-osc-MDT0000: 
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:09:39 srv-lustre11 kernel: Lustre: LUSTRE-OST138b-osc-MDT0000: 
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:10:24 srv-lustre11 kernel: LustreError: 11-0: 
LUSTRE-OST138d-osc-MDT0000: operation ost_connect to node 
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:10:32 srv-lustre11 kernel: LustreError: 11-0: 
LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node 
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:10:32 srv-lustre11 kernel: LustreError: Skipped 5 previous 
similar messages
Jan 24 23:11:18 srv-lustre11 kernel: Lustre: LUSTRE-OST138d-osc-MDT0000: 
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:11:25 srv-lustre11 kernel: Lustre: LUSTRE-OST138a-osc-MDT0000: 
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:12:09 srv-lustre11 kernel: LustreError: 11-0: 
LUSTRE-OST1390-osc-MDT0000: operation ost_connect to node 
10.5.33.244 at o2ib1 failed: rc = -19
Jan 24 23:12:09 srv-lustre11 kernel: LustreError: Skipped 3 previous 
similar messages
Jan 24 23:12:09 srv-lustre11 kernel: Lustre: LUSTRE-OST138e-osc-MDT0000: 
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:12:23 srv-lustre11 kernel: LustreError: 11-0: 
LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node 
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:12:23 srv-lustre11 kernel: LustreError: Skipped 3 previous 
similar messages
Jan 24 23:12:58 srv-lustre11 kernel: Lustre: LUSTRE-OST138f-osc-MDT0000: 
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:13:46 srv-lustre11 kernel: Lustre: LUSTRE-OST1390-osc-MDT0000: 
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:13:46 srv-lustre11 kernel: Lustre: Skipped 1 previous similar 
message
Jan 24 23:14:35 srv-lustre11 kernel: Lustre: LUSTRE-OST1391-osc-MDT0000: 
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:14:36 srv-lustre11 kernel: LustreError: 11-0: 
LUSTRE-OST1392-osc-MDT0000: operation ost_connect to node 
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:14:36 srv-lustre11 kernel: LustreError: Skipped 3 previous 
similar messages
Jan 24 23:16:48 srv-lustre11 kernel: LustreError: 11-0: 
LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node 
10.5.33.244 at o2ib1 failed: rc = -19
Jan 24 23:16:48 srv-lustre11 kernel: LustreError: Skipped 4 previous 
similar messages
Jan 24 23:17:02 srv-lustre11 kernel: Lustre: LUSTRE-OST03f3-osc-MDT0000: 
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:17:02 srv-lustre11 kernel: Lustre: Skipped 1 previous similar 
message
Jan 24 23:19:33 srv-lustre11 kernel: Lustre: LUSTRE-OST03f6-osc-MDT0000: 
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:19:33 srv-lustre11 kernel: Lustre: Skipped 2 previous similar 
messages
Jan 24 23:19:41 srv-lustre11 kernel: LustreError: 11-0: 
LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node 
10.5.33.244 at o2ib1 failed: rc = -19
Jan 24 23:19:41 srv-lustre11 kernel: LustreError: Skipped 3 previous 
similar messages
Jan 24 23:22:11 srv-lustre11 kernel: LustreError: 11-0: 
LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node 
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:22:11 srv-lustre11 kernel: LustreError: Skipped 3 previous 
similar messages
Jan 24 23:23:59 srv-lustre11 kernel: LustreError: 11-0: 
LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node 
10.5.33.245 at o2ib1 failed: rc = -19
Jan 24 23:23:59 srv-lustre11 kernel: LustreError: Skipped 3 previous 
similar messages
Jan 24 23:24:29 srv-lustre11 kernel: Lustre: LUSTRE-OST03fc-osc-MDT0000: 
Connection restored to 10.5.33.245 at o2ib1 (at 10.5.33.245 at o2ib1)
Jan 24 23:24:29 srv-lustre11 kernel: Lustre: Skipped 5 previous similar 
messages
Jan 24 23:26:33 srv-lustre11 kernel: LustreError: 11-0: 
LUSTRE-OST13f0-osc-MDT0000: operation ost_connect to node 
10.5.33.244 at o2ib1 failed: rc = -19
Jan 24 23:26:33 srv-lustre11 kernel: LustreError: Skipped 5 previous 
similar messages/



Thanks.

Jose.


El 21/01/2025 a las 10:34, Jose Manuel Martínez García escribió:
>
> Hello everybody.
>
>
> I am dealing with an issue with a relatively new Lustre installation. 
> The Metadata Server (MDS) hangs randomly without any common pattern. 
> It can take anywhere from 30 minutes to 30 days, but it always ends up 
> hanging without a consistent pattern (at least, I haven't found one). 
> The logs don't show anything unusual at the time of the failure. The 
> only thing I continuously see are these messages:
>
> /[lun ene 20 14:17:10 2025] LustreError: 
> 7068:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with 
> -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 
> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 
> qunit:262144 qtune:65536 edquot:0 default:yes
> [lun ene 20 14:17:10 2025] LustreError: 
> 7068:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous 
> similar messages
> [lun ene 20 14:21:52 2025] LustreError: 
> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! 
> uuid:LUSTRE-MDT0000-lwp-OST0c1f_UUID release: 15476132855418716160 
> granted:262144, total:14257500 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2582 
> enforced:1 hard:62914560 soft:52428800 granted:14257500 time:0 qunit: 
> 262144 edquot:0 may_rel:0 revoke:0 default:yes
> [lun ene 20 14:21:52 2025] LustreError: 
> 1947381:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! 
> uuid:LUSTRE-MDT0000-lwp-OST0fb2_UUID release: 13809297465413342331 
> granted:66568, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325 
> enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit: 
> 262144 edquot:0 may_rel:0 revoke:0 default:yes
> [lun ene 20 14:21:52 2025] LustreError: 
> 1947381:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous 
> similar messages
> [lun ene 20 14:21:52 2025] LustreError: 
> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous 
> similar messages
> [lun ene 20 14:27:24 2025] LustreError: 
> 7047:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with 
> -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 
> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 
> qunit:262144 qtune:65536 edquot:0 default:yes
> [lun ene 20 14:27:24 2025] LustreError: 
> 7047:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous 
> similar messages
> [lun ene 20 14:31:52 2025] LustreError: 
> 1844354:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! 
> uuid:LUSTRE-MDT0000-lwp-OST1399_UUID release: 12882711387029922688 
> granted:66116, total:14078012 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2586 
> enforced:1 hard:62914560 soft:52428800 granted:14078012 time:0 qunit: 
> 262144 edquot:0 may_rel:0 revoke:0 default:yes
> [lun ene 20 14:31:52 2025] LustreError: 
> 1844354:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 785 previous 
> similar messages
> [lun ene 20 14:37:39 2025] LustreError: 
> 7054:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with 
> -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 
> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 
> qunit:262144 qtune:65536 edquot:0 default:yes
> [lun ene 20 14:37:39 2025] LustreError: 
> 7054:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous 
> similar messages
> [lun ene 20 14:41:54 2025] LustreError: 
> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! 
> uuid:LUSTRE-MDT0000-lwp-OST0faa_UUID release: 13811459193234480169 
> granted:65632, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325 
> enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit: 
> 262144 edquot:0 may_rel:0 revoke:0 default:yes
> [lun ene 20 14:41:54 2025] LustreError: 
> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 798 previous 
> similar messages
> [lun ene 20 14:47:53 2025] LustreError: 
> 7052:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with 
> -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 
> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 
> qunit:262144 qtune:65536 edquot:0 default:yes
> [lun ene 20 14:47:53 2025] LustreError: 
> 7052:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous 
> similar messages
> /
> I have ruled out hardware failure since the MDS service has been moved 
> between different servers, and it happens with all of them.
>
> Linux distribution: AlmaLinux release 8.10 (Cerulean Leopard)
> Kernel: Linux srv-lustre15 4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri 
> Jun 28 18:44:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
> Lustre release: lustre-2.15.5-1.el8.x86_64
> Not using ZFS.
>
> Any ideas on where to continue investigating?
> Is the error appearing in dmesg a bug, or is it a corruption in the 
> quota database?
>
> The possible bugs affecting quotas that might be related seem to be 
> fixed in version 2.15.
>
>
> Thanks in advance.
>
> -- 
> - no title specified
>
> Jose Manuel Martínez García
>
> Coordinador de Sistemas
>
> Supercomputación de Castilla y León
>
> Tel: 987 293 174
>
> 	
>
> 	
>
> Edificio CRAI-TIC, Campus de Vegazana, s/n Universidad de León - 24071 
> León, España
>
> <https://www.scayle.es/>
>
> Le informamos, como destinatario de este mensaje, que el correo 
> electrónico y las comunicaciones por medio de Internet no permiten 
> asegurar ni garantizar la confidencialidad de los mensajes 
> transmitidos, así como tampoco su integridad o su correcta recepción, 
> por lo que SCAYLE no asume responsabilidad alguna por tales 
> circunstancias. Si no consintiese en la utilización del correo 
> electrónico o de las comunicaciones vía Internet le rogamos nos lo 
> comunique y ponga en nuestro conocimiento de manera inmediata. Para 
> más información visite nuestro Aviso Legal 
> <https://www.scayle.es/aviso-legal/>.
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250125/fdfa624c/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Y3n8eAX8H72LNRDN.png
Type: image/png
Size: 17332 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250125/fdfa624c/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zBTN1jYIB0K6ERKX.png
Type: image/png
Size: 4610 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250125/fdfa624c/attachment-0003.png>


More information about the lustre-discuss mailing list