[lustre-discuss] MDS hangs for no apparent reason
Jose Manuel Martínez García
jose.martinez at scayle.es
Wed May 28 23:52:05 PDT 2025
Hi,
After disabling quotas, the MDS didn't crash for the last 4 months and
the whole system has been working smoothly.
El 25/01/2025 a las 14:29, Jose Manuel Martínez García escribió:
>
> Hello again,
>
>
> Yesterday the MDS server crashed twice (the whole machine).
>
> The first one was berore 22:57. The second one was at 00:15 of today.
>
> Here you can see the Lustre related logs. The server was manually
> rebooted from the first hang at 22:57 and Lustre started the MDT
> recovery. After recovery, the whole system was working 'propertly'
> until 23:00 where the data started to be unaccesible for the clients.
> Finally, the server hangs at 00:15, but the last lustre log is at 23:26.
>
>
> Here I can see a different line I have not seen before: "/$$$ failed
> to release quota space on glimpse 0!=60826269226353608"/
>
> /
> /
>
>
>
> /Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000:
> Imperative Recovery not enabled, recovery window 300-900
> Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: in
> recovery but waiting for the first client to connect
> Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Will be
> in recovery for at least 5:00, or until 125 clients reconnect
> Jan 24 22:57:13 srv-lustre11 kernel: LustreError:
> 3949134:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
> permitted during recovery req at 000000003892d67b
> x1812509961058304/t0(0)
> o601->LUSTRE-MDT0000-lwp-OST0c20_UUID at 10.5.33.243@o2ib1:274/0 lens
> 336/0 e 0 to 0 dl 1737755839 ref 1 fl Interpret:/0/ffffffff rc 0/-1
> job:'lquota_wb_LUSTR.0'
> Jan 24 22:57:13 srv-lustre11 kernel: LustreError:
> 3949134:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 1
> previous similar message
> Jan 24 22:57:20 srv-lustre11 kernel: LustreError:
> 3949407:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
> permitted during recovery req at 000000009a279624
> x1812509773308160/t0(0)
> o601->LUSTRE-MDT0000-lwp-OST0fa7_UUID at 10.5.33.244@o2ib1:281/0 lens
> 336/0 e 0 to 0 dl 1737755846 ref 1 fl Interpret:/0/ffffffff rc 0/-1
> job:'lquota_wb_LUSTR.0'
> Jan 24 22:57:20 srv-lustre11 kernel: LustreError:
> 3949407:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 9
> previous similar messages
> Jan 24 22:57:21 srv-lustre11 kernel: LustreError:
> 3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
> permitted during recovery req at 000000000db38b1b
> x1812509961083456/t0(0)
> o601->LUSTRE-MDT0000-lwp-OST0c1e_UUID at 10.5.33.243@o2ib1:282/0 lens
> 336/0 e 0 to 0 dl 1737755847 ref 1 fl Interpret:/0/ffffffff rc 0/-1
> job:'lquota_wb_LUSTR.0'
> Jan 24 22:57:21 srv-lustre11 kernel: LustreError:
> 3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 12
> previous similar messages
> Jan 24 22:57:24 srv-lustre11 kernel: LustreError:
> 3949411:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
> permitted during recovery req at 0000000034e830d1
> x1812509773318336/t0(0)
> o601->LUSTRE-MDT0000-lwp-OST0fa1_UUID at 10.5.33.244@o2ib1:285/0 lens
> 336/0 e 0 to 0 dl 1737755850 ref 1 fl Interpret:/0/ffffffff rc 0/-1
> job:'lquota_wb_LUSTR.0'
> Jan 24 22:57:24 srv-lustre11 kernel: LustreError:
> 3949411:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 8
> previous similar messages
> Jan 24 22:57:30 srv-lustre11 kernel: LustreError:
> 3949406:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
> permitted during recovery req at 00000000e40a36e5
> x1812509961108224/t0(0)
> o601->LUSTRE-MDT0000-lwp-OST0bbc_UUID at 10.5.33.243@o2ib1:291/0 lens
> 336/0 e 0 to 0 dl 1737755856 ref 1 fl Interpret:/0/ffffffff rc 0/-1
> job:'lquota_wb_LUSTR.0'
> Jan 24 22:57:30 srv-lustre11 kernel: LustreError:
> 3949406:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 24
> previous similar messages
> Jan 24 22:57:38 srv-lustre11 kernel: LustreError:
> 3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
> permitted during recovery req at 000000004a78941b
> x1812509961124480/t0(0)
> o601->LUSTRE-MDT0000-lwp-OST0c1d_UUID at 10.5.33.243@o2ib1:299/0 lens
> 336/0 e 0 to 0 dl 1737755864 ref 1 fl Interpret:/0/ffffffff rc 0/-1
> job:'lquota_wb_LUSTR.0'
> Jan 24 22:57:38 srv-lustre11 kernel: LustreError:
> 3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 57
> previous similar messages
> Jan 24 22:57:57 srv-lustre11 kernel: LustreError:
> 3949482:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
> permitted during recovery req at 000000002220d707
> x1812509773390720/t0(0)
> o601->LUSTRE-MDT0000-lwp-OST139c_UUID at 10.5.33.244@o2ib1:318/0 lens
> 336/0 e 0 to 0 dl 1737755883 ref 1 fl Interpret:/0/ffffffff rc 0/-1
> job:'lquota_wb_LUSTR.0'
> Jan 24 22:57:57 srv-lustre11 kernel: LustreError:
> 3949482:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 99
> previous similar messages
> Jan 24 22:58:15 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Recovery
> over after 1:07, of 125 clients 125 recovered and 0 were evicted.
> Jan 24 22:58:50 srv-lustre11 kernel: LustreError:
> 3949159:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
> uuid:LUSTRE-MDT0000-lwp-OST0bc4_UUID release: 60826269226353608
> granted:66040, total:13781524 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2949
> enforced:1 hard:62914560 soft:52428800 granted:13781524 time:0 qunit:
> 262144 edquot:0 may_rel:0 revoke:0 default:yes
> Jan 24 22:58:50 srv-lustre11 kernel: LustreError:
> 3949159:0:(qmt_lock.c:425:qmt_lvbo_update()) *$$$ failed to release
> quota space on glimpse 0!=60826269226353608* : rc = -22#012
> qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2949 enforced:1 hard:62914560
> soft:52428800 granted:13781524 time:0 qunit: 262144 edquot:0 may_rel:0
> revoke:0 default:yes
> Jan 24 23:08:52 srv-lustre11 kernel: Lustre:
> LUSTRE-OST1389-osc-MDT0000: Connection restored to 10.5.33.245 at o2ib1
> (at 10.5.33.245 at o2ib1)
> Jan 24 23:09:39 srv-lustre11 kernel: Lustre:
> LUSTRE-OST138b-osc-MDT0000: Connection restored to 10.5.33.245 at o2ib1
> (at 10.5.33.245 at o2ib1)
> Jan 24 23:10:24 srv-lustre11 kernel: LustreError: 11-0:
> LUSTRE-OST138d-osc-MDT0000: operation ost_connect to node
> 10.5.33.245 at o2ib1 failed: rc = -19
> Jan 24 23:10:32 srv-lustre11 kernel: LustreError: 11-0:
> LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node
> 10.5.33.245 at o2ib1 failed: rc = -19
> Jan 24 23:10:32 srv-lustre11 kernel: LustreError: Skipped 5 previous
> similar messages
> Jan 24 23:11:18 srv-lustre11 kernel: Lustre:
> LUSTRE-OST138d-osc-MDT0000: Connection restored to 10.5.33.245 at o2ib1
> (at 10.5.33.245 at o2ib1)
> Jan 24 23:11:25 srv-lustre11 kernel: Lustre:
> LUSTRE-OST138a-osc-MDT0000: Connection restored to 10.5.33.245 at o2ib1
> (at 10.5.33.245 at o2ib1)
> Jan 24 23:12:09 srv-lustre11 kernel: LustreError: 11-0:
> LUSTRE-OST1390-osc-MDT0000: operation ost_connect to node
> 10.5.33.244 at o2ib1 failed: rc = -19
> Jan 24 23:12:09 srv-lustre11 kernel: LustreError: Skipped 3 previous
> similar messages
> Jan 24 23:12:09 srv-lustre11 kernel: Lustre:
> LUSTRE-OST138e-osc-MDT0000: Connection restored to 10.5.33.245 at o2ib1
> (at 10.5.33.245 at o2ib1)
> Jan 24 23:12:23 srv-lustre11 kernel: LustreError: 11-0:
> LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node
> 10.5.33.245 at o2ib1 failed: rc = -19
> Jan 24 23:12:23 srv-lustre11 kernel: LustreError: Skipped 3 previous
> similar messages
> Jan 24 23:12:58 srv-lustre11 kernel: Lustre:
> LUSTRE-OST138f-osc-MDT0000: Connection restored to 10.5.33.245 at o2ib1
> (at 10.5.33.245 at o2ib1)
> Jan 24 23:13:46 srv-lustre11 kernel: Lustre:
> LUSTRE-OST1390-osc-MDT0000: Connection restored to 10.5.33.245 at o2ib1
> (at 10.5.33.245 at o2ib1)
> Jan 24 23:13:46 srv-lustre11 kernel: Lustre: Skipped 1 previous
> similar message
> Jan 24 23:14:35 srv-lustre11 kernel: Lustre:
> LUSTRE-OST1391-osc-MDT0000: Connection restored to 10.5.33.245 at o2ib1
> (at 10.5.33.245 at o2ib1)
> Jan 24 23:14:36 srv-lustre11 kernel: LustreError: 11-0:
> LUSTRE-OST1392-osc-MDT0000: operation ost_connect to node
> 10.5.33.245 at o2ib1 failed: rc = -19
> Jan 24 23:14:36 srv-lustre11 kernel: LustreError: Skipped 3 previous
> similar messages
> Jan 24 23:16:48 srv-lustre11 kernel: LustreError: 11-0:
> LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node
> 10.5.33.244 at o2ib1 failed: rc = -19
> Jan 24 23:16:48 srv-lustre11 kernel: LustreError: Skipped 4 previous
> similar messages
> Jan 24 23:17:02 srv-lustre11 kernel: Lustre:
> LUSTRE-OST03f3-osc-MDT0000: Connection restored to 10.5.33.245 at o2ib1
> (at 10.5.33.245 at o2ib1)
> Jan 24 23:17:02 srv-lustre11 kernel: Lustre: Skipped 1 previous
> similar message
> Jan 24 23:19:33 srv-lustre11 kernel: Lustre:
> LUSTRE-OST03f6-osc-MDT0000: Connection restored to 10.5.33.245 at o2ib1
> (at 10.5.33.245 at o2ib1)
> Jan 24 23:19:33 srv-lustre11 kernel: Lustre: Skipped 2 previous
> similar messages
> Jan 24 23:19:41 srv-lustre11 kernel: LustreError: 11-0:
> LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node
> 10.5.33.244 at o2ib1 failed: rc = -19
> Jan 24 23:19:41 srv-lustre11 kernel: LustreError: Skipped 3 previous
> similar messages
> Jan 24 23:22:11 srv-lustre11 kernel: LustreError: 11-0:
> LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node
> 10.5.33.245 at o2ib1 failed: rc = -19
> Jan 24 23:22:11 srv-lustre11 kernel: LustreError: Skipped 3 previous
> similar messages
> Jan 24 23:23:59 srv-lustre11 kernel: LustreError: 11-0:
> LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node
> 10.5.33.245 at o2ib1 failed: rc = -19
> Jan 24 23:23:59 srv-lustre11 kernel: LustreError: Skipped 3 previous
> similar messages
> Jan 24 23:24:29 srv-lustre11 kernel: Lustre:
> LUSTRE-OST03fc-osc-MDT0000: Connection restored to 10.5.33.245 at o2ib1
> (at 10.5.33.245 at o2ib1)
> Jan 24 23:24:29 srv-lustre11 kernel: Lustre: Skipped 5 previous
> similar messages
> Jan 24 23:26:33 srv-lustre11 kernel: LustreError: 11-0:
> LUSTRE-OST13f0-osc-MDT0000: operation ost_connect to node
> 10.5.33.244 at o2ib1 failed: rc = -19
> Jan 24 23:26:33 srv-lustre11 kernel: LustreError: Skipped 5 previous
> similar messages/
>
>
>
> Thanks.
>
> Jose.
>
>
> El 21/01/2025 a las 10:34, Jose Manuel Martínez García escribió:
>>
>> Hello everybody.
>>
>>
>> I am dealing with an issue with a relatively new Lustre installation.
>> The Metadata Server (MDS) hangs randomly without any common pattern.
>> It can take anywhere from 30 minutes to 30 days, but it always ends
>> up hanging without a consistent pattern (at least, I haven't found
>> one). The logs don't show anything unusual at the time of the
>> failure. The only thing I continuously see are these messages:
>>
>> /[lun ene 20 14:17:10 2025] LustreError:
>> 7068:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
>> -22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
>> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
>> qunit:262144 qtune:65536 edquot:0 default:yes
>> [lun ene 20 14:17:10 2025] LustreError:
>> 7068:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
>> similar messages
>> [lun ene 20 14:21:52 2025] LustreError:
>> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
>> uuid:LUSTRE-MDT0000-lwp-OST0c1f_UUID release: 15476132855418716160
>> granted:262144, total:14257500 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2582
>> enforced:1 hard:62914560 soft:52428800 granted:14257500 time:0 qunit:
>> 262144 edquot:0 may_rel:0 revoke:0 default:yes
>> [lun ene 20 14:21:52 2025] LustreError:
>> 1947381:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
>> uuid:LUSTRE-MDT0000-lwp-OST0fb2_UUID release: 13809297465413342331
>> granted:66568, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325
>> enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit:
>> 262144 edquot:0 may_rel:0 revoke:0 default:yes
>> [lun ene 20 14:21:52 2025] LustreError:
>> 1947381:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous
>> similar messages
>> [lun ene 20 14:21:52 2025] LustreError:
>> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous
>> similar messages
>> [lun ene 20 14:27:24 2025] LustreError:
>> 7047:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
>> -22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
>> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
>> qunit:262144 qtune:65536 edquot:0 default:yes
>> [lun ene 20 14:27:24 2025] LustreError:
>> 7047:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
>> similar messages
>> [lun ene 20 14:31:52 2025] LustreError:
>> 1844354:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
>> uuid:LUSTRE-MDT0000-lwp-OST1399_UUID release: 12882711387029922688
>> granted:66116, total:14078012 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2586
>> enforced:1 hard:62914560 soft:52428800 granted:14078012 time:0 qunit:
>> 262144 edquot:0 may_rel:0 revoke:0 default:yes
>> [lun ene 20 14:31:52 2025] LustreError:
>> 1844354:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 785 previous
>> similar messages
>> [lun ene 20 14:37:39 2025] LustreError:
>> 7054:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
>> -22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
>> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
>> qunit:262144 qtune:65536 edquot:0 default:yes
>> [lun ene 20 14:37:39 2025] LustreError:
>> 7054:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
>> similar messages
>> [lun ene 20 14:41:54 2025] LustreError:
>> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
>> uuid:LUSTRE-MDT0000-lwp-OST0faa_UUID release: 13811459193234480169
>> granted:65632, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325
>> enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit:
>> 262144 edquot:0 may_rel:0 revoke:0 default:yes
>> [lun ene 20 14:41:54 2025] LustreError:
>> 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 798 previous
>> similar messages
>> [lun ene 20 14:47:53 2025] LustreError:
>> 7052:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
>> -22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
>> granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
>> qunit:262144 qtune:65536 edquot:0 default:yes
>> [lun ene 20 14:47:53 2025] LustreError:
>> 7052:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
>> similar messages
>> /
>> I have ruled out hardware failure since the MDS service has been
>> moved between different servers, and it happens with all of them.
>>
>> Linux distribution: AlmaLinux release 8.10 (Cerulean Leopard)
>> Kernel: Linux srv-lustre15 4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP
>> Fri Jun 28 18:44:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
>> Lustre release: lustre-2.15.5-1.el8.x86_64
>> Not using ZFS.
>>
>> Any ideas on where to continue investigating?
>> Is the error appearing in dmesg a bug, or is it a corruption in the
>> quota database?
>>
>> The possible bugs affecting quotas that might be related seem to be
>> fixed in version 2.15.
>>
>>
>> Thanks in advance.
>>
>> --
>> - no title specified
>>
>> Jose Manuel Martínez García
>>
>> Coordinador de Sistemas
>>
>> Supercomputación de Castilla y León
>>
>> Tel: 987 293 174
>>
>>
>>
>>
>>
>> Edificio CRAI-TIC, Campus de Vegazana, s/n Universidad de León -
>> 24071 León, España
>>
>> <https://www.scayle.es/>
>>
>> Le informamos, como destinatario de este mensaje, que el correo
>> electrónico y las comunicaciones por medio de Internet no permiten
>> asegurar ni garantizar la confidencialidad de los mensajes
>> transmitidos, así como tampoco su integridad o su correcta recepción,
>> por lo que SCAYLE no asume responsabilidad alguna por tales
>> circunstancias. Si no consintiese en la utilización del correo
>> electrónico o de las comunicaciones vía Internet le rogamos nos lo
>> comunique y ponga en nuestro conocimiento de manera inmediata. Para
>> más información visite nuestro Aviso Legal
>> <https://www.scayle.es/aviso-legal/>.
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
--
- no title specified
Jose Manuel Martínez García
Coordinador de Sistemas
Supercomputación de Castilla y León
Tel: 987 293 174
Edificio CRAI-TIC, Campus de Vegazana, s/n Universidad de León - 24071
León, España
<https://www.scayle.es/>
Le informamos, como destinatario de este mensaje, que el correo
electrónico y las comunicaciones por medio de Internet no permiten
asegurar ni garantizar la confidencialidad de los mensajes transmitidos,
así como tampoco su integridad o su correcta recepción, por lo que
SCAYLE no asume responsabilidad alguna por tales circunstancias. Si no
consintiese en la utilización del correo electrónico o de las
comunicaciones vía Internet le rogamos nos lo comunique y ponga en
nuestro conocimiento de manera inmediata. Para más información visite
nuestro Aviso Legal <https://www.scayle.es/aviso-legal/>.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250529/005b50cf/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Y3n8eAX8H72LNRDN.png
Type: image/png
Size: 17332 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250529/005b50cf/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zBTN1jYIB0K6ERKX.png
Type: image/png
Size: 4610 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250529/005b50cf/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1dX80P8vmI19CIDf.png
Type: image/png
Size: 17332 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250529/005b50cf/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: QkVjihKkLwGkz00S.png
Type: image/png
Size: 4610 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250529/005b50cf/attachment-0007.png>
More information about the lustre-discuss
mailing list