[lustre-discuss] MDS hangs for no apparent reason
Jose Manuel Martínez García
jose.martinez at scayle.es
Tue Jan 21 01:34:10 PST 2025
Hello everybody.
I am dealing with an issue with a relatively new Lustre installation.
The Metadata Server (MDS) hangs randomly without any common pattern. It
can take anywhere from 30 minutes to 30 days, but it always ends up
hanging without a consistent pattern (at least, I haven't found one).
The logs don't show anything unusual at the time of the failure. The
only thing I continuously see are these messages:
/[lun ene 20 14:17:10 2025] LustreError:
7068:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
-22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted:
16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:17:10 2025] LustreError:
7068:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
similar messages
[lun ene 20 14:21:52 2025] LustreError:
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
uuid:LUSTRE-MDT0000-lwp-OST0c1f_UUID release: 15476132855418716160
granted:262144, total:14257500 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2582
enforced:1 hard:62914560 soft:52428800 granted:14257500 time:0 qunit:
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:21:52 2025] LustreError:
1947381:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
uuid:LUSTRE-MDT0000-lwp-OST0fb2_UUID release: 13809297465413342331
granted:66568, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325
enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit:
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:21:52 2025] LustreError:
1947381:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous similar
messages
[lun ene 20 14:21:52 2025] LustreError:
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous similar
messages
[lun ene 20 14:27:24 2025] LustreError:
7047:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
-22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted:
16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:27:24 2025] LustreError:
7047:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
similar messages
[lun ene 20 14:31:52 2025] LustreError:
1844354:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
uuid:LUSTRE-MDT0000-lwp-OST1399_UUID release: 12882711387029922688
granted:66116, total:14078012 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2586
enforced:1 hard:62914560 soft:52428800 granted:14078012 time:0 qunit:
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:31:52 2025] LustreError:
1844354:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 785 previous similar
messages
[lun ene 20 14:37:39 2025] LustreError:
7054:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
-22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted:
16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:37:39 2025] LustreError:
7054:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
similar messages
[lun ene 20 14:41:54 2025] LustreError:
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
uuid:LUSTRE-MDT0000-lwp-OST0faa_UUID release: 13811459193234480169
granted:65632, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325
enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit:
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:41:54 2025] LustreError:
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 798 previous similar
messages
[lun ene 20 14:47:53 2025] LustreError:
7052:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
-22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted:
16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:47:53 2025] LustreError:
7052:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
similar messages
/
I have ruled out hardware failure since the MDS service has been moved
between different servers, and it happens with all of them.
Linux distribution: AlmaLinux release 8.10 (Cerulean Leopard)
Kernel: Linux srv-lustre15 4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri
Jun 28 18:44:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Lustre release: lustre-2.15.5-1.el8.x86_64
Not using ZFS.
Any ideas on where to continue investigating?
Is the error appearing in dmesg a bug, or is it a corruption in the
quota database?
The possible bugs affecting quotas that might be related seem to be
fixed in version 2.15.
Thanks in advance.
--
- no title specified
Jose Manuel Martínez García
Coordinador de Sistemas
Supercomputación de Castilla y León
Tel: 987 293 174
Edificio CRAI-TIC, Campus de Vegazana, s/n Universidad de León - 24071
León, España
<https://www.scayle.es/>
Le informamos, como destinatario de este mensaje, que el correo
electrónico y las comunicaciones por medio de Internet no permiten
asegurar ni garantizar la confidencialidad de los mensajes transmitidos,
así como tampoco su integridad o su correcta recepción, por lo que
SCAYLE no asume responsabilidad alguna por tales circunstancias. Si no
consintiese en la utilización del correo electrónico o de las
comunicaciones vía Internet le rogamos nos lo comunique y ponga en
nuestro conocimiento de manera inmediata. Para más información visite
nuestro Aviso Legal <https://www.scayle.es/aviso-legal/>.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250121/8fe0d29b/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Y3n8eAX8H72LNRDN.png
Type: image/png
Size: 17332 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250121/8fe0d29b/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zBTN1jYIB0K6ERKX.png
Type: image/png
Size: 4610 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250121/8fe0d29b/attachment-0003.png>
More information about the lustre-discuss
mailing list