[lustre-discuss] MDS hangs for no apparent reason

Jose Manuel Martínez García jose.martinez at scayle.es
Tue Jan 21 01:34:10 PST 2025


Hello everybody.


I am dealing with an issue with a relatively new Lustre installation. 
The Metadata Server (MDS) hangs randomly without any common pattern. It 
can take anywhere from 30 minutes to 30 days, but it always ends up 
hanging without a consistent pattern (at least, I haven't found one). 
The logs don't show anything unusual at the time of the failure. The 
only thing I continuously see are these messages:

/[lun ene 20 14:17:10 2025] LustreError: 
7068:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with 
-22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 
16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:17:10 2025] LustreError: 
7068:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous 
similar messages
[lun ene 20 14:21:52 2025] LustreError: 
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! 
uuid:LUSTRE-MDT0000-lwp-OST0c1f_UUID release: 15476132855418716160 
granted:262144, total:14257500 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2582 
enforced:1 hard:62914560 soft:52428800 granted:14257500 time:0 qunit: 
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:21:52 2025] LustreError: 
1947381:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! 
uuid:LUSTRE-MDT0000-lwp-OST0fb2_UUID release: 13809297465413342331 
granted:66568, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325 
enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit: 
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:21:52 2025] LustreError: 
1947381:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous similar 
messages
[lun ene 20 14:21:52 2025] LustreError: 
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous similar 
messages
[lun ene 20 14:27:24 2025] LustreError: 
7047:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with 
-22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 
16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:27:24 2025] LustreError: 
7047:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous 
similar messages
[lun ene 20 14:31:52 2025] LustreError: 
1844354:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! 
uuid:LUSTRE-MDT0000-lwp-OST1399_UUID release: 12882711387029922688 
granted:66116, total:14078012 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2586 
enforced:1 hard:62914560 soft:52428800 granted:14078012 time:0 qunit: 
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:31:52 2025] LustreError: 
1844354:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 785 previous similar 
messages
[lun ene 20 14:37:39 2025] LustreError: 
7054:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with 
-22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 
16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:37:39 2025] LustreError: 
7054:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous 
similar messages
[lun ene 20 14:41:54 2025] LustreError: 
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! 
uuid:LUSTRE-MDT0000-lwp-OST0faa_UUID release: 13811459193234480169 
granted:65632, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325 
enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit: 
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:41:54 2025] LustreError: 
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 798 previous similar 
messages
[lun ene 20 14:47:53 2025] LustreError: 
7052:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with 
-22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 
16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:47:53 2025] LustreError: 
7052:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous 
similar messages
/
I have ruled out hardware failure since the MDS service has been moved 
between different servers, and it happens with all of them.

Linux distribution: AlmaLinux release 8.10 (Cerulean Leopard)
Kernel: Linux srv-lustre15 4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri 
Jun 28 18:44:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Lustre release: lustre-2.15.5-1.el8.x86_64
Not using ZFS.

Any ideas on where to continue investigating?
Is the error appearing in dmesg a bug, or is it a corruption in the 
quota database?

The possible bugs affecting quotas that might be related seem to be 
fixed in version 2.15.


Thanks in advance.

-- 
- no title specified

Jose Manuel Martínez García

Coordinador de Sistemas

Supercomputación de Castilla y León

Tel: 987 293 174

	

	

Edificio CRAI-TIC, Campus de Vegazana, s/n Universidad de León - 24071 
León, España

<https://www.scayle.es/>

Le informamos, como destinatario de este mensaje, que el correo 
electrónico y las comunicaciones por medio de Internet no permiten 
asegurar ni garantizar la confidencialidad de los mensajes transmitidos, 
así como tampoco su integridad o su correcta recepción, por lo que 
SCAYLE no asume responsabilidad alguna por tales circunstancias. Si no 
consintiese en la utilización del correo electrónico o de las 
comunicaciones vía Internet le rogamos nos lo comunique y ponga en 
nuestro conocimiento de manera inmediata. Para más información visite 
nuestro Aviso Legal <https://www.scayle.es/aviso-legal/>.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250121/8fe0d29b/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Y3n8eAX8H72LNRDN.png
Type: image/png
Size: 17332 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250121/8fe0d29b/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zBTN1jYIB0K6ERKX.png
Type: image/png
Size: 4610 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250121/8fe0d29b/attachment-0003.png>


More information about the lustre-discuss mailing list