[lustre-discuss] MDS Crash due to Null Pointer Dereference in qmt_id_lock_cb within lquota Module
zhliu
zhliu at rcf.rhic.bnl.gov
Mon Jul 8 11:56:21 PDT 2024
Hello,
I am writing to report an issue we encountered with our MDS (Lustre
2.15.4) that resulted in a crash. We have experienced multiple MDS
crashes following a major upgrade from Lustre 2.12 to 2.15.4 several
weeks ago. Below are the details of our system and the crash.
[root at atlasmds01 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.9 (Ootpa)
[root at atlasmds01 ~]# uname -a
Linux atlasmds01.usatlas.bnl.gov 4.18.0-513.9.1.el8_lustre.x86_64 #1 SMP
Sat Dec 23 05:23:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Crash Details:
Our MDS experienced a crash, and upon investigation using the crash
utility, we identified that the issue occurred due to a null pointer
dereference in the qmt_id_lock_cb function within the lquota module.
The call trace shows the sequence of function calls leading to the
crash.
Key functions in the trace: __die_body, no_context,
__bad_area_nosemaphore, do_page_fault, page_fault, qmt_id_lock_cb,
lu_env_info, qmt_glimpse_lock, qmt_reba_thread, and kthread.
The backtrace shows the function qmt_id_lock_cb in the lquota module
causing the issue. The function qmt_glimpse_lock and qmt_reba_thread are
also involved. The crash occurred in qmt_id_lock_cb at
qmt_id_lock_cb+0x69/0x100 [lquota].
The relevant part of the backtrace is as follows:
PID: 6472 TASK: ffff894e2d498000 CPU: 8 COMMAND:
"qmt_reba_atlas0"
#0 [ffff97fc0891bb20] machine_kexec at ffffffff9d06d8c3
#1 [ffff97fc0891bb78] __crash_kexec at ffffffff9d1b757a
#2 [ffff97fc0891bc38] crash_kexec at ffffffff9d1b84b1
#3 [ffff97fc0891bc50] oops_end at ffffffff9d02be31
#4 [ffff97fc0891bc70] no_context at ffffffff9d07f923
#5 [ffff97fc0891bcc8] __bad_area_nosemaphore at ffffffff9d07fc9c
#6 [ffff97fc0891bd10] do_page_fault at ffffffff9d0808b7
#7 [ffff97fc0891bd40] page_fault at ffffffff9dc0116e
[exception RIP: qmt_id_lock_cb+105]
RIP: ffffffffc180c499 RSP: ffff97fc0891bdf0 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff894d1de6d700 RCX: 0000000000000000
RDX: ffff894e2d44dc20 RSI: 0000000000000010 RDI: ffff894abf858063
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000004
R10: 0000000000000010 R11: f000000000000000 R12: ffff894d1de6d700
R13: ffff89536a9bbb20 R14: ffff894cfdbfa690 R15: ffff894cfdbfa640
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffff97fc0891be10] qmt_glimpse_lock at ffffffffc180eba7 [lquota]
#9 [ffff97fc0891beb8] qmt_reba_thread at ffffffffc180ff2d [lquota]
#10 [ffff97fc0891bf10] kthread at ffffffff9d11e974
#11 [ffff97fc0891bf50] ret_from_fork at ffffffff9dc0024f
The disassembly of qmt_id_lock_cb shows that the crash occurs at the
instruction add 0x0(%rbp),%rax, where %rbp is null, leading to a page
fault. Here is the relevant portion of the disassembly:
0xffffffffc180c490 <qmt_id_lock_cb+96>: movslq 0x4(%rsp),%rax
0xffffffffc180c495 <qmt_id_lock_cb+101>: shl $0x4,%rax
0xffffffffc180c499 <qmt_id_lock_cb+105>: add 0x0(%rbp),%rax <- Crash
here
0xffffffffc180c49d <qmt_id_lock_cb+109>: testb $0xc,0x8(%rax)
We would appreciate any insights or suggestions on how to resolve this
issue. I searched a bit and did see similar issues were reported here at
https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg17995.html
and also
https://www.mail-archive.com/search?l=lustre-discuss@lists.lustre.org&q=subject:%22%5C%5BLustre%5C-discuss%5C%5D+MDS%22&o=newest&f=1
(2024-05-05, Lustre 2.15.4 )
Jane Liu
at BNL
More information about the lustre-discuss
mailing list