[lustre-discuss] MDS Crash due to Null Pointer Dereference in qmt_id_lock_cb within lquota Module

Mon Jul 8 11:56:21 PDT 2024

Hello,

I am writing to report an issue we encountered with our MDS (Lustre 
2.15.4) that resulted in a crash. We have experienced multiple MDS 
crashes following a major upgrade from Lustre 2.12 to 2.15.4 several 
weeks ago. Below are the details of our system and the crash.

[root at atlasmds01 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.9 (Ootpa)

[root at atlasmds01 ~]# uname -a
Linux atlasmds01.usatlas.bnl.gov 4.18.0-513.9.1.el8_lustre.x86_64 #1 SMP 
Sat Dec 23 05:23:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Crash Details:

Our MDS experienced a crash, and upon investigation using the crash 
utility, we identified that the issue occurred due to a null pointer 
dereference in the qmt_id_lock_cb function within the lquota module.

The call trace shows the sequence of function calls leading to the 
crash.
Key functions in the trace: __die_body, no_context, 
__bad_area_nosemaphore, do_page_fault, page_fault, qmt_id_lock_cb, 
lu_env_info, qmt_glimpse_lock, qmt_reba_thread, and kthread.

The backtrace shows the function qmt_id_lock_cb in the lquota module 
causing the issue. The function qmt_glimpse_lock and qmt_reba_thread are 
also involved. The crash occurred in qmt_id_lock_cb at 
qmt_id_lock_cb+0x69/0x100 [lquota].

The relevant part of the backtrace is as follows:

PID: 6472     TASK: ffff894e2d498000  CPU: 8    COMMAND: 
"qmt_reba_atlas0"
  #0 [ffff97fc0891bb20] machine_kexec at ffffffff9d06d8c3
  #1 [ffff97fc0891bb78] __crash_kexec at ffffffff9d1b757a
  #2 [ffff97fc0891bc38] crash_kexec at ffffffff9d1b84b1
  #3 [ffff97fc0891bc50] oops_end at ffffffff9d02be31
  #4 [ffff97fc0891bc70] no_context at ffffffff9d07f923
  #5 [ffff97fc0891bcc8] __bad_area_nosemaphore at ffffffff9d07fc9c
  #6 [ffff97fc0891bd10] do_page_fault at ffffffff9d0808b7
  #7 [ffff97fc0891bd40] page_fault at ffffffff9dc0116e
     [exception RIP: qmt_id_lock_cb+105]
     RIP: ffffffffc180c499  RSP: ffff97fc0891bdf0  RFLAGS: 00010246
     RAX: 0000000000000000  RBX: ffff894d1de6d700  RCX: 0000000000000000
     RDX: ffff894e2d44dc20  RSI: 0000000000000010  RDI: ffff894abf858063
     RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000004
     R10: 0000000000000010  R11: f000000000000000  R12: ffff894d1de6d700
     R13: ffff89536a9bbb20  R14: ffff894cfdbfa690  R15: ffff894cfdbfa640
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #8 [ffff97fc0891be10] qmt_glimpse_lock at ffffffffc180eba7 [lquota]
  #9 [ffff97fc0891beb8] qmt_reba_thread at ffffffffc180ff2d [lquota]
#10 [ffff97fc0891bf10] kthread at ffffffff9d11e974
#11 [ffff97fc0891bf50] ret_from_fork at ffffffff9dc0024f

The disassembly of qmt_id_lock_cb shows that the crash occurs at the 
instruction add 0x0(%rbp),%rax, where %rbp is null, leading to a page 
fault. Here is the relevant portion of the disassembly:

0xffffffffc180c490 <qmt_id_lock_cb+96>: movslq 0x4(%rsp),%rax
0xffffffffc180c495 <qmt_id_lock_cb+101>: shl    $0x4,%rax
0xffffffffc180c499 <qmt_id_lock_cb+105>: add    0x0(%rbp),%rax  <- Crash 
here
0xffffffffc180c49d <qmt_id_lock_cb+109>: testb  $0xc,0x8(%rax)

We would appreciate any insights or suggestions on how to resolve this 
issue. I searched a bit and did see similar issues were reported here at 
https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg17995.html 
and also 
https://www.mail-archive.com/search?l=lustre-discuss@lists.lustre.org&q=subject:%22%5C%5BLustre%5C-discuss%5C%5D+MDS%22&o=newest&f=1 
(2024-05-05, Lustre 2.15.4 )

Jane Liu
at BNL