[lustre-discuss] 2.16.1 ptlrpcd infinite loop when machine runs out of RAM
Lewis Hyatt
lhyatt at gmail.com
Tue Feb 4 15:45:33 PST 2025
Hello-
I am running a 2.16.1 client
(g:718a3fd9c78507a589a657ec3745b4c553f66633) on Ubuntu 24.04.1 LTS
(kernel 6.8.0-51). The client is using infiniband with in-kernel ofed
drivers.
I am seeing a mostly reproducible (although subject to race
conditions) issue that occurs when a client server runs out of memory.
I see messages like the following:
2025-02-04T18:02:01.300296-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11174853
2025-02-04T18:02:01.311854-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11174453
2025-02-04T18:02:01.417172-05:00 hostname kernel: message repeated 14
times: [ obd_memory max: 1282184920, obd_memory current: 11174453]
2025-02-04T18:02:01.417173-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11174021
2025-02-04T18:02:01.417184-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11173805
which are output to the syslog forever with ptlrpcd spinning 100% CPU,
while lustre client processes which were waiting on I/O are hung.
In this instance, Linux oom-killer was invoked and killed a process,
it output a bit later after the first errors from
obdclass_oom_handler:
2025-02-04T18:02:01.420654-05:00 hostname kernel: awk invoked
oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP),
order=0, oom_score_adj=0
2025-02-04T18:02:01.420663-05:00 hostname kernel: CPU: 1 PID: 49999
Comm: awk Tainted: G OE 6.8.0-51-generic #52-Ubuntu
2025-02-04T18:02:01.420664-05:00 hostname kernel: Hardware name:
Supermicro X8DTU-6+/X8DTU-6+, BIOS 2.1a 07/08/11
2025-02-04T18:02:01.420665-05:00 hostname kernel: Call Trace:
2025-02-04T18:02:01.420665-05:00 hostname kernel: <TASK>
2025-02-04T18:02:01.420666-05:00 hostname kernel: dump_stack_lvl+0x76/0xa0
2025-02-04T18:02:01.420666-05:00 hostname kernel: dump_stack+0x10/0x20
2025-02-04T18:02:01.420667-05:00 hostname kernel: dump_header+0x47/0x1f0
2025-02-04T18:02:01.420668-05:00 hostname kernel: oom_kill_process+0x118/0x280
2025-02-04T18:02:01.420668-05:00 hostname kernel: ?
oom_evaluate_task+0x143/0x1e0
2025-02-04T18:02:01.420669-05:00 hostname kernel: out_of_memory+0x103/0x350
2025-02-04T18:02:01.420669-05:00 hostname kernel:
__alloc_pages_may_oom+0x10c/0x1d0
2025-02-04T18:02:01.420670-05:00 hostname kernel:
__alloc_pages_slowpath.constprop.0+0x420/0x9f0
2025-02-04T18:02:01.420670-05:00 hostname kernel: __alloc_pages+0x31f/0x350
2025-02-04T18:02:01.420671-05:00 hostname kernel: alloc_pages_mpol+0x91/0x210
2025-02-04T18:02:01.420671-05:00 hostname kernel: folio_alloc+0x64/0x120
2025-02-04T18:02:01.420672-05:00 hostname kernel: ?
filemap_get_entry+0xe5/0x160
2025-02-04T18:02:01.420672-05:00 hostname kernel:
filemap_alloc_folio+0xf4/0x100
2025-02-04T18:02:01.420673-05:00 hostname kernel:
__filemap_get_folio+0x14b/0x2f0
2025-02-04T18:02:01.420673-05:00 hostname kernel: filemap_fault+0x15c/0x8e0
2025-02-04T18:02:01.420674-05:00 hostname kernel: __do_fault+0x38/0x140
2025-02-04T18:02:01.420674-05:00 hostname kernel: do_read_fault+0x133/0x1d0
2025-02-04T18:02:01.420675-05:00 hostname kernel: do_fault+0x109/0x350
2025-02-04T18:02:01.420675-05:00 hostname kernel: handle_pte_fault+0x114/0x1d0
2025-02-04T18:02:01.420676-05:00 hostname kernel: __handle_mm_fault+0x653/0x790
2025-02-04T18:02:01.420677-05:00 hostname kernel: handle_mm_fault+0x18a/0x380
2025-02-04T18:02:01.420677-05:00 hostname kernel:
do_user_addr_fault+0x169/0x670
2025-02-04T18:02:01.420678-05:00 hostname kernel: exc_page_fault+0x83/0x1b0
2025-02-04T18:02:01.420678-05:00 hostname kernel: asm_exc_page_fault+0x27/0x30
2025-02-04T18:02:01.420705-05:00 hostname kernel: RIP: 0033:0x44b932
2025-02-04T18:02:01.420707-05:00 hostname kernel: Code: Unable to
access opcode bytes at 0x44b908.
2025-02-04T18:02:01.420707-05:00 hostname kernel: RSP:
002b:00007ffca65a1ac0 EFLAGS: 00010246
2025-02-04T18:02:01.420709-05:00 hostname kernel: RAX:
0000000039a2b350 RBX: 00000000399e1650 RCX: 373130325f646573
2025-02-04T18:02:01.420709-05:00 hostname kernel: RDX:
00000000399d51d0 RSI: 325f6465736f6c63 RDI: 00000000399d51d0
2025-02-04T18:02:01.420710-05:00 hostname kernel: RBP:
00007ffca65a1b70 R08: 00000000399cb010 R09: 0000000000000007
2025-02-04T18:02:01.420710-05:00 hostname kernel: R10:
00000000399d51d0 R11: fe1f64ced69f9b4f R12: 0000000000000000
2025-02-04T18:02:01.420711-05:00 hostname kernel: R13:
00000000004a3fa0 R14: 0000000039a0b080 R15: 0000000039a2b2e0
2025-02-04T18:02:01.420711-05:00 hostname kernel: </TASK>
2025-02-04T18:02:01.420712-05:00 hostname kernel: Mem-Info:
2025-02-04T18:02:01.420713-05:00 hostname kernel: active_anon:10285863
inactive_anon:1280438 isolated_anon:0
2025-02-04T18:02:01.420713-05:00 hostname kernel: active_file:312
inactive_file:750 isolated_file:0
2025-02-04T18:02:01.420714-05:00 hostname kernel: unevictable:41500
dirty:34 writeback:0
2025-02-04T18:02:01.420715-05:00 hostname kernel:
slab_reclaimable:21229 slab_unreclaimable:33598
2025-02-04T18:02:01.420715-05:00 hostname kernel: mapped:12223
shmem:1218 pagetables:24740
2025-02-04T18:02:01.420715-05:00 hostname kernel: sec_pagetables:0 bounce:0
2025-02-04T18:02:01.420716-05:00 hostname kernel: kernel_misc_reclaimable:0
2025-02-04T18:02:01.420716-05:00 hostname kernel: free:64845
free_pcp:0 free_cma:0
2025-02-04T18:02:01.420717-05:00 hostname kernel: Node 0
active_anon:41143452kB inactive_anon:5121752kB active_file:1572kB
inactive_file:2676kB unevictable:166000kB isolated(anon):0kB
isolated(file):0kB mapped
:48892kB dirty:136kB writeback:0kB shmem:4872kB shmem_thp:0kB
shmem_pmdmapped:0kB anon_thp:0kB writeback_tmp:0kB kernel_stack:7360kB
pagetables:98960kB sec_pagetables:0kB all_unreclaimable? no
2025-02-04T18:02:01.420721-05:00 hostname kernel: Node 0 DMA
free:11264kB boost:0kB min:20kB low:32kB high:44kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevi
ctable:0kB writepending:0kB present:15964kB managed:15360kB
mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
2025-02-04T18:02:01.420722-05:00 hostname kernel: lowmem_reserve[]: 0
2926 48083 48083 48083
2025-02-04T18:02:01.420723-05:00 hostname kernel: Node 0 DMA32
free:184676kB boost:0kB min:4112kB low:7108kB high:10104kB
reserved_highatomic:0KB active_anon:2755536kB inactive_anon:107204kB
active_file:0kB in
active_file:732kB unevictable:0kB writepending:4kB present:3120640kB
managed:3055104kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB
2025-02-04T18:02:01.420723-05:00 hostname kernel: lowmem_reserve[]: 0
0 45156 45156 45156
2025-02-04T18:02:01.420724-05:00 hostname kernel: Node 0 Normal
free:63440kB boost:0kB min:63448kB low:109688kB high:155928kB
reserved_highatomic:0KB active_anon:38387492kB inactive_anon:5014972kB
active_file:
1288kB inactive_file:2392kB unevictable:166000kB writepending:0kB
present:47185920kB managed:46247332kB mlocked:162928kB bounce:0kB
free_pcp:0kB local_pcp:0kB free_cma:0kB
2025-02-04T18:02:01.420725-05:00 hostname kernel: lowmem_reserve[]: 0 0 0 0 0
2025-02-04T18:02:01.420725-05:00 hostname kernel: Node 0 DMA: 0*4kB
0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U)
1*2048kB (M) 2*4096kB (M) = 11264kB
2025-02-04T18:02:01.420726-05:00 hostname kernel: Node 0 DMA32: 35*4kB
(UME) 60*8kB (UME) 127*16kB (UME) 119*32kB (UME) 117*64kB (UME)
91*128kB (UME) 83*256kB (UME) 65*512kB (UME) 40*1024kB (UME) 25*2048kB
(UM
E) 3*4096kB (UE) = 184572kB
2025-02-04T18:02:01.420727-05:00 hostname kernel: Node 0 Normal:
1394*4kB (UME) 1801*8kB (UME) 1236*16kB (UME) 566*32kB (UME) 97*64kB
(UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 64208kB
2025-02-04T18:02:01.420728-05:00 hostname kernel: Node 0
hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=1048576kB
2025-02-04T18:02:01.420728-05:00 hostname kernel: Node 0
hugepages_total=1024 hugepages_free=1024 hugepages_surp=0
hugepages_size=2048kB
2025-02-04T18:02:01.420729-05:00 hostname kernel: 14516 total pagecache pages
2025-02-04T18:02:01.420729-05:00 hostname kernel: 0 pages in swap cache
2025-02-04T18:02:01.420730-05:00 hostname kernel: Free swap = 13631420kB
2025-02-04T18:02:01.420730-05:00 hostname kernel: Total swap = 13631420kB
2025-02-04T18:02:01.420730-05:00 hostname kernel: 12580631 pages RAM
2025-02-04T18:02:01.420731-05:00 hostname kernel: 0 pages HighMem/MovableOnly
2025-02-04T18:02:01.420731-05:00 hostname kernel: 251182 pages reserved
2025-02-04T18:02:01.420732-05:00 hostname kernel: 0 pages hwpoisoned
...
2025-02-04T18:02:01.420887-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11195973
2025-02-04T18:02:01.421187-05:00 hostname kernel: message repeated 269
times: [ obd_memory max: 1282184920, obd_memory current: 11195973]
2025-02-04T18:02:01.421187-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11196765
2025-02-04T18:02:01.421192-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11200645
2025-02-04T18:02:01.421193-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11217029
2025-02-04T18:02:01.421194-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11217029
2025-02-04T18:02:01.421195-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11217429
2025-02-04T18:02:01.421195-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11217429
2025-02-04T18:02:01.421196-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11216645
2025-02-04T18:02:01.421197-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11197597
2025-02-04T18:02:01.421198-05:00 hostname kernel: obd_memory max:
1282184920, obd_memory current: 11208125
It doesn't happen in every test, presumably it just requires the
out-of-memory situation to arise at a certain point in the execution,
but I can make it happen pretty reliably by using up all the RAM while
a process is reading from the lustre filesystem. Is this a known issue
already? Thanks!
-Lewis
More information about the lustre-discuss
mailing list