[lustre-devel] lustre:pcc: Sanity-pcc 7a test hang(Both on Aarch64 and X86_64) discussion
kevin.zhao at linaro.org
Sun Mar 13 18:41:37 PDT 2022
Great! Thanks for that info, will take a look at that.
On Sat, 12 Mar 2022 at 08:02, Andreas Dilger <adilger at whamcloud.com> wrote:
> Qian has a patch https://review.whamcloud.com/40092 "LU-14003
> <https://jira.whamcloud.com/browse/LU-14003> pcc: rework PCC mmap
> implementation" that is changing the PCC MMAP code significantly, but is
> waiting for the 2.16.0 feature landing window to open. It needs to be
> refreshed, but it would be helpful if you could take a look through that
> patch to see if it would resolve the issue you are seeing.
> On Mar 11, 2022, at 01:18, Kevin Zhao via lustre-devel <
> lustre-devel at lists.lustre.org> wrote:
> Recently we've worked on the bug
> https://jira.whamcloud.com/browse/LU-14346. This bug will make the *mmap
> write* hang forever. This one is first occurring on Aarch64, but if we do
> a small change
> we *can easily reproduce it on X86_64*. For more details analysis of this
> bug, you can also check the link
> The hang location is here
> <https://github.com/lustre/lustre-release/blob/master/lustre/tests/multiop.c#L725> as
> case 'W':
> for (i = 0; i < mmap_len && mmap_ptr; i += 4096)
> mmap_ptr[i] += junk++;
> *Bug Analysis - different behavior when run **mmap_ptr[i] += junk++ on
> different platform.*
> Traditionally, this process is:
> 1. read from mmap_ptr[i]first(Execute the read page fault)
> 2. Write a value to the same page(execute the page_mkwrite to change the
> page to writable).
> But on different platforms, it executes quite differently.
> On aarch64 platform: do_page_fault, no FAULT_FLAG_WRITE set, so
> handle_pte_fault will call do_read_fault
> - do_read_fault:
> __do_fault -> call ll_fault, get a page from pcc_fault
> finish_fault(map the returned page to page tables)
> vmf->flags is VM_FAULT_LOCKED
> - call do_wp_page --> do_page_mkwrite --> ll_page_mkwrite
> On X86_64 platform, the mechanism is different. On X86_64, do_page_fault,
> with * FAULT_FLAG_WRITE set*, so handle_pte_fault will call
> - do_shared_fault
> - __do_fault -> call ll_fault, get a page from pcc_fault
> - do_page_mkwrite-> call ll_page_mkwrite
> - finish_fault(map the returned page to page tables)
> - fault_dirty_shared_page
> *Bug Analysis: why hang forever:*
> Also can check:
> for more details.
> Insert the issue 0x1412 OBD_FAIL_LLITE_PCC_DETACH_MKWRITE.
> Return with VM_FAULT_RETRY | VM_FAULT_NOPAGE
> RETRY again, due to PTE is not NULL, vmf->flags FAULT_FLAG_WRITE, will
> call do_wp_page again.
> So that next time we will enter into do_page_mkwrite again. hanging
> *Seek a good solution*
> As the above code snippet shows, *we want to let the kernel retry the
> mmap write (->fault() and ->page_mkwrite).*
> In handle_pte_fault, if there is no page or the page is not mapped(no PTE
> found), then
> __do_page_fault will try the memory fault handling.
> The easy fix here is to* remove the page and page table entry when we do
> fail injection in pcc_page_mkwrite.* But I don't find a good method to
> execute this, so list the info here and ask for community help.
> Some tried fix is:
> add function: *generic_error_remove_page*, but the mapped page still can
> not be unmapped successfully. The error log is here
> Since I'm a newbie to Lustre and not quite familiar with the memory
> management process, so please give some advice on this bug fix. Thanks in
> Cheers, Andreas
> Andreas Dilger
> Lustre Principal Architect
Tech Lead, LDCG Cloud Infrastructure
Linaro Vertical Technologies
kevin.zhao at linaro.org | Mobile/Direct/Wechat: +86 18818270915
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lustre-devel