From ebcbc6ea7d8a604ad8504dae70a6ac1b1e64a0b7 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 14 Feb 2022 18:20:24 -0800 Subject: mm/munlock: delete page_mlock() and all its works We have recommended some applications to mlock their userspace, but that turns out to be counter-productive: when many processes mlock the same file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it is needed for write, to remove any vma mapping that file from rmap's tree; but hogged for read by those with mlocks calling page_mlock() (formerly known as try_to_munlock()) on *each* page mapped from the file (the purpose being to find out whether another process has the page mlocked, so therefore it should not be unmlocked yet). Several optimizations have been made in the past: one is to skip page_mlock() when mapcount tells that nothing else has this page mapped; but that doesn't help at all when others do have it mapped. This time around, I initially intended to add a preliminary search of the rmap tree for overlapping VM_LOCKED ranges; but that gets messy with locking order, when in doubt whether a page is actually present; and risks adding even more contention on the i_mmap_rwsem. A solution would be much easier, if only there were space in struct page for an mlock_count... but actually, most of the time, there is space for it - an mlocked page spends most of its life on an unevictable LRU, but since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has been redundant. Let's try to reuse its page->lru. But leave that until a later patch: in this patch, clear the ground by removing page_mlock(), and all the infrastructure that has gathered around it - which mostly hinders understanding, and will make reviewing new additions harder. Don't mind those old comments about THPs, they date from before 4.5's refcounting rework: splitting is not a risk here. Just keep a minimal version of munlock_vma_page(), as reminder of what it should attend to (in particular, the odd way PGSTRANDED is counted out of PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range(). Move unchanged __mlock_posix_error_return() out of the way, down to above its caller: this series then makes no further change after mlock_fixup(). After this and each following commit, the kernel builds, boots and runs; but with deficiencies which may show up in testing of mlock and munlock. The system calls succeed or fail as before, and mlock remains effective in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts may be shown too low after mlock, grow, then stay too high after munlock: with previously mlocked pages remaining unevictable for too long, until finally unmapped and freed and counts corrected. Normal service will be resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking". Signed-off-by: Hugh Dickins Acked-by: Vlastimil Babka Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 6 ------ 1 file changed, 6 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index e704b1a4c06c..dc48aa8c2c94 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -237,12 +237,6 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); */ int folio_mkclean(struct folio *); -/* - * called in munlock()/munmap() path to check for other vmas holding - * the page mlocked. - */ -void page_mlock(struct page *page); - void remove_migration_ptes(struct page *old, struct page *new, bool locked); /* -- cgit v1.2.3 From cea86fe246b694a191804b47378eb9d77aefabec Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 14 Feb 2022 18:26:39 -0800 Subject: mm/munlock: rmap call mlock_vma_page() munlock_vma_page() Add vma argument to mlock_vma_page() and munlock_vma_page(), make them inline functions which check (vma->vm_flags & VM_LOCKED) before calling mlock_page() and munlock_page() in mm/mlock.c. Add bool compound to mlock_vma_page() and munlock_vma_page(): this is because we have understandable difficulty in accounting pte maps of THPs, and if passed a PageHead page, mlock_page() and munlock_page() cannot tell whether it's a pmd map to be counted or a pte map to be ignored. Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the others, and use that to call mlock_vma_page() at the end of the page adds, and munlock_vma_page() at the end of page_remove_rmap() (end or beginning? unimportant, but end was easier for assertions in testing). No page lock is required (although almost all adds happen to hold it): delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s. Certainly page lock did serialize with page migration, but I'm having difficulty explaining why that was ever important. Mlock accounting on THPs has been hard to define, differed between anon and file, involved PageDoubleMap in some places and not others, required clear_page_mlock() at some points. Keep it simple now: just count the pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks. page_add_new_anon_rmap() callers unchanged: they have long been calling lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED handling (it also checks for not VM_SPECIAL: I think that's overcautious, and inconsistent with other checks, that mmap_region() already prevents VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it). Signed-off-by: Hugh Dickins Acked-by: Vlastimil Babka Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 17 ++++++++------- kernel/events/uprobes.c | 7 ++----- mm/huge_memory.c | 17 +++++++-------- mm/hugetlb.c | 4 ++-- mm/internal.h | 36 ++++++++++++++++++++++++++----- mm/khugepaged.c | 4 ++-- mm/ksm.c | 12 +---------- mm/memory.c | 45 +++++++++++++-------------------------- mm/migrate.c | 9 ++------ mm/mlock.c | 21 +++++++------------ mm/rmap.c | 56 +++++++++++++++++++++++-------------------------- mm/userfaultfd.c | 14 +++++++------ 12 files changed, 113 insertions(+), 129 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index dc48aa8c2c94..ac29b076082b 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -167,18 +167,19 @@ struct anon_vma *page_get_anon_vma(struct page *page); */ void page_move_anon_rmap(struct page *, struct vm_area_struct *); void page_add_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long, bool); + unsigned long address, bool compound); void do_page_add_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long, int); + unsigned long address, int flags); void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long, bool); -void page_add_file_rmap(struct page *, bool); -void page_remove_rmap(struct page *, bool); - + unsigned long address, bool compound); +void page_add_file_rmap(struct page *, struct vm_area_struct *, + bool compound); +void page_remove_rmap(struct page *, struct vm_area_struct *, + bool compound); void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long); + unsigned long address); void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long); + unsigned long address); static inline void page_dup_rmap(struct page *page, bool compound) { diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 6357c3580d07..eed2f7437d96 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -173,7 +173,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, return err; } - /* For try_to_free_swap() and munlock_vma_page() below */ + /* For try_to_free_swap() below */ lock_page(old_page); mmu_notifier_invalidate_range_start(&range); @@ -201,13 +201,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, set_pte_at_notify(mm, addr, pvmw.pte, mk_pte(new_page, vma->vm_page_prot)); - page_remove_rmap(old_page, false); + page_remove_rmap(old_page, vma, false); if (!page_mapped(old_page)) try_to_free_swap(old_page); page_vma_mapped_walk_done(&pvmw); - - if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page)) - munlock_vma_page(old_page); put_page(old_page); err = 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9a34b85ebcf8..d6477f48a27e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1577,7 +1577,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, if (pmd_present(orig_pmd)) { page = pmd_page(orig_pmd); - page_remove_rmap(page, true); + page_remove_rmap(page, vma, true); VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); VM_BUG_ON_PAGE(!PageHead(page), page); } else if (thp_migration_supported()) { @@ -1962,7 +1962,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, set_page_dirty(page); if (!PageReferenced(page) && pmd_young(old_pmd)) SetPageReferenced(page); - page_remove_rmap(page, true); + page_remove_rmap(page, vma, true); put_page(page); } add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR); @@ -2096,6 +2096,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, } } unlock_page_memcg(page); + + /* Above is effectively page_remove_rmap(page, vma, true) */ + munlock_vma_page(page, vma, true); } smp_wmb(); /* make pte visible before pmd */ @@ -2103,7 +2106,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, if (freeze) { for (i = 0; i < HPAGE_PMD_NR; i++) { - page_remove_rmap(page + i, false); + page_remove_rmap(page + i, vma, false); put_page(page + i); } } @@ -2163,8 +2166,6 @@ repeat: do_unlock_page = true; } } - if (PageMlocked(page)) - clear_page_mlock(page); } else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd))) goto out; __split_huge_pmd_locked(vma, pmd, range.start, freeze); @@ -3138,7 +3139,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, if (pmd_soft_dirty(pmdval)) pmdswp = pmd_swp_mksoft_dirty(pmdswp); set_pmd_at(mm, address, pvmw->pmd, pmdswp); - page_remove_rmap(page, true); + page_remove_rmap(page, vma, true); put_page(page); } @@ -3168,10 +3169,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) if (PageAnon(new)) page_add_anon_rmap(new, vma, mmun_start, true); else - page_add_file_rmap(new, true); + page_add_file_rmap(new, vma, true); set_pmd_at(mm, mmun_start, pvmw->pmd, pmde); - if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new)) - mlock_vma_page(new); update_mmu_cache_pmd(vma, address, pvmw->pmd); } #endif diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 61895cc01d09..43fb3155298e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5014,7 +5014,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct set_page_dirty(page); hugetlb_count_sub(pages_per_huge_page(h), mm); - page_remove_rmap(page, true); + page_remove_rmap(page, vma, true); spin_unlock(ptl); tlb_remove_page_size(tlb, page, huge_page_size(h)); @@ -5259,7 +5259,7 @@ retry_avoidcopy: /* Break COW */ huge_ptep_clear_flush(vma, haddr, ptep); mmu_notifier_invalidate_range(mm, range.start, range.end); - page_remove_rmap(old_page, true); + page_remove_rmap(old_page, vma, true); hugepage_add_new_anon_rmap(new_page, vma, haddr); set_huge_pte_at(mm, haddr, ptep, make_huge_pte(vma, new_page, 1)); diff --git a/mm/internal.h b/mm/internal.h index f235aa92e564..3d7dfc8bc471 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -395,12 +395,35 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma, bool write, int *locked); extern int mlock_future_check(struct mm_struct *mm, unsigned long flags, unsigned long len); - /* - * must be called with vma's mmap_lock held for read or write, and page locked. + * mlock_vma_page() and munlock_vma_page(): + * should be called with vma's mmap_lock held for read or write, + * under page table lock for the pte/pmd being added or removed. + * + * mlock is usually called at the end of page_add_*_rmap(), + * munlock at the end of page_remove_rmap(); but new anon + * pages are managed in lru_cache_add_inactive_or_unevictable(). + * + * @compound is used to include pmd mappings of THPs, but filter out + * pte mappings of THPs, which cannot be consistently counted: a pte + * mapping of the THP head cannot be distinguished by the page alone. */ -extern void mlock_vma_page(struct page *page); -extern void munlock_vma_page(struct page *page); +void mlock_page(struct page *page); +static inline void mlock_vma_page(struct page *page, + struct vm_area_struct *vma, bool compound) +{ + if (unlikely(vma->vm_flags & VM_LOCKED) && + (compound || !PageTransCompound(page))) + mlock_page(page); +} +void munlock_page(struct page *page); +static inline void munlock_vma_page(struct page *page, + struct vm_area_struct *vma, bool compound) +{ + if (unlikely(vma->vm_flags & VM_LOCKED) && + (compound || !PageTransCompound(page))) + munlock_page(page); +} /* * Clear the page's PageMlocked(). This can be useful in a situation where @@ -487,7 +510,10 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf, #else /* !CONFIG_MMU */ static inline void unmap_mapping_folio(struct folio *folio) { } static inline void clear_page_mlock(struct page *page) { } -static inline void mlock_vma_page(struct page *page) { } +static inline void mlock_vma_page(struct page *page, + struct vm_area_struct *vma, bool compound) { } +static inline void munlock_vma_page(struct page *page, + struct vm_area_struct *vma, bool compound) { } static inline void vunmap_range_noflush(unsigned long start, unsigned long end) { } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 131492fd1148..52add1825525 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -774,7 +774,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, */ spin_lock(ptl); ptep_clear(vma->vm_mm, address, _pte); - page_remove_rmap(src_page, false); + page_remove_rmap(src_page, vma, false); spin_unlock(ptl); free_page_and_swap_cache(src_page); } @@ -1513,7 +1513,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) if (pte_none(*pte)) continue; page = vm_normal_page(vma, addr, *pte); - page_remove_rmap(page, false); + page_remove_rmap(page, vma, false); } pte_unmap_unlock(start_pte, ptl); diff --git a/mm/ksm.c b/mm/ksm.c index c20bd4d9a0d9..c5a4403b5dc9 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1177,7 +1177,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, ptep_clear_flush(vma, addr, ptep); set_pte_at_notify(mm, addr, ptep, newpte); - page_remove_rmap(page, false); + page_remove_rmap(page, vma, false); if (!page_mapped(page)) try_to_free_swap(page); put_page(page); @@ -1252,16 +1252,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma, err = replace_page(vma, page, kpage, orig_pte); } - if ((vma->vm_flags & VM_LOCKED) && kpage && !err) { - munlock_vma_page(page); - if (!PageMlocked(kpage)) { - unlock_page(page); - lock_page(kpage); - mlock_vma_page(kpage); - page = kpage; /* for final unlock */ - } - } - out_unlock: unlock_page(page); out: diff --git a/mm/memory.c b/mm/memory.c index c125c4969913..53bd9e5f2e33 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -735,9 +735,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma, set_pte_at(vma->vm_mm, address, ptep, pte); - if (vma->vm_flags & VM_LOCKED) - mlock_vma_page(page); - /* * No need to invalidate - it was non-present before. However * secondary CPUs may have mappings that need invalidating. @@ -1377,7 +1374,7 @@ again: mark_page_accessed(page); } rss[mm_counter(page)]--; - page_remove_rmap(page, false); + page_remove_rmap(page, vma, false); if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); if (unlikely(__tlb_remove_page(tlb, page))) { @@ -1397,10 +1394,8 @@ again: continue; pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); rss[mm_counter(page)]--; - if (is_device_private_entry(entry)) - page_remove_rmap(page, false); - + page_remove_rmap(page, vma, false); put_page(page); continue; } @@ -1753,16 +1748,16 @@ static int validate_page_before_insert(struct page *page) return 0; } -static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte, +static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, unsigned long addr, struct page *page, pgprot_t prot) { if (!pte_none(*pte)) return -EBUSY; /* Ok, finally just insert the thing.. */ get_page(page); - inc_mm_counter_fast(mm, mm_counter_file(page)); - page_add_file_rmap(page, false); - set_pte_at(mm, addr, pte, mk_pte(page, prot)); + inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); + page_add_file_rmap(page, vma, false); + set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot)); return 0; } @@ -1776,7 +1771,6 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte, static int insert_page(struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot) { - struct mm_struct *mm = vma->vm_mm; int retval; pte_t *pte; spinlock_t *ptl; @@ -1785,17 +1779,17 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, if (retval) goto out; retval = -ENOMEM; - pte = get_locked_pte(mm, addr, &ptl); + pte = get_locked_pte(vma->vm_mm, addr, &ptl); if (!pte) goto out; - retval = insert_page_into_pte_locked(mm, pte, addr, page, prot); + retval = insert_page_into_pte_locked(vma, pte, addr, page, prot); pte_unmap_unlock(pte, ptl); out: return retval; } #ifdef pte_index -static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte, +static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte, unsigned long addr, struct page *page, pgprot_t prot) { int err; @@ -1805,7 +1799,7 @@ static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte, err = validate_page_before_insert(page); if (err) return err; - return insert_page_into_pte_locked(mm, pte, addr, page, prot); + return insert_page_into_pte_locked(vma, pte, addr, page, prot); } /* insert_pages() amortizes the cost of spinlock operations @@ -1842,7 +1836,7 @@ more: start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock); for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) { - int err = insert_page_in_batch_locked(mm, pte, + int err = insert_page_in_batch_locked(vma, pte, addr, pages[curr_page_idx], prot); if (unlikely(err)) { pte_unmap_unlock(start_pte, pte_lock); @@ -3098,7 +3092,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) * mapcount is visible. So transitively, TLBs to * old page will be flushed before it can be reused. */ - page_remove_rmap(old_page, false); + page_remove_rmap(old_page, vma, false); } /* Free the old page.. */ @@ -3118,16 +3112,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) */ mmu_notifier_invalidate_range_only_end(&range); if (old_page) { - /* - * Don't let another task, with possibly unlocked vma, - * keep the mlocked page. - */ - if (page_copied && (vma->vm_flags & VM_LOCKED)) { - lock_page(old_page); /* LRU manipulation */ - if (PageMlocked(old_page)) - munlock_vma_page(old_page); - unlock_page(old_page); - } if (page_copied) free_swap_cache(old_page); put_page(old_page); @@ -3947,7 +3931,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR); - page_add_file_rmap(page, true); + page_add_file_rmap(page, vma, true); + /* * deposit and withdraw with pmd lock held */ @@ -3996,7 +3981,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr) lru_cache_add_inactive_or_unevictable(page, vma); } else { inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); - page_add_file_rmap(page, false); + page_add_file_rmap(page, vma, false); } set_pte_at(vma->vm_mm, addr, vmf->pte, entry); } diff --git a/mm/migrate.c b/mm/migrate.c index c7da064b4781..7c4223ce2500 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -248,14 +248,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, if (PageAnon(new)) page_add_anon_rmap(new, vma, pvmw.address, false); else - page_add_file_rmap(new, false); + page_add_file_rmap(new, vma, false); set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); } - if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new)) - mlock_vma_page(new); - - if (PageTransHuge(page) && PageMlocked(page)) - clear_page_mlock(page); /* No need to invalidate - it was non-present before */ update_mmu_cache(vma, pvmw.address, pvmw.pte); @@ -2331,7 +2326,7 @@ again: * drop page refcount. Page won't be freed, as we took * a reference just above. */ - page_remove_rmap(page, false); + page_remove_rmap(page, vma, false); put_page(page); if (pte_present(pte)) diff --git a/mm/mlock.c b/mm/mlock.c index 5d7ced8303be..92f28258b4ae 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -78,17 +78,13 @@ void clear_page_mlock(struct page *page) } } -/* - * Mark page as mlocked if not already. - * If page on LRU, isolate and putback to move to unevictable list. +/** + * mlock_page - mlock a page + * @page: page to be mlocked, either a normal page or a THP head. */ -void mlock_vma_page(struct page *page) +void mlock_page(struct page *page) { - /* Serialize with page migration */ - BUG_ON(!PageLocked(page)); - VM_BUG_ON_PAGE(PageTail(page), page); - VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page); if (!TestSetPageMlocked(page)) { int nr_pages = thp_nr_pages(page); @@ -101,14 +97,11 @@ void mlock_vma_page(struct page *page) } /** - * munlock_vma_page - munlock a vma page - * @page: page to be unlocked, either a normal page or THP page head + * munlock_page - munlock a page + * @page: page to be munlocked, either a normal page or a THP head. */ -void munlock_vma_page(struct page *page) +void munlock_page(struct page *page) { - /* Serialize with page migration */ - BUG_ON(!PageLocked(page)); - VM_BUG_ON_PAGE(PageTail(page), page); if (TestClearPageMlocked(page)) { diff --git a/mm/rmap.c b/mm/rmap.c index 7ce7f1946cff..6cc8bf129f18 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1181,17 +1181,17 @@ void do_page_add_anon_rmap(struct page *page, __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr); } - if (unlikely(PageKsm(page))) { + if (unlikely(PageKsm(page))) unlock_page_memcg(page); - return; - } /* address might be in next vma when migration races vma_adjust */ - if (first) + else if (first) __page_set_anon_rmap(page, vma, address, flags & RMAP_EXCLUSIVE); else __page_check_anon_rmap(page, vma, address); + + mlock_vma_page(page, vma, compound); } /** @@ -1232,12 +1232,14 @@ void page_add_new_anon_rmap(struct page *page, /** * page_add_file_rmap - add pte mapping to a file page - * @page: the page to add the mapping to - * @compound: charge the page as compound or small page + * @page: the page to add the mapping to + * @vma: the vm area in which the mapping is added + * @compound: charge the page as compound or small page * * The caller needs to hold the pte lock. */ -void page_add_file_rmap(struct page *page, bool compound) +void page_add_file_rmap(struct page *page, + struct vm_area_struct *vma, bool compound) { int i, nr = 1; @@ -1260,13 +1262,8 @@ void page_add_file_rmap(struct page *page, bool compound) nr_pages); } else { if (PageTransCompound(page) && page_mapping(page)) { - struct page *head = compound_head(page); - VM_WARN_ON_ONCE(!PageLocked(page)); - - SetPageDoubleMap(head); - if (PageMlocked(page)) - clear_page_mlock(head); + SetPageDoubleMap(compound_head(page)); } if (!atomic_inc_and_test(&page->_mapcount)) goto out; @@ -1274,6 +1271,8 @@ void page_add_file_rmap(struct page *page, bool compound) __mod_lruvec_page_state(page, NR_FILE_MAPPED, nr); out: unlock_page_memcg(page); + + mlock_vma_page(page, vma, compound); } static void page_remove_file_rmap(struct page *page, bool compound) @@ -1368,11 +1367,13 @@ static void page_remove_anon_compound_rmap(struct page *page) /** * page_remove_rmap - take down pte mapping from a page * @page: page to remove mapping from + * @vma: the vm area from which the mapping is removed * @compound: uncharge the page as compound or small page * * The caller needs to hold the pte lock. */ -void page_remove_rmap(struct page *page, bool compound) +void page_remove_rmap(struct page *page, + struct vm_area_struct *vma, bool compound) { lock_page_memcg(page); @@ -1414,6 +1415,8 @@ void page_remove_rmap(struct page *page, bool compound) */ out: unlock_page_memcg(page); + + munlock_vma_page(page, vma, compound); } /* @@ -1469,28 +1472,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, mmu_notifier_invalidate_range_start(&range); while (page_vma_mapped_walk(&pvmw)) { + /* Unexpected PMD-mapped THP? */ + VM_BUG_ON_PAGE(!pvmw.pte, page); + /* - * If the page is mlock()d, we cannot swap it out. + * If the page is in an mlock()d vma, we must not swap it out. */ if (!(flags & TTU_IGNORE_MLOCK) && (vma->vm_flags & VM_LOCKED)) { - /* - * PTE-mapped THP are never marked as mlocked: so do - * not set it on a DoubleMap THP, nor on an Anon THP - * (which may still be PTE-mapped after DoubleMap was - * cleared). But stop unmapping even in those cases. - */ - if (!PageTransCompound(page) || (PageHead(page) && - !PageDoubleMap(page) && !PageAnon(page))) - mlock_vma_page(page); + /* Restore the mlock which got missed */ + mlock_vma_page(page, vma, false); page_vma_mapped_walk_done(&pvmw); ret = false; break; } - /* Unexpected PMD-mapped THP? */ - VM_BUG_ON_PAGE(!pvmw.pte, page); - subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); address = pvmw.address; @@ -1668,7 +1664,7 @@ discard: * * See Documentation/vm/mmu_notifier.rst */ - page_remove_rmap(subpage, PageHuge(page)); + page_remove_rmap(subpage, vma, PageHuge(page)); put_page(page); } @@ -1942,7 +1938,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, * * See Documentation/vm/mmu_notifier.rst */ - page_remove_rmap(subpage, PageHuge(page)); + page_remove_rmap(subpage, vma, PageHuge(page)); put_page(page); } @@ -2078,7 +2074,7 @@ static bool page_make_device_exclusive_one(struct page *page, * There is a reference on the page for the swap entry which has * been removed, so shouldn't take another. */ - page_remove_rmap(subpage, false); + page_remove_rmap(subpage, vma, false); } mmu_notifier_invalidate_range_end(&range); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 0780c2a57ff1..15d3e97a6e04 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -95,10 +95,15 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, if (!pte_none(*dst_pte)) goto out_unlock; - if (page_in_cache) - page_add_file_rmap(page, false); - else + if (page_in_cache) { + /* Usually, cache pages are already added to LRU */ + if (newly_allocated) + lru_cache_add(page); + page_add_file_rmap(page, dst_vma, false); + } else { page_add_new_anon_rmap(page, dst_vma, dst_addr, false); + lru_cache_add_inactive_or_unevictable(page, dst_vma); + } /* * Must happen after rmap, as mm_counter() checks mapping (via @@ -106,9 +111,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, */ inc_mm_counter(dst_mm, mm_counter(page)); - if (newly_allocated) - lru_cache_add_inactive_or_unevictable(page, dst_vma); - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); /* No need to invalidate - it was non-present before */ -- cgit v1.2.3 From eed05e54d275b3cfc5d8c79843c5276a5878e94a Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 3 Feb 2022 09:06:08 -0500 Subject: mm: Add DEFINE_PAGE_VMA_WALK and DEFINE_FOLIO_VMA_WALK Instead of declaring a struct page_vma_mapped_walk directly, use these helpers to allow us to transition to a PFN approach in the following patches. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 16 ++++++++++++++++ kernel/events/uprobes.c | 6 +----- mm/damon/paddr.c | 12 ++---------- mm/ksm.c | 5 +---- mm/migrate.c | 7 +------ mm/page_idle.c | 6 +----- mm/rmap.c | 31 +++++-------------------------- 7 files changed, 27 insertions(+), 56 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index ac29b076082b..0d894a2bfaa1 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -214,6 +214,22 @@ struct page_vma_mapped_walk { unsigned int flags; }; +#define DEFINE_PAGE_VMA_WALK(name, _page, _vma, _address, _flags) \ + struct page_vma_mapped_walk name = { \ + .page = _page, \ + .vma = _vma, \ + .address = _address, \ + .flags = _flags, \ + } + +#define DEFINE_FOLIO_VMA_WALK(name, _folio, _vma, _address, _flags) \ + struct page_vma_mapped_walk name = { \ + .page = &_folio->page, \ + .vma = _vma, \ + .address = _address, \ + .flags = _flags, \ + } + static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) { /* HugeTLB pte is set to the relevant page table entry without pte_mapped. */ diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index eed2f7437d96..6418083901d4 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -155,11 +155,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, struct page *old_page, struct page *new_page) { struct mm_struct *mm = vma->vm_mm; - struct page_vma_mapped_walk pvmw = { - .page = compound_head(old_page), - .vma = vma, - .address = addr, - }; + DEFINE_FOLIO_VMA_WALK(pvmw, page_folio(old_page), vma, addr, 0); int err; struct mmu_notifier_range range; diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index 5e8244f65a1a..cb45d49c731d 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -19,11 +19,7 @@ static bool __damon_pa_mkold(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *arg) { - struct page_vma_mapped_walk pvmw = { - .page = page, - .vma = vma, - .address = addr, - }; + DEFINE_PAGE_VMA_WALK(pvmw, page, vma, addr, 0); while (page_vma_mapped_walk(&pvmw)) { addr = pvmw.address; @@ -93,11 +89,7 @@ static bool __damon_pa_young(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *arg) { struct damon_pa_access_chk_result *result = arg; - struct page_vma_mapped_walk pvmw = { - .page = page, - .vma = vma, - .address = addr, - }; + DEFINE_PAGE_VMA_WALK(pvmw, page, vma, addr, 0); result->accessed = false; result->page_sz = PAGE_SIZE; diff --git a/mm/ksm.c b/mm/ksm.c index c5a4403b5dc9..ea82fef93a31 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1034,10 +1034,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, pte_t *orig_pte) { struct mm_struct *mm = vma->vm_mm; - struct page_vma_mapped_walk pvmw = { - .page = page, - .vma = vma, - }; + DEFINE_PAGE_VMA_WALK(pvmw, page, vma, 0, 0); int swapped; int err = -EFAULT; struct mmu_notifier_range range; diff --git a/mm/migrate.c b/mm/migrate.c index f4076093c855..71f92e8ed934 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -174,12 +174,7 @@ void putback_movable_pages(struct list_head *l) static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *old) { - struct page_vma_mapped_walk pvmw = { - .page = old, - .vma = vma, - .address = addr, - .flags = PVMW_SYNC | PVMW_MIGRATION, - }; + DEFINE_PAGE_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION); struct page *new; pte_t pte; swp_entry_t entry; diff --git a/mm/page_idle.c b/mm/page_idle.c index edead6a8a5f9..3e05bf1ce825 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -48,11 +48,7 @@ static bool page_idle_clear_pte_refs_one(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *arg) { - struct page_vma_mapped_walk pvmw = { - .page = page, - .vma = vma, - .address = addr, - }; + DEFINE_PAGE_VMA_WALK(pvmw, page, vma, addr, 0); bool referenced = false; while (page_vma_mapped_walk(&pvmw)) { diff --git a/mm/rmap.c b/mm/rmap.c index 1a13d5d6cfc7..a7f06b76b503 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -802,11 +802,7 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *arg) { struct page_referenced_arg *pra = arg; - struct page_vma_mapped_walk pvmw = { - .page = page, - .vma = vma, - .address = address, - }; + DEFINE_PAGE_VMA_WALK(pvmw, page, vma, address, 0); int referenced = 0; while (page_vma_mapped_walk(&pvmw)) { @@ -934,12 +930,7 @@ int page_referenced(struct page *page, static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *arg) { - struct page_vma_mapped_walk pvmw = { - .page = page, - .vma = vma, - .address = address, - .flags = PVMW_SYNC, - }; + DEFINE_PAGE_VMA_WALK(pvmw, page, vma, address, PVMW_SYNC); struct mmu_notifier_range range; int *cleaned = arg; @@ -1419,11 +1410,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *arg) { struct mm_struct *mm = vma->vm_mm; - struct page_vma_mapped_walk pvmw = { - .page = page, - .vma = vma, - .address = address, - }; + DEFINE_PAGE_VMA_WALK(pvmw, page, vma, address, 0); pte_t pteval; struct page *subpage; bool ret = true; @@ -1714,11 +1701,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *arg) { struct mm_struct *mm = vma->vm_mm; - struct page_vma_mapped_walk pvmw = { - .page = page, - .vma = vma, - .address = address, - }; + DEFINE_PAGE_VMA_WALK(pvmw, page, vma, address, 0); pte_t pteval; struct page *subpage; bool ret = true; @@ -2001,11 +1984,7 @@ static bool page_make_device_exclusive_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *priv) { struct mm_struct *mm = vma->vm_mm; - struct page_vma_mapped_walk pvmw = { - .page = page, - .vma = vma, - .address = address, - }; + DEFINE_PAGE_VMA_WALK(pvmw, page, vma, address, 0); struct make_exclusive_args *args = priv; pte_t pteval; struct page *subpage; -- cgit v1.2.3 From 2aff7a4755bed2870ee23b75bc88cdc8d76cdd03 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Thu, 3 Feb 2022 11:40:17 -0500 Subject: mm: Convert page_vma_mapped_walk to work on PFNs page_mapped_in_vma() really just wants to walk one page, but as the code stands, if passed the head page of a compound page, it will walk every page in the compound page. Extract pfn/nr_pages/pgoff from the struct page early, so they can be overridden by page_mapped_in_vma(). Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/hugetlb.h | 5 +++++ include/linux/rmap.h | 17 ++++++++++----- mm/internal.h | 15 ++++++++----- mm/migrate.c | 5 +++-- mm/page_vma_mapped.c | 58 ++++++++++++++++++++++--------------------------- mm/rmap.c | 8 +++---- 6 files changed, 58 insertions(+), 50 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index d1897a69c540..6ba2f8e74fbb 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -970,6 +970,11 @@ static inline struct hstate *page_hstate(struct page *page) return NULL; } +static inline struct hstate *size_to_hstate(unsigned long size) +{ + return NULL; +} + static inline unsigned long huge_page_size(struct hstate *h) { return PAGE_SIZE; diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 0d894a2bfaa1..0c838ba1a8ee 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -11,6 +11,7 @@ #include #include #include +#include /* * The anon_vma heads a list of private "related" vmas, to scan if @@ -201,11 +202,13 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, /* Avoid racy checks */ #define PVMW_SYNC (1 << 0) -/* Look for migarion entries rather than present PTEs */ +/* Look for migration entries rather than present PTEs */ #define PVMW_MIGRATION (1 << 1) struct page_vma_mapped_walk { - struct page *page; + unsigned long pfn; + unsigned long nr_pages; + pgoff_t pgoff; struct vm_area_struct *vma; unsigned long address; pmd_t *pmd; @@ -216,7 +219,9 @@ struct page_vma_mapped_walk { #define DEFINE_PAGE_VMA_WALK(name, _page, _vma, _address, _flags) \ struct page_vma_mapped_walk name = { \ - .page = _page, \ + .pfn = page_to_pfn(_page), \ + .nr_pages = compound_nr(page), \ + .pgoff = page_to_pgoff(page), \ .vma = _vma, \ .address = _address, \ .flags = _flags, \ @@ -224,7 +229,9 @@ struct page_vma_mapped_walk { #define DEFINE_FOLIO_VMA_WALK(name, _folio, _vma, _address, _flags) \ struct page_vma_mapped_walk name = { \ - .page = &_folio->page, \ + .pfn = folio_pfn(_folio), \ + .nr_pages = folio_nr_pages(_folio), \ + .pgoff = folio_pgoff(_folio), \ .vma = _vma, \ .address = _address, \ .flags = _flags, \ @@ -233,7 +240,7 @@ struct page_vma_mapped_walk { static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw) { /* HugeTLB pte is set to the relevant page table entry without pte_mapped. */ - if (pvmw->pte && !PageHuge(pvmw->page)) + if (pvmw->pte && !is_vm_hugetlb_page(pvmw->vma)) pte_unmap(pvmw->pte); if (pvmw->ptl) spin_unlock(pvmw->ptl); diff --git a/mm/internal.h b/mm/internal.h index 6047268076e7..3b652444f070 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -10,6 +10,7 @@ #include #include #include +#include #include struct folio_batch; @@ -475,18 +476,20 @@ vma_address(struct page *page, struct vm_area_struct *vma) } /* - * Then at what user virtual address will none of the page be found in vma? + * Then at what user virtual address will none of the range be found in vma? * Assumes that vma_address() already returned a good starting address. - * If page is a compound head, the entire compound page is considered. */ -static inline unsigned long -vma_address_end(struct page *page, struct vm_area_struct *vma) +static inline unsigned long vma_address_end(struct page_vma_mapped_walk *pvmw) { + struct vm_area_struct *vma = pvmw->vma; pgoff_t pgoff; unsigned long address; - VM_BUG_ON_PAGE(PageKsm(page), page); /* KSM page->index unusable */ - pgoff = page_to_pgoff(page) + compound_nr(page); + /* Common case, plus ->pgoff is invalid for KSM */ + if (pvmw->nr_pages == 1) + return pvmw->address + PAGE_SIZE; + + pgoff = pvmw->pgoff + pvmw->nr_pages; address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); /* Check for address beyond vma (or wrapped through 0?) */ if (address < vma->vm_start || address > vma->vm_end) diff --git a/mm/migrate.c b/mm/migrate.c index 71f92e8ed934..358bc311caaa 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -174,7 +174,8 @@ void putback_movable_pages(struct list_head *l) static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *old) { - DEFINE_PAGE_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION); + DEFINE_PAGE_VMA_WALK(pvmw, (struct page *)old, vma, addr, + PVMW_SYNC | PVMW_MIGRATION); struct page *new; pte_t pte; swp_entry_t entry; @@ -184,7 +185,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, if (PageKsm(page)) new = page; else - new = page - pvmw.page->index + + new = page - pvmw.pgoff + linear_page_index(vma, pvmw.address); #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index f7b331081791..1187f9c1ec5b 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -53,18 +53,6 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw) return true; } -static inline bool pfn_is_match(struct page *page, unsigned long pfn) -{ - unsigned long page_pfn = page_to_pfn(page); - - /* normal page and hugetlbfs page */ - if (!PageTransCompound(page) || PageHuge(page)) - return page_pfn == pfn; - - /* THP can be referenced by any subpage */ - return pfn >= page_pfn && pfn - page_pfn < thp_nr_pages(page); -} - /** * check_pte - check if @pvmw->page is mapped at the @pvmw->pte * @pvmw: page_vma_mapped_walk struct, includes a pair pte and page for checking @@ -116,7 +104,17 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) pfn = pte_pfn(*pvmw->pte); } - return pfn_is_match(pvmw->page, pfn); + return (pfn - pvmw->pfn) < pvmw->nr_pages; +} + +/* Returns true if the two ranges overlap. Careful to not overflow. */ +static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw) +{ + if ((pfn + HPAGE_PMD_NR - 1) < pvmw->pfn) + return false; + if (pfn > pvmw->pfn + pvmw->nr_pages - 1) + return false; + return true; } static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size) @@ -127,7 +125,7 @@ static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size) } /** - * page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at + * page_vma_mapped_walk - check if @pvmw->pfn is mapped in @pvmw->vma at * @pvmw->address * @pvmw: pointer to struct page_vma_mapped_walk. page, vma, address and flags * must be set. pmd, pte and ptl must be NULL. @@ -152,8 +150,8 @@ static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size) */ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) { - struct mm_struct *mm = pvmw->vma->vm_mm; - struct page *page = pvmw->page; + struct vm_area_struct *vma = pvmw->vma; + struct mm_struct *mm = vma->vm_mm; unsigned long end; pgd_t *pgd; p4d_t *p4d; @@ -164,32 +162,26 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (pvmw->pmd && !pvmw->pte) return not_found(pvmw); - if (unlikely(PageHuge(page))) { + if (unlikely(is_vm_hugetlb_page(vma))) { + unsigned long size = pvmw->nr_pages * PAGE_SIZE; /* The only possible mapping was handled on last iteration */ if (pvmw->pte) return not_found(pvmw); /* when pud is not present, pte will be NULL */ - pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page)); + pvmw->pte = huge_pte_offset(mm, pvmw->address, size); if (!pvmw->pte) return false; - pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte); + pvmw->ptl = huge_pte_lockptr(size_to_hstate(size), mm, + pvmw->pte); spin_lock(pvmw->ptl); if (!check_pte(pvmw)) return not_found(pvmw); return true; } - /* - * Seek to next pte only makes sense for THP. - * But more important than that optimization, is to filter out - * any PageKsm page: whose page->index misleads vma_address() - * and vma_address_end() to disaster. - */ - end = PageTransCompound(page) ? - vma_address_end(page, pvmw->vma) : - pvmw->address + PAGE_SIZE; + end = vma_address_end(pvmw); if (pvmw->pte) goto next_pte; restart: @@ -224,7 +216,7 @@ restart: if (likely(pmd_trans_huge(pmde))) { if (pvmw->flags & PVMW_MIGRATION) return not_found(pvmw); - if (pmd_page(pmde) != page) + if (!check_pmd(pmd_pfn(pmde), pvmw)) return not_found(pvmw); return true; } @@ -236,7 +228,7 @@ restart: return not_found(pvmw); entry = pmd_to_swp_entry(pmde); if (!is_migration_entry(entry) || - pfn_swap_entry_to_page(entry) != page) + !check_pmd(swp_offset(entry), pvmw)) return not_found(pvmw); return true; } @@ -250,7 +242,8 @@ restart: * cleared *pmd but not decremented compound_mapcount(). */ if ((pvmw->flags & PVMW_SYNC) && - PageTransCompound(page)) { + transparent_hugepage_active(vma) && + (pvmw->nr_pages >= HPAGE_PMD_NR)) { spinlock_t *ptl = pmd_lock(mm, pvmw->pmd); spin_unlock(ptl); @@ -307,7 +300,8 @@ next_pte: int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) { struct page_vma_mapped_walk pvmw = { - .page = page, + .pfn = page_to_pfn(page), + .nr_pages = 1, .vma = vma, .flags = PVMW_SYNC, }; diff --git a/mm/rmap.c b/mm/rmap.c index a7f06b76b503..e27ba4172069 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -940,7 +940,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, */ mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, 0, vma, vma->vm_mm, address, - vma_address_end(page, vma)); + vma_address_end(&pvmw)); mmu_notifier_invalidate_range_start(&range); while (page_vma_mapped_walk(&pvmw)) { @@ -1437,8 +1437,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * Note that the page can not be free in this function as call of * try_to_unmap() must hold a reference on the page. */ - range.end = PageKsm(page) ? - address + PAGE_SIZE : vma_address_end(page, vma); + range.end = vma_address_end(&pvmw); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, address, range.end); if (PageHuge(page)) { @@ -1732,8 +1731,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, * Note that the page can not be free in this function as call of * try_to_unmap() must hold a reference on the page. */ - range.end = PageKsm(page) ? - address + PAGE_SIZE : vma_address_end(page, vma); + range.end = vma_address_end(&pvmw); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, address, range.end); if (PageHuge(page)) { -- cgit v1.2.3 From b3ac04132c4b9bc5c9c14608424d410e7ca3b400 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 21 Jan 2022 11:27:31 -0500 Subject: mm/rmap: Turn page_referenced() into folio_referenced() Both its callers pass a page which was previously on an LRU list, so were passing a folio by definition. Use the type system to enforce that and remove a few calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Christoph Hellwig --- include/linux/rmap.h | 4 +-- mm/page_idle.c | 2 +- mm/rmap.c | 70 ++++++++++++++++++++++++++-------------------------- mm/vmscan.c | 20 +++++++++------ 4 files changed, 50 insertions(+), 46 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 0c838ba1a8ee..69a1664216de 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -190,7 +190,7 @@ static inline void page_dup_rmap(struct page *page, bool compound) /* * Called from mm/vmscan.c to handle paging out */ -int page_referenced(struct page *, int is_locked, +int folio_referenced(struct folio *, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags); void try_to_migrate(struct page *page, enum ttu_flags flags); @@ -301,7 +301,7 @@ void rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc); #define anon_vma_prepare(vma) (0) #define anon_vma_link(vma) do {} while (0) -static inline int page_referenced(struct page *page, int is_locked, +static inline int folio_referenced(struct folio *folio, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags) { diff --git a/mm/page_idle.c b/mm/page_idle.c index 2427d832f5d6..5c73a9b578da 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -77,7 +77,7 @@ static bool page_idle_clear_pte_refs_one(struct page *page, /* * We cleared the referenced bit in a mapping to this page. To * avoid interference with page reclaim, mark it young so that - * page_referenced() will return > 0. + * folio_referenced() will return > 0. */ folio_set_young(folio); } diff --git a/mm/rmap.c b/mm/rmap.c index b1d7f3e7f58c..36eef54d05d8 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -789,29 +789,30 @@ out: return pmd; } -struct page_referenced_arg { +struct folio_referenced_arg { int mapcount; int referenced; unsigned long vm_flags; struct mem_cgroup *memcg; }; /* - * arg: page_referenced_arg will be passed + * arg: folio_referenced_arg will be passed */ -static bool page_referenced_one(struct page *page, struct vm_area_struct *vma, +static bool folio_referenced_one(struct page *page, struct vm_area_struct *vma, unsigned long address, void *arg) { - struct page_referenced_arg *pra = arg; - DEFINE_PAGE_VMA_WALK(pvmw, page, vma, address, 0); + struct folio *folio = page_folio(page); + struct folio_referenced_arg *pra = arg; + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); int referenced = 0; while (page_vma_mapped_walk(&pvmw)) { address = pvmw.address; if ((vma->vm_flags & VM_LOCKED) && - (!PageTransCompound(page) || !pvmw.pte)) { + (!folio_test_large(folio) || !pvmw.pte)) { /* Restore the mlock which got missed */ - mlock_vma_page(page, vma, !pvmw.pte); + mlock_vma_folio(folio, vma, !pvmw.pte); page_vma_mapped_walk_done(&pvmw); pra->vm_flags |= VM_LOCKED; return false; /* To break the loop */ @@ -823,10 +824,10 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma, /* * Don't treat a reference through * a sequentially read mapping as such. - * If the page has been used in another mapping, + * If the folio has been used in another mapping, * we will catch it; if this other mapping is * already gone, the unmap path will have set - * PG_referenced or activated the page. + * the referenced flag or activated the folio. */ if (likely(!(vma->vm_flags & VM_SEQ_READ))) referenced++; @@ -836,7 +837,7 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma, pvmw.pmd)) referenced++; } else { - /* unexpected pmd-mapped page? */ + /* unexpected pmd-mapped folio? */ WARN_ON_ONCE(1); } @@ -844,8 +845,8 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma, } if (referenced) - clear_page_idle(page); - if (test_and_clear_page_young(page)) + folio_clear_idle(folio); + if (folio_test_clear_young(folio)) referenced++; if (referenced) { @@ -859,9 +860,9 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma, return true; } -static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg) +static bool invalid_folio_referenced_vma(struct vm_area_struct *vma, void *arg) { - struct page_referenced_arg *pra = arg; + struct folio_referenced_arg *pra = arg; struct mem_cgroup *memcg = pra->memcg; if (!mm_match_cgroup(vma->vm_mm, memcg)) @@ -871,27 +872,26 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg) } /** - * page_referenced - test if the page was referenced - * @page: the page to test - * @is_locked: caller holds lock on the page + * folio_referenced() - Test if the folio was referenced. + * @folio: The folio to test. + * @is_locked: Caller holds lock on the folio. * @memcg: target memory cgroup - * @vm_flags: collect encountered vma->vm_flags who actually referenced the page + * @vm_flags: A combination of all the vma->vm_flags which referenced the folio. * - * Quick test_and_clear_referenced for all mappings to a page, - * returns the number of ptes which referenced the page. + * Quick test_and_clear_referenced for all mappings of a folio, + * + * Return: The number of mappings which referenced the folio. */ -int page_referenced(struct page *page, - int is_locked, - struct mem_cgroup *memcg, - unsigned long *vm_flags) +int folio_referenced(struct folio *folio, int is_locked, + struct mem_cgroup *memcg, unsigned long *vm_flags) { int we_locked = 0; - struct page_referenced_arg pra = { - .mapcount = total_mapcount(page), + struct folio_referenced_arg pra = { + .mapcount = folio_mapcount(folio), .memcg = memcg, }; struct rmap_walk_control rwc = { - .rmap_one = page_referenced_one, + .rmap_one = folio_referenced_one, .arg = (void *)&pra, .anon_lock = page_lock_anon_vma_read, }; @@ -900,11 +900,11 @@ int page_referenced(struct page *page, if (!pra.mapcount) return 0; - if (!page_rmapping(page)) + if (!folio_raw_mapping(folio)) return 0; - if (!is_locked && (!PageAnon(page) || PageKsm(page))) { - we_locked = trylock_page(page); + if (!is_locked && (!folio_test_anon(folio) || folio_test_ksm(folio))) { + we_locked = folio_trylock(folio); if (!we_locked) return 1; } @@ -915,14 +915,14 @@ int page_referenced(struct page *page, * cgroups */ if (memcg) { - rwc.invalid_vma = invalid_page_referenced_vma; + rwc.invalid_vma = invalid_folio_referenced_vma; } - rmap_walk(page, &rwc); + rmap_walk(&folio->page, &rwc); *vm_flags = pra.vm_flags; if (we_locked) - unlock_page(page); + folio_unlock(folio); return pra.referenced; } @@ -1052,8 +1052,8 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma) anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; /* * Ensure that anon_vma and the PAGE_MAPPING_ANON bit are written - * simultaneously, so a concurrent reader (eg page_referenced()'s - * PageAnon()) will not see one without the other. + * simultaneously, so a concurrent reader (eg folio_referenced()'s + * folio_test_anon()) will not see one without the other. */ WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 94729d2d1125..38f124c41bcd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1386,11 +1386,12 @@ enum page_references { static enum page_references page_check_references(struct page *page, struct scan_control *sc) { + struct folio *folio = page_folio(page); int referenced_ptes, referenced_page; unsigned long vm_flags; - referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup, - &vm_flags); + referenced_ptes = folio_referenced(folio, 1, sc->target_mem_cgroup, + &vm_flags); referenced_page = TestClearPageReferenced(page); /* @@ -2490,7 +2491,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, * * If the pages are mostly unmapped, the processing is fast and it is * appropriate to hold lru_lock across the whole operation. But if - * the pages are mapped, the processing is slow (page_referenced()), so + * the pages are mapped, the processing is slow (folio_referenced()), so * we should drop lru_lock around each page. It's impossible to balance * this, so instead we remove the pages from the LRU while processing them. * It is safe to rely on PG_active against the non-LRU pages in here because @@ -2510,7 +2511,6 @@ static void shrink_active_list(unsigned long nr_to_scan, LIST_HEAD(l_hold); /* The pages which were snipped off */ LIST_HEAD(l_active); LIST_HEAD(l_inactive); - struct page *page; unsigned nr_deactivate, nr_activate; unsigned nr_rotated = 0; int file = is_file_lru(lru); @@ -2532,9 +2532,13 @@ static void shrink_active_list(unsigned long nr_to_scan, spin_unlock_irq(&lruvec->lru_lock); while (!list_empty(&l_hold)) { + struct folio *folio; + struct page *page; + cond_resched(); - page = lru_to_page(&l_hold); - list_del(&page->lru); + folio = lru_to_folio(&l_hold); + list_del(&folio->lru); + page = &folio->page; if (unlikely(!page_evictable(page))) { putback_lru_page(page); @@ -2549,8 +2553,8 @@ static void shrink_active_list(unsigned long nr_to_scan, } } - if (page_referenced(page, 0, sc->target_mem_cgroup, - &vm_flags)) { + if (folio_referenced(folio, 0, sc->target_mem_cgroup, + &vm_flags)) { /* * Identify referenced, file-backed active pages and * give them one more trip around the active list. So -- cgit v1.2.3 From 869f7ee6f6477341f859c8b0949ae81caf9ca7f3 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 15 Feb 2022 09:28:49 -0500 Subject: mm/rmap: Convert try_to_unmap() to take a folio Change all three callers and the worker function try_to_unmap_one(). Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 4 +-- mm/huge_memory.c | 3 +- mm/khugepaged.c | 3 +- mm/memory-failure.c | 7 +++-- mm/memory_hotplug.c | 13 ++++---- mm/rmap.c | 83 +++++++++++++++++++++++++++------------------------- mm/vmscan.c | 2 +- 7 files changed, 62 insertions(+), 53 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 69a1664216de..a0c5c38c733f 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -194,7 +194,7 @@ int folio_referenced(struct folio *, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags); void try_to_migrate(struct page *page, enum ttu_flags flags); -void try_to_unmap(struct page *, enum ttu_flags flags); +void try_to_unmap(struct folio *, enum ttu_flags flags); int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, unsigned long end, struct page **pages, @@ -309,7 +309,7 @@ static inline int folio_referenced(struct folio *folio, int is_locked, return 0; } -static inline void try_to_unmap(struct page *page, enum ttu_flags flags) +static inline void try_to_unmap(struct folio *folio, enum ttu_flags flags) { } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 583b735a079b..de684427f79c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2251,6 +2251,7 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, static void unmap_page(struct page *page) { + struct folio *folio = page_folio(page); enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD | TTU_SYNC; @@ -2264,7 +2265,7 @@ static void unmap_page(struct page *page) if (PageAnon(page)) try_to_migrate(page, ttu_flags); else - try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK); + try_to_unmap(folio, ttu_flags | TTU_IGNORE_MLOCK); VM_WARN_ON_ONCE_PAGE(page_mapped(page), page); } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index fa05e6d39783..1cdf7c38b9e5 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1834,7 +1834,8 @@ static void collapse_file(struct mm_struct *mm, } if (page_mapped(page)) - try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_BATCH_FLUSH); + try_to_unmap(page_folio(page), + TTU_IGNORE_MLOCK | TTU_BATCH_FLUSH); xas_lock_irq(&xas); xas_set(&xas, index); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0b72a936b8dd..258913d5e036 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1347,6 +1347,7 @@ static int get_hwpoison_page(struct page *p, unsigned long flags) static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, int flags, struct page *hpage) { + struct folio *folio = page_folio(hpage); enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC; struct address_space *mapping; LIST_HEAD(tokill); @@ -1412,7 +1413,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED); if (!PageHuge(hpage)) { - try_to_unmap(hpage, ttu); + try_to_unmap(folio, ttu); } else { if (!PageAnon(hpage)) { /* @@ -1424,12 +1425,12 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, */ mapping = hugetlb_page_mapping_lock_write(hpage); if (mapping) { - try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED); + try_to_unmap(folio, ttu|TTU_RMAP_LOCKED); i_mmap_unlock_write(mapping); } else pr_info("Memory failure: %#lx: could not lock mapping for mapped huge page\n", pfn); } else { - try_to_unmap(hpage, ttu); + try_to_unmap(folio, ttu); } } diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2a9627dc784c..914057da53c7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1690,10 +1690,13 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) DEFAULT_RATELIMIT_BURST); for (pfn = start_pfn; pfn < end_pfn; pfn++) { + struct folio *folio; + if (!pfn_valid(pfn)) continue; page = pfn_to_page(pfn); - head = compound_head(page); + folio = page_folio(page); + head = &folio->page; if (PageHuge(page)) { pfn = page_to_pfn(head) + compound_nr(head) - 1; @@ -1710,10 +1713,10 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) * the unmap as the catch all safety net). */ if (PageHWPoison(page)) { - if (WARN_ON(PageLRU(page))) - isolate_lru_page(page); - if (page_mapped(page)) - try_to_unmap(page, TTU_IGNORE_MLOCK); + if (WARN_ON(folio_test_lru(folio))) + folio_isolate_lru(folio); + if (folio_mapped(folio)) + try_to_unmap(folio, TTU_IGNORE_MLOCK); continue; } diff --git a/mm/rmap.c b/mm/rmap.c index 8b3d44e56e30..cf6e3de9d2f7 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1412,7 +1412,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, { struct folio *folio = page_folio(page); struct mm_struct *mm = vma->vm_mm; - DEFINE_PAGE_VMA_WALK(pvmw, page, vma, address, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); pte_t pteval; struct page *subpage; bool ret = true; @@ -1436,13 +1436,13 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * For hugetlb, it could be much worse if we need to do pud * invalidation in the case of pmd sharing. * - * Note that the page can not be free in this function as call of - * try_to_unmap() must hold a reference on the page. + * Note that the folio can not be freed in this function as call of + * try_to_unmap() must hold a reference on the folio. */ range.end = vma_address_end(&pvmw); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, address, range.end); - if (PageHuge(page)) { + if (folio_test_hugetlb(folio)) { /* * If sharing is possible, start and end will be adjusted * accordingly. @@ -1454,24 +1454,25 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, while (page_vma_mapped_walk(&pvmw)) { /* Unexpected PMD-mapped THP? */ - VM_BUG_ON_PAGE(!pvmw.pte, page); + VM_BUG_ON_FOLIO(!pvmw.pte, folio); /* - * If the page is in an mlock()d vma, we must not swap it out. + * If the folio is in an mlock()d vma, we must not swap it out. */ if (!(flags & TTU_IGNORE_MLOCK) && (vma->vm_flags & VM_LOCKED)) { /* Restore the mlock which got missed */ - mlock_vma_page(page, vma, false); + mlock_vma_folio(folio, vma, false); page_vma_mapped_walk_done(&pvmw); ret = false; break; } - subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); + subpage = folio_page(folio, + pte_pfn(*pvmw.pte) - folio_pfn(folio)); address = pvmw.address; - if (PageHuge(page) && !PageAnon(page)) { + if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) { /* * To call huge_pmd_unshare, i_mmap_rwsem must be * held in write mode. Caller needs to explicitly @@ -1510,7 +1511,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (should_defer_flush(mm, flags)) { /* * We clear the PTE but do not flush so potentially - * a remote CPU could still be writing to the page. + * a remote CPU could still be writing to the folio. * If the entry was previously clean then the * architecture must guarantee that a clear->dirty * transition on a cached TLB entry is written through @@ -1523,22 +1524,22 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, pteval = ptep_clear_flush(vma, address, pvmw.pte); } - /* Move the dirty bit to the page. Now the pte is gone. */ + /* Set the dirty flag on the folio now the pte is gone. */ if (pte_dirty(pteval)) - set_page_dirty(page); + folio_mark_dirty(folio); /* Update high watermark before we lower rss */ update_hiwater_rss(mm); - if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) { + if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) { pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); - if (PageHuge(page)) { - hugetlb_count_sub(compound_nr(page), mm); + if (folio_test_hugetlb(folio)) { + hugetlb_count_sub(folio_nr_pages(folio), mm); set_huge_swap_pte_at(mm, address, pvmw.pte, pteval, vma_mmu_pagesize(vma)); } else { - dec_mm_counter(mm, mm_counter(page)); + dec_mm_counter(mm, mm_counter(&folio->page)); set_pte_at(mm, address, pvmw.pte, pteval); } @@ -1553,18 +1554,19 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * migration) will not expect userfaults on already * copied pages. */ - dec_mm_counter(mm, mm_counter(page)); + dec_mm_counter(mm, mm_counter(&folio->page)); /* We have to invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); - } else if (PageAnon(page)) { + } else if (folio_test_anon(folio)) { swp_entry_t entry = { .val = page_private(subpage) }; pte_t swp_pte; /* * Store the swap location in the pte. * See handle_pte_fault() ... */ - if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) { + if (unlikely(folio_test_swapbacked(folio) != + folio_test_swapcache(folio))) { WARN_ON_ONCE(1); ret = false; /* We have to invalidate as we cleared the pte */ @@ -1575,8 +1577,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /* MADV_FREE page check */ - if (!PageSwapBacked(page)) { - if (!PageDirty(page)) { + if (!folio_test_swapbacked(folio)) { + if (!folio_test_dirty(folio)) { /* Invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); @@ -1585,11 +1587,11 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } /* - * If the page was redirtied, it cannot be + * If the folio was redirtied, it cannot be * discarded. Remap the page to page table. */ set_pte_at(mm, address, pvmw.pte, pteval); - SetPageSwapBacked(page); + folio_set_swapbacked(folio); ret = false; page_vma_mapped_walk_done(&pvmw); break; @@ -1626,16 +1628,17 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, address + PAGE_SIZE); } else { /* - * This is a locked file-backed page, thus it cannot - * be removed from the page cache and replaced by a new - * page before mmu_notifier_invalidate_range_end, so no - * concurrent thread might update its page table to - * point at new page while a device still is using this - * page. + * This is a locked file-backed folio, + * so it cannot be removed from the page + * cache and replaced by a new folio before + * mmu_notifier_invalidate_range_end, so no + * concurrent thread might update its page table + * to point at a new folio while a device is + * still using this folio. * * See Documentation/vm/mmu_notifier.rst */ - dec_mm_counter(mm, mm_counter_file(page)); + dec_mm_counter(mm, mm_counter_file(&folio->page)); } discard: /* @@ -1645,10 +1648,10 @@ discard: * * See Documentation/vm/mmu_notifier.rst */ - page_remove_rmap(subpage, vma, PageHuge(page)); + page_remove_rmap(subpage, vma, folio_test_hugetlb(folio)); if (vma->vm_flags & VM_LOCKED) mlock_page_drain(smp_processor_id()); - put_page(page); + folio_put(folio); } mmu_notifier_invalidate_range_end(&range); @@ -1667,17 +1670,17 @@ static int page_not_mapped(struct page *page) } /** - * try_to_unmap - try to remove all page table mappings to a page - * @page: the page to get unmapped + * try_to_unmap - Try to remove all page table mappings to a folio. + * @folio: The folio to unmap. * @flags: action and flags * * Tries to remove all the page table entries which are mapping this - * page, used in the pageout path. Caller must hold the page lock. + * folio. It is the caller's responsibility to check if the folio is + * still mapped if needed (use TTU_SYNC to prevent accounting races). * - * It is the caller's responsibility to check if the page is still - * mapped when needed (use TTU_SYNC to prevent accounting races). + * Context: Caller must hold the folio lock. */ -void try_to_unmap(struct page *page, enum ttu_flags flags) +void try_to_unmap(struct folio *folio, enum ttu_flags flags) { struct rmap_walk_control rwc = { .rmap_one = try_to_unmap_one, @@ -1687,9 +1690,9 @@ void try_to_unmap(struct page *page, enum ttu_flags flags) }; if (flags & TTU_RMAP_LOCKED) - rmap_walk_locked(page, &rwc); + rmap_walk_locked(&folio->page, &rwc); else - rmap_walk(page, &rwc); + rmap_walk(&folio->page, &rwc); } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 38f124c41bcd..a57eb747f08d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1768,7 +1768,7 @@ retry: if (unlikely(PageTransHuge(page))) flags |= TTU_SPLIT_HUGE_PMD; - try_to_unmap(page, flags); + try_to_unmap(folio, flags); if (page_mapped(page)) { stat->nr_unmap_fail += nr_pages; if (!was_swapbacked && PageSwapBacked(page)) -- cgit v1.2.3 From 4b8554c527f3cfa183f6c06d231a9387873205a0 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 28 Jan 2022 14:29:43 -0500 Subject: mm/rmap: Convert try_to_migrate() to folios Convert the callers to pass a folio and the try_to_migrate_one() worker to use a folio throughout. Fixes an assumption that a folio must be <= PMD size. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 2 +- mm/huge_memory.c | 4 ++-- mm/migrate.c | 6 ++++-- mm/migrate_device.c | 6 ++++-- mm/rmap.c | 59 +++++++++++++++++++++++++++------------------------- 5 files changed, 42 insertions(+), 35 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index a0c5c38c733f..e6e935c81414 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -193,7 +193,7 @@ static inline void page_dup_rmap(struct page *page, bool compound) int folio_referenced(struct folio *, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags); -void try_to_migrate(struct page *page, enum ttu_flags flags); +void try_to_migrate(struct folio *folio, enum ttu_flags flags); void try_to_unmap(struct folio *, enum ttu_flags flags); int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index de684427f79c..7df1934d6528 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2262,8 +2262,8 @@ static void unmap_page(struct page *page) * pages can simply be left unmapped, then faulted back on demand. * If that is ever changed (perhaps for mlock), update remap_page(). */ - if (PageAnon(page)) - try_to_migrate(page, ttu_flags); + if (folio_test_anon(folio)) + try_to_migrate(folio, ttu_flags); else try_to_unmap(folio, ttu_flags | TTU_IGNORE_MLOCK); diff --git a/mm/migrate.c b/mm/migrate.c index 358bc311caaa..6ed85a5d1be5 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -912,6 +912,7 @@ out: static int __unmap_and_move(struct page *page, struct page *newpage, int force, enum migrate_mode mode) { + struct folio *folio = page_folio(page); int rc = -EAGAIN; bool page_was_mapped = false; struct anon_vma *anon_vma = NULL; @@ -1015,7 +1016,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, /* Establish migration ptes */ VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma, page); - try_to_migrate(page, 0); + try_to_migrate(folio, 0); page_was_mapped = true; } @@ -1165,6 +1166,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, enum migrate_mode mode, int reason, struct list_head *ret) { + struct folio *src = page_folio(hpage); int rc = -EAGAIN; int page_was_mapped = 0; struct page *new_hpage; @@ -1241,7 +1243,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, ttu |= TTU_RMAP_LOCKED; } - try_to_migrate(hpage, ttu); + try_to_migrate(src, ttu); page_was_mapped = 1; if (mapping_locked) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 0326b901d2fd..b2c611d4bdb2 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -333,6 +333,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate) for (i = 0; i < npages; i++) { struct page *page = migrate_pfn_to_page(migrate->src[i]); + struct folio *folio; if (!page) continue; @@ -356,8 +357,9 @@ static void migrate_vma_unmap(struct migrate_vma *migrate) put_page(page); } - if (page_mapped(page)) - try_to_migrate(page, 0); + folio = page_folio(page); + if (folio_mapped(folio)) + try_to_migrate(folio, 0); if (page_mapped(page) || !migrate_vma_check_page(page)) { if (!is_zone_device_page(page)) { diff --git a/mm/rmap.c b/mm/rmap.c index cf6e3de9d2f7..8497da29193c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1706,7 +1706,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, { struct folio *folio = page_folio(page); struct mm_struct *mm = vma->vm_mm; - DEFINE_PAGE_VMA_WALK(pvmw, page, vma, address, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); pte_t pteval; struct page *subpage; bool ret = true; @@ -1740,7 +1740,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, range.end = vma_address_end(&pvmw); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, address, range.end); - if (PageHuge(page)) { + if (folio_test_hugetlb(folio)) { /* * If sharing is possible, start and end will be adjusted * accordingly. @@ -1754,21 +1754,24 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION /* PMD-mapped THP migration entry */ if (!pvmw.pte) { - VM_BUG_ON_PAGE(PageHuge(page) || - !PageTransCompound(page), page); + subpage = folio_page(folio, + pmd_pfn(*pvmw.pmd) - folio_pfn(folio)); + VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) || + !folio_test_pmd_mappable(folio), folio); - set_pmd_migration_entry(&pvmw, page); + set_pmd_migration_entry(&pvmw, subpage); continue; } #endif /* Unexpected PMD-mapped THP? */ - VM_BUG_ON_PAGE(!pvmw.pte, page); + VM_BUG_ON_FOLIO(!pvmw.pte, folio); - subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); + subpage = folio_page(folio, + pte_pfn(*pvmw.pte) - folio_pfn(folio)); address = pvmw.address; - if (PageHuge(page) && !PageAnon(page)) { + if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) { /* * To call huge_pmd_unshare, i_mmap_rwsem must be * held in write mode. Caller needs to explicitly @@ -1806,15 +1809,15 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); pteval = ptep_clear_flush(vma, address, pvmw.pte); - /* Move the dirty bit to the page. Now the pte is gone. */ + /* Set the dirty flag on the folio now the pte is gone. */ if (pte_dirty(pteval)) - set_page_dirty(page); + folio_mark_dirty(folio); /* Update high watermark before we lower rss */ update_hiwater_rss(mm); - if (is_zone_device_page(page)) { - unsigned long pfn = page_to_pfn(page); + if (folio_is_zone_device(folio)) { + unsigned long pfn = folio_pfn(folio); swp_entry_t entry; pte_t swp_pte; @@ -1850,16 +1853,16 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, * changed when hugepage migrations to device private * memory are supported. */ - subpage = page; - } else if (PageHWPoison(page)) { + subpage = &folio->page; + } else if (PageHWPoison(subpage)) { pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); - if (PageHuge(page)) { - hugetlb_count_sub(compound_nr(page), mm); + if (folio_test_hugetlb(folio)) { + hugetlb_count_sub(folio_nr_pages(folio), mm); set_huge_swap_pte_at(mm, address, pvmw.pte, pteval, vma_mmu_pagesize(vma)); } else { - dec_mm_counter(mm, mm_counter(page)); + dec_mm_counter(mm, mm_counter(&folio->page)); set_pte_at(mm, address, pvmw.pte, pteval); } @@ -1874,7 +1877,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, * migration) will not expect userfaults on already * copied pages. */ - dec_mm_counter(mm, mm_counter(page)); + dec_mm_counter(mm, mm_counter(&folio->page)); /* We have to invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); @@ -1920,10 +1923,10 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, * * See Documentation/vm/mmu_notifier.rst */ - page_remove_rmap(subpage, vma, PageHuge(page)); + page_remove_rmap(subpage, vma, folio_test_hugetlb(folio)); if (vma->vm_flags & VM_LOCKED) mlock_page_drain(smp_processor_id()); - put_page(page); + folio_put(folio); } mmu_notifier_invalidate_range_end(&range); @@ -1933,13 +1936,13 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, /** * try_to_migrate - try to replace all page table mappings with swap entries - * @page: the page to replace page table entries for + * @folio: the folio to replace page table entries for * @flags: action and flags * - * Tries to remove all the page table entries which are mapping this page and - * replace them with special swap entries. Caller must hold the page lock. + * Tries to remove all the page table entries which are mapping this folio and + * replace them with special swap entries. Caller must hold the folio lock. */ -void try_to_migrate(struct page *page, enum ttu_flags flags) +void try_to_migrate(struct folio *folio, enum ttu_flags flags) { struct rmap_walk_control rwc = { .rmap_one = try_to_migrate_one, @@ -1956,7 +1959,7 @@ void try_to_migrate(struct page *page, enum ttu_flags flags) TTU_SYNC))) return; - if (is_zone_device_page(page) && !is_device_private_page(page)) + if (folio_is_zone_device(folio) && !folio_is_device_private(folio)) return; /* @@ -1967,13 +1970,13 @@ void try_to_migrate(struct page *page, enum ttu_flags flags) * locking requirements of exec(), migration skips * temporary VMAs until after exec() completes. */ - if (!PageKsm(page) && PageAnon(page)) + if (!folio_test_ksm(folio) && folio_test_anon(folio)) rwc.invalid_vma = invalid_migration_vma; if (flags & TTU_RMAP_LOCKED) - rmap_walk_locked(page, &rwc); + rmap_walk_locked(&folio->page, &rwc); else - rmap_walk(page, &rwc); + rmap_walk(&folio->page, &rwc); } #ifdef CONFIG_DEVICE_PRIVATE -- cgit v1.2.3 From 4eecb8b9163df82c87c91764a02fff228ef25f6d Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Fri, 28 Jan 2022 23:32:59 -0500 Subject: mm/migrate: Convert remove_migration_ptes() to folios Convert the implementation and all callers. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 2 +- mm/huge_memory.c | 24 ++++++++++++----------- mm/migrate.c | 55 ++++++++++++++++++++++++++++------------------------ mm/migrate_device.c | 15 +++++++++----- 4 files changed, 54 insertions(+), 42 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index e6e935c81414..21af80d5b711 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -261,7 +261,7 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); */ int folio_mkclean(struct folio *); -void remove_migration_ptes(struct page *old, struct page *new, bool locked); +void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked); /* * Called by memory-failure.c to kill processes. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7df1934d6528..d55b25f1ceba 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2270,18 +2270,19 @@ static void unmap_page(struct page *page) VM_WARN_ON_ONCE_PAGE(page_mapped(page), page); } -static void remap_page(struct page *page, unsigned int nr) +static void remap_page(struct folio *folio, unsigned long nr) { - int i; + int i = 0; /* If unmap_page() uses try_to_migrate() on file, remove this check */ - if (!PageAnon(page)) + if (!folio_test_anon(folio)) return; - if (PageTransHuge(page)) { - remove_migration_ptes(page, page, true); - } else { - for (i = 0; i < nr; i++) - remove_migration_ptes(page + i, page + i, true); + for (;;) { + remove_migration_ptes(folio, folio, true); + i += folio_nr_pages(folio); + if (i >= nr) + break; + folio = folio_next(folio); } } @@ -2441,7 +2442,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, } local_irq_enable(); - remap_page(head, nr); + remap_page(folio, nr); if (PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; @@ -2550,7 +2551,8 @@ bool can_split_huge_page(struct page *page, int *pextra_pins) */ int split_huge_page_to_list(struct page *page, struct list_head *list) { - struct page *head = compound_head(page); + struct folio *folio = page_folio(page); + struct page *head = &folio->page; struct deferred_split *ds_queue = get_deferred_split_queue(head); XA_STATE(xas, &head->mapping->i_pages, head->index); struct anon_vma *anon_vma = NULL; @@ -2667,7 +2669,7 @@ fail: if (mapping) xas_unlock(&xas); local_irq_enable(); - remap_page(head, thp_nr_pages(head)); + remap_page(folio, folio_nr_pages(folio)); ret = -EBUSY; } diff --git a/mm/migrate.c b/mm/migrate.c index 6ed85a5d1be5..eba3cd5376e3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -174,30 +174,32 @@ void putback_movable_pages(struct list_head *l) static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *old) { - DEFINE_PAGE_VMA_WALK(pvmw, (struct page *)old, vma, addr, - PVMW_SYNC | PVMW_MIGRATION); - struct page *new; - pte_t pte; - swp_entry_t entry; + struct folio *folio = page_folio(page); + DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION); VM_BUG_ON_PAGE(PageTail(page), page); while (page_vma_mapped_walk(&pvmw)) { - if (PageKsm(page)) - new = page; - else - new = page - pvmw.pgoff + - linear_page_index(vma, pvmw.address); + pte_t pte; + swp_entry_t entry; + struct page *new; + unsigned long idx = 0; + + /* pgoff is invalid for ksm pages, but they are never large */ + if (folio_test_large(folio) && !folio_test_hugetlb(folio)) + idx = linear_page_index(vma, pvmw.address) - pvmw.pgoff; + new = folio_page(folio, idx); #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION /* PMD-mapped THP migration entry */ if (!pvmw.pte) { - VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page); + VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) || + !folio_test_pmd_mappable(folio), folio); remove_migration_pmd(&pvmw, new); continue; } #endif - get_page(new); + folio_get(folio); pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot))); if (pte_swp_soft_dirty(*pvmw.pte)) pte = pte_mksoft_dirty(pte); @@ -226,12 +228,12 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, } #ifdef CONFIG_HUGETLB_PAGE - if (PageHuge(new)) { + if (folio_test_hugetlb(folio)) { unsigned int shift = huge_page_shift(hstate_vma(vma)); pte = pte_mkhuge(pte); pte = arch_make_huge_pte(pte, shift, vma->vm_flags); - if (PageAnon(new)) + if (folio_test_anon(folio)) hugepage_add_anon_rmap(new, vma, pvmw.address); else page_dup_rmap(new, true); @@ -239,7 +241,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, } else #endif { - if (PageAnon(new)) + if (folio_test_anon(folio)) page_add_anon_rmap(new, vma, pvmw.address, false); else page_add_file_rmap(new, vma, false); @@ -259,17 +261,17 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, * Get rid of all migration entries and replace them by * references to the indicated page. */ -void remove_migration_ptes(struct page *old, struct page *new, bool locked) +void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked) { struct rmap_walk_control rwc = { .rmap_one = remove_migration_pte, - .arg = old, + .arg = src, }; if (locked) - rmap_walk_locked(new, &rwc); + rmap_walk_locked(&dst->page, &rwc); else - rmap_walk(new, &rwc); + rmap_walk(&dst->page, &rwc); } /* @@ -756,6 +758,7 @@ int buffer_migrate_page_norefs(struct address_space *mapping, */ static int writeout(struct address_space *mapping, struct page *page) { + struct folio *folio = page_folio(page); struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, .nr_to_write = 1, @@ -781,7 +784,7 @@ static int writeout(struct address_space *mapping, struct page *page) * At this point we know that the migration attempt cannot * be successful. */ - remove_migration_ptes(page, page, false); + remove_migration_ptes(folio, folio, false); rc = mapping->a_ops->writepage(page, &wbc); @@ -913,6 +916,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, int force, enum migrate_mode mode) { struct folio *folio = page_folio(page); + struct folio *dst = page_folio(newpage); int rc = -EAGAIN; bool page_was_mapped = false; struct anon_vma *anon_vma = NULL; @@ -1039,8 +1043,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage, } if (page_was_mapped) - remove_migration_ptes(page, - rc == MIGRATEPAGE_SUCCESS ? newpage : page, false); + remove_migration_ptes(folio, + rc == MIGRATEPAGE_SUCCESS ? dst : folio, false); out_unlock_both: unlock_page(newpage); @@ -1166,7 +1170,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, enum migrate_mode mode, int reason, struct list_head *ret) { - struct folio *src = page_folio(hpage); + struct folio *dst, *src = page_folio(hpage); int rc = -EAGAIN; int page_was_mapped = 0; struct page *new_hpage; @@ -1194,6 +1198,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, new_hpage = get_new_page(hpage, private); if (!new_hpage) return -ENOMEM; + dst = page_folio(new_hpage); if (!trylock_page(hpage)) { if (!force) @@ -1254,8 +1259,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, rc = move_to_new_page(new_hpage, hpage, mode); if (page_was_mapped) - remove_migration_ptes(hpage, - rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, false); + remove_migration_ptes(src, + rc == MIGRATEPAGE_SUCCESS ? dst : src, false); unlock_put_anon: unlock_page(new_hpage); diff --git a/mm/migrate_device.c b/mm/migrate_device.c index b2c611d4bdb2..70c7dc05bbfc 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -376,15 +376,17 @@ static void migrate_vma_unmap(struct migrate_vma *migrate) for (i = 0; i < npages && restore; i++) { struct page *page = migrate_pfn_to_page(migrate->src[i]); + struct folio *folio; if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE)) continue; - remove_migration_ptes(page, page, false); + folio = page_folio(page); + remove_migration_ptes(folio, folio, false); migrate->src[i] = 0; - unlock_page(page); - put_page(page); + folio_unlock(folio); + folio_put(folio); restore--; } } @@ -729,6 +731,7 @@ void migrate_vma_finalize(struct migrate_vma *migrate) unsigned long i; for (i = 0; i < npages; i++) { + struct folio *dst, *src; struct page *newpage = migrate_pfn_to_page(migrate->dst[i]); struct page *page = migrate_pfn_to_page(migrate->src[i]); @@ -748,8 +751,10 @@ void migrate_vma_finalize(struct migrate_vma *migrate) newpage = page; } - remove_migration_ptes(page, newpage, false); - unlock_page(page); + src = page_folio(page); + dst = page_folio(newpage); + remove_migration_ptes(src, dst, false); + folio_unlock(src); if (is_zone_device_page(page)) put_page(page); -- cgit v1.2.3 From 9595d76942b8714627d670a7e7ae543812c731ae Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Tue, 1 Feb 2022 23:33:08 -0500 Subject: mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read() Add back page_lock_anon_vma_read() as a wrapper. This saves a few calls to compound_head(). If any callers were passing a tail page before, this would have failed to lock the anon VMA as page->mapping is not valid for tail pages. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/rmap.h | 1 + mm/folio-compat.c | 7 +++++++ mm/memory-failure.c | 3 ++- mm/rmap.c | 12 ++++++------ 4 files changed, 16 insertions(+), 7 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 21af80d5b711..be020d38b0a5 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -267,6 +267,7 @@ void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked); * Called by memory-failure.c to kill processes. */ struct anon_vma *page_lock_anon_vma_read(struct page *page); +struct anon_vma *folio_lock_anon_vma_read(struct folio *folio); void page_unlock_anon_vma_read(struct anon_vma *anon_vma); int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); diff --git a/mm/folio-compat.c b/mm/folio-compat.c index 46fa179e32fb..968ad97bbffa 100644 --- a/mm/folio-compat.c +++ b/mm/folio-compat.c @@ -164,3 +164,10 @@ void putback_lru_page(struct page *page) { folio_putback_lru(page_folio(page)); } + +#ifdef CONFIG_MMU +struct anon_vma *page_lock_anon_vma_read(struct page *page) +{ + return folio_lock_anon_vma_read(page_folio(page)); +} +#endif diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 258913d5e036..aa8236848949 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -487,12 +487,13 @@ static struct task_struct *task_early_kill(struct task_struct *tsk, static void collect_procs_anon(struct page *page, struct list_head *to_kill, int force_early) { + struct folio *folio = page_folio(page); struct vm_area_struct *vma; struct task_struct *tsk; struct anon_vma *av; pgoff_t pgoff; - av = page_lock_anon_vma_read(page); + av = folio_lock_anon_vma_read(folio); if (av == NULL) /* Not actually mapped anymore */ return; diff --git a/mm/rmap.c b/mm/rmap.c index c74de8af7eec..64655d345234 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -526,28 +526,28 @@ out: * atomic op -- the trylock. If we fail the trylock, we fall back to getting a * reference like with page_get_anon_vma() and then block on the mutex. */ -struct anon_vma *page_lock_anon_vma_read(struct page *page) +struct anon_vma *folio_lock_anon_vma_read(struct folio *folio) { struct anon_vma *anon_vma = NULL; struct anon_vma *root_anon_vma; unsigned long anon_mapping; rcu_read_lock(); - anon_mapping = (unsigned long)READ_ONCE(page->mapping); + anon_mapping = (unsigned long)READ_ONCE(folio->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) goto out; - if (!page_mapped(page)) + if (!folio_mapped(folio)) goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); root_anon_vma = READ_ONCE(anon_vma->root); if (down_read_trylock(&root_anon_vma->rwsem)) { /* - * If the page is still mapped, then this anon_vma is still + * If the folio is still mapped, then this anon_vma is still * its anon_vma, and holding the mutex ensures that it will * not go away, see anon_vma_free(). */ - if (!page_mapped(page)) { + if (!folio_mapped(folio)) { up_read(&root_anon_vma->rwsem); anon_vma = NULL; } @@ -560,7 +560,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page) goto out; } - if (!page_mapped(page)) { + if (!folio_mapped(folio)) { rcu_read_unlock(); put_anon_vma(anon_vma); return NULL; -- cgit v1.2.3 From 2f031c6f042cb8a9b221a8b6b80e69de5170f830 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sat, 29 Jan 2022 16:06:53 -0500 Subject: mm/rmap: Convert rmap_walk() to take a folio This ripples all the way through to every calling and called function from rmap. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/ksm.h | 4 +- include/linux/rmap.h | 11 +++-- mm/damon/paddr.c | 15 ++++--- mm/folio-compat.c | 7 ---- mm/huge_memory.c | 2 +- mm/ksm.c | 12 +++--- mm/migrate.c | 10 ++--- mm/page_idle.c | 7 ++-- mm/rmap.c | 111 ++++++++++++++++++++++++--------------------------- 9 files changed, 80 insertions(+), 99 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/ksm.h b/include/linux/ksm.h index a38a5bca1ba5..0b4f17418f64 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -51,7 +51,7 @@ static inline void ksm_exit(struct mm_struct *mm) struct page *ksm_might_need_to_copy(struct page *page, struct vm_area_struct *vma, unsigned long address); -void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc); +void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc); void folio_migrate_ksm(struct folio *newfolio, struct folio *folio); #else /* !CONFIG_KSM */ @@ -78,7 +78,7 @@ static inline struct page *ksm_might_need_to_copy(struct page *page, return page; } -static inline void rmap_walk_ksm(struct page *page, +static inline void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc) { } diff --git a/include/linux/rmap.h b/include/linux/rmap.h index be020d38b0a5..a74d811c5b3f 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -266,7 +266,6 @@ void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked); /* * Called by memory-failure.c to kill processes. */ -struct anon_vma *page_lock_anon_vma_read(struct page *page); struct anon_vma *folio_lock_anon_vma_read(struct folio *folio); void page_unlock_anon_vma_read(struct anon_vma *anon_vma); int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); @@ -286,15 +285,15 @@ struct rmap_walk_control { * Return false if page table scanning in rmap_walk should be stopped. * Otherwise, return true. */ - bool (*rmap_one)(struct page *page, struct vm_area_struct *vma, + bool (*rmap_one)(struct folio *folio, struct vm_area_struct *vma, unsigned long addr, void *arg); - int (*done)(struct page *page); - struct anon_vma *(*anon_lock)(struct page *page); + int (*done)(struct folio *folio); + struct anon_vma *(*anon_lock)(struct folio *folio); bool (*invalid_vma)(struct vm_area_struct *vma, void *arg); }; -void rmap_walk(struct page *page, struct rmap_walk_control *rwc); -void rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc); +void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc); +void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc); #else /* !CONFIG_MMU */ diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index eee8a3395d71..ae24549921e2 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -16,10 +16,10 @@ #include "../internal.h" #include "prmtv-common.h" -static bool __damon_pa_mkold(struct page *page, struct vm_area_struct *vma, +static bool __damon_pa_mkold(struct folio *folio, struct vm_area_struct *vma, unsigned long addr, void *arg) { - DEFINE_PAGE_VMA_WALK(pvmw, page, vma, addr, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, addr, 0); while (page_vma_mapped_walk(&pvmw)) { addr = pvmw.address; @@ -37,7 +37,7 @@ static void damon_pa_mkold(unsigned long paddr) struct page *page = damon_get_page(PHYS_PFN(paddr)); struct rmap_walk_control rwc = { .rmap_one = __damon_pa_mkold, - .anon_lock = page_lock_anon_vma_read, + .anon_lock = folio_lock_anon_vma_read, }; bool need_lock; @@ -54,7 +54,7 @@ static void damon_pa_mkold(unsigned long paddr) if (need_lock && !folio_trylock(folio)) goto out; - rmap_walk(&folio->page, &rwc); + rmap_walk(folio, &rwc); if (need_lock) folio_unlock(folio); @@ -87,10 +87,9 @@ struct damon_pa_access_chk_result { bool accessed; }; -static bool __damon_pa_young(struct page *page, struct vm_area_struct *vma, +static bool __damon_pa_young(struct folio *folio, struct vm_area_struct *vma, unsigned long addr, void *arg) { - struct folio *folio = page_folio(page); struct damon_pa_access_chk_result *result = arg; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, addr, 0); @@ -133,7 +132,7 @@ static bool damon_pa_young(unsigned long paddr, unsigned long *page_sz) struct rmap_walk_control rwc = { .arg = &result, .rmap_one = __damon_pa_young, - .anon_lock = page_lock_anon_vma_read, + .anon_lock = folio_lock_anon_vma_read, }; bool need_lock; @@ -156,7 +155,7 @@ static bool damon_pa_young(unsigned long paddr, unsigned long *page_sz) return NULL; } - rmap_walk(&folio->page, &rwc); + rmap_walk(folio, &rwc); if (need_lock) folio_unlock(folio); diff --git a/mm/folio-compat.c b/mm/folio-compat.c index 968ad97bbffa..46fa179e32fb 100644 --- a/mm/folio-compat.c +++ b/mm/folio-compat.c @@ -164,10 +164,3 @@ void putback_lru_page(struct page *page) { folio_putback_lru(page_folio(page)); } - -#ifdef CONFIG_MMU -struct anon_vma *page_lock_anon_vma_read(struct page *page) -{ - return folio_lock_anon_vma_read(page_folio(page)); -} -#endif diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d55b25f1ceba..d874d50e703b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2572,7 +2572,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) * The caller does not necessarily hold an mmap_lock that would * prevent the anon_vma disappearing so we first we take a * reference to it and then lock the anon_vma for write. This - * is similar to page_lock_anon_vma_read except the write lock + * is similar to folio_lock_anon_vma_read except the write lock * is taken to serialise against parallel split or collapse * operations. */ diff --git a/mm/ksm.c b/mm/ksm.c index b25d545e0cd1..dd737f925c04 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2588,21 +2588,21 @@ struct page *ksm_might_need_to_copy(struct page *page, return new_page; } -void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc) +void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc) { struct stable_node *stable_node; struct rmap_item *rmap_item; int search_new_forks = 0; - VM_BUG_ON_PAGE(!PageKsm(page), page); + VM_BUG_ON_FOLIO(!folio_test_ksm(folio), folio); /* * Rely on the page lock to protect against concurrent modifications * to that page's node of the stable tree. */ - VM_BUG_ON_PAGE(!PageLocked(page), page); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - stable_node = page_stable_node(page); + stable_node = folio_stable_node(folio); if (!stable_node) return; again: @@ -2637,11 +2637,11 @@ again: if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) continue; - if (!rwc->rmap_one(page, vma, addr, rwc->arg)) { + if (!rwc->rmap_one(folio, vma, addr, rwc->arg)) { anon_vma_unlock_read(anon_vma); return; } - if (rwc->done && rwc->done(page)) { + if (rwc->done && rwc->done(folio)) { anon_vma_unlock_read(anon_vma); return; } diff --git a/mm/migrate.c b/mm/migrate.c index eba3cd5376e3..2defe9aa4d0e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -171,13 +171,11 @@ void putback_movable_pages(struct list_head *l) /* * Restore a potential migration pte to a working pte entry */ -static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, - unsigned long addr, void *old) +static bool remove_migration_pte(struct folio *folio, + struct vm_area_struct *vma, unsigned long addr, void *old) { - struct folio *folio = page_folio(page); DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION); - VM_BUG_ON_PAGE(PageTail(page), page); while (page_vma_mapped_walk(&pvmw)) { pte_t pte; swp_entry_t entry; @@ -269,9 +267,9 @@ void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked) }; if (locked) - rmap_walk_locked(&dst->page, &rwc); + rmap_walk_locked(dst, &rwc); else - rmap_walk(&dst->page, &rwc); + rmap_walk(dst, &rwc); } /* diff --git a/mm/page_idle.c b/mm/page_idle.c index 5c73a9b578da..e34ba04e22e2 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -46,11 +46,10 @@ static struct page *page_idle_get_page(unsigned long pfn) return page; } -static bool page_idle_clear_pte_refs_one(struct page *page, +static bool page_idle_clear_pte_refs_one(struct folio *folio, struct vm_area_struct *vma, unsigned long addr, void *arg) { - struct folio *folio = page_folio(page); DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, addr, 0); bool referenced = false; @@ -93,7 +92,7 @@ static void page_idle_clear_pte_refs(struct page *page) */ static const struct rmap_walk_control rwc = { .rmap_one = page_idle_clear_pte_refs_one, - .anon_lock = page_lock_anon_vma_read, + .anon_lock = folio_lock_anon_vma_read, }; bool need_lock; @@ -104,7 +103,7 @@ static void page_idle_clear_pte_refs(struct page *page) if (need_lock && !folio_trylock(folio)) return; - rmap_walk(&folio->page, (struct rmap_walk_control *)&rwc); + rmap_walk(folio, (struct rmap_walk_control *)&rwc); if (need_lock) folio_unlock(folio); diff --git a/mm/rmap.c b/mm/rmap.c index 09301aecf2fc..2cc97c64901e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -107,15 +107,15 @@ static inline void anon_vma_free(struct anon_vma *anon_vma) VM_BUG_ON(atomic_read(&anon_vma->refcount)); /* - * Synchronize against page_lock_anon_vma_read() such that + * Synchronize against folio_lock_anon_vma_read() such that * we can safely hold the lock without the anon_vma getting * freed. * * Relies on the full mb implied by the atomic_dec_and_test() from * put_anon_vma() against the acquire barrier implied by - * down_read_trylock() from page_lock_anon_vma_read(). This orders: + * down_read_trylock() from folio_lock_anon_vma_read(). This orders: * - * page_lock_anon_vma_read() VS put_anon_vma() + * folio_lock_anon_vma_read() VS put_anon_vma() * down_read_trylock() atomic_dec_and_test() * LOCK MB * atomic_read() rwsem_is_locked() @@ -168,7 +168,7 @@ static void anon_vma_chain_link(struct vm_area_struct *vma, * allocate a new one. * * Anon-vma allocations are very subtle, because we may have - * optimistically looked up an anon_vma in page_lock_anon_vma_read() + * optimistically looked up an anon_vma in folio_lock_anon_vma_read() * and that may actually touch the rwsem even in the newly * allocated vma (it depends on RCU to make sure that the * anon_vma isn't actually destroyed). @@ -799,10 +799,9 @@ struct folio_referenced_arg { /* * arg: folio_referenced_arg will be passed */ -static bool folio_referenced_one(struct page *page, struct vm_area_struct *vma, - unsigned long address, void *arg) +static bool folio_referenced_one(struct folio *folio, + struct vm_area_struct *vma, unsigned long address, void *arg) { - struct folio *folio = page_folio(page); struct folio_referenced_arg *pra = arg; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); int referenced = 0; @@ -894,7 +893,7 @@ int folio_referenced(struct folio *folio, int is_locked, struct rmap_walk_control rwc = { .rmap_one = folio_referenced_one, .arg = (void *)&pra, - .anon_lock = page_lock_anon_vma_read, + .anon_lock = folio_lock_anon_vma_read, }; *vm_flags = 0; @@ -919,7 +918,7 @@ int folio_referenced(struct folio *folio, int is_locked, rwc.invalid_vma = invalid_folio_referenced_vma; } - rmap_walk(&folio->page, &rwc); + rmap_walk(folio, &rwc); *vm_flags = pra.vm_flags; if (we_locked) @@ -928,10 +927,9 @@ int folio_referenced(struct folio *folio, int is_locked, return pra.referenced; } -static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, +static bool page_mkclean_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { - struct folio *folio = page_folio(page); DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_SYNC); struct mmu_notifier_range range; int *cleaned = arg; @@ -1025,7 +1023,7 @@ int folio_mkclean(struct folio *folio) if (!mapping) return 0; - rmap_walk(&folio->page, &rwc); + rmap_walk(folio, &rwc); return cleaned; } @@ -1410,10 +1408,9 @@ out: /* * @arg: enum ttu_flags will be passed to this argument */ -static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, +static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { - struct folio *folio = page_folio(page); struct mm_struct *mm = vma->vm_mm; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); pte_t pteval; @@ -1667,9 +1664,9 @@ static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg) return vma_is_temporary_stack(vma); } -static int page_not_mapped(struct page *page) +static int page_not_mapped(struct folio *folio) { - return !page_mapped(page); + return !folio_mapped(folio); } /** @@ -1689,13 +1686,13 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags) .rmap_one = try_to_unmap_one, .arg = (void *)flags, .done = page_not_mapped, - .anon_lock = page_lock_anon_vma_read, + .anon_lock = folio_lock_anon_vma_read, }; if (flags & TTU_RMAP_LOCKED) - rmap_walk_locked(&folio->page, &rwc); + rmap_walk_locked(folio, &rwc); else - rmap_walk(&folio->page, &rwc); + rmap_walk(folio, &rwc); } /* @@ -1704,10 +1701,9 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags) * If TTU_SPLIT_HUGE_PMD is specified any PMD mappings will be split into PTEs * containing migration entries. */ -static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, +static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { - struct folio *folio = page_folio(page); struct mm_struct *mm = vma->vm_mm; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); pte_t pteval; @@ -1951,7 +1947,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags) .rmap_one = try_to_migrate_one, .arg = (void *)flags, .done = page_not_mapped, - .anon_lock = page_lock_anon_vma_read, + .anon_lock = folio_lock_anon_vma_read, }; /* @@ -1977,9 +1973,9 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags) rwc.invalid_vma = invalid_migration_vma; if (flags & TTU_RMAP_LOCKED) - rmap_walk_locked(&folio->page, &rwc); + rmap_walk_locked(folio, &rwc); else - rmap_walk(&folio->page, &rwc); + rmap_walk(folio, &rwc); } #ifdef CONFIG_DEVICE_PRIVATE @@ -1990,10 +1986,9 @@ struct make_exclusive_args { bool valid; }; -static bool page_make_device_exclusive_one(struct page *page, +static bool page_make_device_exclusive_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *priv) { - struct folio *folio = page_folio(page); struct mm_struct *mm = vma->vm_mm; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); struct make_exclusive_args *args = priv; @@ -2098,7 +2093,7 @@ static bool folio_make_device_exclusive(struct folio *folio, struct rmap_walk_control rwc = { .rmap_one = page_make_device_exclusive_one, .done = page_not_mapped, - .anon_lock = page_lock_anon_vma_read, + .anon_lock = folio_lock_anon_vma_read, .arg = &args, }; @@ -2109,7 +2104,7 @@ static bool folio_make_device_exclusive(struct folio *folio, if (!folio_test_anon(folio)) return false; - rmap_walk(&folio->page, &rwc); + rmap_walk(folio, &rwc); return args.valid && !folio_mapcount(folio); } @@ -2177,17 +2172,16 @@ void __put_anon_vma(struct anon_vma *anon_vma) anon_vma_free(root); } -static struct anon_vma *rmap_walk_anon_lock(struct page *page, +static struct anon_vma *rmap_walk_anon_lock(struct folio *folio, struct rmap_walk_control *rwc) { - struct folio *folio = page_folio(page); struct anon_vma *anon_vma; if (rwc->anon_lock) - return rwc->anon_lock(page); + return rwc->anon_lock(folio); /* - * Note: remove_migration_ptes() cannot use page_lock_anon_vma_read() + * Note: remove_migration_ptes() cannot use folio_lock_anon_vma_read() * because that depends on page_mapped(); but not all its usages * are holding mmap_lock. Users without mmap_lock are required to * take a reference count to prevent the anon_vma disappearing @@ -2209,10 +2203,9 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page, * Find all the mappings of a page using the mapping pointer and the vma chains * contained in the anon_vma struct it points to. */ -static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc, +static void rmap_walk_anon(struct folio *folio, struct rmap_walk_control *rwc, bool locked) { - struct folio *folio = page_folio(page); struct anon_vma *anon_vma; pgoff_t pgoff_start, pgoff_end; struct anon_vma_chain *avc; @@ -2222,17 +2215,17 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc, /* anon_vma disappear under us? */ VM_BUG_ON_FOLIO(!anon_vma, folio); } else { - anon_vma = rmap_walk_anon_lock(page, rwc); + anon_vma = rmap_walk_anon_lock(folio, rwc); } if (!anon_vma) return; - pgoff_start = page_to_pgoff(page); - pgoff_end = pgoff_start + thp_nr_pages(page) - 1; + pgoff_start = folio_pgoff(folio); + pgoff_end = pgoff_start + folio_nr_pages(folio) - 1; anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff_start, pgoff_end) { struct vm_area_struct *vma = avc->vma; - unsigned long address = vma_address(page, vma); + unsigned long address = vma_address(&folio->page, vma); VM_BUG_ON_VMA(address == -EFAULT, vma); cond_resched(); @@ -2240,9 +2233,9 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc, if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) continue; - if (!rwc->rmap_one(page, vma, address, rwc->arg)) + if (!rwc->rmap_one(folio, vma, address, rwc->arg)) break; - if (rwc->done && rwc->done(page)) + if (rwc->done && rwc->done(folio)) break; } @@ -2258,10 +2251,10 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc, * Find all the mappings of a page using the mapping pointer and the vma chains * contained in the address_space struct it points to. */ -static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc, +static void rmap_walk_file(struct folio *folio, struct rmap_walk_control *rwc, bool locked) { - struct address_space *mapping = page_mapping(page); + struct address_space *mapping = folio_mapping(folio); pgoff_t pgoff_start, pgoff_end; struct vm_area_struct *vma; @@ -2271,18 +2264,18 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc, * structure at mapping cannot be freed and reused yet, * so we can safely take mapping->i_mmap_rwsem. */ - VM_BUG_ON_PAGE(!PageLocked(page), page); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); if (!mapping) return; - pgoff_start = page_to_pgoff(page); - pgoff_end = pgoff_start + thp_nr_pages(page) - 1; + pgoff_start = folio_pgoff(folio); + pgoff_end = pgoff_start + folio_nr_pages(folio) - 1; if (!locked) i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff_start, pgoff_end) { - unsigned long address = vma_address(page, vma); + unsigned long address = vma_address(&folio->page, vma); VM_BUG_ON_VMA(address == -EFAULT, vma); cond_resched(); @@ -2290,9 +2283,9 @@ static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc, if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) continue; - if (!rwc->rmap_one(page, vma, address, rwc->arg)) + if (!rwc->rmap_one(folio, vma, address, rwc->arg)) goto done; - if (rwc->done && rwc->done(page)) + if (rwc->done && rwc->done(folio)) goto done; } @@ -2301,25 +2294,25 @@ done: i_mmap_unlock_read(mapping); } -void rmap_walk(struct page *page, struct rmap_walk_control *rwc) +void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc) { - if (unlikely(PageKsm(page))) - rmap_walk_ksm(page, rwc); - else if (PageAnon(page)) - rmap_walk_anon(page, rwc, false); + if (unlikely(folio_test_ksm(folio))) + rmap_walk_ksm(folio, rwc); + else if (folio_test_anon(folio)) + rmap_walk_anon(folio, rwc, false); else - rmap_walk_file(page, rwc, false); + rmap_walk_file(folio, rwc, false); } /* Like rmap_walk, but caller holds relevant rmap lock */ -void rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc) +void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc) { /* no ksm support for now */ - VM_BUG_ON_PAGE(PageKsm(page), page); - if (PageAnon(page)) - rmap_walk_anon(page, rwc, true); + VM_BUG_ON_FOLIO(folio_test_ksm(folio), folio); + if (folio_test_anon(folio)) + rmap_walk_anon(folio, rwc, true); else - rmap_walk_file(page, rwc, true); + rmap_walk_file(folio, rwc, true); } #ifdef CONFIG_HUGETLB_PAGE -- cgit v1.2.3 From 84fbbe21894bb9be8e16df408cbfbb9466fe396d Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Sat, 29 Jan 2022 16:16:54 -0500 Subject: mm/rmap: Constify the rmap_walk_control argument The rmap walking functions do not modify the rmap_walk_control, and page_idle_clear_pte_refs() takes advantage of that to move construction of the rmap_walk_control to compile time. This lets us remove an unclean cast. Signed-off-by: Matthew Wilcox (Oracle) --- include/linux/ksm.h | 4 ++-- include/linux/rmap.h | 4 ++-- mm/ksm.c | 2 +- mm/page_idle.c | 2 +- mm/rmap.c | 14 +++++++------- 5 files changed, 13 insertions(+), 13 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 0b4f17418f64..0630e545f4cb 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -51,7 +51,7 @@ static inline void ksm_exit(struct mm_struct *mm) struct page *ksm_might_need_to_copy(struct page *page, struct vm_area_struct *vma, unsigned long address); -void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc); +void rmap_walk_ksm(struct folio *folio, const struct rmap_walk_control *rwc); void folio_migrate_ksm(struct folio *newfolio, struct folio *folio); #else /* !CONFIG_KSM */ @@ -79,7 +79,7 @@ static inline struct page *ksm_might_need_to_copy(struct page *page, } static inline void rmap_walk_ksm(struct folio *folio, - struct rmap_walk_control *rwc) + const struct rmap_walk_control *rwc) { } diff --git a/include/linux/rmap.h b/include/linux/rmap.h index a74d811c5b3f..17230c458341 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -292,8 +292,8 @@ struct rmap_walk_control { bool (*invalid_vma)(struct vm_area_struct *vma, void *arg); }; -void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc); -void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc); +void rmap_walk(struct folio *folio, const struct rmap_walk_control *rwc); +void rmap_walk_locked(struct folio *folio, const struct rmap_walk_control *rwc); #else /* !CONFIG_MMU */ diff --git a/mm/ksm.c b/mm/ksm.c index dd737f925c04..eed2ff25a2fb 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2588,7 +2588,7 @@ struct page *ksm_might_need_to_copy(struct page *page, return new_page; } -void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc) +void rmap_walk_ksm(struct folio *folio, const struct rmap_walk_control *rwc) { struct stable_node *stable_node; struct rmap_item *rmap_item; diff --git a/mm/page_idle.c b/mm/page_idle.c index e34ba04e22e2..fc0435abf909 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -103,7 +103,7 @@ static void page_idle_clear_pte_refs(struct page *page) if (need_lock && !folio_trylock(folio)) return; - rmap_walk(folio, (struct rmap_walk_control *)&rwc); + rmap_walk(folio, &rwc); if (need_lock) folio_unlock(folio); diff --git a/mm/rmap.c b/mm/rmap.c index 2cc97c64901e..8192cb5809bc 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2173,7 +2173,7 @@ void __put_anon_vma(struct anon_vma *anon_vma) } static struct anon_vma *rmap_walk_anon_lock(struct folio *folio, - struct rmap_walk_control *rwc) + const struct rmap_walk_control *rwc) { struct anon_vma *anon_vma; @@ -2203,8 +2203,8 @@ static struct anon_vma *rmap_walk_anon_lock(struct folio *folio, * Find all the mappings of a page using the mapping pointer and the vma chains * contained in the anon_vma struct it points to. */ -static void rmap_walk_anon(struct folio *folio, struct rmap_walk_control *rwc, - bool locked) +static void rmap_walk_anon(struct folio *folio, + const struct rmap_walk_control *rwc, bool locked) { struct anon_vma *anon_vma; pgoff_t pgoff_start, pgoff_end; @@ -2251,8 +2251,8 @@ static void rmap_walk_anon(struct folio *folio, struct rmap_walk_control *rwc, * Find all the mappings of a page using the mapping pointer and the vma chains * contained in the address_space struct it points to. */ -static void rmap_walk_file(struct folio *folio, struct rmap_walk_control *rwc, - bool locked) +static void rmap_walk_file(struct folio *folio, + const struct rmap_walk_control *rwc, bool locked) { struct address_space *mapping = folio_mapping(folio); pgoff_t pgoff_start, pgoff_end; @@ -2294,7 +2294,7 @@ done: i_mmap_unlock_read(mapping); } -void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc) +void rmap_walk(struct folio *folio, const struct rmap_walk_control *rwc) { if (unlikely(folio_test_ksm(folio))) rmap_walk_ksm(folio, rwc); @@ -2305,7 +2305,7 @@ void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc) } /* Like rmap_walk, but caller holds relevant rmap lock */ -void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc) +void rmap_walk_locked(struct folio *folio, const struct rmap_walk_control *rwc) { /* no ksm support for now */ VM_BUG_ON_FOLIO(folio_test_ksm(folio), folio); -- cgit v1.2.3 From 6a8e0596f00469c15ec556b9f3624acd2e9a04f9 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Thu, 28 Apr 2022 23:16:10 -0700 Subject: mm: rmap: introduce pfn_mkclean_range() to cleans PTEs The page_mkclean_one() is supposed to be used with the pfn that has a associated struct page, but not all the pfns (e.g. DAX) have a struct page. Introduce a new function pfn_mkclean_range() to cleans the PTEs (including PMDs) mapped with range of pfns which has no struct page associated with them. This helper will be used by DAX device in the next patch to make pfns clean. Link: https://lkml.kernel.org/r/20220403053957.10770-4-songmuchun@bytedance.com Signed-off-by: Muchun Song Cc: Alistair Popple Cc: Al Viro Cc: Christoph Hellwig Cc: Dan Williams Cc: Hugh Dickins Cc: Jan Kara Cc: "Kirill A. Shutemov" Cc: Matthew Wilcox Cc: Ralph Campbell Cc: Ross Zwisler Cc: Xiongchun Duan Cc: Xiyu Yang Cc: Yang Shi Signed-off-by: Andrew Morton --- include/linux/rmap.h | 3 +++ mm/internal.h | 26 +++++++++++++-------- mm/rmap.c | 65 +++++++++++++++++++++++++++++++++++++++++++--------- 3 files changed, 74 insertions(+), 20 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 17230c458341..8573aae50d96 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -261,6 +261,9 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); */ int folio_mkclean(struct folio *); +int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, + struct vm_area_struct *vma); + void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked); /* diff --git a/mm/internal.h b/mm/internal.h index 6293468fd7b7..69a5eabe0943 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -525,26 +525,22 @@ void mlock_page_drain_remote(int cpu); extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); /* - * At what user virtual address is page expected in vma? - * Returns -EFAULT if all of the page is outside the range of vma. - * If page is a compound head, the entire compound page is considered. + * Return the start of user virtual address at the specific offset within + * a vma. */ static inline unsigned long -vma_address(struct page *page, struct vm_area_struct *vma) +vma_pgoff_address(pgoff_t pgoff, unsigned long nr_pages, + struct vm_area_struct *vma) { - pgoff_t pgoff; unsigned long address; - VM_BUG_ON_PAGE(PageKsm(page), page); /* KSM page->index unusable */ - pgoff = page_to_pgoff(page); if (pgoff >= vma->vm_pgoff) { address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); /* Check for address beyond vma (or wrapped through 0?) */ if (address < vma->vm_start || address >= vma->vm_end) address = -EFAULT; - } else if (PageHead(page) && - pgoff + compound_nr(page) - 1 >= vma->vm_pgoff) { + } else if (pgoff + nr_pages - 1 >= vma->vm_pgoff) { /* Test above avoids possibility of wrap to 0 on 32-bit */ address = vma->vm_start; } else { @@ -553,6 +549,18 @@ vma_address(struct page *page, struct vm_area_struct *vma) return address; } +/* + * Return the start of user virtual address of a page within a vma. + * Returns -EFAULT if all of the page is outside the range of vma. + * If page is a compound head, the entire compound page is considered. + */ +static inline unsigned long +vma_address(struct page *page, struct vm_area_struct *vma) +{ + VM_BUG_ON_PAGE(PageKsm(page), page); /* KSM page->index unusable */ + return vma_pgoff_address(page_to_pgoff(page), compound_nr(page), vma); +} + /* * Then at what user virtual address will none of the range be found in vma? * Assumes that vma_address() already returned a good starting address. diff --git a/mm/rmap.c b/mm/rmap.c index c194ec913c68..ad509b3a89ee 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -929,12 +929,12 @@ int folio_referenced(struct folio *folio, int is_locked, return pra.referenced; } -static bool page_mkclean_one(struct folio *folio, struct vm_area_struct *vma, - unsigned long address, void *arg) +static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw) { - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_SYNC); + int cleaned = 0; + struct vm_area_struct *vma = pvmw->vma; struct mmu_notifier_range range; - int *cleaned = arg; + unsigned long address = pvmw->address; /* * We have to assume the worse case ie pmd for invalidation. Note that @@ -942,16 +942,16 @@ static bool page_mkclean_one(struct folio *folio, struct vm_area_struct *vma, */ mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, 0, vma, vma->vm_mm, address, - vma_address_end(&pvmw)); + vma_address_end(pvmw)); mmu_notifier_invalidate_range_start(&range); - while (page_vma_mapped_walk(&pvmw)) { + while (page_vma_mapped_walk(pvmw)) { int ret = 0; - address = pvmw.address; - if (pvmw.pte) { + address = pvmw->address; + if (pvmw->pte) { pte_t entry; - pte_t *pte = pvmw.pte; + pte_t *pte = pvmw->pte; if (!pte_dirty(*pte) && !pte_write(*pte)) continue; @@ -964,7 +964,7 @@ static bool page_mkclean_one(struct folio *folio, struct vm_area_struct *vma, ret = 1; } else { #ifdef CONFIG_TRANSPARENT_HUGEPAGE - pmd_t *pmd = pvmw.pmd; + pmd_t *pmd = pvmw->pmd; pmd_t entry; if (!pmd_dirty(*pmd) && !pmd_write(*pmd)) @@ -991,11 +991,22 @@ static bool page_mkclean_one(struct folio *folio, struct vm_area_struct *vma, * See Documentation/vm/mmu_notifier.rst */ if (ret) - (*cleaned)++; + cleaned++; } mmu_notifier_invalidate_range_end(&range); + return cleaned; +} + +static bool page_mkclean_one(struct folio *folio, struct vm_area_struct *vma, + unsigned long address, void *arg) +{ + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_SYNC); + int *cleaned = arg; + + *cleaned += page_vma_mkclean_one(&pvmw); + return true; } @@ -1032,6 +1043,38 @@ int folio_mkclean(struct folio *folio) } EXPORT_SYMBOL_GPL(folio_mkclean); +/** + * pfn_mkclean_range - Cleans the PTEs (including PMDs) mapped with range of + * [@pfn, @pfn + @nr_pages) at the specific offset (@pgoff) + * within the @vma of shared mappings. And since clean PTEs + * should also be readonly, write protects them too. + * @pfn: start pfn. + * @nr_pages: number of physically contiguous pages srarting with @pfn. + * @pgoff: page offset that the @pfn mapped with. + * @vma: vma that @pfn mapped within. + * + * Returns the number of cleaned PTEs (including PMDs). + */ +int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, + struct vm_area_struct *vma) +{ + struct page_vma_mapped_walk pvmw = { + .pfn = pfn, + .nr_pages = nr_pages, + .pgoff = pgoff, + .vma = vma, + .flags = PVMW_SYNC, + }; + + if (invalid_mkclean_vma(vma, NULL)) + return 0; + + pvmw.address = vma_pgoff_address(pgoff, nr_pages, vma); + VM_BUG_ON_VMA(pvmw.address == -EFAULT, vma); + + return page_vma_mkclean_one(&pvmw); +} + /** * page_move_anon_rmap - move a page to our anon_vma * @page: the page to move to our anon_vma -- cgit v1.2.3 From fb3d824d1a46c5bb0584ea88f32dc2495544aebf Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 9 May 2022 18:20:43 -0700 Subject: mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap() ... and move the special check for pinned pages into page_try_dup_anon_rmap() to prepare for tracking exclusive anonymous pages via a new pageflag, clearing it only after making sure that there are no GUP pins on the anonymous page. We really only care about pins on anonymous pages, because they are prone to getting replaced in the COW handler once mapped R/O. For !anon pages in cow-mappings (!VM_SHARED && VM_MAYWRITE) we shouldn't really care about that, at least not that I could come up with an example. Let's drop the is_cow_mapping() check from page_needs_cow_for_dma(), as we know we're dealing with anonymous pages. Also, drop the handling of pinned pages from copy_huge_pud() and add a comment if ever supporting anonymous pages on the PUD level. This is a preparation for tracking exclusivity of anonymous pages in the rmap code, and disallowing marking a page shared (-> failing to duplicate) if there are GUP pins on a page. Link: https://lkml.kernel.org/r/20220428083441.37290-5-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: Christoph Hellwig Cc: David Rientjes Cc: Don Dutile Cc: Hugh Dickins Cc: Jan Kara Cc: Jann Horn Cc: Jason Gunthorpe Cc: John Hubbard Cc: Khalid Aziz Cc: "Kirill A. Shutemov" Cc: Liang Zhang Cc: "Matthew Wilcox (Oracle)" Cc: Michal Hocko Cc: Mike Kravetz Cc: Mike Rapoport Cc: Nadav Amit Cc: Oded Gabbay Cc: Oleg Nesterov Cc: Pedro Demarchi Gomes Cc: Peter Xu Cc: Rik van Riel Cc: Roman Gushchin Cc: Shakeel Butt Cc: Yang Shi Signed-off-by: Andrew Morton --- include/linux/mm.h | 5 +---- include/linux/rmap.h | 49 ++++++++++++++++++++++++++++++++++++++++++++++++- mm/huge_memory.c | 27 ++++++++------------------- mm/hugetlb.c | 16 +++++++++------- mm/memory.c | 17 ++++++++++++----- mm/migrate.c | 2 +- 6 files changed, 79 insertions(+), 37 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/mm.h b/include/linux/mm.h index a02812178562..9ee3ae51e8e3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1575,16 +1575,13 @@ static inline bool page_maybe_dma_pinned(struct page *page) /* * This should most likely only be called during fork() to see whether we - * should break the cow immediately for a page on the src mm. + * should break the cow immediately for an anon page on the src mm. * * The caller has to hold the PT lock and the vma->vm_mm->->write_protect_seq. */ static inline bool page_needs_cow_for_dma(struct vm_area_struct *vma, struct page *page) { - if (!is_cow_mapping(vma->vm_flags)) - return false; - VM_BUG_ON(!(raw_read_seqcount(&vma->vm_mm->write_protect_seq) & 1)); if (!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 8573aae50d96..73f41544084f 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -12,6 +12,7 @@ #include #include #include +#include /* * The anon_vma heads a list of private "related" vmas, to scan if @@ -182,11 +183,57 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *, void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long address); -static inline void page_dup_rmap(struct page *page, bool compound) +static inline void __page_dup_rmap(struct page *page, bool compound) { atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount); } +static inline void page_dup_file_rmap(struct page *page, bool compound) +{ + __page_dup_rmap(page, compound); +} + +/** + * page_try_dup_anon_rmap - try duplicating a mapping of an already mapped + * anonymous page + * @page: the page to duplicate the mapping for + * @compound: the page is mapped as compound or as a small page + * @vma: the source vma + * + * The caller needs to hold the PT lock and the vma->vma_mm->write_protect_seq. + * + * Duplicating the mapping can only fail if the page may be pinned; device + * private pages cannot get pinned and consequently this function cannot fail. + * + * If duplicating the mapping succeeds, the page has to be mapped R/O into + * the parent and the child. It must *not* get mapped writable after this call. + * + * Returns 0 if duplicating the mapping succeeded. Returns -EBUSY otherwise. + */ +static inline int page_try_dup_anon_rmap(struct page *page, bool compound, + struct vm_area_struct *vma) +{ + VM_BUG_ON_PAGE(!PageAnon(page), page); + + /* + * If this page may have been pinned by the parent process, + * don't allow to duplicate the mapping but instead require to e.g., + * copy the page immediately for the child so that we'll always + * guarantee the pinned page won't be randomly replaced in the + * future on write faults. + */ + if (likely(!is_device_private_page(page) && + unlikely(page_needs_cow_for_dma(vma, page)))) + return -EBUSY; + + /* + * It's okay to share the anon page between both processes, mapping + * the page R/O into both processes. + */ + __page_dup_rmap(page, compound); + return 0; +} + /* * Called from mm/vmscan.c to handle paging out */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c468fee595ff..baf4ea6d8e1a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1097,23 +1097,16 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, src_page = pmd_page(pmd); VM_BUG_ON_PAGE(!PageHead(src_page), src_page); - /* - * If this page is a potentially pinned page, split and retry the fault - * with smaller page size. Normally this should not happen because the - * userspace should use MADV_DONTFORK upon pinned regions. This is a - * best effort that the pinned pages won't be replaced by another - * random page during the coming copy-on-write. - */ - if (unlikely(page_needs_cow_for_dma(src_vma, src_page))) { + get_page(src_page); + if (unlikely(page_try_dup_anon_rmap(src_page, true, src_vma))) { + /* Page maybe pinned: split and retry the fault on PTEs. */ + put_page(src_page); pte_free(dst_mm, pgtable); spin_unlock(src_ptl); spin_unlock(dst_ptl); __split_huge_pmd(src_vma, src_pmd, addr, false, NULL); return -EAGAIN; } - - get_page(src_page); - page_dup_rmap(src_page, true); add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); out_zero_page: mm_inc_nr_ptes(dst_mm); @@ -1217,14 +1210,10 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, /* No huge zero pud yet */ } - /* Please refer to comments in copy_huge_pmd() */ - if (unlikely(page_needs_cow_for_dma(vma, pud_page(pud)))) { - spin_unlock(src_ptl); - spin_unlock(dst_ptl); - __split_huge_pud(vma, src_pud, addr); - return -EAGAIN; - } - + /* + * TODO: once we support anonymous pages, use page_try_dup_anon_rmap() + * and split if duplicating fails. + */ pudp_set_wrprotect(src_mm, addr, src_pud); pud = pud_mkold(pud_wrprotect(pud)); set_pud_at(dst_mm, addr, dst_pud, pud); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7389cd8a9a87..7a6052a984e9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4806,15 +4806,18 @@ again: get_page(ptepage); /* - * This is a rare case where we see pinned hugetlb - * pages while they're prone to COW. We need to do the - * COW earlier during fork. + * Failing to duplicate the anon rmap is a rare case + * where we see pinned hugetlb pages while they're + * prone to COW. We need to do the COW earlier during + * fork. * * When pre-allocating the page or copying data, we * need to be without the pgtable locks since we could * sleep during the process. */ - if (unlikely(page_needs_cow_for_dma(vma, ptepage))) { + if (!PageAnon(ptepage)) { + page_dup_file_rmap(ptepage, true); + } else if (page_try_dup_anon_rmap(ptepage, true, vma)) { pte_t src_pte_old = entry; struct page *new; @@ -4861,7 +4864,6 @@ again: entry = huge_pte_wrprotect(entry); } - page_dup_rmap(ptepage, true); set_huge_pte_at(dst, addr, dst_pte, entry); hugetlb_count_add(npages, dst); } @@ -5541,7 +5543,7 @@ retry: ClearHPageRestoreReserve(page); hugepage_add_new_anon_rmap(page, vma, haddr); } else - page_dup_rmap(page, true); + page_dup_file_rmap(page, true); new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) && (vma->vm_flags & VM_SHARED))); set_huge_pte_at(mm, haddr, ptep, new_pte); @@ -5902,7 +5904,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, goto out_release_unlock; if (vm_shared) { - page_dup_rmap(page, true); + page_dup_file_rmap(page, true); } else { ClearHPageRestoreReserve(page); hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); diff --git a/mm/memory.c b/mm/memory.c index 9bdfce15e98f..9f5f829a1b1f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -825,7 +825,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, */ get_page(page); rss[mm_counter(page)]++; - page_dup_rmap(page, false); + /* Cannot fail as these pages cannot get pinned. */ + BUG_ON(page_try_dup_anon_rmap(page, false, src_vma)); /* * We do not preserve soft-dirty information, because so @@ -921,18 +922,24 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, struct page *page; page = vm_normal_page(src_vma, addr, pte); - if (page && unlikely(page_needs_cow_for_dma(src_vma, page))) { + if (page && PageAnon(page)) { /* * If this page may have been pinned by the parent process, * copy the page immediately for the child so that we'll always * guarantee the pinned page won't be randomly replaced in the * future. */ - return copy_present_page(dst_vma, src_vma, dst_pte, src_pte, - addr, rss, prealloc, page); + get_page(page); + if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) { + /* Page maybe pinned, we have to copy. */ + put_page(page); + return copy_present_page(dst_vma, src_vma, dst_pte, src_pte, + addr, rss, prealloc, page); + } + rss[mm_counter(page)]++; } else if (page) { get_page(page); - page_dup_rmap(page, false); + page_dup_file_rmap(page, false); rss[mm_counter(page)]++; } diff --git a/mm/migrate.c b/mm/migrate.c index 77592288fbc0..6ca6e89ded94 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -234,7 +234,7 @@ static bool remove_migration_pte(struct folio *folio, if (folio_test_anon(folio)) hugepage_add_anon_rmap(new, vma, pvmw.address); else - page_dup_rmap(new, true); + page_dup_file_rmap(new, true); set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); } else #endif -- cgit v1.2.3 From 14f9135d547060d1d0c182501201f8da19895fe3 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 9 May 2022 18:20:43 -0700 Subject: mm/rmap: convert RMAP flags to a proper distinct rmap_t type We want to pass the flags to more than one anon rmap function, getting rid of special "do_page_add_anon_rmap()". So let's pass around a distinct __bitwise type and refine documentation. Link: https://lkml.kernel.org/r/20220428083441.37290-6-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: Christoph Hellwig Cc: David Rientjes Cc: Don Dutile Cc: Hugh Dickins Cc: Jan Kara Cc: Jann Horn Cc: Jason Gunthorpe Cc: John Hubbard Cc: Khalid Aziz Cc: "Kirill A. Shutemov" Cc: Liang Zhang Cc: "Matthew Wilcox (Oracle)" Cc: Michal Hocko Cc: Mike Kravetz Cc: Mike Rapoport Cc: Nadav Amit Cc: Oded Gabbay Cc: Oleg Nesterov Cc: Pedro Demarchi Gomes Cc: Peter Xu Cc: Rik van Riel Cc: Roman Gushchin Cc: Shakeel Butt Cc: Yang Shi Signed-off-by: Andrew Morton --- include/linux/rmap.h | 22 ++++++++++++++++++---- mm/memory.c | 6 +++--- mm/rmap.c | 7 ++++--- 3 files changed, 25 insertions(+), 10 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 73f41544084f..c53a38151089 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -160,9 +160,23 @@ static inline void anon_vma_merge(struct vm_area_struct *vma, struct anon_vma *page_get_anon_vma(struct page *page); -/* bitflags for do_page_add_anon_rmap() */ -#define RMAP_EXCLUSIVE 0x01 -#define RMAP_COMPOUND 0x02 +/* RMAP flags, currently only relevant for some anon rmap operations. */ +typedef int __bitwise rmap_t; + +/* + * No special request: if the page is a subpage of a compound page, it is + * mapped via a PTE. The mapped (sub)page is possibly shared between processes. + */ +#define RMAP_NONE ((__force rmap_t)0) + +/* The (sub)page is exclusive to a single process. */ +#define RMAP_EXCLUSIVE ((__force rmap_t)BIT(0)) + +/* + * The compound page is not mapped via PTEs, but instead via a single PMD and + * should be accounted accordingly. + */ +#define RMAP_COMPOUND ((__force rmap_t)BIT(1)) /* * rmap interfaces called when adding or removing pte of page @@ -171,7 +185,7 @@ void page_move_anon_rmap(struct page *, struct vm_area_struct *); void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long address, bool compound); void do_page_add_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long address, int flags); + unsigned long address, rmap_t flags); void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long address, bool compound); void page_add_file_rmap(struct page *, struct vm_area_struct *, diff --git a/mm/memory.c b/mm/memory.c index 9f5f829a1b1f..4cd8cadf1268 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3511,10 +3511,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct page *page = NULL, *swapcache; struct swap_info_struct *si = NULL; + rmap_t rmap_flags = RMAP_NONE; swp_entry_t entry; pte_t pte; int locked; - int exclusive = 0; vm_fault_t ret = 0; void *shadow = NULL; @@ -3689,7 +3689,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) pte = maybe_mkwrite(pte_mkdirty(pte), vma); vmf->flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; - exclusive = RMAP_EXCLUSIVE; + rmap_flags |= RMAP_EXCLUSIVE; } flush_icache_page(vma, page); if (pte_swp_soft_dirty(vmf->orig_pte)) @@ -3705,7 +3705,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) page_add_new_anon_rmap(page, vma, vmf->address, false); lru_cache_add_inactive_or_unevictable(page, vma); } else { - do_page_add_anon_rmap(page, vma, vmf->address, exclusive); + do_page_add_anon_rmap(page, vma, vmf->address, rmap_flags); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); diff --git a/mm/rmap.c b/mm/rmap.c index 91a63dc636ad..23a41132995e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1181,7 +1181,8 @@ static void __page_check_anon_rmap(struct page *page, void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address, bool compound) { - do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0); + do_page_add_anon_rmap(page, vma, address, + compound ? RMAP_COMPOUND : RMAP_NONE); } /* @@ -1190,7 +1191,7 @@ void page_add_anon_rmap(struct page *page, * Everybody else should continue to use page_add_anon_rmap above. */ void do_page_add_anon_rmap(struct page *page, - struct vm_area_struct *vma, unsigned long address, int flags) + struct vm_area_struct *vma, unsigned long address, rmap_t flags) { bool compound = flags & RMAP_COMPOUND; bool first; @@ -1229,7 +1230,7 @@ void do_page_add_anon_rmap(struct page *page, /* address might be in next vma when migration races vma_adjust */ else if (first) __page_set_anon_rmap(page, vma, address, - flags & RMAP_EXCLUSIVE); + !!(flags & RMAP_EXCLUSIVE)); else __page_check_anon_rmap(page, vma, address); -- cgit v1.2.3 From f1e2db12e45baaa2d366f87c885968096c2ff5aa Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 9 May 2022 18:20:43 -0700 Subject: mm/rmap: remove do_page_add_anon_rmap() ... and instead convert page_add_anon_rmap() to accept flags. Passing flags instead of bools is usually nicer either way, and we want to more often also pass RMAP_EXCLUSIVE in follow up patches when detecting that an anonymous page is exclusive: for example, when restoring an anonymous page from a writable migration entry. This is a preparation for marking an anonymous page inside page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed. Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: Christoph Hellwig Cc: David Rientjes Cc: Don Dutile Cc: Hugh Dickins Cc: Jan Kara Cc: Jann Horn Cc: Jason Gunthorpe Cc: John Hubbard Cc: Khalid Aziz Cc: "Kirill A. Shutemov" Cc: Liang Zhang Cc: "Matthew Wilcox (Oracle)" Cc: Michal Hocko Cc: Mike Kravetz Cc: Mike Rapoport Cc: Nadav Amit Cc: Oded Gabbay Cc: Oleg Nesterov Cc: Pedro Demarchi Gomes Cc: Peter Xu Cc: Rik van Riel Cc: Roman Gushchin Cc: Shakeel Butt Cc: Yang Shi Signed-off-by: Andrew Morton --- include/linux/rmap.h | 2 -- mm/huge_memory.c | 2 +- mm/ksm.c | 2 +- mm/memory.c | 4 ++-- mm/migrate.c | 3 ++- mm/rmap.c | 14 +------------- mm/swapfile.c | 2 +- 7 files changed, 8 insertions(+), 21 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index c53a38151089..643801a937f3 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -183,8 +183,6 @@ typedef int __bitwise rmap_t; */ void page_move_anon_rmap(struct page *, struct vm_area_struct *); void page_add_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long address, bool compound); -void do_page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long address, rmap_t flags); void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long address, bool compound); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index baf4ea6d8e1a..6232b6817fab 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3072,7 +3072,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde)); if (PageAnon(new)) - page_add_anon_rmap(new, vma, mmun_start, true); + page_add_anon_rmap(new, vma, mmun_start, RMAP_COMPOUND); else page_add_file_rmap(new, vma, true); set_pmd_at(mm, mmun_start, pvmw->pmd, pmde); diff --git a/mm/ksm.c b/mm/ksm.c index 94bb0f049806..2b6692a7df6a 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1156,7 +1156,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, */ if (!is_zero_pfn(page_to_pfn(kpage))) { get_page(kpage); - page_add_anon_rmap(kpage, vma, addr, false); + page_add_anon_rmap(kpage, vma, addr, RMAP_NONE); newpte = mk_pte(kpage, vma->vm_page_prot); } else { newpte = pte_mkspecial(pfn_pte(page_to_pfn(kpage), diff --git a/mm/memory.c b/mm/memory.c index 4cd8cadf1268..8e92010f3d89 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -725,7 +725,7 @@ static void restore_exclusive_pte(struct vm_area_struct *vma, * created when the swap entry was made. */ if (PageAnon(page)) - page_add_anon_rmap(page, vma, address, false); + page_add_anon_rmap(page, vma, address, RMAP_NONE); else /* * Currently device exclusive access only supports anonymous @@ -3705,7 +3705,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) page_add_new_anon_rmap(page, vma, vmf->address, false); lru_cache_add_inactive_or_unevictable(page, vma); } else { - do_page_add_anon_rmap(page, vma, vmf->address, rmap_flags); + page_add_anon_rmap(page, vma, vmf->address, rmap_flags); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); diff --git a/mm/migrate.c b/mm/migrate.c index 6ca6e89ded94..381c7e05e8d4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -240,7 +240,8 @@ static bool remove_migration_pte(struct folio *folio, #endif { if (folio_test_anon(folio)) - page_add_anon_rmap(new, vma, pvmw.address, false); + page_add_anon_rmap(new, vma, pvmw.address, + RMAP_NONE); else page_add_file_rmap(new, vma, false); set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); diff --git a/mm/rmap.c b/mm/rmap.c index 23a41132995e..a0cbbf201c98 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1171,7 +1171,7 @@ static void __page_check_anon_rmap(struct page *page, * @page: the page to add the mapping to * @vma: the vm area in which the mapping is added * @address: the user virtual address mapped - * @compound: charge the page as compound or small page + * @flags: the rmap flags * * The caller needs to hold the pte lock, and the page must be locked in * the anon_vma case: to serialize mapping,index checking after setting, @@ -1179,18 +1179,6 @@ static void __page_check_anon_rmap(struct page *page, * (but PageKsm is never downgraded to PageAnon). */ void page_add_anon_rmap(struct page *page, - struct vm_area_struct *vma, unsigned long address, bool compound) -{ - do_page_add_anon_rmap(page, vma, address, - compound ? RMAP_COMPOUND : RMAP_NONE); -} - -/* - * Special version of the above for do_swap_page, which often runs - * into pages that are exclusively owned by the current process. - * Everybody else should continue to use page_add_anon_rmap above. - */ -void do_page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address, rmap_t flags) { bool compound = flags & RMAP_COMPOUND; diff --git a/mm/swapfile.c b/mm/swapfile.c index 63c61f8b2611..1ba525a2179d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1800,7 +1800,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, inc_mm_counter(vma->vm_mm, MM_ANONPAGES); get_page(page); if (page == swapcache) { - page_add_anon_rmap(page, vma, addr, false); + page_add_anon_rmap(page, vma, addr, RMAP_NONE); } else { /* ksm created a completely new copy */ page_add_new_anon_rmap(page, vma, addr, false); lru_cache_add_inactive_or_unevictable(page, vma); -- cgit v1.2.3 From 28c5209dfd5f86f4398ce01bfac8508b2c4d4050 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 9 May 2022 18:20:43 -0700 Subject: mm/rmap: pass rmap flags to hugepage_add_anon_rmap() Let's prepare for passing RMAP_EXCLUSIVE, similarly as we do for page_add_anon_rmap() now. RMAP_COMPOUND is implicit for hugetlb pages and ignored. Link: https://lkml.kernel.org/r/20220428083441.37290-8-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: Christoph Hellwig Cc: David Rientjes Cc: Don Dutile Cc: Hugh Dickins Cc: Jan Kara Cc: Jann Horn Cc: Jason Gunthorpe Cc: John Hubbard Cc: Khalid Aziz Cc: "Kirill A. Shutemov" Cc: Liang Zhang Cc: "Matthew Wilcox (Oracle)" Cc: Michal Hocko Cc: Mike Kravetz Cc: Mike Rapoport Cc: Nadav Amit Cc: Oded Gabbay Cc: Oleg Nesterov Cc: Pedro Demarchi Gomes Cc: Peter Xu Cc: Rik van Riel Cc: Roman Gushchin Cc: Shakeel Butt Cc: Yang Shi Signed-off-by: Andrew Morton --- include/linux/rmap.h | 2 +- mm/migrate.c | 3 ++- mm/rmap.c | 9 ++++++--- 3 files changed, 9 insertions(+), 5 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 643801a937f3..90e1e6925789 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -191,7 +191,7 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *, void page_remove_rmap(struct page *, struct vm_area_struct *, bool compound); void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long address); + unsigned long address, rmap_t flags); void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long address); diff --git a/mm/migrate.c b/mm/migrate.c index 381c7e05e8d4..92e932f08be5 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -232,7 +232,8 @@ static bool remove_migration_pte(struct folio *folio, pte = pte_mkhuge(pte); pte = arch_make_huge_pte(pte, shift, vma->vm_flags); if (folio_test_anon(folio)) - hugepage_add_anon_rmap(new, vma, pvmw.address); + hugepage_add_anon_rmap(new, vma, pvmw.address, + RMAP_NONE); else page_dup_file_rmap(new, true); set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); diff --git a/mm/rmap.c b/mm/rmap.c index a0cbbf201c98..32630f1b1ee1 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2391,9 +2391,11 @@ void rmap_walk_locked(struct folio *folio, const struct rmap_walk_control *rwc) * The following two functions are for anonymous (private mapped) hugepages. * Unlike common anonymous pages, anonymous hugepages have no accounting code * and no lru code, because we handle hugepages differently from common pages. + * + * RMAP_COMPOUND is ignored. */ -void hugepage_add_anon_rmap(struct page *page, - struct vm_area_struct *vma, unsigned long address) +void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma, + unsigned long address, rmap_t flags) { struct anon_vma *anon_vma = vma->anon_vma; int first; @@ -2403,7 +2405,8 @@ void hugepage_add_anon_rmap(struct page *page, /* address might be in next vma when migration races vma_adjust */ first = atomic_inc_and_test(compound_mapcount_ptr(page)); if (first) - __page_set_anon_rmap(page, vma, address, 0); + __page_set_anon_rmap(page, vma, address, + !!(flags & RMAP_EXCLUSIVE)); } void hugepage_add_new_anon_rmap(struct page *page, -- cgit v1.2.3 From 40f2bbf71161fa9195c7869004290003af152375 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 9 May 2022 18:20:43 -0700 Subject: mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() New anonymous pages are always mapped natively: only THP/khugepaged code maps a new compound anonymous page and passes "true". Otherwise, we're just dealing with simple, non-compound pages. Let's give the interface clearer semantics and document these. Remove the PageTransCompound() sanity check from page_add_new_anon_rmap(). Link: https://lkml.kernel.org/r/20220428083441.37290-9-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: Christoph Hellwig Cc: David Rientjes Cc: Don Dutile Cc: Hugh Dickins Cc: Jan Kara Cc: Jann Horn Cc: Jason Gunthorpe Cc: John Hubbard Cc: Khalid Aziz Cc: "Kirill A. Shutemov" Cc: Liang Zhang Cc: "Matthew Wilcox (Oracle)" Cc: Michal Hocko Cc: Mike Kravetz Cc: Mike Rapoport Cc: Nadav Amit Cc: Oded Gabbay Cc: Oleg Nesterov Cc: Pedro Demarchi Gomes Cc: Peter Xu Cc: Rik van Riel Cc: Roman Gushchin Cc: Shakeel Butt Cc: Yang Shi Signed-off-by: Andrew Morton --- include/linux/rmap.h | 3 ++- kernel/events/uprobes.c | 2 +- mm/huge_memory.c | 2 +- mm/khugepaged.c | 2 +- mm/memory.c | 10 +++++----- mm/migrate_device.c | 2 +- mm/rmap.c | 11 ++++++----- mm/swapfile.c | 2 +- mm/userfaultfd.c | 2 +- 9 files changed, 19 insertions(+), 17 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 90e1e6925789..e4156921eea9 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -185,11 +185,12 @@ void page_move_anon_rmap(struct page *, struct vm_area_struct *); void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long address, rmap_t flags); void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, - unsigned long address, bool compound); + unsigned long address); void page_add_file_rmap(struct page *, struct vm_area_struct *, bool compound); void page_remove_rmap(struct page *, struct vm_area_struct *, bool compound); + void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long address, rmap_t flags); void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *, diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 6418083901d4..4ef5385815d3 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -180,7 +180,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, if (new_page) { get_page(new_page); - page_add_new_anon_rmap(new_page, vma, addr, false); + page_add_new_anon_rmap(new_page, vma, addr); lru_cache_add_inactive_or_unevictable(new_page, vma); } else /* no new page, just dec_mm_counter for old_page */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6232b6817fab..c0365280b481 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -647,7 +647,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, entry = mk_huge_pmd(page, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); - page_add_new_anon_rmap(page, vma, haddr, true); + page_add_new_anon_rmap(page, vma, haddr); lru_cache_add_inactive_or_unevictable(page, vma); pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index ac53ad2c9bb1..a2560f970881 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1183,7 +1183,7 @@ static void collapse_huge_page(struct mm_struct *mm, spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); - page_add_new_anon_rmap(new_page, vma, address, true); + page_add_new_anon_rmap(new_page, vma, address); lru_cache_add_inactive_or_unevictable(new_page, vma); pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, address, pmd, _pmd); diff --git a/mm/memory.c b/mm/memory.c index 8e92010f3d89..3dedb575baef 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -893,7 +893,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma *prealloc = NULL; copy_user_highpage(new_page, page, addr, src_vma); __SetPageUptodate(new_page); - page_add_new_anon_rmap(new_page, dst_vma, addr, false); + page_add_new_anon_rmap(new_page, dst_vma, addr); lru_cache_add_inactive_or_unevictable(new_page, dst_vma); rss[mm_counter(new_page)]++; @@ -3058,7 +3058,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) * some TLBs while the old PTE remains in others. */ ptep_clear_flush_notify(vma, vmf->address, vmf->pte); - page_add_new_anon_rmap(new_page, vma, vmf->address, false); + page_add_new_anon_rmap(new_page, vma, vmf->address); lru_cache_add_inactive_or_unevictable(new_page, vma); /* * We call the notify macro here because, when using secondary @@ -3702,7 +3702,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* ksm created a completely new copy */ if (unlikely(page != swapcache && swapcache)) { - page_add_new_anon_rmap(page, vma, vmf->address, false); + page_add_new_anon_rmap(page, vma, vmf->address); lru_cache_add_inactive_or_unevictable(page, vma); } else { page_add_anon_rmap(page, vma, vmf->address, rmap_flags); @@ -3852,7 +3852,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) } inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); - page_add_new_anon_rmap(page, vma, vmf->address, false); + page_add_new_anon_rmap(page, vma, vmf->address); lru_cache_add_inactive_or_unevictable(page, vma); setpte: set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); @@ -4039,7 +4039,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr) /* copy-on-write page */ if (write && !(vma->vm_flags & VM_SHARED)) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); - page_add_new_anon_rmap(page, vma, addr, false); + page_add_new_anon_rmap(page, vma, addr); lru_cache_add_inactive_or_unevictable(page, vma); } else { inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 70c7dc05bbfc..fb6d7d5499f5 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -610,7 +610,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, goto unlock_abort; inc_mm_counter(mm, MM_ANONPAGES); - page_add_new_anon_rmap(page, vma, addr, false); + page_add_new_anon_rmap(page, vma, addr); if (!is_zone_device_page(page)) lru_cache_add_inactive_or_unevictable(page, vma); get_page(page); diff --git a/mm/rmap.c b/mm/rmap.c index 32630f1b1ee1..90f92c53476f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1226,19 +1226,22 @@ void page_add_anon_rmap(struct page *page, } /** - * page_add_new_anon_rmap - add pte mapping to a new anonymous page + * page_add_new_anon_rmap - add mapping to a new anonymous page * @page: the page to add the mapping to * @vma: the vm area in which the mapping is added * @address: the user virtual address mapped - * @compound: charge the page as compound or small page + * + * If it's a compound page, it is accounted as a compound page. As the page + * is new, it's assume to get mapped exclusively by a single process. * * Same as page_add_anon_rmap but must only be called on *new* pages. * This means the inc-and-test can be bypassed. * Page does not have to be locked. */ void page_add_new_anon_rmap(struct page *page, - struct vm_area_struct *vma, unsigned long address, bool compound) + struct vm_area_struct *vma, unsigned long address) { + const bool compound = PageCompound(page); int nr = compound ? thp_nr_pages(page) : 1; VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); @@ -1251,8 +1254,6 @@ void page_add_new_anon_rmap(struct page *page, __mod_lruvec_page_state(page, NR_ANON_THPS, nr); } else { - /* Anon THP always mapped first with PMD */ - VM_BUG_ON_PAGE(PageTransCompound(page), page); /* increment count (starts at -1) */ atomic_set(&page->_mapcount, 0); } diff --git a/mm/swapfile.c b/mm/swapfile.c index 1ba525a2179d..0ad7ed7ded21 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1802,7 +1802,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, if (page == swapcache) { page_add_anon_rmap(page, vma, addr, RMAP_NONE); } else { /* ksm created a completely new copy */ - page_add_new_anon_rmap(page, vma, addr, false); + page_add_new_anon_rmap(page, vma, addr); lru_cache_add_inactive_or_unevictable(page, vma); } set_pte_at(vma->vm_mm, addr, pte, diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e9bb6db002aa..dae25d985d15 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -104,7 +104,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, lru_cache_add(page); page_add_file_rmap(page, dst_vma, false); } else { - page_add_new_anon_rmap(page, dst_vma, dst_addr, false); + page_add_new_anon_rmap(page, dst_vma, dst_addr); lru_cache_add_inactive_or_unevictable(page, dst_vma); } -- cgit v1.2.3 From 6c287605fd56466e645693eff3ae7c08fba56e0a Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 9 May 2022 18:20:44 -0700 Subject: mm: remember exclusively mapped anonymous pages with PG_anon_exclusive Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as exclusive, and use that information to make GUP pins reliable and stay consistent with the page mapped into the page table even if the page table entry gets write-protected. With that information at hand, we can extend our COW logic to always reuse anonymous pages that are exclusive. For anonymous pages that might be shared, the existing logic applies. As already documented, PG_anon_exclusive is usually only expressive in combination with a page table entry. Especially PTE vs. PMD-mapped anonymous pages require more thought, some examples: due to mremap() we can easily have a single compound page PTE-mapped into multiple page tables exclusively in a single process -- multiple page table locks apply. Further, due to MADV_WIPEONFORK we might not necessarily write-protect all PTEs, and only some subpages might be pinned. Long story short: once PTE-mapped, we have to track information about exclusivity per sub-page, but until then, we can just track it for the compound page in the head page and not having to update a whole bunch of subpages all of the time for a simple PMD mapping of a THP. For simplicity, this commit mostly talks about "anonymous pages", while it's for THP actually "the part of an anonymous folio referenced via a page table entry". To not spill PG_anon_exclusive code all over the mm code-base, we let the anon rmap code to handle all PG_anon_exclusive logic it can easily handle. If a writable, present page table entry points at an anonymous (sub)page, that (sub)page must be PG_anon_exclusive. If GUP wants to take a reliably pin (FOLL_PIN) on an anonymous page references via a present page table entry, it must only pin if PG_anon_exclusive is set for the mapped (sub)page. This commit doesn't adjust GUP, so this is only implicitly handled for FOLL_WRITE, follow-up commits will teach GUP to also respect it for FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully reliable. Whenever an anonymous page is to be shared (fork(), KSM), or when temporarily unmapping an anonymous page (swap, migration), the relevant PG_anon_exclusive bit has to be cleared to mark the anonymous page possibly shared. Clearing will fail if there are GUP pins on the page: * For fork(), this means having to copy the page and not being able to share it. fork() protects against concurrent GUP using the PT lock and the src_mm->write_protect_seq. * For KSM, this means sharing will fail. For swap this means, unmapping will fail, For migration this means, migration will fail early. All three cases protect against concurrent GUP using the PT lock and a proper clear/invalidate+flush of the relevant page table entry. This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a pinned page gets mapped R/O and the successive write fault ends up replacing the page instead of reusing it. It improves the situation for O_DIRECT/vmsplice/... that still use FOLL_GET instead of FOLL_PIN, if fork() is *not* involved, however swapout and fork() are still problematic. Properly using FOLL_PIN instead of FOLL_GET for these GUP users will fix the issue for them. I. Details about basic handling I.1. Fresh anonymous pages page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the given page exclusive via __page_set_anon_rmap(exclusive=1). As that is the mechanism fresh anonymous pages come into life (besides migration code where we copy the page->mapping), all fresh anonymous pages will start out as exclusive. I.2. COW reuse handling of anonymous pages When a COW handler stumbles over a (sub)page that's marked exclusive, it simply reuses it. Otherwise, the handler tries harder under page lock to detect if the (sub)page is exclusive and can be reused. If exclusive, page_move_anon_rmap() will mark the given (sub)page exclusive. Note that hugetlb code does not yet check for PageAnonExclusive(), as it still uses the old COW logic that is prone to the COW security issue because hugetlb code cannot really tolerate unnecessary/wrong COW as huge pages are a scarce resource. I.3. Migration handling try_to_migrate() has to try marking an exclusive anonymous page shared via page_try_share_anon_rmap(). If it fails because there are GUP pins on the page, unmap fails. migrate_vma_collect_pmd() and __split_huge_pmd_locked() are handled similarly. Writable migration entries implicitly point at shared anonymous pages. For readable migration entries that information is stored via a new "readable-exclusive" migration entry, specific to anonymous pages. When restoring a migration entry in remove_migration_pte(), information about exlusivity is detected via the migration entry type, and RMAP_EXCLUSIVE is set accordingly for page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information. I.4. Swapout handling try_to_unmap() has to try marking the mapped page possibly shared via page_try_share_anon_rmap(). If it fails because there are GUP pins on the page, unmap fails. For now, information about exclusivity is lost. In the future, we might want to remember that information in the swap entry in some cases, however, it requires more thought, care, and a way to store that information in swap entries. I.5. Swapin handling do_swap_page() will never stumble over exclusive anonymous pages in the swap cache, as try_to_migrate() prohibits that. do_swap_page() always has to detect manually if an anonymous page is exclusive and has to set RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly. I.6. THP handling __split_huge_pmd_locked() has to move the information about exclusivity from the PMD to the PTEs. a) In case we have a readable-exclusive PMD migration entry, simply insert readable-exclusive PTE migration entries. b) In case we have a present PMD entry and we don't want to freeze ("convert to migration entries"), simply forward PG_anon_exclusive to all sub-pages, no need to temporarily clear the bit. c) In case we have a present PMD entry and want to freeze, handle it similar to try_to_migrate(): try marking the page shared first. In case we fail, we ignore the "freeze" instruction and simply split ordinarily. try_to_migrate() will properly fail because the THP is still mapped via PTEs. When splitting a compound anonymous folio (THP), the information about exclusivity is implicitly handled via the migration entries: no need to replicate PG_anon_exclusive manually. I.7. fork() handling fork() handling is relatively easy, because PG_anon_exclusive is only expressive for some page table entry types. a) Present anonymous pages page_try_dup_anon_rmap() will mark the given subpage shared -- which will fail if the page is pinned. If it failed, we have to copy (or PTE-map a PMD to handle it on the PTE level). Note that device exclusive entries are just a pointer at a PageAnon() page. fork() will first convert a device exclusive entry to a present page table and handle it just like present anonymous pages. b) Device private entry Device private entries point at PageAnon() pages that cannot be mapped directly and, therefore, cannot get pinned. page_try_dup_anon_rmap() will mark the given subpage shared, which cannot fail because they cannot get pinned. c) HW poison entries PG_anon_exclusive will remain untouched and is stale -- the page table entry is just a placeholder after all. d) Migration entries Writable and readable-exclusive entries are converted to readable entries: possibly shared. I.8. mprotect() handling mprotect() only has to properly handle the new readable-exclusive migration entry: When write-protecting a migration entry that points at an anonymous page, remember the information about exclusivity via the "readable-exclusive" migration entry type. II. Migration and GUP-fast Whenever replacing a present page table entry that maps an exclusive anonymous page by a migration entry, we have to mark the page possibly shared and synchronize against GUP-fast by a proper clear/invalidate+flush to make the following scenario impossible: 1. try_to_migrate() places a migration entry after checking for GUP pins and marks the page possibly shared. 2. GUP-fast pins the page due to lack of synchronization 3. fork() converts the "writable/readable-exclusive" migration entry into a readable migration entry 4. Migration fails due to the GUP pin (failing to freeze the refcount) 5. Migration entries are restored. PG_anon_exclusive is lost -> We have a pinned page that is not marked exclusive anymore. Note that we move information about exclusivity from the page to the migration entry as it otherwise highly overcomplicates fork() and PTE-mapping a THP. III. Swapout and GUP-fast Whenever replacing a present page table entry that maps an exclusive anonymous page by a swap entry, we have to mark the page possibly shared and synchronize against GUP-fast by a proper clear/invalidate+flush to make the following scenario impossible: 1. try_to_unmap() places a swap entry after checking for GUP pins and clears exclusivity information on the page. 2. GUP-fast pins the page due to lack of synchronization. -> We have a pinned page that is not marked exclusive anymore. If we'd ever store information about exclusivity in the swap entry, similar to migration handling, the same considerations as in II would apply. This is future work. Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Vlastimil Babka Cc: Andrea Arcangeli Cc: Christoph Hellwig Cc: David Rientjes Cc: Don Dutile Cc: Hugh Dickins Cc: Jan Kara Cc: Jann Horn Cc: Jason Gunthorpe Cc: John Hubbard Cc: Khalid Aziz Cc: "Kirill A. Shutemov" Cc: Liang Zhang Cc: "Matthew Wilcox (Oracle)" Cc: Michal Hocko Cc: Mike Kravetz Cc: Mike Rapoport Cc: Nadav Amit Cc: Oded Gabbay Cc: Oleg Nesterov Cc: Pedro Demarchi Gomes Cc: Peter Xu Cc: Rik van Riel Cc: Roman Gushchin Cc: Shakeel Butt Cc: Yang Shi Signed-off-by: Andrew Morton --- include/linux/rmap.h | 40 +++++++++++++++++++++++++ include/linux/swap.h | 15 +++++++--- include/linux/swapops.h | 25 ++++++++++++++++ mm/huge_memory.c | 78 ++++++++++++++++++++++++++++++++++++++++++++----- mm/hugetlb.c | 15 +++++++--- mm/ksm.c | 13 ++++++++- mm/memory.c | 33 ++++++++++++++++----- mm/migrate.c | 14 +++++++-- mm/migrate_device.c | 21 ++++++++++++- mm/mprotect.c | 8 +++-- mm/rmap.c | 61 ++++++++++++++++++++++++++++++++++---- 11 files changed, 289 insertions(+), 34 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/rmap.h b/include/linux/rmap.h index e4156921eea9..cbe279a6f0de 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -228,6 +228,13 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound, { VM_BUG_ON_PAGE(!PageAnon(page), page); + /* + * No need to check+clear for already shared pages, including KSM + * pages. + */ + if (!PageAnonExclusive(page)) + goto dup; + /* * If this page may have been pinned by the parent process, * don't allow to duplicate the mapping but instead require to e.g., @@ -239,14 +246,47 @@ static inline int page_try_dup_anon_rmap(struct page *page, bool compound, unlikely(page_needs_cow_for_dma(vma, page)))) return -EBUSY; + ClearPageAnonExclusive(page); /* * It's okay to share the anon page between both processes, mapping * the page R/O into both processes. */ +dup: __page_dup_rmap(page, compound); return 0; } +/** + * page_try_share_anon_rmap - try marking an exclusive anonymous page possibly + * shared to prepare for KSM or temporary unmapping + * @page: the exclusive anonymous page to try marking possibly shared + * + * The caller needs to hold the PT lock and has to have the page table entry + * cleared/invalidated+flushed, to properly sync against GUP-fast. + * + * This is similar to page_try_dup_anon_rmap(), however, not used during fork() + * to duplicate a mapping, but instead to prepare for KSM or temporarily + * unmapping a page (swap, migration) via page_remove_rmap(). + * + * Marking the page shared can only fail if the page may be pinned; device + * private pages cannot get pinned and consequently this function cannot fail. + * + * Returns 0 if marking the page possibly shared succeeded. Returns -EBUSY + * otherwise. + */ +static inline int page_try_share_anon_rmap(struct page *page) +{ + VM_BUG_ON_PAGE(!PageAnon(page) || !PageAnonExclusive(page), page); + + /* See page_try_dup_anon_rmap(). */ + if (likely(!is_device_private_page(page) && + unlikely(page_maybe_dma_pinned(page)))) + return -EBUSY; + + ClearPageAnonExclusive(page); + return 0; +} + /* * Called from mm/vmscan.c to handle paging out */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 27093b477c5f..e6d70a4156e8 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -78,12 +78,19 @@ static inline int current_is_kswapd(void) #endif /* - * NUMA node memory migration support + * Page migration support. + * + * SWP_MIGRATION_READ_EXCLUSIVE is only applicable to anonymous pages and + * indicates that the referenced (part of) an anonymous page is exclusive to + * a single process. For SWP_MIGRATION_WRITE, that information is implicit: + * (part of) an anonymous page that are mapped writable are exclusive to a + * single process. */ #ifdef CONFIG_MIGRATION -#define SWP_MIGRATION_NUM 2 -#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM) -#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1) +#define SWP_MIGRATION_NUM 3 +#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM) +#define SWP_MIGRATION_READ_EXCLUSIVE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1) +#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 2) #else #define SWP_MIGRATION_NUM 0 #endif diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 5af852b68805..6648b97244e7 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -194,6 +194,7 @@ static inline bool is_writable_device_exclusive_entry(swp_entry_t entry) static inline int is_migration_entry(swp_entry_t entry) { return unlikely(swp_type(entry) == SWP_MIGRATION_READ || + swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE || swp_type(entry) == SWP_MIGRATION_WRITE); } @@ -202,11 +203,26 @@ static inline int is_writable_migration_entry(swp_entry_t entry) return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE); } +static inline int is_readable_migration_entry(swp_entry_t entry) +{ + return unlikely(swp_type(entry) == SWP_MIGRATION_READ); +} + +static inline int is_readable_exclusive_migration_entry(swp_entry_t entry) +{ + return unlikely(swp_type(entry) == SWP_MIGRATION_READ_EXCLUSIVE); +} + static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) { return swp_entry(SWP_MIGRATION_READ, offset); } +static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset) +{ + return swp_entry(SWP_MIGRATION_READ_EXCLUSIVE, offset); +} + static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) { return swp_entry(SWP_MIGRATION_WRITE, offset); @@ -224,6 +240,11 @@ static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) return swp_entry(0, 0); } +static inline swp_entry_t make_readable_exclusive_migration_entry(pgoff_t offset) +{ + return swp_entry(0, 0); +} + static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) { return swp_entry(0, 0); @@ -244,6 +265,10 @@ static inline int is_writable_migration_entry(swp_entry_t entry) { return 0; } +static inline int is_readable_migration_entry(swp_entry_t entry) +{ + return 0; +} #endif diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cb5bcd833d9e..231911b7bf9d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1054,7 +1054,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, swp_entry_t entry = pmd_to_swp_entry(pmd); VM_BUG_ON(!is_pmd_migration_entry(pmd)); - if (is_writable_migration_entry(entry)) { + if (!is_readable_migration_entry(entry)) { entry = make_readable_migration_entry( swp_offset(entry)); pmd = swp_entry_to_pmd(entry); @@ -1292,6 +1292,10 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) page = pmd_page(orig_pmd); VM_BUG_ON_PAGE(!PageHead(page), page); + /* Early check when only holding the PT lock. */ + if (PageAnonExclusive(page)) + goto reuse; + if (!trylock_page(page)) { get_page(page); spin_unlock(vmf->ptl); @@ -1306,6 +1310,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) put_page(page); } + /* Recheck after temporarily dropping the PT lock. */ + if (PageAnonExclusive(page)) { + unlock_page(page); + goto reuse; + } + /* * See do_wp_page(): we can only map the page writable if there are * no additional references. Note that we always drain the LRU @@ -1319,11 +1329,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) pmd_t entry; page_move_anon_rmap(page, vma); + unlock_page(page); +reuse: entry = pmd_mkyoung(orig_pmd); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1)) update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); - unlock_page(page); spin_unlock(vmf->ptl); return VM_FAULT_WRITE; } @@ -1708,6 +1719,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION if (is_swap_pmd(*pmd)) { swp_entry_t entry = pmd_to_swp_entry(*pmd); + struct page *page = pfn_swap_entry_to_page(entry); VM_BUG_ON(!is_pmd_migration_entry(*pmd)); if (is_writable_migration_entry(entry)) { @@ -1716,8 +1728,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, * A protection check is difficult so * just be safe and disable write */ - entry = make_readable_migration_entry( - swp_offset(entry)); + if (PageAnon(page)) + entry = make_readable_exclusive_migration_entry(swp_offset(entry)); + else + entry = make_readable_migration_entry(swp_offset(entry)); newpmd = swp_entry_to_pmd(entry); if (pmd_swp_soft_dirty(*pmd)) newpmd = pmd_swp_mksoft_dirty(newpmd); @@ -1937,6 +1951,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, pgtable_t pgtable; pmd_t old_pmd, _pmd; bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false; + bool anon_exclusive = false; unsigned long addr; int i; @@ -2018,6 +2033,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, entry = pmd_to_swp_entry(old_pmd); page = pfn_swap_entry_to_page(entry); write = is_writable_migration_entry(entry); + if (PageAnon(page)) + anon_exclusive = is_readable_exclusive_migration_entry(entry); young = false; soft_dirty = pmd_swp_soft_dirty(old_pmd); uffd_wp = pmd_swp_uffd_wp(old_pmd); @@ -2029,8 +2046,26 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, young = pmd_young(old_pmd); soft_dirty = pmd_soft_dirty(old_pmd); uffd_wp = pmd_uffd_wp(old_pmd); + VM_BUG_ON_PAGE(!page_count(page), page); page_ref_add(page, HPAGE_PMD_NR - 1); + + /* + * Without "freeze", we'll simply split the PMD, propagating the + * PageAnonExclusive() flag for each PTE by setting it for + * each subpage -- no need to (temporarily) clear. + * + * With "freeze" we want to replace mapped pages by + * migration entries right away. This is only possible if we + * managed to clear PageAnonExclusive() -- see + * set_pmd_migration_entry(). + * + * In case we cannot clear PageAnonExclusive(), split the PMD + * only and let try_to_migrate_one() fail later. + */ + anon_exclusive = PageAnon(page) && PageAnonExclusive(page); + if (freeze && anon_exclusive && page_try_share_anon_rmap(page)) + freeze = false; } /* @@ -2052,6 +2087,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, if (write) swp_entry = make_writable_migration_entry( page_to_pfn(page + i)); + else if (anon_exclusive) + swp_entry = make_readable_exclusive_migration_entry( + page_to_pfn(page + i)); else swp_entry = make_readable_migration_entry( page_to_pfn(page + i)); @@ -2063,6 +2101,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, } else { entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot)); entry = maybe_mkwrite(entry, vma); + if (anon_exclusive) + SetPageAnonExclusive(page + i); if (!write) entry = pte_wrprotect(entry); if (!young) @@ -2294,6 +2334,13 @@ static void __split_huge_page_tail(struct page *head, int tail, * * After successful get_page_unless_zero() might follow flags change, * for example lock_page() which set PG_waiters. + * + * Note that for mapped sub-pages of an anonymous THP, + * PG_anon_exclusive has been cleared in unmap_page() and is stored in + * the migration entry instead from where remap_page() will restore it. + * We can still have PG_anon_exclusive set on effectively unmapped and + * unreferenced sub-pages of an anonymous THP: we can simply drop + * PG_anon_exclusive (-> PG_mappedtodisk) for these here. */ page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; page_tail->flags |= (head->flags & @@ -3025,6 +3072,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, struct vm_area_struct *vma = pvmw->vma; struct mm_struct *mm = vma->vm_mm; unsigned long address = pvmw->address; + bool anon_exclusive; pmd_t pmdval; swp_entry_t entry; pmd_t pmdswp; @@ -3034,10 +3082,19 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, flush_cache_range(vma, address, address + HPAGE_PMD_SIZE); pmdval = pmdp_invalidate(vma, address, pvmw->pmd); + + anon_exclusive = PageAnon(page) && PageAnonExclusive(page); + if (anon_exclusive && page_try_share_anon_rmap(page)) { + set_pmd_at(mm, address, pvmw->pmd, pmdval); + return; + } + if (pmd_dirty(pmdval)) set_page_dirty(page); if (pmd_write(pmdval)) entry = make_writable_migration_entry(page_to_pfn(page)); + else if (anon_exclusive) + entry = make_readable_exclusive_migration_entry(page_to_pfn(page)); else entry = make_readable_migration_entry(page_to_pfn(page)); pmdswp = swp_entry_to_pmd(entry); @@ -3071,10 +3128,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) if (pmd_swp_uffd_wp(*pvmw->pmd)) pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde)); - if (PageAnon(new)) - page_add_anon_rmap(new, vma, mmun_start, RMAP_COMPOUND); - else + if (PageAnon(new)) { + rmap_t rmap_flags = RMAP_COMPOUND; + + if (!is_readable_migration_entry(entry)) + rmap_flags |= RMAP_EXCLUSIVE; + + page_add_anon_rmap(new, vma, mmun_start, rmap_flags); + } else { page_add_file_rmap(new, vma, true); + } + VM_BUG_ON(pmd_write(pmde) && PageAnon(new) && !PageAnonExclusive(new)); set_pmd_at(mm, mmun_start, pvmw->pmd, pmde); /* No need to invalidate - it was non-present before */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 03cbb75bcb54..c0012b4a8eec 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4790,7 +4790,7 @@ again: is_hugetlb_entry_hwpoisoned(entry))) { swp_entry_t swp_entry = pte_to_swp_entry(entry); - if (is_writable_migration_entry(swp_entry) && cow) { + if (!is_readable_migration_entry(swp_entry) && cow) { /* * COW mappings require pages in both * parent and child to be set to read. @@ -5190,6 +5190,8 @@ retry_avoidcopy: set_huge_ptep_writable(vma, haddr, ptep); return 0; } + VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page), + old_page); /* * If the process that created a MAP_PRIVATE mapping is about to @@ -6187,12 +6189,17 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, } if (unlikely(is_hugetlb_entry_migration(pte))) { swp_entry_t entry = pte_to_swp_entry(pte); + struct page *page = pfn_swap_entry_to_page(entry); - if (is_writable_migration_entry(entry)) { + if (!is_readable_migration_entry(entry)) { pte_t newpte; - entry = make_readable_migration_entry( - swp_offset(entry)); + if (PageAnon(page)) + entry = make_readable_exclusive_migration_entry( + swp_offset(entry)); + else + entry = make_readable_migration_entry( + swp_offset(entry)); newpte = swp_entry_to_pte(entry); set_huge_swap_pte_at(mm, address, ptep, newpte, huge_page_size(h)); diff --git a/mm/ksm.c b/mm/ksm.c index 2b6692a7df6a..38360285497a 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -872,6 +872,7 @@ static inline struct stable_node *page_stable_node(struct page *page) static inline void set_page_stable_node(struct page *page, struct stable_node *stable_node) { + VM_BUG_ON_PAGE(PageAnon(page) && PageAnonExclusive(page), page); page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM); } @@ -1044,6 +1045,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, int swapped; int err = -EFAULT; struct mmu_notifier_range range; + bool anon_exclusive; pvmw.address = page_address_in_vma(page, vma); if (pvmw.address == -EFAULT) @@ -1061,9 +1063,10 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?")) goto out_unlock; + anon_exclusive = PageAnonExclusive(page); if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) || - mm_tlb_flush_pending(mm)) { + anon_exclusive || mm_tlb_flush_pending(mm)) { pte_t entry; swapped = PageSwapCache(page); @@ -1091,6 +1094,12 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, set_pte_at(mm, pvmw.address, pvmw.pte, entry); goto out_unlock; } + + if (anon_exclusive && page_try_share_anon_rmap(page)) { + set_pte_at(mm, pvmw.address, pvmw.pte, entry); + goto out_unlock; + } + if (pte_dirty(entry)) set_page_dirty(page); @@ -1149,6 +1158,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, pte_unmap_unlock(ptep, ptl); goto out_mn; } + VM_BUG_ON_PAGE(PageAnonExclusive(page), page); + VM_BUG_ON_PAGE(PageAnon(kpage) && PageAnonExclusive(kpage), kpage); /* * No need to check ksm_use_zero_pages here: we can only have a diff --git a/mm/memory.c b/mm/memory.c index 0b0727758c86..454ecc05ad85 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -720,6 +720,8 @@ static void restore_exclusive_pte(struct vm_area_struct *vma, else if (is_writable_device_exclusive_entry(entry)) pte = maybe_mkwrite(pte_mkdirty(pte), vma); + VM_BUG_ON(pte_write(pte) && !(PageAnon(page) && PageAnonExclusive(page))); + /* * No need to take a page reference as one was already * created when the swap entry was made. @@ -796,11 +798,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, rss[mm_counter(page)]++; - if (is_writable_migration_entry(entry) && + if (!is_readable_migration_entry(entry) && is_cow_mapping(vm_flags)) { /* - * COW mappings require pages in both - * parent and child to be set to read. + * COW mappings require pages in both parent and child + * to be set to read. A previously exclusive entry is + * now shared. */ entry = make_readable_migration_entry( swp_offset(entry)); @@ -951,6 +954,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); } + VM_BUG_ON(page && PageAnon(page) && PageAnonExclusive(page)); /* * If it's a shared mapping, mark it clean in @@ -2949,6 +2953,9 @@ static inline void wp_page_reuse(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct page *page = vmf->page; pte_t entry; + + VM_BUG_ON(PageAnon(page) && !PageAnonExclusive(page)); + /* * Clear the pages cpupid information as the existing * information potentially belongs to a now completely @@ -3273,6 +3280,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) if (PageAnon(vmf->page)) { struct page *page = vmf->page; + /* + * If the page is exclusive to this process we must reuse the + * page without further checks. + */ + if (PageAnonExclusive(page)) + goto reuse; + /* * We have to verify under page lock: these early checks are * just an optimization to avoid locking the page and freeing @@ -3305,6 +3319,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) */ page_move_anon_rmap(page, vma); unlock_page(page); +reuse: wp_page_reuse(vmf); return VM_FAULT_WRITE; } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == @@ -3696,11 +3711,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) * that are certainly not shared because we just allocated them without * exposing them to the swapcache. */ - if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) && - (page != swapcache || page_count(page) == 1)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); - vmf->flags &= ~FAULT_FLAG_WRITE; - ret |= VM_FAULT_WRITE; + if (!PageKsm(page) && (page != swapcache || page_count(page) == 1)) { + if (vmf->flags & FAULT_FLAG_WRITE) { + pte = maybe_mkwrite(pte_mkdirty(pte), vma); + vmf->flags &= ~FAULT_FLAG_WRITE; + ret |= VM_FAULT_WRITE; + } rmap_flags |= RMAP_EXCLUSIVE; } flush_icache_page(vma, page); @@ -3720,6 +3736,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) page_add_anon_rmap(page, vma, vmf->address, rmap_flags); } + VM_BUG_ON(!PageAnon(page) || (pte_write(pte) && !PageAnonExclusive(page))); set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); diff --git a/mm/migrate.c b/mm/migrate.c index 92e932f08be5..b2678279eb43 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -177,6 +177,7 @@ static bool remove_migration_pte(struct folio *folio, DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION); while (page_vma_mapped_walk(&pvmw)) { + rmap_t rmap_flags = RMAP_NONE; pte_t pte; swp_entry_t entry; struct page *new; @@ -211,6 +212,9 @@ static bool remove_migration_pte(struct folio *folio, else if (pte_swp_uffd_wp(*pvmw.pte)) pte = pte_mkuffd_wp(pte); + if (folio_test_anon(folio) && !is_readable_migration_entry(entry)) + rmap_flags |= RMAP_EXCLUSIVE; + if (unlikely(is_device_private_page(new))) { if (pte_write(pte)) entry = make_writable_device_private_entry( @@ -233,7 +237,7 @@ static bool remove_migration_pte(struct folio *folio, pte = arch_make_huge_pte(pte, shift, vma->vm_flags); if (folio_test_anon(folio)) hugepage_add_anon_rmap(new, vma, pvmw.address, - RMAP_NONE); + rmap_flags); else page_dup_file_rmap(new, true); set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); @@ -242,7 +246,7 @@ static bool remove_migration_pte(struct folio *folio, { if (folio_test_anon(folio)) page_add_anon_rmap(new, vma, pvmw.address, - RMAP_NONE); + rmap_flags); else page_add_file_rmap(new, vma, false); set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); @@ -514,6 +518,12 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio) folio_set_workingset(newfolio); if (folio_test_checked(folio)) folio_set_checked(newfolio); + /* + * PG_anon_exclusive (-> PG_mappedtodisk) is always migrated via + * migration entries. We can still have PG_anon_exclusive set on an + * effectively unmapped and unreferenced first sub-pages of an + * anonymous THP: we can simply copy it here via PG_mappedtodisk. + */ if (folio_test_mappedtodisk(folio)) folio_set_mappedtodisk(newfolio); diff --git a/mm/migrate_device.c b/mm/migrate_device.c index fb6d7d5499f5..5052093d0262 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -184,15 +184,34 @@ again: * set up a special migration page table entry now. */ if (trylock_page(page)) { + bool anon_exclusive; pte_t swp_pte; + anon_exclusive = PageAnon(page) && PageAnonExclusive(page); + if (anon_exclusive) { + flush_cache_page(vma, addr, pte_pfn(*ptep)); + ptep_clear_flush(vma, addr, ptep); + + if (page_try_share_anon_rmap(page)) { + set_pte_at(mm, addr, ptep, pte); + unlock_page(page); + put_page(page); + mpfn = 0; + goto next; + } + } else { + ptep_get_and_clear(mm, addr, ptep); + } + migrate->cpages++; - ptep_get_and_clear(mm, addr, ptep); /* Setup special migration page table entry */ if (mpfn & MIGRATE_PFN_WRITE) entry = make_writable_migration_entry( page_to_pfn(page)); + else if (anon_exclusive) + entry = make_readable_exclusive_migration_entry( + page_to_pfn(page)); else entry = make_readable_migration_entry( page_to_pfn(page)); diff --git a/mm/mprotect.c b/mm/mprotect.c index b69ce7a7b2b7..56060acdabd3 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -152,6 +152,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, pages++; } else if (is_swap_pte(oldpte)) { swp_entry_t entry = pte_to_swp_entry(oldpte); + struct page *page = pfn_swap_entry_to_page(entry); pte_t newpte; if (is_writable_migration_entry(entry)) { @@ -159,8 +160,11 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, * A protection check is difficult so * just be safe and disable write */ - entry = make_readable_migration_entry( - swp_offset(entry)); + if (PageAnon(page)) + entry = make_readable_exclusive_migration_entry( + swp_offset(entry)); + else + entry = make_readable_migration_entry(swp_offset(entry)); newpte = swp_entry_to_pte(entry); if (pte_swp_soft_dirty(oldpte)) newpte = pte_swp_mksoft_dirty(newpte); diff --git a/mm/rmap.c b/mm/rmap.c index 90f92c53476f..0d63e7ce35cc 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1088,6 +1088,7 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma->anon_vma; + struct page *subpage = page; page = compound_head(page); @@ -1101,6 +1102,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma) * folio_test_anon()) will not see one without the other. */ WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); + SetPageAnonExclusive(subpage); } /** @@ -1118,7 +1120,7 @@ static void __page_set_anon_rmap(struct page *page, BUG_ON(!anon_vma); if (PageAnon(page)) - return; + goto out; /* * If the page isn't exclusively mapped into this vma, @@ -1137,6 +1139,9 @@ static void __page_set_anon_rmap(struct page *page, anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); page->index = linear_page_index(vma, address); +out: + if (exclusive) + SetPageAnonExclusive(page); } /** @@ -1198,6 +1203,8 @@ void page_add_anon_rmap(struct page *page, } else { first = atomic_inc_and_test(&page->_mapcount); } + VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); + VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); if (first) { int nr = compound ? thp_nr_pages(page) : 1; @@ -1459,7 +1466,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); pte_t pteval; struct page *subpage; - bool ret = true; + bool anon_exclusive, ret = true; struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)(long)arg; @@ -1515,6 +1522,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, subpage = folio_page(folio, pte_pfn(*pvmw.pte) - folio_pfn(folio)); address = pvmw.address; + anon_exclusive = folio_test_anon(folio) && + PageAnonExclusive(subpage); if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) { /* @@ -1550,9 +1559,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, } } - /* Nuke the page table entry. */ + /* + * Nuke the page table entry. When having to clear + * PageAnonExclusive(), we always have to flush. + */ flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); - if (should_defer_flush(mm, flags)) { + if (should_defer_flush(mm, flags) && !anon_exclusive) { /* * We clear the PTE but do not flush so potentially * a remote CPU could still be writing to the folio. @@ -1677,6 +1689,24 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, page_vma_mapped_walk_done(&pvmw); break; } + if (anon_exclusive && + page_try_share_anon_rmap(subpage)) { + swap_free(entry); + set_pte_at(mm, address, pvmw.pte, pteval); + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } + /* + * Note: We *don't* remember yet if the page was mapped + * exclusively in the swap entry, so swapin code has + * to re-determine that manually and might detect the + * page as possibly shared, for example, if there are + * other references on the page or if the page is under + * writeback. We made sure that there are no GUP pins + * on the page that would rely on it, so for GUP pins + * this is fine. + */ if (list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); if (list_empty(&mm->mmlist)) @@ -1776,7 +1806,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); pte_t pteval; struct page *subpage; - bool ret = true; + bool anon_exclusive, ret = true; struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)(long)arg; @@ -1837,6 +1867,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, subpage = folio_page(folio, pte_pfn(*pvmw.pte) - folio_pfn(folio)); address = pvmw.address; + anon_exclusive = folio_test_anon(folio) && + PageAnonExclusive(subpage); if (folio_test_hugetlb(folio) && !folio_test_anon(folio)) { /* @@ -1888,6 +1920,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, swp_entry_t entry; pte_t swp_pte; + if (anon_exclusive) + BUG_ON(page_try_share_anon_rmap(subpage)); + /* * Store the pfn of the page in a special migration * pte. do_swap_page() will wait until the migration @@ -1896,6 +1931,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, entry = pte_to_swp_entry(pteval); if (is_writable_device_private_entry(entry)) entry = make_writable_migration_entry(pfn); + else if (anon_exclusive) + entry = make_readable_exclusive_migration_entry(pfn); else entry = make_readable_migration_entry(pfn); swp_pte = swp_entry_to_pte(entry); @@ -1960,6 +1997,15 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, page_vma_mapped_walk_done(&pvmw); break; } + VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) && + !anon_exclusive, subpage); + if (anon_exclusive && + page_try_share_anon_rmap(subpage)) { + set_pte_at(mm, address, pvmw.pte, pteval); + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } /* * Store the pfn of the page in a special migration @@ -1969,6 +2015,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, if (pte_write(pteval)) entry = make_writable_migration_entry( page_to_pfn(subpage)); + else if (anon_exclusive) + entry = make_readable_exclusive_migration_entry( + page_to_pfn(subpage)); else entry = make_readable_migration_entry( page_to_pfn(subpage)); @@ -2405,6 +2454,8 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma, BUG_ON(!anon_vma); /* address might be in next vma when migration races vma_adjust */ first = atomic_inc_and_test(compound_mapcount_ptr(page)); + VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page); + VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); if (first) __page_set_anon_rmap(page, vma, address, !!(flags & RMAP_EXCLUSIVE)); -- cgit v1.2.3 From 6d4675e601357834dadd2ba1d803f6484596015c Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Thu, 19 May 2022 14:08:54 -0700 Subject: mm: don't be stuck to rmap lock on reclaim path The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended under memory pressure if processes keep working on their vmas(e.g., fork, mmap, munmap). It makes reclaim path stuck. In our real workload traces, we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it makes other processes entering direct reclaim, which were also stuck on the lock. This patch makes lru aging path try_lock mode like shink_page_list so the reclaim context will keep working with next lru pages without being stuck. if it found the rmap lock contended, it rotates the page back to head of lru in both active/inactive lrus to make them consistent behavior, which is basic starting point rather than adding more heristic. Since this patch introduces a new "contended" field as out-param along with try_lock in-param in rmap_walk_control, it's not immutable any longer if the try_lock is set so remove const keywords on rmap related functions. Since rmap walking is already expensive operation, I doubt the const would help sizable benefit( And we didn't have it until 5.17). In a heavy app workload in Android, trace shows following statistics. It almost removes rmap lock contention from reclaim path. Martin Liu reported: Before: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function 1632 0 1631 151.542173 31672 209 page_lock_anon_vma_read 601 0 601 145.544681 28817 198 rmap_walk_file After: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function NaN NaN NaN NaN NaN 0.0 NaN 0 0 0 0.127645 1 12 rmap_walk_file [minchan@kernel.org: add comment, per Matthew] Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org Signed-off-by: Minchan Kim Acked-by: Johannes Weiner Cc: Suren Baghdasaryan Cc: Michal Hocko Cc: John Dias Cc: Tim Murray Cc: Matthew Wilcox Cc: Vladimir Davydov Cc: Martin Liu Cc: Minchan Kim Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- include/linux/fs.h | 5 +++++ include/linux/ksm.h | 4 ++-- include/linux/rmap.h | 28 ++++++++++++++++++++-------- mm/ksm.c | 10 ++++++++-- mm/memory-failure.c | 2 +- mm/page_idle.c | 7 ++++--- mm/rmap.c | 52 +++++++++++++++++++++++++++++++++++++++++----------- mm/vmscan.c | 7 ++++++- 8 files changed, 87 insertions(+), 28 deletions(-) (limited to 'include/linux/rmap.h') diff --git a/include/linux/fs.h b/include/linux/fs.h index b81cacc51d2f..044b67f8d861 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -477,6 +477,11 @@ static inline void i_mmap_unlock_write(struct address_space *mapping) up_write(&mapping->i_mmap_rwsem); } +static inline int i_mmap_trylock_read(struct address_space *mapping) +{ + return down_read_trylock(&mapping->i_mmap_rwsem); +} + static inline void i_mmap_lock_read(struct address_space *mapping) { down_read(&mapping->i_mmap_rwsem); diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 0630e545f4cb..0b4f17418f64 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -51,7 +51,7 @@ static inline void ksm_exit(struct mm_struct *mm) struct page *ksm_might_need_to_copy(struct page *page, struct vm_area_struct *vma, unsigned long address); -void rmap_walk_ksm(struct folio *folio, const struct rmap_walk_control *rwc); +void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc); void folio_migrate_ksm(struct folio *newfolio, struct folio *folio); #else /* !CONFIG_KSM */ @@ -79,7 +79,7 @@ static inline struct page *ksm_might_need_to_copy(struct page *page, } static inline void rmap_walk_ksm(struct folio *folio, - const struct rmap_walk_control *rwc) + struct rmap_walk_control *rwc) { } diff --git a/include/linux/rmap.h b/include/linux/rmap.h index cbe279a6f0de..9ec23138e410 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -128,6 +128,11 @@ static inline void anon_vma_lock_read(struct anon_vma *anon_vma) down_read(&anon_vma->root->rwsem); } +static inline int anon_vma_trylock_read(struct anon_vma *anon_vma) +{ + return down_read_trylock(&anon_vma->root->rwsem); +} + static inline void anon_vma_unlock_read(struct anon_vma *anon_vma) { up_read(&anon_vma->root->rwsem); @@ -366,17 +371,14 @@ int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff, void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked); -/* - * Called by memory-failure.c to kill processes. - */ -struct anon_vma *folio_lock_anon_vma_read(struct folio *folio); -void page_unlock_anon_vma_read(struct anon_vma *anon_vma); int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); /* * rmap_walk_control: To control rmap traversing for specific needs * * arg: passed to rmap_one() and invalid_vma() + * try_lock: bail out if the rmap lock is contended + * contended: indicate the rmap traversal bailed out due to lock contention * rmap_one: executed on each vma where page is mapped * done: for checking traversing termination condition * anon_lock: for getting anon_lock by optimized way rather than default @@ -384,6 +386,8 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); */ struct rmap_walk_control { void *arg; + bool try_lock; + bool contended; /* * Return false if page table scanning in rmap_walk should be stopped. * Otherwise, return true. @@ -391,12 +395,20 @@ struct rmap_walk_control { bool (*rmap_one)(struct folio *folio, struct vm_area_struct *vma, unsigned long addr, void *arg); int (*done)(struct folio *folio); - struct anon_vma *(*anon_lock)(struct folio *folio); + struct anon_vma *(*anon_lock)(struct folio *folio, + struct rmap_walk_control *rwc); bool (*invalid_vma)(struct vm_area_struct *vma, void *arg); }; -void rmap_walk(struct folio *folio, const struct rmap_walk_control *rwc); -void rmap_walk_locked(struct folio *folio, const struct rmap_walk_control *rwc); +void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc); +void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc); + +/* + * Called by memory-failure.c to kill processes. + */ +struct anon_vma *folio_lock_anon_vma_read(struct folio *folio, + struct rmap_walk_control *rwc); +void page_unlock_anon_vma_read(struct anon_vma *anon_vma); #else /* !CONFIG_MMU */ diff --git a/mm/ksm.c b/mm/ksm.c index 38360285497a..9ee82c9bce94 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2610,7 +2610,7 @@ struct page *ksm_might_need_to_copy(struct page *page, return new_page; } -void rmap_walk_ksm(struct folio *folio, const struct rmap_walk_control *rwc) +void rmap_walk_ksm(struct folio *folio, struct rmap_walk_control *rwc) { struct stable_node *stable_node; struct rmap_item *rmap_item; @@ -2634,7 +2634,13 @@ again: struct vm_area_struct *vma; cond_resched(); - anon_vma_lock_read(anon_vma); + if (!anon_vma_trylock_read(anon_vma)) { + if (rwc->try_lock) { + rwc->contended = true; + return; + } + anon_vma_lock_read(anon_vma); + } anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, 0, ULONG_MAX) { unsigned long addr; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 01f8b63d3621..a934ee8124dd 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -485,7 +485,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, struct anon_vma *av; pgoff_t pgoff; - av = folio_lock_anon_vma_read(folio); + av = folio_lock_anon_vma_read(folio, NULL); if (av == NULL) /* Not actually mapped anymore */ return; diff --git a/mm/page_idle.c b/mm/page_idle.c index fc0435abf909..bc08332a609c 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -86,11 +86,12 @@ static bool page_idle_clear_pte_refs_one(struct folio *folio, static void page_idle_clear_pte_refs(struct page *page) { struct folio *folio = page_folio(page); + /* - * Since rwc.arg is unused, rwc is effectively immutable, so we - * can make it static const to save some cycles and stack. + * Since rwc.try_lock is unused, rwc is effectively immutable, so we + * can make it static to save some cycles and stack. */ - static const struct rmap_walk_control rwc = { + static struct rmap_walk_control rwc = { .rmap_one = page_idle_clear_pte_refs_one, .anon_lock = folio_lock_anon_vma_read, }; diff --git a/mm/rmap.c b/mm/rmap.c index 219e287a83d2..5bcb334cd6f2 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -527,9 +527,11 @@ out: * * Its a little more complex as it tries to keep the fast path to a single * atomic op -- the trylock. If we fail the trylock, we fall back to getting a - * reference like with page_get_anon_vma() and then block on the mutex. + * reference like with page_get_anon_vma() and then block on the mutex + * on !rwc->try_lock case. */ -struct anon_vma *folio_lock_anon_vma_read(struct folio *folio) +struct anon_vma *folio_lock_anon_vma_read(struct folio *folio, + struct rmap_walk_control *rwc) { struct anon_vma *anon_vma = NULL; struct anon_vma *root_anon_vma; @@ -557,6 +559,12 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio) goto out; } + if (rwc && rwc->try_lock) { + anon_vma = NULL; + rwc->contended = true; + goto out; + } + /* trylock failed, we got to sleep */ if (!atomic_inc_not_zero(&anon_vma->refcount)) { anon_vma = NULL; @@ -883,7 +891,8 @@ static bool invalid_folio_referenced_vma(struct vm_area_struct *vma, void *arg) * * Quick test_and_clear_referenced for all mappings of a folio, * - * Return: The number of mappings which referenced the folio. + * Return: The number of mappings which referenced the folio. Return -1 if + * the function bailed out due to rmap lock contention. */ int folio_referenced(struct folio *folio, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags) @@ -897,6 +906,7 @@ int folio_referenced(struct folio *folio, int is_locked, .rmap_one = folio_referenced_one, .arg = (void *)&pra, .anon_lock = folio_lock_anon_vma_read, + .try_lock = true, }; *vm_flags = 0; @@ -927,7 +937,7 @@ int folio_referenced(struct folio *folio, int is_locked, if (we_locked) folio_unlock(folio); - return pra.referenced; + return rwc.contended ? -1 : pra.referenced; } static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw) @@ -2336,12 +2346,12 @@ void __put_anon_vma(struct anon_vma *anon_vma) } static struct anon_vma *rmap_walk_anon_lock(struct folio *folio, - const struct rmap_walk_control *rwc) + struct rmap_walk_control *rwc) { struct anon_vma *anon_vma; if (rwc->anon_lock) - return rwc->anon_lock(folio); + return rwc->anon_lock(folio, rwc); /* * Note: remove_migration_ptes() cannot use folio_lock_anon_vma_read() @@ -2353,7 +2363,17 @@ static struct anon_vma *rmap_walk_anon_lock(struct folio *folio, if (!anon_vma) return NULL; + if (anon_vma_trylock_read(anon_vma)) + goto out; + + if (rwc->try_lock) { + anon_vma = NULL; + rwc->contended = true; + goto out; + } + anon_vma_lock_read(anon_vma); +out: return anon_vma; } @@ -2367,7 +2387,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct folio *folio, * contained in the anon_vma struct it points to. */ static void rmap_walk_anon(struct folio *folio, - const struct rmap_walk_control *rwc, bool locked) + struct rmap_walk_control *rwc, bool locked) { struct anon_vma *anon_vma; pgoff_t pgoff_start, pgoff_end; @@ -2415,7 +2435,7 @@ static void rmap_walk_anon(struct folio *folio, * contained in the address_space struct it points to. */ static void rmap_walk_file(struct folio *folio, - const struct rmap_walk_control *rwc, bool locked) + struct rmap_walk_control *rwc, bool locked) { struct address_space *mapping = folio_mapping(folio); pgoff_t pgoff_start, pgoff_end; @@ -2434,8 +2454,18 @@ static void rmap_walk_file(struct folio *folio, pgoff_start = folio_pgoff(folio); pgoff_end = pgoff_start + folio_nr_pages(folio) - 1; - if (!locked) + if (!locked) { + if (i_mmap_trylock_read(mapping)) + goto lookup; + + if (rwc->try_lock) { + rwc->contended = true; + return; + } + i_mmap_lock_read(mapping); + } +lookup: vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff_start, pgoff_end) { unsigned long address = vma_address(&folio->page, vma); @@ -2457,7 +2487,7 @@ done: i_mmap_unlock_read(mapping); } -void rmap_walk(struct folio *folio, const struct rmap_walk_control *rwc) +void rmap_walk(struct folio *folio, struct rmap_walk_control *rwc) { if (unlikely(folio_test_ksm(folio))) rmap_walk_ksm(folio, rwc); @@ -2468,7 +2498,7 @@ void rmap_walk(struct folio *folio, const struct rmap_walk_control *rwc) } /* Like rmap_walk, but caller holds relevant rmap lock */ -void rmap_walk_locked(struct folio *folio, const struct rmap_walk_control *rwc) +void rmap_walk_locked(struct folio *folio, struct rmap_walk_control *rwc) { /* no ksm support for now */ VM_BUG_ON_FOLIO(folio_test_ksm(folio), folio); diff --git a/mm/vmscan.c b/mm/vmscan.c index 24dbe04520cb..887edcd93a40 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1391,6 +1391,10 @@ static enum page_references folio_check_references(struct folio *folio, if (vm_flags & VM_LOCKED) return PAGEREF_ACTIVATE; + /* rmap lock contention: rotate */ + if (referenced_ptes == -1) + return PAGEREF_KEEP; + if (referenced_ptes) { /* * All mapped folios start out with page table @@ -2499,8 +2503,9 @@ static void shrink_active_list(unsigned long nr_to_scan, } } + /* Referenced or rmap lock contention: rotate */ if (folio_referenced(folio, 0, sc->target_mem_cgroup, - &vm_flags)) { + &vm_flags) != 0) { /* * Identify referenced, file-backed active pages and * give them one more trip around the active list. So -- cgit v1.2.3