From 8b28f621bea6f84d44adf7e804b73aff1e09105b Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Fri, 12 Dec 2014 16:54:18 -0800 Subject: mm,fs: introduce helpers around the i_mmap_mutex This series is a continuation of the conversion of the i_mmap_mutex to rwsem, following what we have for the anon memory counterpart. With Hugh's feedback from the first iteration. Ultimately, the most obvious paths that require exclusive ownership of the lock is when we modify the VMA interval tree, via vma_interval_tree_insert() and vma_interval_tree_remove() families. Cases such as unmapping, where the ptes content is changed but the tree remains untouched should make it safe to share the i_mmap_rwsem. As such, the code of course is straightforward, however the devil is very much in the details. While its been tested on a number of workloads without anything exploding, I would not be surprised if there are some less documented/known assumptions about the lock that could suffer from these changes. Or maybe I'm just missing something, but either way I believe its at the point where it could use more eyes and hopefully some time in linux-next. Because the lock type conversion is the heart of this patchset, its worth noting a few comparisons between mutex vs rwsem (xadd): (i) Same size, no extra footprint. (ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for exclusive lock ownership. (iii) Both can be slightly unfair wrt exclusive ownership, with writer lock stealing properties, not necessarily respecting FIFO order for granting the lock when contended. (iv) Mutexes can be slightly faster than rwsems when the lock is non-contended. (v) Both suck at performance for debug (slowpaths), which shouldn't matter anyway. Sharing the lock is obviously beneficial, and sem writer ownership is close enough to mutexes. The biggest winner of these changes is migration. As for concrete numbers, the following performance results are for a 4-socket 60-core IvyBridge-EX with 130Gb of RAM. Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well with this set, with a steady ~60% throughput (jpm) increase for alltests and up to ~30% for disk for high amounts of concurrency. Lower counts of workload users (< 100) does not show much difference at all, so at least no regressions. 3.18-rc1 3.18-rc1-i_mmap_rwsem alltests-100 17918.72 ( 0.00%) 28417.97 ( 58.59%) alltests-200 16529.39 ( 0.00%) 26807.92 ( 62.18%) alltests-300 16591.17 ( 0.00%) 26878.08 ( 62.00%) alltests-400 16490.37 ( 0.00%) 26664.63 ( 61.70%) alltests-500 16593.17 ( 0.00%) 26433.72 ( 59.30%) alltests-600 16508.56 ( 0.00%) 26409.20 ( 59.97%) alltests-700 16508.19 ( 0.00%) 26298.58 ( 59.31%) alltests-800 16437.58 ( 0.00%) 26433.02 ( 60.81%) alltests-900 16418.35 ( 0.00%) 26241.61 ( 59.83%) alltests-1000 16369.00 ( 0.00%) 26195.76 ( 60.03%) alltests-1100 16330.11 ( 0.00%) 26133.46 ( 60.03%) alltests-1200 16341.30 ( 0.00%) 26084.03 ( 59.62%) alltests-1300 16304.75 ( 0.00%) 26024.74 ( 59.61%) alltests-1400 16231.08 ( 0.00%) 25952.35 ( 59.89%) alltests-1500 16168.06 ( 0.00%) 25850.58 ( 59.89%) alltests-1600 16142.56 ( 0.00%) 25767.42 ( 59.62%) alltests-1700 16118.91 ( 0.00%) 25689.58 ( 59.38%) alltests-1800 16068.06 ( 0.00%) 25599.71 ( 59.32%) alltests-1900 16046.94 ( 0.00%) 25525.92 ( 59.07%) alltests-2000 16007.26 ( 0.00%) 25513.07 ( 59.38%) disk-100 7582.14 ( 0.00%) 7257.48 ( -4.28%) disk-200 6962.44 ( 0.00%) 7109.15 ( 2.11%) disk-300 6435.93 ( 0.00%) 6904.75 ( 7.28%) disk-400 6370.84 ( 0.00%) 6861.26 ( 7.70%) disk-500 6353.42 ( 0.00%) 6846.71 ( 7.76%) disk-600 6368.82 ( 0.00%) 6806.75 ( 6.88%) disk-700 6331.37 ( 0.00%) 6796.01 ( 7.34%) disk-800 6324.22 ( 0.00%) 6788.00 ( 7.33%) disk-900 6253.52 ( 0.00%) 6750.43 ( 7.95%) disk-1000 6242.53 ( 0.00%) 6855.11 ( 9.81%) disk-1100 6234.75 ( 0.00%) 6858.47 ( 10.00%) disk-1200 6312.76 ( 0.00%) 6845.13 ( 8.43%) disk-1300 6309.95 ( 0.00%) 6834.51 ( 8.31%) disk-1400 6171.76 ( 0.00%) 6787.09 ( 9.97%) disk-1500 6139.81 ( 0.00%) 6761.09 ( 10.12%) disk-1600 4807.12 ( 0.00%) 6725.33 ( 39.90%) disk-1700 4669.50 ( 0.00%) 5985.38 ( 28.18%) disk-1800 4663.51 ( 0.00%) 5972.99 ( 28.08%) disk-1900 4674.31 ( 0.00%) 5949.94 ( 27.29%) disk-2000 4668.36 ( 0.00%) 5834.93 ( 24.99%) In addition, a 67.5% increase in successfully migrated NUMA pages, thus improving node locality. The patch layout is simple but designed for bisection (in case reversion is needed if the changes break upstream) and easier review: o Patches 1-4 convert the i_mmap lock from mutex to rwsem. o Patches 5-10 share the lock in specific paths, each patch details the rationale behind why it should be safe. This patchset has been tested with: postgres 9.4 (with brand new hugetlb support), hugetlbfs test suite (all tests pass, in fact more tests pass with these changes than with an upstream kernel), ltp, aim7 benchmarks, memcached and iozone with the -B option for mmap'ing. *Untested* paths are nommu, memory-failure, uprobes and xip. This patch (of 8): Various parts of the kernel acquire and release this mutex, so add i_mmap_lock_write() and immap_unlock_write() helper functions that will encapsulate this logic. The next patch will make use of these. Signed-off-by: Davidlohr Bueso Reviewed-by: Rik van Riel Acked-by: "Kirill A. Shutemov" Acked-by: Hugh Dickins Cc: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Srikar Dronamraju Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/fs.h | 10 ++++++++++ 1 file changed, 10 insertions(+) (limited to 'include/linux') diff --git a/include/linux/fs.h b/include/linux/fs.h index bb29b02d9bb6..bd0a1b2f3c02 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -467,6 +467,16 @@ struct block_device { int mapping_tagged(struct address_space *mapping, int tag); +static inline void i_mmap_lock_write(struct address_space *mapping) +{ + mutex_lock(&mapping->i_mmap_mutex); +} + +static inline void i_mmap_unlock_write(struct address_space *mapping) +{ + mutex_unlock(&mapping->i_mmap_mutex); +} + /* * Might pages of this file be mapped into userspace? */ -- cgit v1.2.3 From c8c06efa8b552608493b7066c234cfa82c47fcea Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Fri, 12 Dec 2014 16:54:24 -0800 Subject: mm: convert i_mmap_mutex to rwsem The i_mmap_mutex is a close cousin of the anon vma lock, both protecting similar data, one for file backed pages and the other for anon memory. To this end, this lock can also be a rwsem. In addition, there are some important opportunities to share the lock when there are no tree modifications. This conversion is straightforward. For now, all users take the write lock. [sfr@canb.auug.org.au: update fremap.c] Signed-off-by: Davidlohr Bueso Reviewed-by: Rik van Riel Acked-by: "Kirill A. Shutemov" Acked-by: Hugh Dickins Cc: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Srikar Dronamraju Acked-by: Mel Gorman Signed-off-by: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/hugetlbfs/inode.c | 10 +++++----- fs/inode.c | 2 +- include/linux/fs.h | 7 ++++--- include/linux/mmu_notifier.h | 2 +- kernel/events/uprobes.c | 2 +- mm/filemap.c | 10 +++++----- mm/hugetlb.c | 10 +++++----- mm/mmap.c | 8 ++++---- mm/mremap.c | 2 +- mm/rmap.c | 6 +++--- 10 files changed, 30 insertions(+), 29 deletions(-) (limited to 'include/linux') diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a082709aa427..5eba47f593f8 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -472,12 +472,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, } /* - * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never + * Hugetlbfs is not reclaimable; therefore its i_mmap_rwsem will never * be taken from reclaim -- unlike regular filesystems. This needs an * annotation because huge_pmd_share() does an allocation under - * i_mmap_mutex. + * i_mmap_rwsem. */ -static struct lock_class_key hugetlbfs_i_mmap_mutex_key; +static struct lock_class_key hugetlbfs_i_mmap_rwsem_key; static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct inode *dir, @@ -495,8 +495,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct hugetlbfs_inode_info *info; inode->i_ino = get_next_ino(); inode_init_owner(inode, dir, mode); - lockdep_set_class(&inode->i_mapping->i_mmap_mutex, - &hugetlbfs_i_mmap_mutex_key); + lockdep_set_class(&inode->i_mapping->i_mmap_rwsem, + &hugetlbfs_i_mmap_rwsem_key); inode->i_mapping->a_ops = &hugetlbfs_aops; inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info; inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; diff --git a/fs/inode.c b/fs/inode.c index 2ed95f7caa4f..ad60555b4768 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -346,7 +346,7 @@ void address_space_init_once(struct address_space *mapping) memset(mapping, 0, sizeof(*mapping)); INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC); spin_lock_init(&mapping->tree_lock); - mutex_init(&mapping->i_mmap_mutex); + init_rwsem(&mapping->i_mmap_rwsem); INIT_LIST_HEAD(&mapping->private_list); spin_lock_init(&mapping->private_lock); mapping->i_mmap = RB_ROOT; diff --git a/include/linux/fs.h b/include/linux/fs.h index bd0a1b2f3c02..6abcd0b72ae0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -401,7 +402,7 @@ struct address_space { atomic_t i_mmap_writable;/* count VM_SHARED mappings */ struct rb_root i_mmap; /* tree of private and shared mappings */ struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ - struct mutex i_mmap_mutex; /* protect tree, count, list */ + struct rw_semaphore i_mmap_rwsem; /* protect tree, count, list */ /* Protected by tree_lock together with the radix tree */ unsigned long nrpages; /* number of total pages */ unsigned long nrshadows; /* number of shadow entries */ @@ -469,12 +470,12 @@ int mapping_tagged(struct address_space *mapping, int tag); static inline void i_mmap_lock_write(struct address_space *mapping) { - mutex_lock(&mapping->i_mmap_mutex); + down_write(&mapping->i_mmap_rwsem); } static inline void i_mmap_unlock_write(struct address_space *mapping) { - mutex_unlock(&mapping->i_mmap_mutex); + up_write(&mapping->i_mmap_rwsem); } /* diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 88787bb4b3b9..ab8564b03468 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -154,7 +154,7 @@ struct mmu_notifier_ops { * Therefore notifier chains can only be traversed when either * * 1. mmap_sem is held. - * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwsem). + * 2. One of the reverse map locks is held (i_mmap_rwsem or anon_vma->rwsem). * 3. No other concurrent thread can access the list (release) */ struct mmu_notifier { diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index aac81bf9df09..1901dbfa7ce0 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -731,7 +731,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register) if (!prev && !more) { /* - * Needs GFP_NOWAIT to avoid i_mmap_mutex recursion through + * Needs GFP_NOWAIT to avoid i_mmap_rwsem recursion through * reclaim. This is optimistic, no harm done if it fails. */ prev = kmalloc(sizeof(struct map_info), diff --git a/mm/filemap.c b/mm/filemap.c index 14b4642279f1..e8905bc3cbd7 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -62,16 +62,16 @@ /* * Lock ordering: * - * ->i_mmap_mutex (truncate_pagecache) + * ->i_mmap_rwsem (truncate_pagecache) * ->private_lock (__free_pte->__set_page_dirty_buffers) * ->swap_lock (exclusive_swap_page, others) * ->mapping->tree_lock * * ->i_mutex - * ->i_mmap_mutex (truncate->unmap_mapping_range) + * ->i_mmap_rwsem (truncate->unmap_mapping_range) * * ->mmap_sem - * ->i_mmap_mutex + * ->i_mmap_rwsem * ->page_table_lock or pte_lock (various, mainly in memory.c) * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock) * @@ -85,7 +85,7 @@ * sb_lock (fs/fs-writeback.c) * ->mapping->tree_lock (__sync_single_inode) * - * ->i_mmap_mutex + * ->i_mmap_rwsem * ->anon_vma.lock (vma_adjust) * * ->anon_vma.lock @@ -105,7 +105,7 @@ * ->inode->i_lock (zap_pte_range->set_page_dirty) * ->private_lock (zap_pte_range->__set_page_dirty_buffers) * - * ->i_mmap_mutex + * ->i_mmap_rwsem * ->tasklist_lock (memory_failure, collect_procs_ao) */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ffe19304cc09..989cb032eaf5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2726,9 +2726,9 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb, * on its way out. We're lucky that the flag has such an appropriate * name, and can in fact be safely cleared here. We could clear it * before the __unmap_hugepage_range above, but all that's necessary - * is to clear it before releasing the i_mmap_mutex. This works + * is to clear it before releasing the i_mmap_rwsem. This works * because in the context this is called, the VMA is about to be - * destroyed and the i_mmap_mutex is held. + * destroyed and the i_mmap_rwsem is held. */ vma->vm_flags &= ~VM_MAYSHARE; } @@ -3370,9 +3370,9 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, spin_unlock(ptl); } /* - * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare + * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare * may have cleared our pud entry and done put_page on the page table: - * once we release i_mmap_mutex, another task can do the final put_page + * once we release i_mmap_rwsem, another task can do the final put_page * and that page table be reused and filled with junk. */ flush_tlb_range(vma, start, end); @@ -3525,7 +3525,7 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr) * and returns the corresponding pte. While this is not necessary for the * !shared pmd case because we can allocate the pmd later as well, it makes the * code much cleaner. pmd allocation is essential for the shared case because - * pud has to be populated inside the same i_mmap_mutex section - otherwise + * pud has to be populated inside the same i_mmap_rwsem section - otherwise * racing tasks could either miss the sharing (see huge_pte_offset) or select a * bad pmd for sharing. */ diff --git a/mm/mmap.c b/mm/mmap.c index ecd6ecf48778..0d84b2f86f3b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -232,7 +232,7 @@ error: } /* - * Requires inode->i_mapping->i_mmap_mutex + * Requires inode->i_mapping->i_mmap_rwsem */ static void __remove_shared_vm_struct(struct vm_area_struct *vma, struct file *file, struct address_space *mapping) @@ -2791,7 +2791,7 @@ void exit_mmap(struct mm_struct *mm) /* Insert vm structure into process list sorted by address * and into the inode's i_mmap tree. If vm_file is non-NULL - * then i_mmap_mutex is taken here. + * then i_mmap_rwsem is taken here. */ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) { @@ -3086,7 +3086,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping) */ if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags)) BUG(); - mutex_lock_nest_lock(&mapping->i_mmap_mutex, &mm->mmap_sem); + down_write_nest_lock(&mapping->i_mmap_rwsem, &mm->mmap_sem); } } @@ -3113,7 +3113,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping) * vma in this mm is backed by the same anon_vma or address_space. * * We can take all the locks in random order because the VM code - * taking i_mmap_mutex or anon_vma->rwsem outside the mmap_sem never + * taking i_mmap_rwsem or anon_vma->rwsem outside the mmap_sem never * takes more than one of them in a row. Secondly we're protected * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex. * diff --git a/mm/mremap.c b/mm/mremap.c index 426b448d6447..84aa36f9f308 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -99,7 +99,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, spinlock_t *old_ptl, *new_ptl; /* - * When need_rmap_locks is true, we take the i_mmap_mutex and anon_vma + * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma * locks to ensure that rmap will always observe either the old or the * new ptes. This is the easiest way to avoid races with * truncate_pagecache(), page migration, etc... diff --git a/mm/rmap.c b/mm/rmap.c index bea03f6bec61..18247f89f1a8 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -23,7 +23,7 @@ * inode->i_mutex (while writing or truncating, not reading or faulting) * mm->mmap_sem * page->flags PG_locked (lock_page) - * mapping->i_mmap_mutex + * mapping->i_mmap_rwsem * anon_vma->rwsem * mm->page_table_lock or pte_lock * zone->lru_lock (in mark_page_accessed, isolate_lru_page) @@ -1260,7 +1260,7 @@ out_mlock: /* * We need mmap_sem locking, Otherwise VM_LOCKED check makes * unstable result and race. Plus, We can't wait here because - * we now hold anon_vma->rwsem or mapping->i_mmap_mutex. + * we now hold anon_vma->rwsem or mapping->i_mmap_rwsem. * if trylock failed, the page remain in evictable lru and later * vmscan could retry to move the page to unevictable lru if the * page is actually mlocked. @@ -1684,7 +1684,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) * The page lock not only makes sure that page->mapping cannot * suddenly be NULLified by truncation, it makes sure that the * structure at mapping cannot be freed and reused yet, - * so we can safely take mapping->i_mmap_mutex. + * so we can safely take mapping->i_mmap_rwsem. */ VM_BUG_ON_PAGE(!PageLocked(page), page); -- cgit v1.2.3 From 3dec0ba0be6a532cac949e02b853021bf6d57dad Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Fri, 12 Dec 2014 16:54:27 -0800 Subject: mm/rmap: share the i_mmap_rwsem Similarly to the anon memory counterpart, we can share the mapping's lock ownership as the interval tree is not modified when doing doing the walk, only the file page. Signed-off-by: Davidlohr Bueso Acked-by: Rik van Riel Acked-by: "Kirill A. Shutemov" Acked-by: Hugh Dickins Cc: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Srikar Dronamraju Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/fs.h | 10 ++++++++++ mm/rmap.c | 6 +++--- 2 files changed, 13 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/fs.h b/include/linux/fs.h index 6abcd0b72ae0..1d1838de6882 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -478,6 +478,16 @@ static inline void i_mmap_unlock_write(struct address_space *mapping) up_write(&mapping->i_mmap_rwsem); } +static inline void i_mmap_lock_read(struct address_space *mapping) +{ + down_read(&mapping->i_mmap_rwsem); +} + +static inline void i_mmap_unlock_read(struct address_space *mapping) +{ + up_read(&mapping->i_mmap_rwsem); +} + /* * Might pages of this file be mapped into userspace? */ diff --git a/mm/rmap.c b/mm/rmap.c index 18247f89f1a8..14ad2b3b0f54 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1690,7 +1690,8 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) if (!mapping) return ret; - i_mmap_lock_write(mapping); + + i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { unsigned long address = vma_address(page, vma); @@ -1711,9 +1712,8 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) goto done; ret = rwc->file_nonlinear(page, mapping, rwc->arg); - done: - i_mmap_unlock_write(mapping); + i_mmap_unlock_read(mapping); return ret; } -- cgit v1.2.3 From 5e19b013f55a884c59a14391b22138899d1cc4cc Mon Sep 17 00:00:00 2001 From: Michal Nazarewicz Date: Fri, 12 Dec 2014 16:54:45 -0800 Subject: lib: bitmap: add alignment offset for bitmap_find_next_zero_area() Add a bitmap_find_next_zero_area_off() function which works like bitmap_find_next_zero_area() function except it allows an offset to be specified when alignment is checked. This lets caller request a bit such that its number plus the offset is aligned according to the mask. [gregory.0xf0@gmail.com: Retrieved from https://patchwork.linuxtv.org/patch/6254/ and updated documentation] Signed-off-by: Michal Nazarewicz Signed-off-by: Kyungmin Park Signed-off-by: Marek Szyprowski Signed-off-by: Gregory Fong Cc: Joonsoo Kim Cc: Kukjin Kim Cc: Laurent Pinchart Cc: Laura Abbott Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/bitmap.h | 36 +++++++++++++++++++++++++++++++----- lib/bitmap.c | 24 +++++++++++++----------- 2 files changed, 44 insertions(+), 16 deletions(-) (limited to 'include/linux') diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h index e1c8d080c427..34e020c23644 100644 --- a/include/linux/bitmap.h +++ b/include/linux/bitmap.h @@ -45,6 +45,7 @@ * bitmap_set(dst, pos, nbits) Set specified bit area * bitmap_clear(dst, pos, nbits) Clear specified bit area * bitmap_find_next_zero_area(buf, len, pos, n, mask) Find bit free area + * bitmap_find_next_zero_area_off(buf, len, pos, n, mask) as above * bitmap_shift_right(dst, src, n, nbits) *dst = *src >> n * bitmap_shift_left(dst, src, n, nbits) *dst = *src << n * bitmap_remap(dst, src, old, new, nbits) *dst = map(old, new)(src) @@ -114,11 +115,36 @@ extern int __bitmap_weight(const unsigned long *bitmap, unsigned int nbits); extern void bitmap_set(unsigned long *map, unsigned int start, int len); extern void bitmap_clear(unsigned long *map, unsigned int start, int len); -extern unsigned long bitmap_find_next_zero_area(unsigned long *map, - unsigned long size, - unsigned long start, - unsigned int nr, - unsigned long align_mask); + +extern unsigned long bitmap_find_next_zero_area_off(unsigned long *map, + unsigned long size, + unsigned long start, + unsigned int nr, + unsigned long align_mask, + unsigned long align_offset); + +/** + * bitmap_find_next_zero_area - find a contiguous aligned zero area + * @map: The address to base the search on + * @size: The bitmap size in bits + * @start: The bitnumber to start searching at + * @nr: The number of zeroed bits we're looking for + * @align_mask: Alignment mask for zero area + * + * The @align_mask should be one less than a power of 2; the effect is that + * the bit offset of all zero areas this function finds is multiples of that + * power of 2. A @align_mask of 0 means no alignment is required. + */ +static inline unsigned long +bitmap_find_next_zero_area(unsigned long *map, + unsigned long size, + unsigned long start, + unsigned int nr, + unsigned long align_mask) +{ + return bitmap_find_next_zero_area_off(map, size, start, nr, + align_mask, 0); +} extern int bitmap_scnprintf(char *buf, unsigned int len, const unsigned long *src, int nbits); diff --git a/lib/bitmap.c b/lib/bitmap.c index b499ab6ada29..969ae8fbc85b 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -326,30 +326,32 @@ void bitmap_clear(unsigned long *map, unsigned int start, int len) } EXPORT_SYMBOL(bitmap_clear); -/* - * bitmap_find_next_zero_area - find a contiguous aligned zero area +/** + * bitmap_find_next_zero_area_off - find a contiguous aligned zero area * @map: The address to base the search on * @size: The bitmap size in bits * @start: The bitnumber to start searching at * @nr: The number of zeroed bits we're looking for * @align_mask: Alignment mask for zero area + * @align_offset: Alignment offset for zero area. * * The @align_mask should be one less than a power of 2; the effect is that - * the bit offset of all zero areas this function finds is multiples of that - * power of 2. A @align_mask of 0 means no alignment is required. + * the bit offset of all zero areas this function finds plus @align_offset + * is multiple of that power of 2. */ -unsigned long bitmap_find_next_zero_area(unsigned long *map, - unsigned long size, - unsigned long start, - unsigned int nr, - unsigned long align_mask) +unsigned long bitmap_find_next_zero_area_off(unsigned long *map, + unsigned long size, + unsigned long start, + unsigned int nr, + unsigned long align_mask, + unsigned long align_offset) { unsigned long index, end, i; again: index = find_next_zero_bit(map, size, start); /* Align allocation */ - index = __ALIGN_MASK(index, align_mask); + index = __ALIGN_MASK(index + align_offset, align_mask) - align_offset; end = index + nr; if (end > size) @@ -361,7 +363,7 @@ again: } return index; } -EXPORT_SYMBOL(bitmap_find_next_zero_area); +EXPORT_SYMBOL(bitmap_find_next_zero_area_off); /* * Bitmap printing & parsing functions: first version by Nadia Yvette Chambers, -- cgit v1.2.3 From 6f185c290edec576a2cccd6670e5b8e02e6f04db Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Fri, 12 Dec 2014 16:55:15 -0800 Subject: memcg: turn memcg_kmem_skip_account into a bit field It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on the task_struct. Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to set/clear the flag inline. Regarding the overwhelming comment to the helpers, which is removed by this patch too, we already have a compact yet accurate explanation in memcg_schedule_cache_create, no need in yet another one. Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/sched.h | 7 +++++-- mm/memcontrol.c | 35 ++--------------------------------- 2 files changed, 7 insertions(+), 35 deletions(-) (limited to 'include/linux') diff --git a/include/linux/sched.h b/include/linux/sched.h index 55f5ee7cc3d3..4cfdbcf8cf56 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1364,6 +1364,10 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; +#ifdef CONFIG_MEMCG_KMEM + unsigned memcg_kmem_skip_account:1; +#endif + unsigned long atomic_flags; /* Flags needing atomic access. */ pid_t pid; @@ -1679,8 +1683,7 @@ struct task_struct { /* bitmask and counter of trace recursion */ unsigned long trace_recursion; #endif /* CONFIG_TRACING */ -#ifdef CONFIG_MEMCG /* memcg uses this to do batch job */ - unsigned int memcg_kmem_skip_account; +#ifdef CONFIG_MEMCG struct memcg_oom_info { struct mem_cgroup *memcg; gfp_t gfp_mask; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d9fab72da52e..11cbfde4dc6d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2673,37 +2673,6 @@ static void memcg_unregister_cache(struct kmem_cache *cachep) css_put(&memcg->css); } -/* - * During the creation a new cache, we need to disable our accounting mechanism - * altogether. This is true even if we are not creating, but rather just - * enqueing new caches to be created. - * - * This is because that process will trigger allocations; some visible, like - * explicit kmallocs to auxiliary data structures, name strings and internal - * cache structures; some well concealed, like INIT_WORK() that can allocate - * objects during debug. - * - * If any allocation happens during memcg_kmem_get_cache, we will recurse back - * to it. This may not be a bounded recursion: since the first cache creation - * failed to complete (waiting on the allocation), we'll just try to create the - * cache again, failing at the same point. - * - * memcg_kmem_get_cache is prepared to abort after seeing a positive count of - * memcg_kmem_skip_account. So we enclose anything that might allocate memory - * inside the following two functions. - */ -static inline void memcg_stop_kmem_account(void) -{ - VM_BUG_ON(!current->mm); - current->memcg_kmem_skip_account++; -} - -static inline void memcg_resume_kmem_account(void) -{ - VM_BUG_ON(!current->mm); - current->memcg_kmem_skip_account--; -} - int __memcg_cleanup_cache_params(struct kmem_cache *s) { struct kmem_cache *c; @@ -2798,9 +2767,9 @@ static void memcg_schedule_register_cache(struct mem_cgroup *memcg, * this point we can't allow ourselves back into memcg_kmem_get_cache, * the safest choice is to do it like this, wrapping the whole function. */ - memcg_stop_kmem_account(); + current->memcg_kmem_skip_account = 1; __memcg_schedule_register_cache(memcg, cachep); - memcg_resume_kmem_account(); + current->memcg_kmem_skip_account = 0; } int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order) -- cgit v1.2.3 From bd6dace78b9e4595d29892f806514518449c7489 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 12 Dec 2014 16:55:35 -0800 Subject: mm: move swp_entry_t definition to include/linux/mm_types.h swp_entry_t being defined in include/linux/swap.h instead of include/linux/mm_types.h causes cyclic include dependency later when include/linux/page_cgroup.h is included from writeback path. Move the definition to include/linux/mm_types.h. While at it, reformat the comment above it. Signed-off-by: Tejun Heo Cc: Mel Gorman Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm_types.h | 8 ++++++++ include/linux/swap.h | 8 -------- 2 files changed, 8 insertions(+), 8 deletions(-) (limited to 'include/linux') diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index bf9f57529dcf..fc2daffa9db1 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -534,4 +534,12 @@ enum tlb_flush_reason { NR_TLB_FLUSH_REASONS, }; + /* + * A swap entry has to fit into a "unsigned long", as the entry is hidden + * in the "index" field of the swapper address space. + */ +typedef struct { + unsigned long val; +} swp_entry_t; + #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 37a585beef5c..34e8b60ab973 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -102,14 +102,6 @@ union swap_header { } info; }; - /* A swap entry has to fit into a "unsigned long", as - * the entry is hidden in the "index" field of the - * swapper address space. - */ -typedef struct { - unsigned long val; -} swp_entry_t; - /* * current->reclaim_state points to one of these when a task is running * memory reclaim -- cgit v1.2.3 From 056b7ccef4bc670b1ed77181159c8228de0926ab Mon Sep 17 00:00:00 2001 From: Zhang Zhen Date: Fri, 12 Dec 2014 16:55:38 -0800 Subject: mm/memcontrol.c: remove the unused arg in __memcg_kmem_get_cache() The gfp was passed in but never used in this function. Signed-off-by: Zhang Zhen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 4 ++-- mm/memcontrol.c | 3 +-- 2 files changed, 3 insertions(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6ea9f919e888..b74942a9e22f 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -401,7 +401,7 @@ int memcg_cache_id(struct mem_cgroup *memcg); void memcg_update_array_size(int num_groups); struct kmem_cache * -__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp); +__memcg_kmem_get_cache(struct kmem_cache *cachep); int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order); void __memcg_uncharge_slab(struct kmem_cache *cachep, int order); @@ -492,7 +492,7 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp) if (unlikely(fatal_signal_pending(current))) return cachep; - return __memcg_kmem_get_cache(cachep, gfp); + return __memcg_kmem_get_cache(cachep); } #else #define for_each_memcg_cache_index(_idx) \ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 11cbfde4dc6d..c6ac50e7d1c2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2804,8 +2804,7 @@ void __memcg_uncharge_slab(struct kmem_cache *cachep, int order) * Can't be called in interrupt context or from kernel threads. * This function needs to be called with rcu_read_lock() held. */ -struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, - gfp_t gfp) +struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep) { struct mem_cgroup *memcg; struct kmem_cache *memcg_cachep; -- cgit v1.2.3 From 66f2ca7e3f59312888131546176b42d6e248558a Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Fri, 12 Dec 2014 16:55:41 -0800 Subject: include/linux/kmemleak.h: needs slab.h include/linux/kmemleak.h: In function 'kmemleak_alloc_recursive': include/linux/kmemleak.h:43: error: 'SLAB_NOLEAKTRACE' undeclared (first use in this function) Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kmemleak.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include/linux') diff --git a/include/linux/kmemleak.h b/include/linux/kmemleak.h index 057e95971014..e705467ddb47 100644 --- a/include/linux/kmemleak.h +++ b/include/linux/kmemleak.h @@ -21,6 +21,8 @@ #ifndef __KMEMLEAK_H #define __KMEMLEAK_H +#include + #ifdef CONFIG_DEBUG_KMEMLEAK extern void kmemleak_init(void) __ref; -- cgit v1.2.3 From 2d48366b3ff745729815c15077508f8d7722ec5f Mon Sep 17 00:00:00 2001 From: Jianyu Zhan Date: Fri, 12 Dec 2014 16:55:43 -0800 Subject: mm, gfp: escalatedly define GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE GFP_USER, GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE are escalatedly confined defined, also implied by their names: GFP_USER = GFP_USER GFP_USER + __GFP_HIGHMEM = GFP_HIGHUSER GFP_USER + __GFP_HIGHMEM + __GFP_MOVABLE = GFP_HIGHUSER_MOVABLE So just make GFP_HIGHUSER and GFP_HIGHUSER_MOVABLE escalatedly defined to reflect this fact. It also makes the definition clear and texturally warn on any furture break-up of this escalated relastionship. Signed-off-by: Jianyu Zhan Acked-by: Johannes Weiner Acked-by: Kirill A. Shutemov Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) (limited to 'include/linux') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 07d2699cdb51..b840e3b2770d 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -110,11 +110,8 @@ struct vm_area_struct; #define GFP_TEMPORARY (__GFP_WAIT | __GFP_IO | __GFP_FS | \ __GFP_RECLAIMABLE) #define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL) -#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \ - __GFP_HIGHMEM) -#define GFP_HIGHUSER_MOVABLE (__GFP_WAIT | __GFP_IO | __GFP_FS | \ - __GFP_HARDWALL | __GFP_HIGHMEM | \ - __GFP_MOVABLE) +#define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM) +#define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE) #define GFP_IOFS (__GFP_IO | __GFP_FS) #define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \ __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \ -- cgit v1.2.3 From eefa864b701d78dc9753c70a3540a2e9ae192595 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Fri, 12 Dec 2014 16:55:46 -0800 Subject: mm/page_ext: resurrect struct page extending code for debugging When we debug something, we'd like to insert some information to every page. For this purpose, we sometimes modify struct page itself. But, this has drawbacks. First, it requires re-compile. This makes us hesitate to use the powerful debug feature so development process is slowed down. And, second, sometimes it is impossible to rebuild the kernel due to third party module dependency. At third, system behaviour would be largely different after re-compile, because it changes size of struct page greatly and this structure is accessed by every part of kernel. Keeping this as it is would be better to reproduce errornous situation. This feature is intended to overcome above mentioned problems. This feature allocates memory for extended data per page in certain place rather than the struct page itself. This memory can be accessed by the accessor functions provided by this code. During the boot process, it checks whether allocation of huge chunk of memory is needed or not. If not, it avoids allocating memory at all. With this advantage, we can include this feature into the kernel in default and can avoid rebuild and solve related problems. Until now, memcg uses this technique. But, now, memcg decides to embed their variable to struct page itself and it's code to extend struct page has been removed. I'd like to use this code to develop debug feature, so this patch resurrect it. To help these things to work well, this patch introduces two callbacks for clients. One is the need callback which is mandatory if user wants to avoid useless memory allocation at boot-time. The other is optional, init callback, which is used to do proper initialization after memory is allocated. Detailed explanation about purpose of these functions is in code comment. Please refer it. Others are completely same with previous extension code in memcg. Signed-off-by: Joonsoo Kim Cc: Mel Gorman Cc: Johannes Weiner Cc: Minchan Kim Cc: Dave Hansen Cc: Michal Nazarewicz Cc: Jungsoo Son Cc: Ingo Molnar Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 12 ++ include/linux/page_ext.h | 59 +++++++ init/main.c | 7 + mm/Kconfig.debug | 9 ++ mm/Makefile | 1 + mm/page_alloc.c | 2 + mm/page_ext.c | 395 +++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 485 insertions(+) create mode 100644 include/linux/page_ext.h create mode 100644 mm/page_ext.c (limited to 'include/linux') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 3879d7664dfc..2f0856d14b21 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -722,6 +722,9 @@ typedef struct pglist_data { int nr_zones; #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */ struct page *node_mem_map; +#ifdef CONFIG_PAGE_EXTENSION + struct page_ext *node_page_ext; +#endif #endif #ifndef CONFIG_NO_BOOTMEM struct bootmem_data *bdata; @@ -1075,6 +1078,7 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn) #define SECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SECTION_MASK) struct page; +struct page_ext; struct mem_section { /* * This is, logically, a pointer to an array of struct @@ -1092,6 +1096,14 @@ struct mem_section { /* See declaration of similar field in struct zone */ unsigned long *pageblock_flags; +#ifdef CONFIG_PAGE_EXTENSION + /* + * If !SPARSEMEM, pgdat doesn't have page_ext pointer. We use + * section. (see page_ext.h about this.) + */ + struct page_ext *page_ext; + unsigned long pad; +#endif /* * WARNING: mem_section must be a power-of-2 in size for the * calculation and use of SECTION_ROOT_MASK to make sense. diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h new file mode 100644 index 000000000000..2ccc8b414e5c --- /dev/null +++ b/include/linux/page_ext.h @@ -0,0 +1,59 @@ +#ifndef __LINUX_PAGE_EXT_H +#define __LINUX_PAGE_EXT_H + +struct pglist_data; +struct page_ext_operations { + bool (*need)(void); + void (*init)(void); +}; + +#ifdef CONFIG_PAGE_EXTENSION + +/* + * Page Extension can be considered as an extended mem_map. + * A page_ext page is associated with every page descriptor. The + * page_ext helps us add more information about the page. + * All page_ext are allocated at boot or memory hotplug event, + * then the page_ext for pfn always exists. + */ +struct page_ext { + unsigned long flags; +}; + +extern void pgdat_page_ext_init(struct pglist_data *pgdat); + +#ifdef CONFIG_SPARSEMEM +static inline void page_ext_init_flatmem(void) +{ +} +extern void page_ext_init(void); +#else +extern void page_ext_init_flatmem(void); +static inline void page_ext_init(void) +{ +} +#endif + +struct page_ext *lookup_page_ext(struct page *page); + +#else /* !CONFIG_PAGE_EXTENSION */ +struct page_ext; + +static inline void pgdat_page_ext_init(struct pglist_data *pgdat) +{ +} + +static inline struct page_ext *lookup_page_ext(struct page *page) +{ + return NULL; +} + +static inline void page_ext_init(void) +{ +} + +static inline void page_ext_init_flatmem(void) +{ +} +#endif /* CONFIG_PAGE_EXTENSION */ +#endif /* __LINUX_PAGE_EXT_H */ diff --git a/init/main.c b/init/main.c index ca380ec685de..ed7e7ad5fee0 100644 --- a/init/main.c +++ b/init/main.c @@ -51,6 +51,7 @@ #include #include #include +#include #include #include #include @@ -484,6 +485,11 @@ void __init __weak thread_info_cache_init(void) */ static void __init mm_init(void) { + /* + * page_ext requires contiguous pages, + * bigger than MAX_ORDER unless SPARSEMEM. + */ + page_ext_init_flatmem(); mem_init(); kmem_cache_init(); percpu_init_late(); @@ -621,6 +627,7 @@ asmlinkage __visible void __init start_kernel(void) initrd_start = 0; } #endif + page_ext_init(); debug_objects_mem_init(); kmemleak_init(); setup_per_cpu_pageset(); diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug index 4b2443254de2..1ba81c7769f7 100644 --- a/mm/Kconfig.debug +++ b/mm/Kconfig.debug @@ -1,3 +1,12 @@ +config PAGE_EXTENSION + bool "Extend memmap on extra space for more information on page" + ---help--- + Extend memmap on extra space for more information on page. This + could be used for debugging features that need to insert extra + field for every page. This extension enables us to save memory + by not allocating this extra memory according to boottime + configuration. + config DEBUG_PAGEALLOC bool "Debug page memory allocations" depends on DEBUG_KERNEL diff --git a/mm/Makefile b/mm/Makefile index b3c6ce932c64..580cd3f392af 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -71,3 +71,4 @@ obj-$(CONFIG_ZSMALLOC) += zsmalloc.o obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o obj-$(CONFIG_CMA) += cma.o obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o +obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2e8b7f39605a..b64666cf5865 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -48,6 +48,7 @@ #include #include #include +#include #include #include #include @@ -4856,6 +4857,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, #endif init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); + pgdat_page_ext_init(pgdat); for (j = 0; j < MAX_NR_ZONES; j++) { struct zone *zone = pgdat->node_zones + j; diff --git a/mm/page_ext.c b/mm/page_ext.c new file mode 100644 index 000000000000..514a3bccd63f --- /dev/null +++ b/mm/page_ext.c @@ -0,0 +1,395 @@ +#include +#include +#include +#include +#include +#include +#include + +/* + * struct page extension + * + * This is the feature to manage memory for extended data per page. + * + * Until now, we must modify struct page itself to store extra data per page. + * This requires rebuilding the kernel and it is really time consuming process. + * And, sometimes, rebuild is impossible due to third party module dependency. + * At last, enlarging struct page could cause un-wanted system behaviour change. + * + * This feature is intended to overcome above mentioned problems. This feature + * allocates memory for extended data per page in certain place rather than + * the struct page itself. This memory can be accessed by the accessor + * functions provided by this code. During the boot process, it checks whether + * allocation of huge chunk of memory is needed or not. If not, it avoids + * allocating memory at all. With this advantage, we can include this feature + * into the kernel in default and can avoid rebuild and solve related problems. + * + * To help these things to work well, there are two callbacks for clients. One + * is the need callback which is mandatory if user wants to avoid useless + * memory allocation at boot-time. The other is optional, init callback, which + * is used to do proper initialization after memory is allocated. + * + * The need callback is used to decide whether extended memory allocation is + * needed or not. Sometimes users want to deactivate some features in this + * boot and extra memory would be unneccessary. In this case, to avoid + * allocating huge chunk of memory, each clients represent their need of + * extra memory through the need callback. If one of the need callbacks + * returns true, it means that someone needs extra memory so that + * page extension core should allocates memory for page extension. If + * none of need callbacks return true, memory isn't needed at all in this boot + * and page extension core can skip to allocate memory. As result, + * none of memory is wasted. + * + * The init callback is used to do proper initialization after page extension + * is completely initialized. In sparse memory system, extra memory is + * allocated some time later than memmap is allocated. In other words, lifetime + * of memory for page extension isn't same with memmap for struct page. + * Therefore, clients can't store extra data until page extension is + * initialized, even if pages are allocated and used freely. This could + * cause inadequate state of extra data per page, so, to prevent it, client + * can utilize this callback to initialize the state of it correctly. + */ + +static struct page_ext_operations *page_ext_ops[] = { +}; + +static unsigned long total_usage; + +static bool __init invoke_need_callbacks(void) +{ + int i; + int entries = ARRAY_SIZE(page_ext_ops); + + for (i = 0; i < entries; i++) { + if (page_ext_ops[i]->need && page_ext_ops[i]->need()) + return true; + } + + return false; +} + +static void __init invoke_init_callbacks(void) +{ + int i; + int entries = ARRAY_SIZE(page_ext_ops); + + for (i = 0; i < entries; i++) { + if (page_ext_ops[i]->init) + page_ext_ops[i]->init(); + } +} + +#if !defined(CONFIG_SPARSEMEM) + + +void __meminit pgdat_page_ext_init(struct pglist_data *pgdat) +{ + pgdat->node_page_ext = NULL; +} + +struct page_ext *lookup_page_ext(struct page *page) +{ + unsigned long pfn = page_to_pfn(page); + unsigned long offset; + struct page_ext *base; + + base = NODE_DATA(page_to_nid(page))->node_page_ext; +#ifdef CONFIG_DEBUG_VM + /* + * The sanity checks the page allocator does upon freeing a + * page can reach here before the page_ext arrays are + * allocated when feeding a range of pages to the allocator + * for the first time during bootup or memory hotplug. + */ + if (unlikely(!base)) + return NULL; +#endif + offset = pfn - round_down(node_start_pfn(page_to_nid(page)), + MAX_ORDER_NR_PAGES); + return base + offset; +} + +static int __init alloc_node_page_ext(int nid) +{ + struct page_ext *base; + unsigned long table_size; + unsigned long nr_pages; + + nr_pages = NODE_DATA(nid)->node_spanned_pages; + if (!nr_pages) + return 0; + + /* + * Need extra space if node range is not aligned with + * MAX_ORDER_NR_PAGES. When page allocator's buddy algorithm + * checks buddy's status, range could be out of exact node range. + */ + if (!IS_ALIGNED(node_start_pfn(nid), MAX_ORDER_NR_PAGES) || + !IS_ALIGNED(node_end_pfn(nid), MAX_ORDER_NR_PAGES)) + nr_pages += MAX_ORDER_NR_PAGES; + + table_size = sizeof(struct page_ext) * nr_pages; + + base = memblock_virt_alloc_try_nid_nopanic( + table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS), + BOOTMEM_ALLOC_ACCESSIBLE, nid); + if (!base) + return -ENOMEM; + NODE_DATA(nid)->node_page_ext = base; + total_usage += table_size; + return 0; +} + +void __init page_ext_init_flatmem(void) +{ + + int nid, fail; + + if (!invoke_need_callbacks()) + return; + + for_each_online_node(nid) { + fail = alloc_node_page_ext(nid); + if (fail) + goto fail; + } + pr_info("allocated %ld bytes of page_ext\n", total_usage); + invoke_init_callbacks(); + return; + +fail: + pr_crit("allocation of page_ext failed.\n"); + panic("Out of memory"); +} + +#else /* CONFIG_FLAT_NODE_MEM_MAP */ + +struct page_ext *lookup_page_ext(struct page *page) +{ + unsigned long pfn = page_to_pfn(page); + struct mem_section *section = __pfn_to_section(pfn); +#ifdef CONFIG_DEBUG_VM + /* + * The sanity checks the page allocator does upon freeing a + * page can reach here before the page_ext arrays are + * allocated when feeding a range of pages to the allocator + * for the first time during bootup or memory hotplug. + */ + if (!section->page_ext) + return NULL; +#endif + return section->page_ext + pfn; +} + +static void *__meminit alloc_page_ext(size_t size, int nid) +{ + gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN; + void *addr = NULL; + + addr = alloc_pages_exact_nid(nid, size, flags); + if (addr) { + kmemleak_alloc(addr, size, 1, flags); + return addr; + } + + if (node_state(nid, N_HIGH_MEMORY)) + addr = vzalloc_node(size, nid); + else + addr = vzalloc(size); + + return addr; +} + +static int __meminit init_section_page_ext(unsigned long pfn, int nid) +{ + struct mem_section *section; + struct page_ext *base; + unsigned long table_size; + + section = __pfn_to_section(pfn); + + if (section->page_ext) + return 0; + + table_size = sizeof(struct page_ext) * PAGES_PER_SECTION; + base = alloc_page_ext(table_size, nid); + + /* + * The value stored in section->page_ext is (base - pfn) + * and it does not point to the memory block allocated above, + * causing kmemleak false positives. + */ + kmemleak_not_leak(base); + + if (!base) { + pr_err("page ext allocation failure\n"); + return -ENOMEM; + } + + /* + * The passed "pfn" may not be aligned to SECTION. For the calculation + * we need to apply a mask. + */ + pfn &= PAGE_SECTION_MASK; + section->page_ext = base - pfn; + total_usage += table_size; + return 0; +} +#ifdef CONFIG_MEMORY_HOTPLUG +static void free_page_ext(void *addr) +{ + if (is_vmalloc_addr(addr)) { + vfree(addr); + } else { + struct page *page = virt_to_page(addr); + size_t table_size; + + table_size = sizeof(struct page_ext) * PAGES_PER_SECTION; + + BUG_ON(PageReserved(page)); + free_pages_exact(addr, table_size); + } +} + +static void __free_page_ext(unsigned long pfn) +{ + struct mem_section *ms; + struct page_ext *base; + + ms = __pfn_to_section(pfn); + if (!ms || !ms->page_ext) + return; + base = ms->page_ext + pfn; + free_page_ext(base); + ms->page_ext = NULL; +} + +static int __meminit online_page_ext(unsigned long start_pfn, + unsigned long nr_pages, + int nid) +{ + unsigned long start, end, pfn; + int fail = 0; + + start = SECTION_ALIGN_DOWN(start_pfn); + end = SECTION_ALIGN_UP(start_pfn + nr_pages); + + if (nid == -1) { + /* + * In this case, "nid" already exists and contains valid memory. + * "start_pfn" passed to us is a pfn which is an arg for + * online__pages(), and start_pfn should exist. + */ + nid = pfn_to_nid(start_pfn); + VM_BUG_ON(!node_state(nid, N_ONLINE)); + } + + for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) { + if (!pfn_present(pfn)) + continue; + fail = init_section_page_ext(pfn, nid); + } + if (!fail) + return 0; + + /* rollback */ + for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) + __free_page_ext(pfn); + + return -ENOMEM; +} + +static int __meminit offline_page_ext(unsigned long start_pfn, + unsigned long nr_pages, int nid) +{ + unsigned long start, end, pfn; + + start = SECTION_ALIGN_DOWN(start_pfn); + end = SECTION_ALIGN_UP(start_pfn + nr_pages); + + for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) + __free_page_ext(pfn); + return 0; + +} + +static int __meminit page_ext_callback(struct notifier_block *self, + unsigned long action, void *arg) +{ + struct memory_notify *mn = arg; + int ret = 0; + + switch (action) { + case MEM_GOING_ONLINE: + ret = online_page_ext(mn->start_pfn, + mn->nr_pages, mn->status_change_nid); + break; + case MEM_OFFLINE: + offline_page_ext(mn->start_pfn, + mn->nr_pages, mn->status_change_nid); + break; + case MEM_CANCEL_ONLINE: + offline_page_ext(mn->start_pfn, + mn->nr_pages, mn->status_change_nid); + break; + case MEM_GOING_OFFLINE: + break; + case MEM_ONLINE: + case MEM_CANCEL_OFFLINE: + break; + } + + return notifier_from_errno(ret); +} + +#endif + +void __init page_ext_init(void) +{ + unsigned long pfn; + int nid; + + if (!invoke_need_callbacks()) + return; + + for_each_node_state(nid, N_MEMORY) { + unsigned long start_pfn, end_pfn; + + start_pfn = node_start_pfn(nid); + end_pfn = node_end_pfn(nid); + /* + * start_pfn and end_pfn may not be aligned to SECTION and the + * page->flags of out of node pages are not initialized. So we + * scan [start_pfn, the biggest section's pfn < end_pfn) here. + */ + for (pfn = start_pfn; pfn < end_pfn; + pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) { + + if (!pfn_valid(pfn)) + continue; + /* + * Nodes's pfns can be overlapping. + * We know some arch can have a nodes layout such as + * -------------pfn--------------> + * N0 | N1 | N2 | N0 | N1 | N2|.... + */ + if (pfn_to_nid(pfn) != nid) + continue; + if (init_section_page_ext(pfn, nid)) + goto oom; + } + } + hotplug_memory_notifier(page_ext_callback, 0); + pr_info("allocated %ld bytes of page_ext\n", total_usage); + invoke_init_callbacks(); + return; + +oom: + panic("Out of memory"); +} + +void __meminit pgdat_page_ext_init(struct pglist_data *pgdat) +{ +} + +#endif -- cgit v1.2.3 From e30825f1869a75b29a69dc8e0aaaaccc492092cf Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Fri, 12 Dec 2014 16:55:49 -0800 Subject: mm/debug-pagealloc: prepare boottime configurable on/off Until now, debug-pagealloc needs extra flags in struct page, so we need to recompile whole source code when we decide to use it. This is really painful, because it takes some time to recompile and sometimes rebuild is not possible due to third party module depending on struct page. So, we can't use this good feature in many cases. Now, we have the page extension feature that allows us to insert extra flags to outside of struct page. This gets rid of third party module issue mentioned above. And, this allows us to determine if we need extra memory for this page extension in boottime. With these property, we can avoid using debug-pagealloc in boottime with low computational overhead in the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our development process greatly. This patch is the preparation step to achive above goal. debug-pagealloc originally uses extra field of struct page, but, after this patch, it will use field of struct page_ext. Because memory for page_ext is allocated later than initialization of page allocator in CONFIG_SPARSEMEM, we should disable debug-pagealloc feature temporarily until initialization of page_ext. This patch implements this. Signed-off-by: Joonsoo Kim Cc: Mel Gorman Cc: Johannes Weiner Cc: Minchan Kim Cc: Dave Hansen Cc: Michal Nazarewicz Cc: Jungsoo Son Cc: Ingo Molnar Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 19 ++++++++++++++++++- include/linux/mm_types.h | 4 ---- include/linux/page-debug-flags.h | 32 -------------------------------- include/linux/page_ext.h | 15 +++++++++++++++ mm/Kconfig.debug | 1 + mm/debug-pagealloc.c | 37 +++++++++++++++++++++++++++++++++---- mm/page_alloc.c | 38 +++++++++++++++++++++++++++++++++++--- mm/page_ext.c | 4 ++++ 8 files changed, 106 insertions(+), 44 deletions(-) delete mode 100644 include/linux/page-debug-flags.h (limited to 'include/linux') diff --git a/include/linux/mm.h b/include/linux/mm.h index 3b337efbe533..66560f1a0564 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -19,6 +19,7 @@ #include #include #include +#include struct mempolicy; struct anon_vma; @@ -2155,20 +2156,36 @@ extern void copy_user_huge_page(struct page *dst, struct page *src, unsigned int pages_per_huge_page); #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */ +extern struct page_ext_operations debug_guardpage_ops; +extern struct page_ext_operations page_poisoning_ops; + #ifdef CONFIG_DEBUG_PAGEALLOC extern unsigned int _debug_guardpage_minorder; +extern bool _debug_guardpage_enabled; static inline unsigned int debug_guardpage_minorder(void) { return _debug_guardpage_minorder; } +static inline bool debug_guardpage_enabled(void) +{ + return _debug_guardpage_enabled; +} + static inline bool page_is_guard(struct page *page) { - return test_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags); + struct page_ext *page_ext; + + if (!debug_guardpage_enabled()) + return false; + + page_ext = lookup_page_ext(page); + return test_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags); } #else static inline unsigned int debug_guardpage_minorder(void) { return 0; } +static inline bool debug_guardpage_enabled(void) { return false; } static inline bool page_is_guard(struct page *page) { return false; } #endif /* CONFIG_DEBUG_PAGEALLOC */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index fc2daffa9db1..6d34aa266a8c 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -10,7 +10,6 @@ #include #include #include -#include #include #include #include @@ -186,9 +185,6 @@ struct page { void *virtual; /* Kernel virtual address (NULL if not kmapped, ie. highmem) */ #endif /* WANT_PAGE_VIRTUAL */ -#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS - unsigned long debug_flags; /* Use atomic bitops on this */ -#endif #ifdef CONFIG_KMEMCHECK /* diff --git a/include/linux/page-debug-flags.h b/include/linux/page-debug-flags.h deleted file mode 100644 index 22691f614043..000000000000 --- a/include/linux/page-debug-flags.h +++ /dev/null @@ -1,32 +0,0 @@ -#ifndef LINUX_PAGE_DEBUG_FLAGS_H -#define LINUX_PAGE_DEBUG_FLAGS_H - -/* - * page->debug_flags bits: - * - * PAGE_DEBUG_FLAG_POISON is set for poisoned pages. This is used to - * implement generic debug pagealloc feature. The pages are filled with - * poison patterns and set this flag after free_pages(). The poisoned - * pages are verified whether the patterns are not corrupted and clear - * the flag before alloc_pages(). - */ - -enum page_debug_flags { - PAGE_DEBUG_FLAG_POISON, /* Page is poisoned */ - PAGE_DEBUG_FLAG_GUARD, -}; - -/* - * Ensure that CONFIG_WANT_PAGE_DEBUG_FLAGS reliably - * gets turned off when no debug features are enabling it! - */ - -#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS -#if !defined(CONFIG_PAGE_POISONING) && \ - !defined(CONFIG_PAGE_GUARD) \ -/* && !defined(CONFIG_PAGE_DEBUG_SOMETHING_ELSE) && ... */ -#error WANT_PAGE_DEBUG_FLAGS is turned on with no debug features! -#endif -#endif /* CONFIG_WANT_PAGE_DEBUG_FLAGS */ - -#endif /* LINUX_PAGE_DEBUG_FLAGS_H */ diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h index 2ccc8b414e5c..61c0f05f9069 100644 --- a/include/linux/page_ext.h +++ b/include/linux/page_ext.h @@ -9,6 +9,21 @@ struct page_ext_operations { #ifdef CONFIG_PAGE_EXTENSION +/* + * page_ext->flags bits: + * + * PAGE_EXT_DEBUG_POISON is set for poisoned pages. This is used to + * implement generic debug pagealloc feature. The pages are filled with + * poison patterns and set this flag after free_pages(). The poisoned + * pages are verified whether the patterns are not corrupted and clear + * the flag before alloc_pages(). + */ + +enum page_ext_flags { + PAGE_EXT_DEBUG_POISON, /* Page is poisoned */ + PAGE_EXT_DEBUG_GUARD, +}; + /* * Page Extension can be considered as an extended mem_map. * A page_ext page is associated with every page descriptor. The diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug index 1ba81c7769f7..56badfc4810a 100644 --- a/mm/Kconfig.debug +++ b/mm/Kconfig.debug @@ -12,6 +12,7 @@ config DEBUG_PAGEALLOC depends on DEBUG_KERNEL depends on !HIBERNATION || ARCH_SUPPORTS_DEBUG_PAGEALLOC && !PPC && !SPARC depends on !KMEMCHECK + select PAGE_EXTENSION select PAGE_POISONING if !ARCH_SUPPORTS_DEBUG_PAGEALLOC select PAGE_GUARD if ARCH_SUPPORTS_DEBUG_PAGEALLOC ---help--- diff --git a/mm/debug-pagealloc.c b/mm/debug-pagealloc.c index 789ff70c8a4a..0072f2c53331 100644 --- a/mm/debug-pagealloc.c +++ b/mm/debug-pagealloc.c @@ -2,23 +2,49 @@ #include #include #include -#include +#include #include #include +static bool page_poisoning_enabled __read_mostly; + +static bool need_page_poisoning(void) +{ + return true; +} + +static void init_page_poisoning(void) +{ + page_poisoning_enabled = true; +} + +struct page_ext_operations page_poisoning_ops = { + .need = need_page_poisoning, + .init = init_page_poisoning, +}; + static inline void set_page_poison(struct page *page) { - __set_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags); + struct page_ext *page_ext; + + page_ext = lookup_page_ext(page); + __set_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags); } static inline void clear_page_poison(struct page *page) { - __clear_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags); + struct page_ext *page_ext; + + page_ext = lookup_page_ext(page); + __clear_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags); } static inline bool page_poison(struct page *page) { - return test_bit(PAGE_DEBUG_FLAG_POISON, &page->debug_flags); + struct page_ext *page_ext; + + page_ext = lookup_page_ext(page); + return test_bit(PAGE_EXT_DEBUG_POISON, &page_ext->flags); } static void poison_page(struct page *page) @@ -95,6 +121,9 @@ static void unpoison_pages(struct page *page, int n) void kernel_map_pages(struct page *page, int numpages, int enable) { + if (!page_poisoning_enabled) + return; + if (enable) unpoison_pages(page, numpages); else diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b64666cf5865..e0a39d328ca1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -56,7 +56,7 @@ #include #include #include -#include +#include #include #include @@ -425,6 +425,22 @@ static inline void prep_zero_page(struct page *page, unsigned int order, #ifdef CONFIG_DEBUG_PAGEALLOC unsigned int _debug_guardpage_minorder; +bool _debug_guardpage_enabled __read_mostly; + +static bool need_debug_guardpage(void) +{ + return true; +} + +static void init_debug_guardpage(void) +{ + _debug_guardpage_enabled = true; +} + +struct page_ext_operations debug_guardpage_ops = { + .need = need_debug_guardpage, + .init = init_debug_guardpage, +}; static int __init debug_guardpage_minorder_setup(char *buf) { @@ -443,7 +459,14 @@ __setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup); static inline void set_page_guard(struct zone *zone, struct page *page, unsigned int order, int migratetype) { - __set_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags); + struct page_ext *page_ext; + + if (!debug_guardpage_enabled()) + return; + + page_ext = lookup_page_ext(page); + __set_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags); + INIT_LIST_HEAD(&page->lru); set_page_private(page, order); /* Guard pages are not available for any usage */ @@ -453,12 +476,20 @@ static inline void set_page_guard(struct zone *zone, struct page *page, static inline void clear_page_guard(struct zone *zone, struct page *page, unsigned int order, int migratetype) { - __clear_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags); + struct page_ext *page_ext; + + if (!debug_guardpage_enabled()) + return; + + page_ext = lookup_page_ext(page); + __clear_bit(PAGE_EXT_DEBUG_GUARD, &page_ext->flags); + set_page_private(page, 0); if (!is_migrate_isolate(migratetype)) __mod_zone_freepage_state(zone, (1 << order), migratetype); } #else +struct page_ext_operations debug_guardpage_ops = { NULL, }; static inline void set_page_guard(struct zone *zone, struct page *page, unsigned int order, int migratetype) {} static inline void clear_page_guard(struct zone *zone, struct page *page, @@ -869,6 +900,7 @@ static inline void expand(struct zone *zone, struct page *page, VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]); if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC) && + debug_guardpage_enabled() && high < debug_guardpage_minorder()) { /* * Mark as guard pages (or page), that will allow to diff --git a/mm/page_ext.c b/mm/page_ext.c index 514a3bccd63f..c2cd7b15f0de 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -51,6 +51,10 @@ */ static struct page_ext_operations *page_ext_ops[] = { + &debug_guardpage_ops, +#ifdef CONFIG_PAGE_POISONING + &page_poisoning_ops, +#endif }; static unsigned long total_usage; -- cgit v1.2.3 From 031bc5743f158b2d5498294f489e534a31251626 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Fri, 12 Dec 2014 16:55:52 -0800 Subject: mm/debug-pagealloc: make debug-pagealloc boottime configurable Now, we have prepared to avoid using debug-pagealloc in boottime. So introduce new kernel-parameter to disable debug-pagealloc in boottime, and makes related functions to be disabled in this case. Only non-intuitive part is change of guard page functions. Because guard page is effective only if debug-pagealloc is enabled, turning off according to debug-pagealloc is reasonable thing to do. Signed-off-by: Joonsoo Kim Cc: Mel Gorman Cc: Johannes Weiner Cc: Minchan Kim Cc: Dave Hansen Cc: Michal Nazarewicz Cc: Jungsoo Son Cc: Ingo Molnar Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 9 +++++++++ arch/powerpc/mm/hash_utils_64.c | 2 +- arch/powerpc/mm/pgtable_32.c | 2 +- arch/s390/mm/pageattr.c | 2 +- arch/sparc/mm/init_64.c | 2 +- arch/x86/mm/pageattr.c | 2 +- include/linux/mm.h | 17 ++++++++++++++++- mm/debug-pagealloc.c | 8 +++++++- mm/page_alloc.c | 20 ++++++++++++++++++++ 9 files changed, 57 insertions(+), 7 deletions(-) (limited to 'include/linux') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 24539d1c7d25..6f067954675b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -829,6 +829,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. CONFIG_DEBUG_PAGEALLOC, hence this option will not help tracking down these problems. + debug_pagealloc= + [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this + parameter enables the feature at boot time. In + default, it is disabled. We can avoid allocating huge + chunk of memory for debug pagealloc if we don't enable + it at boot time and the system will work mostly same + with the kernel built without CONFIG_DEBUG_PAGEALLOC. + on: enable the feature + debugpat [X86] Enable PAT debugging decnet.addr= [HW,NET] diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index e56a307bc676..2c2022d16059 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -1514,7 +1514,7 @@ static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi) mmu_kernel_ssize, 0); } -void kernel_map_pages(struct page *page, int numpages, int enable) +void __kernel_map_pages(struct page *page, int numpages, int enable) { unsigned long flags, vaddr, lmi; int i; diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c index d545b1231594..50fad3801f30 100644 --- a/arch/powerpc/mm/pgtable_32.c +++ b/arch/powerpc/mm/pgtable_32.c @@ -429,7 +429,7 @@ static int change_page_attr(struct page *page, int numpages, pgprot_t prot) } -void kernel_map_pages(struct page *page, int numpages, int enable) +void __kernel_map_pages(struct page *page, int numpages, int enable) { if (PageHighMem(page)) return; diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c index 3fef3b299665..426c9d462d1c 100644 --- a/arch/s390/mm/pageattr.c +++ b/arch/s390/mm/pageattr.c @@ -120,7 +120,7 @@ static void ipte_range(pte_t *pte, unsigned long address, int nr) } } -void kernel_map_pages(struct page *page, int numpages, int enable) +void __kernel_map_pages(struct page *page, int numpages, int enable) { unsigned long address; int nr, i, j; diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 2d91c62f7f5f..3ea267c53320 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -1621,7 +1621,7 @@ static void __init kernel_physical_mapping_init(void) } #ifdef CONFIG_DEBUG_PAGEALLOC -void kernel_map_pages(struct page *page, int numpages, int enable) +void __kernel_map_pages(struct page *page, int numpages, int enable) { unsigned long phys_start = page_to_pfn(page) << PAGE_SHIFT; unsigned long phys_end = phys_start + (numpages * PAGE_SIZE); diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index a3a5d46605d2..dfaf2e0f5f8f 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1817,7 +1817,7 @@ static int __set_pages_np(struct page *page, int numpages) return __change_page_attr_set_clr(&cpa, 0); } -void kernel_map_pages(struct page *page, int numpages, int enable) +void __kernel_map_pages(struct page *page, int numpages, int enable) { if (PageHighMem(page)) return; diff --git a/include/linux/mm.h b/include/linux/mm.h index 66560f1a0564..8b8d77a1532f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2061,7 +2061,22 @@ static inline void vm_stat_account(struct mm_struct *mm, #endif /* CONFIG_PROC_FS */ #ifdef CONFIG_DEBUG_PAGEALLOC -extern void kernel_map_pages(struct page *page, int numpages, int enable); +extern bool _debug_pagealloc_enabled; +extern void __kernel_map_pages(struct page *page, int numpages, int enable); + +static inline bool debug_pagealloc_enabled(void) +{ + return _debug_pagealloc_enabled; +} + +static inline void +kernel_map_pages(struct page *page, int numpages, int enable) +{ + if (!debug_pagealloc_enabled()) + return; + + __kernel_map_pages(page, numpages, enable); +} #ifdef CONFIG_HIBERNATION extern bool kernel_page_present(struct page *page); #endif /* CONFIG_HIBERNATION */ diff --git a/mm/debug-pagealloc.c b/mm/debug-pagealloc.c index 0072f2c53331..5bf5906ce13b 100644 --- a/mm/debug-pagealloc.c +++ b/mm/debug-pagealloc.c @@ -10,11 +10,17 @@ static bool page_poisoning_enabled __read_mostly; static bool need_page_poisoning(void) { + if (!debug_pagealloc_enabled()) + return false; + return true; } static void init_page_poisoning(void) { + if (!debug_pagealloc_enabled()) + return; + page_poisoning_enabled = true; } @@ -119,7 +125,7 @@ static void unpoison_pages(struct page *page, int n) unpoison_page(page + i); } -void kernel_map_pages(struct page *page, int numpages, int enable) +void __kernel_map_pages(struct page *page, int numpages, int enable) { if (!page_poisoning_enabled) return; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e0a39d328ca1..303d38516807 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -425,15 +425,35 @@ static inline void prep_zero_page(struct page *page, unsigned int order, #ifdef CONFIG_DEBUG_PAGEALLOC unsigned int _debug_guardpage_minorder; +bool _debug_pagealloc_enabled __read_mostly; bool _debug_guardpage_enabled __read_mostly; +static int __init early_debug_pagealloc(char *buf) +{ + if (!buf) + return -EINVAL; + + if (strcmp(buf, "on") == 0) + _debug_pagealloc_enabled = true; + + return 0; +} +early_param("debug_pagealloc", early_debug_pagealloc); + static bool need_debug_guardpage(void) { + /* If we don't use debug_pagealloc, we don't need guard page */ + if (!debug_pagealloc_enabled()) + return false; + return true; } static void init_debug_guardpage(void) { + if (!debug_pagealloc_enabled()) + return; + _debug_guardpage_enabled = true; } -- cgit v1.2.3 From 9a92a6ce6f842713ccd0025c5228fe8bea61234c Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Fri, 12 Dec 2014 16:55:58 -0800 Subject: stacktrace: introduce snprint_stack_trace for buffer output Current stacktrace only have the function for console output. page_owner that will be introduced in following patch needs to print the output of stacktrace into the buffer for our own output format so so new function, snprint_stack_trace(), is needed. Signed-off-by: Joonsoo Kim Cc: Mel Gorman Cc: Johannes Weiner Cc: Minchan Kim Cc: Dave Hansen Cc: Michal Nazarewicz Cc: Jungsoo Son Cc: Ingo Molnar Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/stacktrace.h | 5 +++++ kernel/stacktrace.c | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 37 insertions(+) (limited to 'include/linux') diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h index 115b570e3bff..669045ab73f3 100644 --- a/include/linux/stacktrace.h +++ b/include/linux/stacktrace.h @@ -1,6 +1,8 @@ #ifndef __LINUX_STACKTRACE_H #define __LINUX_STACKTRACE_H +#include + struct task_struct; struct pt_regs; @@ -20,6 +22,8 @@ extern void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace); extern void print_stack_trace(struct stack_trace *trace, int spaces); +extern int snprint_stack_trace(char *buf, size_t size, + struct stack_trace *trace, int spaces); #ifdef CONFIG_USER_STACKTRACE_SUPPORT extern void save_stack_trace_user(struct stack_trace *trace); @@ -32,6 +36,7 @@ extern void save_stack_trace_user(struct stack_trace *trace); # define save_stack_trace_tsk(tsk, trace) do { } while (0) # define save_stack_trace_user(trace) do { } while (0) # define print_stack_trace(trace, spaces) do { } while (0) +# define snprint_stack_trace(buf, size, trace, spaces) do { } while (0) #endif #endif diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c index 00fe55cc5a82..b6e4c16377c7 100644 --- a/kernel/stacktrace.c +++ b/kernel/stacktrace.c @@ -25,6 +25,38 @@ void print_stack_trace(struct stack_trace *trace, int spaces) } EXPORT_SYMBOL_GPL(print_stack_trace); +int snprint_stack_trace(char *buf, size_t size, + struct stack_trace *trace, int spaces) +{ + int i; + unsigned long ip; + int generated; + int total = 0; + + if (WARN_ON(!trace->entries)) + return 0; + + for (i = 0; i < trace->nr_entries; i++) { + ip = trace->entries[i]; + generated = snprintf(buf, size, "%*c[<%p>] %pS\n", + 1 + spaces, ' ', (void *) ip, (void *) ip); + + total += generated; + + /* Assume that generated isn't a negative number */ + if (generated >= size) { + buf += size; + size = 0; + } else { + buf += generated; + size -= generated; + } + } + + return total; +} +EXPORT_SYMBOL_GPL(snprint_stack_trace); + /* * Architectures that do not implement save_stack_trace_tsk or * save_stack_trace_regs get this weak alias and a once-per-bootup warning -- cgit v1.2.3 From 48c96a3685795e52903e60c7ee115e5e22e7d640 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Fri, 12 Dec 2014 16:56:01 -0800 Subject: mm/page_owner: keep track of page owners This is the page owner tracking code which is introduced so far ago. It is resident on Andrew's tree, though, nobody tried to upstream so it remain as is. Our company uses this feature actively to debug memory leak or to find a memory hogger so I decide to upstream this feature. This functionality help us to know who allocates the page. When allocating a page, we store some information about allocation in extra memory. Later, if we need to know status of all pages, we can get and analyze it from this stored information. In previous version of this feature, extra memory is statically defined in struct page, but, in this version, extra memory is allocated outside of struct page. It enables us to turn on/off this feature at boottime without considerable memory waste. Although we already have tracepoint for tracing page allocation/free, using it to analyze page owner is rather complex. We need to enlarge the trace buffer for preventing overlapping until userspace program launched. And, launched program continually dump out the trace buffer for later analysis and it would change system behaviour with more possibility rather than just keeping it in memory, so bad for debug. Moreover, we can use page_owner feature further for various purposes. For example, we can use it for fragmentation statistics implemented in this patch. And, I also plan to implement some CMA failure debugging feature using this interface. I'd like to give the credit for all developers contributed this feature, but, it's not easy because I don't know exact history. Sorry about that. Below is people who has "Signed-off-by" in the patches in Andrew's tree. Contributor: Alexander Nyberg Mel Gorman Dave Hansen Minchan Kim Michal Nazarewicz Andrew Morton Jungsoo Son Signed-off-by: Joonsoo Kim Cc: Mel Gorman Cc: Johannes Weiner Cc: Minchan Kim Cc: Dave Hansen Cc: Michal Nazarewicz Cc: Jungsoo Son Cc: Ingo Molnar Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 6 + include/linux/page_ext.h | 10 ++ include/linux/page_owner.h | 38 ++++++ lib/Kconfig.debug | 16 +++ mm/Makefile | 1 + mm/page_alloc.c | 11 +- mm/page_ext.c | 4 + mm/page_owner.c | 222 ++++++++++++++++++++++++++++++++++++ mm/vmstat.c | 101 ++++++++++++++++ tools/vm/Makefile | 4 +- tools/vm/page_owner_sort.c | 144 +++++++++++++++++++++++ 11 files changed, 554 insertions(+), 3 deletions(-) create mode 100644 include/linux/page_owner.h create mode 100644 mm/page_owner.c create mode 100644 tools/vm/page_owner_sort.c (limited to 'include/linux') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6f067954675b..68153642c44e 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2513,6 +2513,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. OSS [HW,OSS] See Documentation/sound/oss/oss-parameters.txt + page_owner= [KNL] Boot-time page_owner enabling option. + Storage of the information about who allocated + each page is disabled in default. With this switch, + we can turn it on. + on: enable the feature + panic= [KNL] Kernel behaviour on panic: delay timeout > 0: seconds before rebooting timeout = 0: wait forever diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h index 61c0f05f9069..d2a2c84c72d0 100644 --- a/include/linux/page_ext.h +++ b/include/linux/page_ext.h @@ -1,6 +1,9 @@ #ifndef __LINUX_PAGE_EXT_H #define __LINUX_PAGE_EXT_H +#include +#include + struct pglist_data; struct page_ext_operations { bool (*need)(void); @@ -22,6 +25,7 @@ struct page_ext_operations { enum page_ext_flags { PAGE_EXT_DEBUG_POISON, /* Page is poisoned */ PAGE_EXT_DEBUG_GUARD, + PAGE_EXT_OWNER, }; /* @@ -33,6 +37,12 @@ enum page_ext_flags { */ struct page_ext { unsigned long flags; +#ifdef CONFIG_PAGE_OWNER + unsigned int order; + gfp_t gfp_mask; + struct stack_trace trace; + unsigned long trace_entries[8]; +#endif }; extern void pgdat_page_ext_init(struct pglist_data *pgdat); diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h new file mode 100644 index 000000000000..b48c3471c254 --- /dev/null +++ b/include/linux/page_owner.h @@ -0,0 +1,38 @@ +#ifndef __LINUX_PAGE_OWNER_H +#define __LINUX_PAGE_OWNER_H + +#ifdef CONFIG_PAGE_OWNER +extern bool page_owner_inited; +extern struct page_ext_operations page_owner_ops; + +extern void __reset_page_owner(struct page *page, unsigned int order); +extern void __set_page_owner(struct page *page, + unsigned int order, gfp_t gfp_mask); + +static inline void reset_page_owner(struct page *page, unsigned int order) +{ + if (likely(!page_owner_inited)) + return; + + __reset_page_owner(page, order); +} + +static inline void set_page_owner(struct page *page, + unsigned int order, gfp_t gfp_mask) +{ + if (likely(!page_owner_inited)) + return; + + __set_page_owner(page, order, gfp_mask); +} +#else +static inline void reset_page_owner(struct page *page, unsigned int order) +{ +} +static inline void set_page_owner(struct page *page, + unsigned int order, gfp_t gfp_mask) +{ +} + +#endif /* CONFIG_PAGE_OWNER */ +#endif /* __LINUX_PAGE_OWNER_H */ diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index d780351835e9..5f2ce616c046 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -227,6 +227,22 @@ config UNUSED_SYMBOLS you really need it, and what the merge plan to the mainline kernel for your module is. +config PAGE_OWNER + bool "Track page owner" + depends on DEBUG_KERNEL && STACKTRACE_SUPPORT + select DEBUG_FS + select STACKTRACE + select PAGE_EXTENSION + help + This keeps track of what call chain is the owner of a page, may + help to find bare alloc_page(s) leaks. Even if you include this + feature on your build, it is disabled in default. You should pass + "page_owner=on" to boot parameter in order to enable it. Eats + a fair amount of memory if enabled. See tools/vm/page_owner_sort.c + for user-space helper. + + If unsure, say N. + config DEBUG_FS bool "Debug Filesystem" help diff --git a/mm/Makefile b/mm/Makefile index 580cd3f392af..4bf586e66378 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -63,6 +63,7 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o +obj-$(CONFIG_PAGE_OWNER) += page_owner.o obj-$(CONFIG_CLEANCACHE) += cleancache.o obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o obj-$(CONFIG_ZPOOL) += zpool.o diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 303d38516807..c13b6b29add2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -59,6 +59,7 @@ #include #include #include +#include #include #include @@ -813,6 +814,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order) if (bad) return false; + reset_page_owner(page, order); + if (!PageHighMem(page)) { debug_check_no_locks_freed(page_address(page), PAGE_SIZE << order); @@ -988,6 +991,8 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags) if (order && (gfp_flags & __GFP_COMP)) prep_compound_page(page, order); + set_page_owner(page, order, gfp_flags); + return 0; } @@ -1560,8 +1565,11 @@ void split_page(struct page *page, unsigned int order) split_page(virt_to_page(page[0].shadow), order); #endif - for (i = 1; i < (1 << order); i++) + set_page_owner(page, 0, 0); + for (i = 1; i < (1 << order); i++) { set_page_refcounted(page + i); + set_page_owner(page + i, 0, 0); + } } EXPORT_SYMBOL_GPL(split_page); @@ -1601,6 +1609,7 @@ int __isolate_free_page(struct page *page, unsigned int order) } } + set_page_owner(page, order, 0); return 1UL << order; } diff --git a/mm/page_ext.c b/mm/page_ext.c index c2cd7b15f0de..d86fd2f5353f 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -5,6 +5,7 @@ #include #include #include +#include /* * struct page extension @@ -55,6 +56,9 @@ static struct page_ext_operations *page_ext_ops[] = { #ifdef CONFIG_PAGE_POISONING &page_poisoning_ops, #endif +#ifdef CONFIG_PAGE_OWNER + &page_owner_ops, +#endif }; static unsigned long total_usage; diff --git a/mm/page_owner.c b/mm/page_owner.c new file mode 100644 index 000000000000..85eec7ea6735 --- /dev/null +++ b/mm/page_owner.c @@ -0,0 +1,222 @@ +#include +#include +#include +#include +#include +#include +#include +#include "internal.h" + +static bool page_owner_disabled = true; +bool page_owner_inited __read_mostly; + +static int early_page_owner_param(char *buf) +{ + if (!buf) + return -EINVAL; + + if (strcmp(buf, "on") == 0) + page_owner_disabled = false; + + return 0; +} +early_param("page_owner", early_page_owner_param); + +static bool need_page_owner(void) +{ + if (page_owner_disabled) + return false; + + return true; +} + +static void init_page_owner(void) +{ + if (page_owner_disabled) + return; + + page_owner_inited = true; +} + +struct page_ext_operations page_owner_ops = { + .need = need_page_owner, + .init = init_page_owner, +}; + +void __reset_page_owner(struct page *page, unsigned int order) +{ + int i; + struct page_ext *page_ext; + + for (i = 0; i < (1 << order); i++) { + page_ext = lookup_page_ext(page + i); + __clear_bit(PAGE_EXT_OWNER, &page_ext->flags); + } +} + +void __set_page_owner(struct page *page, unsigned int order, gfp_t gfp_mask) +{ + struct page_ext *page_ext; + struct stack_trace *trace; + + page_ext = lookup_page_ext(page); + + trace = &page_ext->trace; + trace->nr_entries = 0; + trace->max_entries = ARRAY_SIZE(page_ext->trace_entries); + trace->entries = &page_ext->trace_entries[0]; + trace->skip = 3; + save_stack_trace(&page_ext->trace); + + page_ext->order = order; + page_ext->gfp_mask = gfp_mask; + + __set_bit(PAGE_EXT_OWNER, &page_ext->flags); +} + +static ssize_t +print_page_owner(char __user *buf, size_t count, unsigned long pfn, + struct page *page, struct page_ext *page_ext) +{ + int ret; + int pageblock_mt, page_mt; + char *kbuf; + + kbuf = kmalloc(count, GFP_KERNEL); + if (!kbuf) + return -ENOMEM; + + ret = snprintf(kbuf, count, + "Page allocated via order %u, mask 0x%x\n", + page_ext->order, page_ext->gfp_mask); + + if (ret >= count) + goto err; + + /* Print information relevant to grouping pages by mobility */ + pageblock_mt = get_pfnblock_migratetype(page, pfn); + page_mt = gfpflags_to_migratetype(page_ext->gfp_mask); + ret += snprintf(kbuf + ret, count - ret, + "PFN %lu Block %lu type %d %s Flags %s%s%s%s%s%s%s%s%s%s%s%s\n", + pfn, + pfn >> pageblock_order, + pageblock_mt, + pageblock_mt != page_mt ? "Fallback" : " ", + PageLocked(page) ? "K" : " ", + PageError(page) ? "E" : " ", + PageReferenced(page) ? "R" : " ", + PageUptodate(page) ? "U" : " ", + PageDirty(page) ? "D" : " ", + PageLRU(page) ? "L" : " ", + PageActive(page) ? "A" : " ", + PageSlab(page) ? "S" : " ", + PageWriteback(page) ? "W" : " ", + PageCompound(page) ? "C" : " ", + PageSwapCache(page) ? "B" : " ", + PageMappedToDisk(page) ? "M" : " "); + + if (ret >= count) + goto err; + + ret += snprint_stack_trace(kbuf + ret, count - ret, + &page_ext->trace, 0); + if (ret >= count) + goto err; + + ret += snprintf(kbuf + ret, count - ret, "\n"); + if (ret >= count) + goto err; + + if (copy_to_user(buf, kbuf, ret)) + ret = -EFAULT; + + kfree(kbuf); + return ret; + +err: + kfree(kbuf); + return -ENOMEM; +} + +static ssize_t +read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos) +{ + unsigned long pfn; + struct page *page; + struct page_ext *page_ext; + + if (!page_owner_inited) + return -EINVAL; + + page = NULL; + pfn = min_low_pfn + *ppos; + + /* Find a valid PFN or the start of a MAX_ORDER_NR_PAGES area */ + while (!pfn_valid(pfn) && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0) + pfn++; + + drain_all_pages(NULL); + + /* Find an allocated page */ + for (; pfn < max_pfn; pfn++) { + /* + * If the new page is in a new MAX_ORDER_NR_PAGES area, + * validate the area as existing, skip it if not + */ + if ((pfn & (MAX_ORDER_NR_PAGES - 1)) == 0 && !pfn_valid(pfn)) { + pfn += MAX_ORDER_NR_PAGES - 1; + continue; + } + + /* Check for holes within a MAX_ORDER area */ + if (!pfn_valid_within(pfn)) + continue; + + page = pfn_to_page(pfn); + if (PageBuddy(page)) { + unsigned long freepage_order = page_order_unsafe(page); + + if (freepage_order < MAX_ORDER) + pfn += (1UL << freepage_order) - 1; + continue; + } + + page_ext = lookup_page_ext(page); + + /* + * Pages allocated before initialization of page_owner are + * non-buddy and have no page_owner info. + */ + if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags)) + continue; + + /* Record the next PFN to read in the file offset */ + *ppos = (pfn - min_low_pfn) + 1; + + return print_page_owner(buf, count, pfn, page, page_ext); + } + + return 0; +} + +static const struct file_operations proc_page_owner_operations = { + .read = read_page_owner, +}; + +static int __init pageowner_init(void) +{ + struct dentry *dentry; + + if (!page_owner_inited) { + pr_info("page_owner is disabled\n"); + return 0; + } + + dentry = debugfs_create_file("page_owner", S_IRUSR, NULL, + NULL, &proc_page_owner_operations); + if (IS_ERR(dentry)) + return PTR_ERR(dentry); + + return 0; +} +module_init(pageowner_init) diff --git a/mm/vmstat.c b/mm/vmstat.c index 1b12d390dc68..b090e9e3d626 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -22,6 +22,8 @@ #include #include #include +#include +#include #include "internal.h" @@ -1017,6 +1019,104 @@ static int pagetypeinfo_showblockcount(struct seq_file *m, void *arg) return 0; } +#ifdef CONFIG_PAGE_OWNER +static void pagetypeinfo_showmixedcount_print(struct seq_file *m, + pg_data_t *pgdat, + struct zone *zone) +{ + struct page *page; + struct page_ext *page_ext; + unsigned long pfn = zone->zone_start_pfn, block_end_pfn; + unsigned long end_pfn = pfn + zone->spanned_pages; + unsigned long count[MIGRATE_TYPES] = { 0, }; + int pageblock_mt, page_mt; + int i; + + /* Scan block by block. First and last block may be incomplete */ + pfn = zone->zone_start_pfn; + + /* + * Walk the zone in pageblock_nr_pages steps. If a page block spans + * a zone boundary, it will be double counted between zones. This does + * not matter as the mixed block count will still be correct + */ + for (; pfn < end_pfn; ) { + if (!pfn_valid(pfn)) { + pfn = ALIGN(pfn + 1, MAX_ORDER_NR_PAGES); + continue; + } + + block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages); + block_end_pfn = min(block_end_pfn, end_pfn); + + page = pfn_to_page(pfn); + pageblock_mt = get_pfnblock_migratetype(page, pfn); + + for (; pfn < block_end_pfn; pfn++) { + if (!pfn_valid_within(pfn)) + continue; + + page = pfn_to_page(pfn); + if (PageBuddy(page)) { + pfn += (1UL << page_order(page)) - 1; + continue; + } + + if (PageReserved(page)) + continue; + + page_ext = lookup_page_ext(page); + + if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags)) + continue; + + page_mt = gfpflags_to_migratetype(page_ext->gfp_mask); + if (pageblock_mt != page_mt) { + if (is_migrate_cma(pageblock_mt)) + count[MIGRATE_MOVABLE]++; + else + count[pageblock_mt]++; + + pfn = block_end_pfn; + break; + } + pfn += (1UL << page_ext->order) - 1; + } + } + + /* Print counts */ + seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name); + for (i = 0; i < MIGRATE_TYPES; i++) + seq_printf(m, "%12lu ", count[i]); + seq_putc(m, '\n'); +} +#endif /* CONFIG_PAGE_OWNER */ + +/* + * Print out the number of pageblocks for each migratetype that contain pages + * of other types. This gives an indication of how well fallbacks are being + * contained by rmqueue_fallback(). It requires information from PAGE_OWNER + * to determine what is going on + */ +static void pagetypeinfo_showmixedcount(struct seq_file *m, pg_data_t *pgdat) +{ +#ifdef CONFIG_PAGE_OWNER + int mtype; + + if (!page_owner_inited) + return; + + drain_all_pages(NULL); + + seq_printf(m, "\n%-23s", "Number of mixed blocks "); + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) + seq_printf(m, "%12s ", migratetype_names[mtype]); + seq_putc(m, '\n'); + + walk_zones_in_node(m, pgdat, pagetypeinfo_showmixedcount_print); +#endif /* CONFIG_PAGE_OWNER */ +} + /* * This prints out statistics in relation to grouping pages by mobility. * It is expensive to collect so do not constantly read the file. @@ -1034,6 +1134,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg) seq_putc(m, '\n'); pagetypeinfo_showfree(m, pgdat); pagetypeinfo_showblockcount(m, pgdat); + pagetypeinfo_showmixedcount(m, pgdat); return 0; } diff --git a/tools/vm/Makefile b/tools/vm/Makefile index 3d907dacf2ac..ac884b65a072 100644 --- a/tools/vm/Makefile +++ b/tools/vm/Makefile @@ -1,6 +1,6 @@ # Makefile for vm tools # -TARGETS=page-types slabinfo +TARGETS=page-types slabinfo page_owner_sort LIB_DIR = ../lib/api LIBS = $(LIB_DIR)/libapikfs.a @@ -18,5 +18,5 @@ $(LIBS): $(CC) $(CFLAGS) -o $@ $< $(LDFLAGS) clean: - $(RM) page-types slabinfo + $(RM) page-types slabinfo page_owner_sort make -C $(LIB_DIR) clean diff --git a/tools/vm/page_owner_sort.c b/tools/vm/page_owner_sort.c new file mode 100644 index 000000000000..77147b42d598 --- /dev/null +++ b/tools/vm/page_owner_sort.c @@ -0,0 +1,144 @@ +/* + * User-space helper to sort the output of /sys/kernel/debug/page_owner + * + * Example use: + * cat /sys/kernel/debug/page_owner > page_owner_full.txt + * grep -v ^PFN page_owner_full.txt > page_owner.txt + * ./sort page_owner.txt sorted_page_owner.txt +*/ + +#include +#include +#include +#include +#include +#include +#include + +struct block_list { + char *txt; + int len; + int num; +}; + + +static struct block_list *list; +static int list_size; +static int max_size; + +struct block_list *block_head; + +int read_block(char *buf, int buf_size, FILE *fin) +{ + char *curr = buf, *const buf_end = buf + buf_size; + + while (buf_end - curr > 1 && fgets(curr, buf_end - curr, fin)) { + if (*curr == '\n') /* empty line */ + return curr - buf; + curr += strlen(curr); + } + + return -1; /* EOF or no space left in buf. */ +} + +static int compare_txt(const void *p1, const void *p2) +{ + const struct block_list *l1 = p1, *l2 = p2; + + return strcmp(l1->txt, l2->txt); +} + +static int compare_num(const void *p1, const void *p2) +{ + const struct block_list *l1 = p1, *l2 = p2; + + return l2->num - l1->num; +} + +static void add_list(char *buf, int len) +{ + if (list_size != 0 && + len == list[list_size-1].len && + memcmp(buf, list[list_size-1].txt, len) == 0) { + list[list_size-1].num++; + return; + } + if (list_size == max_size) { + printf("max_size too small??\n"); + exit(1); + } + list[list_size].txt = malloc(len+1); + list[list_size].len = len; + list[list_size].num = 1; + memcpy(list[list_size].txt, buf, len); + list[list_size].txt[len] = 0; + list_size++; + if (list_size % 1000 == 0) { + printf("loaded %d\r", list_size); + fflush(stdout); + } +} + +#define BUF_SIZE 1024 + +int main(int argc, char **argv) +{ + FILE *fin, *fout; + char buf[BUF_SIZE]; + int ret, i, count; + struct block_list *list2; + struct stat st; + + if (argc < 3) { + printf("Usage: ./program \n"); + perror("open: "); + exit(1); + } + + fin = fopen(argv[1], "r"); + fout = fopen(argv[2], "w"); + if (!fin || !fout) { + printf("Usage: ./program \n"); + perror("open: "); + exit(1); + } + + fstat(fileno(fin), &st); + max_size = st.st_size / 100; /* hack ... */ + + list = malloc(max_size * sizeof(*list)); + + for ( ; ; ) { + ret = read_block(buf, BUF_SIZE, fin); + if (ret < 0) + break; + + add_list(buf, ret); + } + + printf("loaded %d\n", list_size); + + printf("sorting ....\n"); + + qsort(list, list_size, sizeof(list[0]), compare_txt); + + list2 = malloc(sizeof(*list) * list_size); + + printf("culling\n"); + + for (i = count = 0; i < list_size; i++) { + if (count == 0 || + strcmp(list2[count-1].txt, list[i].txt) != 0) { + list2[count++] = list[i]; + } else { + list2[count-1].num += list[i].num; + } + } + + qsort(list2, count, sizeof(list[0]), compare_num); + + for (i = 0; i < count; i++) + fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt); + + return 0; +} -- cgit v1.2.3 From f5f302e21257ebb0c074bbafc37606c26d28cc3d Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Fri, 12 Dec 2014 16:56:10 -0800 Subject: mm,vmacache: count number of system-wide flushes These flushes deal with sequence number overflows, such as for long lived threads. These are rare, but interesting from a debugging PoV. As such, display the number of flushes when vmacache debugging is enabled. Signed-off-by: Davidlohr Bueso Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/vm_event_item.h | 1 + mm/vmacache.c | 2 ++ mm/vmstat.c | 1 + 3 files changed, 4 insertions(+) (limited to 'include/linux') diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 730334cdf037..9246d32dc973 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -90,6 +90,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_DEBUG_VM_VMACACHE VMACACHE_FIND_CALLS, VMACACHE_FIND_HITS, + VMACACHE_FULL_FLUSHES, #endif NR_VM_EVENT_ITEMS }; diff --git a/mm/vmacache.c b/mm/vmacache.c index 9f25af825dec..b6e3662fe339 100644 --- a/mm/vmacache.c +++ b/mm/vmacache.c @@ -17,6 +17,8 @@ void vmacache_flush_all(struct mm_struct *mm) { struct task_struct *g, *p; + count_vm_vmacache_event(VMACACHE_FULL_FLUSHES); + /* * Single threaded tasks need not iterate the entire * list of process. We can avoid the flushing as well diff --git a/mm/vmstat.c b/mm/vmstat.c index b090e9e3d626..1284f89fca08 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -900,6 +900,7 @@ const char * const vmstat_text[] = { #ifdef CONFIG_DEBUG_VM_VMACACHE "vmacache_find_calls", "vmacache_find_hits", + "vmacache_full_flushes", #endif #endif /* CONFIG_VM_EVENTS_COUNTERS */ }; -- cgit v1.2.3 From 6b4f7799c6a5703ac6b8c0649f4c22f00fa07513 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Fri, 12 Dec 2014 16:56:13 -0800 Subject: mm: vmscan: invoke slab shrinkers from shrink_zone() The slab shrinkers are currently invoked from the zonelist walkers in kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the eligible LRU pages and assemble a nodemask to pass to NUMA-aware shrinkers, which then again have to walk over the nodemask. This is redundant code, extra runtime work, and fairly inaccurate when it comes to the estimation of actually scannable LRU pages. The code duplication will only get worse when making the shrinkers cgroup-aware and requiring them to have out-of-band cgroup hierarchy walks as well. Instead, invoke the shrinkers from shrink_zone(), which is where all reclaimers end up, to avoid this duplication. Take the count for eligible LRU pages out of get_scan_count(), which considers many more factors than just the availability of swap space, like zone_reclaimable_pages() currently does. Accumulate the number over all visited lruvecs to get the per-zone value. Some nodes have multiple zones due to memory addressing restrictions. To avoid putting too much pressure on the shrinkers, only invoke them once for each such node, using the class zone of the allocation as the pivot zone. For now, this integrates the slab shrinking better into the reclaim logic and gets rid of duplicative invocations from kswapd, direct reclaim, and zone reclaim. It also prepares for cgroup-awareness, allowing memcg-capable shrinkers to be added at the lruvec level without much duplication of both code and runtime work. This changes kswapd behavior, which used to invoke the shrinkers for each zone, but with scan ratios gathered from the entire node, resulting in meaningless pressure quantities on multi-zone nodes. Zone reclaim behavior also changes. It used to shrink slabs until the same amount of pages were shrunk as were reclaimed from the LRUs. Now it merely invokes the shrinkers once with the zone's scan ratio, which makes the shrinkers go easier on caches that implement aging and would prefer feeding back pressure from recently used slab objects to unused LRU pages. [vdavydov@parallels.com: assure class zone is populated] Signed-off-by: Johannes Weiner Cc: Dave Chinner Signed-off-by: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/staging/android/ashmem.c | 3 +- fs/drop_caches.c | 11 +- include/linux/mm.h | 6 +- include/linux/shrinker.h | 2 - mm/memory-failure.c | 11 +- mm/page_alloc.c | 6 +- mm/vmscan.c | 216 ++++++++++++++++----------------------- 7 files changed, 106 insertions(+), 149 deletions(-) (limited to 'include/linux') diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c index ad4f5790a76f..46f8ef42559e 100644 --- a/drivers/staging/android/ashmem.c +++ b/drivers/staging/android/ashmem.c @@ -418,7 +418,7 @@ out: } /* - * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab + * ashmem_shrink - our cache shrinker, called from mm/vmscan.c * * 'nr_to_scan' is the number of objects to scan for freeing. * @@ -785,7 +785,6 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg) .nr_to_scan = LONG_MAX, }; ret = ashmem_shrink_count(&ashmem_shrinker, &sc); - nodes_setall(sc.nodes_to_scan); ashmem_shrink_scan(&ashmem_shrinker, &sc); } break; diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 1de7294aad20..2bc2c87f35e7 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -40,13 +40,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused) static void drop_slab(void) { int nr_objects; - struct shrink_control shrink = { - .gfp_mask = GFP_KERNEL, - }; - nodes_setall(shrink.nodes_to_scan); do { - nr_objects = shrink_slab(&shrink, 1000, 1000); + int nid; + + nr_objects = 0; + for_each_online_node(nid) + nr_objects += shrink_node_slabs(GFP_KERNEL, nid, + 1000, 1000); } while (nr_objects > 10); } diff --git a/include/linux/mm.h b/include/linux/mm.h index 8b8d77a1532f..c0a67b894c4c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2110,9 +2110,9 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); #endif -unsigned long shrink_slab(struct shrink_control *shrink, - unsigned long nr_pages_scanned, - unsigned long lru_pages); +unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, + unsigned long nr_scanned, + unsigned long nr_eligible); #ifndef CONFIG_MMU #define randomize_va_space 0 diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 68c097077ef0..f4aee75f00b1 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -18,8 +18,6 @@ struct shrink_control { */ unsigned long nr_to_scan; - /* shrink from these nodes */ - nodemask_t nodes_to_scan; /* current node being shrunk (for NUMA aware shrinkers) */ int nid; }; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 6b94969d91c5..feb803bf3443 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -239,19 +239,14 @@ void shake_page(struct page *p, int access) } /* - * Only call shrink_slab here (which would also shrink other caches) if - * access is not potentially fatal. + * Only call shrink_node_slabs here (which would also shrink + * other caches) if access is not potentially fatal. */ if (access) { int nr; int nid = page_to_nid(p); do { - struct shrink_control shrink = { - .gfp_mask = GFP_KERNEL, - }; - node_set(nid, shrink.nodes_to_scan); - - nr = shrink_slab(&shrink, 1000, 1000); + nr = shrink_node_slabs(GFP_KERNEL, nid, 1000, 1000); if (page_count(p) == 1) break; } while (nr > 10); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c13b6b29add2..9d3870862293 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6284,9 +6284,9 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count, if (!PageLRU(page)) found++; /* - * If there are RECLAIMABLE pages, we need to check it. - * But now, memory offline itself doesn't call shrink_slab() - * and it still to be fixed. + * If there are RECLAIMABLE pages, we need to check + * it. But now, memory offline itself doesn't call + * shrink_node_slabs() and it still to be fixed. */ /* * If the page is not RAM, page_count()should be 0. diff --git a/mm/vmscan.c b/mm/vmscan.c index a384339bf718..bd9a72bc4a1b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -229,9 +229,10 @@ EXPORT_SYMBOL(unregister_shrinker); #define SHRINK_BATCH 128 -static unsigned long -shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker, - unsigned long nr_pages_scanned, unsigned long lru_pages) +static unsigned long shrink_slabs(struct shrink_control *shrinkctl, + struct shrinker *shrinker, + unsigned long nr_scanned, + unsigned long nr_eligible) { unsigned long freed = 0; unsigned long long delta; @@ -255,9 +256,9 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker, nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); total_scan = nr; - delta = (4 * nr_pages_scanned) / shrinker->seeks; + delta = (4 * nr_scanned) / shrinker->seeks; delta *= freeable; - do_div(delta, lru_pages + 1); + do_div(delta, nr_eligible + 1); total_scan += delta; if (total_scan < 0) { pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n", @@ -289,8 +290,8 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker, total_scan = freeable * 2; trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, - nr_pages_scanned, lru_pages, - freeable, delta, total_scan); + nr_scanned, nr_eligible, + freeable, delta, total_scan); /* * Normally, we should not scan less than batch_size objects in one @@ -339,34 +340,37 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker, return freed; } -/* - * Call the shrink functions to age shrinkable caches - * - * Here we assume it costs one seek to replace a lru page and that it also - * takes a seek to recreate a cache object. With this in mind we age equal - * percentages of the lru and ageable caches. This should balance the seeks - * generated by these structures. +/** + * shrink_node_slabs - shrink slab caches of a given node + * @gfp_mask: allocation context + * @nid: node whose slab caches to target + * @nr_scanned: pressure numerator + * @nr_eligible: pressure denominator * - * If the vm encountered mapped pages on the LRU it increase the pressure on - * slab to avoid swapping. + * Call the shrink functions to age shrinkable caches. * - * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits. + * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set, + * unaware shrinkers will receive a node id of 0 instead. * - * `lru_pages' represents the number of on-LRU pages in all the zones which - * are eligible for the caller's allocation attempt. It is used for balancing - * slab reclaim versus page reclaim. + * @nr_scanned and @nr_eligible form a ratio that indicate how much of + * the available objects should be scanned. Page reclaim for example + * passes the number of pages scanned and the number of pages on the + * LRU lists that it considered on @nid, plus a bias in @nr_scanned + * when it encountered mapped pages. The ratio is further biased by + * the ->seeks setting of the shrink function, which indicates the + * cost to recreate an object relative to that of an LRU page. * - * Returns the number of slab objects which we shrunk. + * Returns the number of reclaimed slab objects. */ -unsigned long shrink_slab(struct shrink_control *shrinkctl, - unsigned long nr_pages_scanned, - unsigned long lru_pages) +unsigned long shrink_node_slabs(gfp_t gfp_mask, int nid, + unsigned long nr_scanned, + unsigned long nr_eligible) { struct shrinker *shrinker; unsigned long freed = 0; - if (nr_pages_scanned == 0) - nr_pages_scanned = SWAP_CLUSTER_MAX; + if (nr_scanned == 0) + nr_scanned = SWAP_CLUSTER_MAX; if (!down_read_trylock(&shrinker_rwsem)) { /* @@ -380,20 +384,17 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl, } list_for_each_entry(shrinker, &shrinker_list, list) { - if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) { - shrinkctl->nid = 0; - freed += shrink_slab_node(shrinkctl, shrinker, - nr_pages_scanned, lru_pages); - continue; - } + struct shrink_control sc = { + .gfp_mask = gfp_mask, + .nid = nid, + }; - for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) { - if (node_online(shrinkctl->nid)) - freed += shrink_slab_node(shrinkctl, shrinker, - nr_pages_scanned, lru_pages); + if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) + sc.nid = 0; - } + freed += shrink_slabs(&sc, shrinker, nr_scanned, nr_eligible); } + up_read(&shrinker_rwsem); out: cond_resched(); @@ -1876,7 +1877,8 @@ enum scan_balance { * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan */ static void get_scan_count(struct lruvec *lruvec, int swappiness, - struct scan_control *sc, unsigned long *nr) + struct scan_control *sc, unsigned long *nr, + unsigned long *lru_pages) { struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; u64 fraction[2]; @@ -2022,6 +2024,7 @@ out: some_scanned = false; /* Only use force_scan on second pass. */ for (pass = 0; !some_scanned && pass < 2; pass++) { + *lru_pages = 0; for_each_evictable_lru(lru) { int file = is_file_lru(lru); unsigned long size; @@ -2048,14 +2051,19 @@ out: case SCAN_FILE: case SCAN_ANON: /* Scan one type exclusively */ - if ((scan_balance == SCAN_FILE) != file) + if ((scan_balance == SCAN_FILE) != file) { + size = 0; scan = 0; + } break; default: /* Look ma, no brain */ BUG(); } + + *lru_pages += size; nr[lru] = scan; + /* * Skip the second pass and don't force_scan, * if we found something to scan. @@ -2069,7 +2077,7 @@ out: * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. */ static void shrink_lruvec(struct lruvec *lruvec, int swappiness, - struct scan_control *sc) + struct scan_control *sc, unsigned long *lru_pages) { unsigned long nr[NR_LRU_LISTS]; unsigned long targets[NR_LRU_LISTS]; @@ -2080,7 +2088,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness, struct blk_plug plug; bool scan_adjusted; - get_scan_count(lruvec, swappiness, sc, nr); + get_scan_count(lruvec, swappiness, sc, nr, lru_pages); /* Record the original scan target for proportional adjustments later */ memcpy(targets, nr, sizeof(nr)); @@ -2258,7 +2266,8 @@ static inline bool should_continue_reclaim(struct zone *zone, } } -static bool shrink_zone(struct zone *zone, struct scan_control *sc) +static bool shrink_zone(struct zone *zone, struct scan_control *sc, + bool is_classzone) { unsigned long nr_reclaimed, nr_scanned; bool reclaimable = false; @@ -2269,6 +2278,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc) .zone = zone, .priority = sc->priority, }; + unsigned long zone_lru_pages = 0; struct mem_cgroup *memcg; nr_reclaimed = sc->nr_reclaimed; @@ -2276,13 +2286,15 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc) memcg = mem_cgroup_iter(root, NULL, &reclaim); do { + unsigned long lru_pages; struct lruvec *lruvec; int swappiness; lruvec = mem_cgroup_zone_lruvec(zone, memcg); swappiness = mem_cgroup_swappiness(memcg); - shrink_lruvec(lruvec, swappiness, sc); + shrink_lruvec(lruvec, swappiness, sc, &lru_pages); + zone_lru_pages += lru_pages; /* * Direct reclaim and kswapd have to scan all memory @@ -2302,6 +2314,25 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc) memcg = mem_cgroup_iter(root, memcg, &reclaim); } while (memcg); + /* + * Shrink the slab caches in the same proportion that + * the eligible LRU pages were scanned. + */ + if (global_reclaim(sc) && is_classzone) { + struct reclaim_state *reclaim_state; + + shrink_node_slabs(sc->gfp_mask, zone_to_nid(zone), + sc->nr_scanned - nr_scanned, + zone_lru_pages); + + reclaim_state = current->reclaim_state; + if (reclaim_state) { + sc->nr_reclaimed += + reclaim_state->reclaimed_slab; + reclaim_state->reclaimed_slab = 0; + } + } + vmpressure(sc->gfp_mask, sc->target_mem_cgroup, sc->nr_scanned - nr_scanned, sc->nr_reclaimed - nr_reclaimed); @@ -2376,12 +2407,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) struct zone *zone; unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; - unsigned long lru_pages = 0; - struct reclaim_state *reclaim_state = current->reclaim_state; gfp_t orig_mask; - struct shrink_control shrink = { - .gfp_mask = sc->gfp_mask, - }; enum zone_type requested_highidx = gfp_zone(sc->gfp_mask); bool reclaimable = false; @@ -2394,12 +2420,18 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) if (buffer_heads_over_limit) sc->gfp_mask |= __GFP_HIGHMEM; - nodes_clear(shrink.nodes_to_scan); - for_each_zone_zonelist_nodemask(zone, z, zonelist, - gfp_zone(sc->gfp_mask), sc->nodemask) { + requested_highidx, sc->nodemask) { + enum zone_type classzone_idx; + if (!populated_zone(zone)) continue; + + classzone_idx = requested_highidx; + while (!populated_zone(zone->zone_pgdat->node_zones + + classzone_idx)) + classzone_idx--; + /* * Take care memory controller reclaiming has small influence * to global LRU. @@ -2409,9 +2441,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) GFP_KERNEL | __GFP_HARDWALL)) continue; - lru_pages += zone_reclaimable_pages(zone); - node_set(zone_to_nid(zone), shrink.nodes_to_scan); - if (sc->priority != DEF_PRIORITY && !zone_reclaimable(zone)) continue; /* Let kswapd poll it */ @@ -2450,7 +2479,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) /* need some check for avoid more shrink_zone() */ } - if (shrink_zone(zone, sc)) + if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx)) reclaimable = true; if (global_reclaim(sc) && @@ -2458,20 +2487,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc) reclaimable = true; } - /* - * Don't shrink slabs when reclaiming memory from over limit cgroups - * but do shrink slab at least once when aborting reclaim for - * compaction to avoid unevenly scanning file/anon LRU pages over slab - * pages. - */ - if (global_reclaim(sc)) { - shrink_slab(&shrink, sc->nr_scanned, lru_pages); - if (reclaim_state) { - sc->nr_reclaimed += reclaim_state->reclaimed_slab; - reclaim_state->reclaimed_slab = 0; - } - } - /* * Restore to original mask to avoid the impact on the caller if we * promoted it to __GFP_HIGHMEM. @@ -2736,6 +2751,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, }; struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg); int swappiness = mem_cgroup_swappiness(memcg); + unsigned long lru_pages; sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK); @@ -2751,7 +2767,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg, * will pick up pages from other mem cgroup's as well. We hack * the priority and make it zero. */ - shrink_lruvec(lruvec, swappiness, &sc); + shrink_lruvec(lruvec, swappiness, &sc, &lru_pages); trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); @@ -2932,15 +2948,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, static bool kswapd_shrink_zone(struct zone *zone, int classzone_idx, struct scan_control *sc, - unsigned long lru_pages, unsigned long *nr_attempted) { int testorder = sc->order; unsigned long balance_gap; - struct reclaim_state *reclaim_state = current->reclaim_state; - struct shrink_control shrink = { - .gfp_mask = sc->gfp_mask, - }; bool lowmem_pressure; /* Reclaim above the high watermark. */ @@ -2975,13 +2986,7 @@ static bool kswapd_shrink_zone(struct zone *zone, balance_gap, classzone_idx)) return true; - shrink_zone(zone, sc); - nodes_clear(shrink.nodes_to_scan); - node_set(zone_to_nid(zone), shrink.nodes_to_scan); - - reclaim_state->reclaimed_slab = 0; - shrink_slab(&shrink, sc->nr_scanned, lru_pages); - sc->nr_reclaimed += reclaim_state->reclaimed_slab; + shrink_zone(zone, sc, zone_idx(zone) == classzone_idx); /* Account for the number of pages attempted to reclaim */ *nr_attempted += sc->nr_to_reclaim; @@ -3042,7 +3047,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, count_vm_event(PAGEOUTRUN); do { - unsigned long lru_pages = 0; unsigned long nr_attempted = 0; bool raise_priority = true; bool pgdat_needs_compaction = (order > 0); @@ -3102,8 +3106,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, if (!populated_zone(zone)) continue; - lru_pages += zone_reclaimable_pages(zone); - /* * If any zone is currently balanced then kswapd will * not call compaction as it is expected that the @@ -3159,8 +3161,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, * that that high watermark would be met at 100% * efficiency. */ - if (kswapd_shrink_zone(zone, end_zone, &sc, - lru_pages, &nr_attempted)) + if (kswapd_shrink_zone(zone, end_zone, + &sc, &nr_attempted)) raise_priority = false; } @@ -3612,10 +3614,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP), .may_swap = 1, }; - struct shrink_control shrink = { - .gfp_mask = sc.gfp_mask, - }; - unsigned long nr_slab_pages0, nr_slab_pages1; cond_resched(); /* @@ -3634,44 +3632,10 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) * priorities until we have enough memory freed. */ do { - shrink_zone(zone, &sc); + shrink_zone(zone, &sc, true); } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0); } - nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); - if (nr_slab_pages0 > zone->min_slab_pages) { - /* - * shrink_slab() does not currently allow us to determine how - * many pages were freed in this zone. So we take the current - * number of slab pages and shake the slab until it is reduced - * by the same nr_pages that we used for reclaiming unmapped - * pages. - */ - nodes_clear(shrink.nodes_to_scan); - node_set(zone_to_nid(zone), shrink.nodes_to_scan); - for (;;) { - unsigned long lru_pages = zone_reclaimable_pages(zone); - - /* No reclaimable slab or very low memory pressure */ - if (!shrink_slab(&shrink, sc.nr_scanned, lru_pages)) - break; - - /* Freed enough memory */ - nr_slab_pages1 = zone_page_state(zone, - NR_SLAB_RECLAIMABLE); - if (nr_slab_pages1 + nr_pages <= nr_slab_pages0) - break; - } - - /* - * Update nr_reclaimed by the number of slab pages we - * reclaimed from this zone. - */ - nr_slab_pages1 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); - if (nr_slab_pages1 < nr_slab_pages0) - sc.nr_reclaimed += nr_slab_pages0 - nr_slab_pages1; - } - p->reclaim_state = NULL; current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE); lockdep_clear_current_reclaim_state(); -- cgit v1.2.3 From d003f371b27016354c392464819530d47a915765 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Fri, 12 Dec 2014 16:56:24 -0800 Subject: oom: don't assume that a coredumping thread will exit soon oom_kill.c assumes that PF_EXITING task should exit and free the memory soon. This is wrong in many ways and one important case is the coredump. A task can sleep in exit_mm() "forever" while the coredumping sub-thread can need more memory. Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account, we add the new trivial helper for that. Note: this is only the first step, this patch doesn't try to solve other problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can participate in coredump after it was already observed in PF_EXITING state, so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set. fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it. And even the name/usage of the new helper is confusing, an exiting thread can only free its ->mm if it is the only/last task in thread group. [akpm@linux-foundation.org: add comment] Signed-off-by: Oleg Nesterov Cc: Cong Wang Acked-by: David Rientjes Acked-by: Michal Hocko Cc: "Rafael J. Wysocki" Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/oom.h | 11 +++++++++++ mm/memcontrol.c | 2 +- mm/oom_kill.c | 6 +++--- 3 files changed, 15 insertions(+), 4 deletions(-) (limited to 'include/linux') diff --git a/include/linux/oom.h b/include/linux/oom.h index e8d6e1058723..853698c721f7 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -92,6 +92,17 @@ static inline bool oom_gfp_allowed(gfp_t gfp_mask) extern struct task_struct *find_lock_task_mm(struct task_struct *p); +static inline bool task_will_free_mem(struct task_struct *task) +{ + /* + * A coredumping process may sleep for an extended period in exit_mm(), + * so the oom killer cannot assume that the process will promptly exit + * and release memory. + */ + return (task->flags & PF_EXITING) && + !(task->signal->flags & SIGNAL_GROUP_COREDUMP); +} + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c6ac50e7d1c2..998fb1756d43 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1559,7 +1559,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. */ - if (fatal_signal_pending(current) || current->flags & PF_EXITING) { + if (fatal_signal_pending(current) || task_will_free_mem(current)) { set_thread_flag(TIF_MEMDIE); return; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 864bba992735..f694ef0d9f9a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -281,7 +281,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, if (oom_task_origin(task)) return OOM_SCAN_SELECT; - if (task->flags & PF_EXITING && !force_kill) { + if (task_will_free_mem(task) && !force_kill) { /* * If this task is not being ptraced on exit, then wait for it * to finish before killing some other task unnecessarily. @@ -443,7 +443,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (task_will_free_mem(p)) { set_tsk_thread_flag(p, TIF_MEMDIE); put_task_struct(p); return; @@ -649,7 +649,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. */ - if (fatal_signal_pending(current) || current->flags & PF_EXITING) { + if (fatal_signal_pending(current) || task_will_free_mem(current)) { set_thread_flag(TIF_MEMDIE); return; } -- cgit v1.2.3 From 8135be5a8012f4c7e95218563855e16c09a8271b Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Fri, 12 Dec 2014 16:56:38 -0800 Subject: memcg: fix possible use-after-free in memcg_kmem_get_cache() Suppose task @t that belongs to a memory cgroup @memcg is going to allocate an object from a kmem cache @c. The copy of @c corresponding to @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory cgroup destruction we can access the memory cgroup's copy of the cache after it was destroyed: CPU0 CPU1 ---- ---- [ current=@t @mc->memcg_params->nr_pages=0 ] kmem_cache_alloc(@c): call memcg_kmem_get_cache(@c); proceed to allocation from @mc: alloc a page for @mc: ... move @t from @memcg destroy @memcg: mem_cgroup_css_offline(@memcg): memcg_unregister_all_caches(@memcg): kmem_cache_destroy(@mc) add page to @mc We could fix this issue by taking a reference to a per-memcg cache, but that would require adding a per-cpu reference counter to per-memcg caches, which would look cumbersome. Instead, let's take a reference to a memory cgroup, which already has a per-cpu reference counter, in the beginning of kmem_cache_alloc to be dropped in the end, and move per memcg caches destruction from css offline to css free. As a side effect, per-memcg caches will be destroyed not one by one, but all at once when the last page accounted to the memory cgroup is freed. This doesn't sound as a high price for code readability though. Note, this patch does add some overhead to the kmem_cache_alloc hot path, but it is pretty negligible - it's just a function call plus a per cpu counter decrement, which is comparable to what we already have in memcg_kmem_get_cache. Besides, it's only relevant if there are memory cgroups with kmem accounting enabled. I don't think we can find a way to handle this race w/o it, because alloc_page called from kmem_cache_alloc may sleep so we can't flush all pending kmallocs w/o reference counting. Signed-off-by: Vladimir Davydov Acked-by: Christoph Lameter Cc: Johannes Weiner Cc: Michal Hocko Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 14 +++++++++++-- include/linux/slab.h | 2 -- mm/memcontrol.c | 51 +++++++++++++++------------------------------- mm/slab.c | 2 ++ mm/slub.c | 14 ++++++++----- 5 files changed, 39 insertions(+), 44 deletions(-) (limited to 'include/linux') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b74942a9e22f..7c95af8d552c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -400,8 +400,8 @@ int memcg_cache_id(struct mem_cgroup *memcg); void memcg_update_array_size(int num_groups); -struct kmem_cache * -__memcg_kmem_get_cache(struct kmem_cache *cachep); +struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep); +void __memcg_kmem_put_cache(struct kmem_cache *cachep); int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order); void __memcg_uncharge_slab(struct kmem_cache *cachep, int order); @@ -494,6 +494,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp) return __memcg_kmem_get_cache(cachep); } + +static __always_inline void memcg_kmem_put_cache(struct kmem_cache *cachep) +{ + if (memcg_kmem_enabled()) + __memcg_kmem_put_cache(cachep); +} #else #define for_each_memcg_cache_index(_idx) \ for (; NULL; ) @@ -528,6 +534,10 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp) { return cachep; } + +static inline void memcg_kmem_put_cache(struct kmem_cache *cachep) +{ +} #endif /* CONFIG_MEMCG_KMEM */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/include/linux/slab.h b/include/linux/slab.h index 8a2457d42fc8..9a139b637069 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -493,7 +493,6 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) * @memcg: pointer to the memcg this cache belongs to * @list: list_head for the list of all caches in this memcg * @root_cache: pointer to the global, root cache, this cache was derived from - * @nr_pages: number of pages that belongs to this cache. */ struct memcg_cache_params { bool is_root_cache; @@ -506,7 +505,6 @@ struct memcg_cache_params { struct mem_cgroup *memcg; struct list_head list; struct kmem_cache *root_cache; - atomic_t nr_pages; }; }; }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index dac81b975996..05e1584750ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2635,7 +2635,6 @@ static void memcg_register_cache(struct mem_cgroup *memcg, if (!cachep) return; - css_get(&memcg->css); list_add(&cachep->memcg_params->list, &memcg->memcg_slab_caches); /* @@ -2669,9 +2668,6 @@ static void memcg_unregister_cache(struct kmem_cache *cachep) list_del(&cachep->memcg_params->list); kmem_cache_destroy(cachep); - - /* drop the reference taken in memcg_register_cache */ - css_put(&memcg->css); } int __memcg_cleanup_cache_params(struct kmem_cache *s) @@ -2705,9 +2701,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg) mutex_lock(&memcg_slab_mutex); list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) { cachep = memcg_params_to_cache(params); - kmem_cache_shrink(cachep); - if (atomic_read(&cachep->memcg_params->nr_pages) == 0) - memcg_unregister_cache(cachep); + memcg_unregister_cache(cachep); } mutex_unlock(&memcg_slab_mutex); } @@ -2742,10 +2736,10 @@ static void __memcg_schedule_register_cache(struct mem_cgroup *memcg, struct memcg_register_cache_work *cw; cw = kmalloc(sizeof(*cw), GFP_NOWAIT); - if (cw == NULL) { - css_put(&memcg->css); + if (!cw) return; - } + + css_get(&memcg->css); cw->memcg = memcg; cw->cachep = cachep; @@ -2776,12 +2770,8 @@ static void memcg_schedule_register_cache(struct mem_cgroup *memcg, int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order) { unsigned int nr_pages = 1 << order; - int res; - res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages); - if (!res) - atomic_add(nr_pages, &cachep->memcg_params->nr_pages); - return res; + return memcg_charge_kmem(cachep->memcg_params->memcg, gfp, nr_pages); } void __memcg_uncharge_slab(struct kmem_cache *cachep, int order) @@ -2789,7 +2779,6 @@ void __memcg_uncharge_slab(struct kmem_cache *cachep, int order) unsigned int nr_pages = 1 << order; memcg_uncharge_kmem(cachep->memcg_params->memcg, nr_pages); - atomic_sub(nr_pages, &cachep->memcg_params->nr_pages); } /* @@ -2816,22 +2805,13 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep) if (current->memcg_kmem_skip_account) return cachep; - rcu_read_lock(); - memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner)); - + memcg = get_mem_cgroup_from_mm(current->mm); if (!memcg_kmem_is_active(memcg)) goto out; memcg_cachep = cache_from_memcg_idx(cachep, memcg_cache_id(memcg)); - if (likely(memcg_cachep)) { - cachep = memcg_cachep; - goto out; - } - - /* The corresponding put will be done in the workqueue. */ - if (!css_tryget_online(&memcg->css)) - goto out; - rcu_read_unlock(); + if (likely(memcg_cachep)) + return memcg_cachep; /* * If we are in a safe context (can wait, and not in interrupt @@ -2846,12 +2826,17 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep) * defer everything. */ memcg_schedule_register_cache(memcg, cachep); - return cachep; out: - rcu_read_unlock(); + css_put(&memcg->css); return cachep; } +void __memcg_kmem_put_cache(struct kmem_cache *cachep) +{ + if (!is_root_cache(cachep)) + css_put(&cachep->memcg_params->memcg->css); +} + /* * We need to verify if the allocation against current->mm->owner's memcg is * possible for the given order. But the page is not allocated yet, so we'll @@ -2914,10 +2899,6 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order) memcg_uncharge_kmem(memcg, 1 << order); page->mem_cgroup = NULL; } -#else -static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg) -{ -} #endif /* CONFIG_MEMCG_KMEM */ #ifdef CONFIG_TRANSPARENT_HUGEPAGE @@ -4188,6 +4169,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss) static void memcg_destroy_kmem(struct mem_cgroup *memcg) { + memcg_unregister_all_caches(memcg); mem_cgroup_sockets_destroy(memcg); } #else @@ -4797,7 +4779,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) } spin_unlock(&memcg->event_list_lock); - memcg_unregister_all_caches(memcg); vmpressure_cleanup(&memcg->vmpressure); } diff --git a/mm/slab.c b/mm/slab.c index fee275b5b6b7..6042fe57cc60 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3182,6 +3182,7 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, memset(ptr, 0, cachep->object_size); } + memcg_kmem_put_cache(cachep); return ptr; } @@ -3247,6 +3248,7 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller) memset(objp, 0, cachep->object_size); } + memcg_kmem_put_cache(cachep); return objp; } diff --git a/mm/slub.c b/mm/slub.c index 765c5884d03d..fe4db9c17238 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1233,13 +1233,17 @@ static inline void kfree_hook(const void *x) kmemleak_free(x); } -static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) +static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, + gfp_t flags) { flags &= gfp_allowed_mask; lockdep_trace_alloc(flags); might_sleep_if(flags & __GFP_WAIT); - return should_failslab(s->object_size, flags, s->flags); + if (should_failslab(s->object_size, flags, s->flags)) + return NULL; + + return memcg_kmem_get_cache(s, flags); } static inline void slab_post_alloc_hook(struct kmem_cache *s, @@ -1248,6 +1252,7 @@ static inline void slab_post_alloc_hook(struct kmem_cache *s, flags &= gfp_allowed_mask; kmemcheck_slab_alloc(s, flags, object, slab_ksize(s)); kmemleak_alloc_recursive(object, s->object_size, 1, s->flags, flags); + memcg_kmem_put_cache(s); } static inline void slab_free_hook(struct kmem_cache *s, void *x) @@ -2384,10 +2389,9 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, struct page *page; unsigned long tid; - if (slab_pre_alloc_hook(s, gfpflags)) + s = slab_pre_alloc_hook(s, gfpflags); + if (!s) return NULL; - - s = memcg_kmem_get_cache(s, gfpflags); redo: /* * Must read kmem_cache cpu data via this cpu ptr. Preemption is -- cgit v1.2.3 From 51f39a1f0cea1cacf8c787f652f26dfee9611874 Mon Sep 17 00:00:00 2001 From: David Drysdale Date: Fri, 12 Dec 2014 16:57:29 -0800 Subject: syscalls: implement execveat() system call This patchset adds execveat(2) for x86, and is derived from Meredydd Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528). The primary aim of adding an execveat syscall is to allow an implementation of fexecve(3) that does not rely on the /proc filesystem, at least for executables (rather than scripts). The current glibc version of fexecve(3) is implemented via /proc, which causes problems in sandboxed or otherwise restricted environments. Given the desire for a /proc-free fexecve() implementation, HPA suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be an appropriate generalization. Also, having a new syscall means that it can take a flags argument without back-compatibility concerns. The current implementation just defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be added in future -- for example, flags for new namespaces (as suggested at https://lkml.org/lkml/2006/7/11/474). Related history: - https://lkml.org/lkml/2006/12/27/123 is an example of someone realizing that fexecve() is likely to fail in a chroot environment. - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered documenting the /proc requirement of fexecve(3) in its manpage, to "prevent other people from wasting their time". - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a problem where a process that did setuid() could not fexecve() because it no longer had access to /proc/self/fd; this has since been fixed. This patch (of 4): Add a new execveat(2) system call. execveat() is to execve() as openat() is to open(): it takes a file descriptor that refers to a directory, and resolves the filename relative to that. In addition, if the filename is empty and AT_EMPTY_PATH is specified, execveat() executes the file to which the file descriptor refers. This replicates the functionality of fexecve(), which is a system call in other UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/" (and so relies on /proc being mounted). The filename fed to the executed program as argv[0] (or the name of the script fed to a script interpreter) will be of the form "/dev/fd/" (for an empty filename) or "/dev/fd//", effectively reflecting how the executable was found. This does however mean that execution of a script in a /proc-less environment won't work; also, script execution via an O_CLOEXEC file descriptor fails (as the file will not be accessible after exec). Based on patches by Meredydd Luff. Signed-off-by: David Drysdale Cc: Meredydd Luff Cc: Shuah Khan Cc: "Eric W. Biederman" Cc: Andy Lutomirski Cc: Alexander Viro Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Kees Cook Cc: Arnd Bergmann Cc: Rich Felker Cc: Christoph Hellwig Cc: Michael Kerrisk Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/binfmt_em86.c | 4 ++ fs/binfmt_misc.c | 4 ++ fs/binfmt_script.c | 10 ++++ fs/exec.c | 113 +++++++++++++++++++++++++++++++++----- fs/namei.c | 2 +- include/linux/binfmts.h | 4 ++ include/linux/compat.h | 3 + include/linux/fs.h | 1 + include/linux/sched.h | 4 ++ include/linux/syscalls.h | 5 ++ include/uapi/asm-generic/unistd.h | 4 +- kernel/sys_ni.c | 3 + lib/audit.c | 3 + 13 files changed, 145 insertions(+), 15 deletions(-) (limited to 'include/linux') diff --git a/fs/binfmt_em86.c b/fs/binfmt_em86.c index f37b08cea1f7..490538536cb4 100644 --- a/fs/binfmt_em86.c +++ b/fs/binfmt_em86.c @@ -42,6 +42,10 @@ static int load_em86(struct linux_binprm *bprm) return -ENOEXEC; } + /* Need to be able to load the file after exec */ + if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE) + return -ENOENT; + allow_write_access(bprm->file); fput(bprm->file); bprm->file = NULL; diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index 70789e198dea..c04ef1d4f18a 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -144,6 +144,10 @@ static int load_misc_binary(struct linux_binprm *bprm) if (!fmt) goto ret; + /* Need to be able to load the file after exec */ + if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE) + return -ENOENT; + if (!(fmt->flags & MISC_FMT_PRESERVE_ARGV0)) { retval = remove_arg_zero(bprm); if (retval) diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c index 5027a3e14922..afdf4e3cafc2 100644 --- a/fs/binfmt_script.c +++ b/fs/binfmt_script.c @@ -24,6 +24,16 @@ static int load_script(struct linux_binprm *bprm) if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!')) return -ENOEXEC; + + /* + * If the script filename will be inaccessible after exec, typically + * because it is a "/dev/fd//.." path against an O_CLOEXEC fd, give + * up now (on the assumption that the interpreter will want to load + * this file). + */ + if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE) + return -ENOENT; + /* * This section does the #! interpretation. * Sorta complicated, but hopefully it will work. -TYT diff --git a/fs/exec.c b/fs/exec.c index 01aebe300200..ad8798e26be9 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -748,18 +748,25 @@ EXPORT_SYMBOL(setup_arg_pages); #endif /* CONFIG_MMU */ -static struct file *do_open_exec(struct filename *name) +static struct file *do_open_execat(int fd, struct filename *name, int flags) { struct file *file; int err; - static const struct open_flags open_exec_flags = { + struct open_flags open_exec_flags = { .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC, .acc_mode = MAY_EXEC | MAY_OPEN, .intent = LOOKUP_OPEN, .lookup_flags = LOOKUP_FOLLOW, }; - file = do_filp_open(AT_FDCWD, name, &open_exec_flags); + if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0) + return ERR_PTR(-EINVAL); + if (flags & AT_SYMLINK_NOFOLLOW) + open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW; + if (flags & AT_EMPTY_PATH) + open_exec_flags.lookup_flags |= LOOKUP_EMPTY; + + file = do_filp_open(fd, name, &open_exec_flags); if (IS_ERR(file)) goto out; @@ -770,12 +777,13 @@ static struct file *do_open_exec(struct filename *name) if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) goto exit; - fsnotify_open(file); - err = deny_write_access(file); if (err) goto exit; + if (name->name[0] != '\0') + fsnotify_open(file); + out: return file; @@ -787,7 +795,7 @@ exit: struct file *open_exec(const char *name) { struct filename tmp = { .name = name }; - return do_open_exec(&tmp); + return do_open_execat(AT_FDCWD, &tmp, 0); } EXPORT_SYMBOL(open_exec); @@ -1428,10 +1436,12 @@ static int exec_binprm(struct linux_binprm *bprm) /* * sys_execve() executes a new program. */ -static int do_execve_common(struct filename *filename, - struct user_arg_ptr argv, - struct user_arg_ptr envp) +static int do_execveat_common(int fd, struct filename *filename, + struct user_arg_ptr argv, + struct user_arg_ptr envp, + int flags) { + char *pathbuf = NULL; struct linux_binprm *bprm; struct file *file; struct files_struct *displaced; @@ -1472,7 +1482,7 @@ static int do_execve_common(struct filename *filename, check_unsafe_exec(bprm); current->in_execve = 1; - file = do_open_exec(filename); + file = do_open_execat(fd, filename, flags); retval = PTR_ERR(file); if (IS_ERR(file)) goto out_unmark; @@ -1480,7 +1490,28 @@ static int do_execve_common(struct filename *filename, sched_exec(); bprm->file = file; - bprm->filename = bprm->interp = filename->name; + if (fd == AT_FDCWD || filename->name[0] == '/') { + bprm->filename = filename->name; + } else { + if (filename->name[0] == '\0') + pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d", fd); + else + pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d/%s", + fd, filename->name); + if (!pathbuf) { + retval = -ENOMEM; + goto out_unmark; + } + /* + * Record that a name derived from an O_CLOEXEC fd will be + * inaccessible after exec. Relies on having exclusive access to + * current->files (due to unshare_files above). + */ + if (close_on_exec(fd, rcu_dereference_raw(current->files->fdt))) + bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE; + bprm->filename = pathbuf; + } + bprm->interp = bprm->filename; retval = bprm_mm_init(bprm); if (retval) @@ -1521,6 +1552,7 @@ static int do_execve_common(struct filename *filename, acct_update_integrals(current); task_numa_free(current); free_bprm(bprm); + kfree(pathbuf); putname(filename); if (displaced) put_files_struct(displaced); @@ -1538,6 +1570,7 @@ out_unmark: out_free: free_bprm(bprm); + kfree(pathbuf); out_files: if (displaced) @@ -1553,7 +1586,18 @@ int do_execve(struct filename *filename, { struct user_arg_ptr argv = { .ptr.native = __argv }; struct user_arg_ptr envp = { .ptr.native = __envp }; - return do_execve_common(filename, argv, envp); + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0); +} + +int do_execveat(int fd, struct filename *filename, + const char __user *const __user *__argv, + const char __user *const __user *__envp, + int flags) +{ + struct user_arg_ptr argv = { .ptr.native = __argv }; + struct user_arg_ptr envp = { .ptr.native = __envp }; + + return do_execveat_common(fd, filename, argv, envp, flags); } #ifdef CONFIG_COMPAT @@ -1569,7 +1613,23 @@ static int compat_do_execve(struct filename *filename, .is_compat = true, .ptr.compat = __envp, }; - return do_execve_common(filename, argv, envp); + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0); +} + +static int compat_do_execveat(int fd, struct filename *filename, + const compat_uptr_t __user *__argv, + const compat_uptr_t __user *__envp, + int flags) +{ + struct user_arg_ptr argv = { + .is_compat = true, + .ptr.compat = __argv, + }; + struct user_arg_ptr envp = { + .is_compat = true, + .ptr.compat = __envp, + }; + return do_execveat_common(fd, filename, argv, envp, flags); } #endif @@ -1609,6 +1669,20 @@ SYSCALL_DEFINE3(execve, { return do_execve(getname(filename), argv, envp); } + +SYSCALL_DEFINE5(execveat, + int, fd, const char __user *, filename, + const char __user *const __user *, argv, + const char __user *const __user *, envp, + int, flags) +{ + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0; + + return do_execveat(fd, + getname_flags(filename, lookup_flags, NULL), + argv, envp, flags); +} + #ifdef CONFIG_COMPAT COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename, const compat_uptr_t __user *, argv, @@ -1616,4 +1690,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename, { return compat_do_execve(getname(filename), argv, envp); } + +COMPAT_SYSCALL_DEFINE5(execveat, int, fd, + const char __user *, filename, + const compat_uptr_t __user *, argv, + const compat_uptr_t __user *, envp, + int, flags) +{ + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0; + + return compat_do_execveat(fd, + getname_flags(filename, lookup_flags, NULL), + argv, envp, flags); +} #endif diff --git a/fs/namei.c b/fs/namei.c index db5fe86319e6..ca814165d84c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -130,7 +130,7 @@ void final_putname(struct filename *name) #define EMBEDDED_NAME_MAX (PATH_MAX - sizeof(struct filename)) -static struct filename * +struct filename * getname_flags(const char __user *filename, int flags, int *empty) { struct filename *result, *err; diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 61f29e5ea840..576e4639ca60 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -53,6 +53,10 @@ struct linux_binprm { #define BINPRM_FLAGS_EXECFD_BIT 1 #define BINPRM_FLAGS_EXECFD (1 << BINPRM_FLAGS_EXECFD_BIT) +/* filename of the binary will be inaccessible after exec */ +#define BINPRM_FLAGS_PATH_INACCESSIBLE_BIT 2 +#define BINPRM_FLAGS_PATH_INACCESSIBLE (1 << BINPRM_FLAGS_PATH_INACCESSIBLE_BIT) + /* Function parameter for binfmt->coredump */ struct coredump_params { const siginfo_t *siginfo; diff --git a/include/linux/compat.h b/include/linux/compat.h index e6494261eaff..7450ca2ac1fc 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int); asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv, const compat_uptr_t __user *envp); +asmlinkage long compat_sys_execveat(int dfd, const char __user *filename, + const compat_uptr_t __user *argv, + const compat_uptr_t __user *envp, int flags); asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp, compat_ulong_t __user *outp, compat_ulong_t __user *exp, diff --git a/include/linux/fs.h b/include/linux/fs.h index 1d1838de6882..4193a0bd99b0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2096,6 +2096,7 @@ extern int vfs_open(const struct path *, struct file *, const struct cred *); extern struct file * dentry_open(const struct path *, int, const struct cred *); extern int filp_close(struct file *, fl_owner_t id); +extern struct filename *getname_flags(const char __user *, int, int *); extern struct filename *getname(const char __user *); extern struct filename *getname_kernel(const char *); diff --git a/include/linux/sched.h b/include/linux/sched.h index 4cfdbcf8cf56..8db31ef98d2f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2485,6 +2485,10 @@ extern void do_group_exit(int); extern int do_execve(struct filename *, const char __user * const __user *, const char __user * const __user *); +extern int do_execveat(int, struct filename *, + const char __user * const __user *, + const char __user * const __user *, + int); extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *); struct task_struct *fork_idle(int); extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index c9afdc7a7f84..85893d744901 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -877,4 +877,9 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags, asmlinkage long sys_getrandom(char __user *buf, size_t count, unsigned int flags); asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size); + +asmlinkage long sys_execveat(int dfd, const char __user *filename, + const char __user *const __user *argv, + const char __user *const __user *envp, int flags); + #endif diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 22749c134117..e016bd9b1a04 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -707,9 +707,11 @@ __SYSCALL(__NR_getrandom, sys_getrandom) __SYSCALL(__NR_memfd_create, sys_memfd_create) #define __NR_bpf 280 __SYSCALL(__NR_bpf, sys_bpf) +#define __NR_execveat 281 +__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat) #undef __NR_syscalls -#define __NR_syscalls 281 +#define __NR_syscalls 282 /* * All syscalls below here should go away really, diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 61eea02b53f5..5adcb0ae3a58 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -226,3 +226,6 @@ cond_syscall(sys_seccomp); /* access BPF programs and maps */ cond_syscall(sys_bpf); + +/* execveat */ +cond_syscall(sys_execveat); diff --git a/lib/audit.c b/lib/audit.c index 1d726a22565b..b8fb5ee81e26 100644 --- a/lib/audit.c +++ b/lib/audit.c @@ -53,6 +53,9 @@ int audit_classify_syscall(int abi, unsigned syscall) #ifdef __NR_socketcall case __NR_socketcall: return 4; +#endif +#ifdef __NR_execveat + case __NR_execveat: #endif case __NR_execve: return 5; -- cgit v1.2.3 From 89e3f90995b370fa46922eece62ea23f039a202d Mon Sep 17 00:00:00 2001 From: Dmitry Monakhov Date: Fri, 12 Dec 2014 16:57:57 -0800 Subject: ratelimit: add initialization macro Signed-off-by: Dmitry Monakhov Cc: Akinobu Mita Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/ratelimit.h | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) (limited to 'include/linux') diff --git a/include/linux/ratelimit.h b/include/linux/ratelimit.h index 0a260d8a18bf..18102529254e 100644 --- a/include/linux/ratelimit.h +++ b/include/linux/ratelimit.h @@ -17,14 +17,20 @@ struct ratelimit_state { unsigned long begin; }; -#define DEFINE_RATELIMIT_STATE(name, interval_init, burst_init) \ - \ - struct ratelimit_state name = { \ +#define RATELIMIT_STATE_INIT(name, interval_init, burst_init) { \ .lock = __RAW_SPIN_LOCK_UNLOCKED(name.lock), \ .interval = interval_init, \ .burst = burst_init, \ } +#define RATELIMIT_STATE_INIT_DISABLED \ + RATELIMIT_STATE_INIT(ratelimit_state, 0, DEFAULT_RATELIMIT_BURST) + +#define DEFINE_RATELIMIT_STATE(name, interval_init, burst_init) \ + \ + struct ratelimit_state name = \ + RATELIMIT_STATE_INIT(name, interval_init, burst_init) \ + static inline void ratelimit_state_init(struct ratelimit_state *rs, int interval, int burst) { -- cgit v1.2.3 From 6adc4a22f20bbf3bbc754f1ec8c82be5dfb5c71a Mon Sep 17 00:00:00 2001 From: Dmitry Monakhov Date: Fri, 12 Dec 2014 16:58:00 -0800 Subject: fault-inject: add ratelimit option Current debug levels are not optimal. Especially if one want to provoke big numbers of faults(broken device simulator) then any verbose level will produce giant numbers of identical logging messages. Let's add ratelimit parameter for that purpose. Signed-off-by: Dmitry Monakhov Acked-by: Akinobu Mita Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/fault-inject.h | 17 +++++++++++------ lib/fault-inject.c | 21 +++++++++++++++++---- 2 files changed, 28 insertions(+), 10 deletions(-) (limited to 'include/linux') diff --git a/include/linux/fault-inject.h b/include/linux/fault-inject.h index c6f996f2abb6..798fad9e420d 100644 --- a/include/linux/fault-inject.h +++ b/include/linux/fault-inject.h @@ -5,6 +5,7 @@ #include #include +#include #include /* @@ -25,14 +26,18 @@ struct fault_attr { unsigned long reject_end; unsigned long count; + struct ratelimit_state ratelimit_state; + struct dentry *dname; }; -#define FAULT_ATTR_INITIALIZER { \ - .interval = 1, \ - .times = ATOMIC_INIT(1), \ - .require_end = ULONG_MAX, \ - .stacktrace_depth = 32, \ - .verbose = 2, \ +#define FAULT_ATTR_INITIALIZER { \ + .interval = 1, \ + .times = ATOMIC_INIT(1), \ + .require_end = ULONG_MAX, \ + .stacktrace_depth = 32, \ + .ratelimit_state = RATELIMIT_STATE_INIT_DISABLED, \ + .verbose = 2, \ + .dname = NULL, \ } #define DECLARE_FAULT_ATTR(name) struct fault_attr name = FAULT_ATTR_INITIALIZER diff --git a/lib/fault-inject.c b/lib/fault-inject.c index d7d501ea856d..f1cdeb024d17 100644 --- a/lib/fault-inject.c +++ b/lib/fault-inject.c @@ -40,10 +40,16 @@ EXPORT_SYMBOL_GPL(setup_fault_attr); static void fail_dump(struct fault_attr *attr) { - if (attr->verbose > 0) - printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure\n"); - if (attr->verbose > 1) - dump_stack(); + if (attr->verbose > 0 && __ratelimit(&attr->ratelimit_state)) { + printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure.\n" + "name %pd, interval %lu, probability %lu, " + "space %d, times %d\n", attr->dname, + attr->probability, attr->interval, + atomic_read(&attr->space), + atomic_read(&attr->times)); + if (attr->verbose > 1) + dump_stack(); + } } #define atomic_dec_not_zero(v) atomic_add_unless((v), -1, 0) @@ -202,6 +208,12 @@ struct dentry *fault_create_debugfs_attr(const char *name, goto fail; if (!debugfs_create_ul("verbose", mode, dir, &attr->verbose)) goto fail; + if (!debugfs_create_u32("verbose_ratelimit_interval_ms", mode, dir, + &attr->ratelimit_state.interval)) + goto fail; + if (!debugfs_create_u32("verbose_ratelimit_burst", mode, dir, + &attr->ratelimit_state.burst)) + goto fail; if (!debugfs_create_bool("task-filter", mode, dir, &attr->task_filter)) goto fail; @@ -222,6 +234,7 @@ struct dentry *fault_create_debugfs_attr(const char *name, #endif /* CONFIG_FAULT_INJECTION_STACKTRACE_FILTER */ + attr->dname = dget(dir); return dir; fail: debugfs_remove_recursive(dir); -- cgit v1.2.3 From 0050ee059f7fc86b1df2527aaa14ed5dc72f9973 Mon Sep 17 00:00:00 2001 From: Manfred Spraul Date: Fri, 12 Dec 2014 16:58:17 -0800 Subject: ipc/msg: increase MSGMNI, remove scaling SysV can be abused to allocate locked kernel memory. For most systems, a small limit doesn't make sense, see the discussion with regards to SHMMAX. Therefore: increase MSGMNI to the maximum supported. And: If we ignore the risk of locking too much memory, then an automatic scaling of MSGMNI doesn't make sense. Therefore the logic can be removed. The code preserves auto_msgmni to avoid breaking any user space applications that expect that the value exists. Notes: 1) If an administrator must limit the memory allocations, then he can set MSGMNI as necessary. Or he can disable sysv entirely (as e.g. done by Android). 2) MSGMAX and MSGMNB are intentionally not increased, as these values are used to control latency vs. throughput: If MSGMNB is large, then msgsnd() just returns and more messages can be queued before a task switch to a task that calls msgrcv() is forced. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul Cc: Davidlohr Bueso Cc: Rafael Aquini Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/kernel.txt | 10 +++-- include/linux/ipc_namespace.h | 20 --------- include/uapi/linux/msg.h | 28 +++++++++---- ipc/Makefile | 2 +- ipc/ipc_sysctl.c | 93 ++++++++--------------------------------- ipc/ipcns_notifier.c | 92 ---------------------------------------- ipc/msg.c | 36 +--------------- ipc/namespace.c | 22 ---------- ipc/util.c | 40 ------------------ 9 files changed, 45 insertions(+), 298 deletions(-) delete mode 100644 ipc/ipcns_notifier.c (limited to 'include/linux') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index b5d0c8501a18..75511efefc64 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -116,10 +116,12 @@ set during run time. auto_msgmni: -Enables/Disables automatic recomputing of msgmni upon memory add/remove -or upon ipc namespace creation/removal (see the msgmni description -above). Echoing "1" into this file enables msgmni automatic recomputing. -Echoing "0" turns it off. auto_msgmni default value is 1. +This variable has no effect and may be removed in future kernel +releases. Reading it always returns 0. +Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni +upon memory add/remove or upon ipc namespace creation/removal. +Echoing "1" into this file enabled msgmni automatic recomputing. +Echoing "0" turned it off. auto_msgmni default value was 1. ============================================================== diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h index 35e7eca4e33b..e365d5ec69cb 100644 --- a/include/linux/ipc_namespace.h +++ b/include/linux/ipc_namespace.h @@ -7,15 +7,6 @@ #include #include -/* - * ipc namespace events - */ -#define IPCNS_MEMCHANGED 0x00000001 /* Notify lowmem size changed */ -#define IPCNS_CREATED 0x00000002 /* Notify new ipc namespace created */ -#define IPCNS_REMOVED 0x00000003 /* Notify ipc namespace removed */ - -#define IPCNS_CALLBACK_PRI 0 - struct user_namespace; struct ipc_ids { @@ -38,7 +29,6 @@ struct ipc_namespace { unsigned int msg_ctlmni; atomic_t msg_bytes; atomic_t msg_hdrs; - int auto_msgmni; size_t shm_ctlmax; size_t shm_ctlall; @@ -77,18 +67,8 @@ extern atomic_t nr_ipc_ns; extern spinlock_t mq_lock; #ifdef CONFIG_SYSVIPC -extern int register_ipcns_notifier(struct ipc_namespace *); -extern int cond_register_ipcns_notifier(struct ipc_namespace *); -extern void unregister_ipcns_notifier(struct ipc_namespace *); -extern int ipcns_notify(unsigned long); extern void shm_destroy_orphaned(struct ipc_namespace *ns); #else /* CONFIG_SYSVIPC */ -static inline int register_ipcns_notifier(struct ipc_namespace *ns) -{ return 0; } -static inline int cond_register_ipcns_notifier(struct ipc_namespace *ns) -{ return 0; } -static inline void unregister_ipcns_notifier(struct ipc_namespace *ns) { } -static inline int ipcns_notify(unsigned long l) { return 0; } static inline void shm_destroy_orphaned(struct ipc_namespace *ns) {} #endif /* CONFIG_SYSVIPC */ diff --git a/include/uapi/linux/msg.h b/include/uapi/linux/msg.h index a70375526578..f51c8001dbe5 100644 --- a/include/uapi/linux/msg.h +++ b/include/uapi/linux/msg.h @@ -51,16 +51,28 @@ struct msginfo { }; /* - * Scaling factor to compute msgmni: - * the memory dedicated to msg queues (msgmni * msgmnb) should occupy - * at most 1/MSG_MEM_SCALE of the lowmem (see the formula in ipc/msg.c): - * up to 8MB : msgmni = 16 (MSGMNI) - * 4 GB : msgmni = 8K - * more than 16 GB : msgmni = 32K (IPCMNI) + * MSGMNI, MSGMAX and MSGMNB are default values which can be + * modified by sysctl. + * + * MSGMNI is the upper limit for the number of messages queues per + * namespace. + * It has been chosen to be as large possible without facilitating + * scenarios where userspace causes overflows when adjusting the limits via + * operations of the form retrieve current limit; add X; update limit". + * + * MSGMNB is the default size of a new message queue. Non-root tasks can + * decrease the size with msgctl(IPC_SET), root tasks + * (actually: CAP_SYS_RESOURCE) can both increase and decrease the queue + * size. The optimal value is application dependent. + * 16384 is used because it was always used (since 0.99.10) + * + * MAXMAX is the maximum size of an individual message, it's a global + * (per-namespace) limit that applies for all message queues. + * It's set to 1/2 of MSGMNB, to ensure that at least two messages fit into + * the queue. This is also an arbitrary choice (since 2.6.0). */ -#define MSG_MEM_SCALE 32 -#define MSGMNI 16 /* <= IPCMNI */ /* max # of msg queue identifiers */ +#define MSGMNI 32000 /* <= IPCMNI */ /* max # of msg queue identifiers */ #define MSGMAX 8192 /* <= INT_MAX */ /* max size of message (bytes) */ #define MSGMNB 16384 /* <= INT_MAX */ /* default max size of a message queue */ diff --git a/ipc/Makefile b/ipc/Makefile index 9075e172e52c..86c7300ecdf5 100644 --- a/ipc/Makefile +++ b/ipc/Makefile @@ -3,7 +3,7 @@ # obj-$(CONFIG_SYSVIPC_COMPAT) += compat.o -obj-$(CONFIG_SYSVIPC) += util.o msgutil.o msg.o sem.o shm.o ipcns_notifier.o syscall.o +obj-$(CONFIG_SYSVIPC) += util.o msgutil.o msg.o sem.o shm.o syscall.o obj-$(CONFIG_SYSVIPC_SYSCTL) += ipc_sysctl.o obj_mq-$(CONFIG_COMPAT) += compat_mq.o obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y) diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c index e8075b247497..8ad93c29f511 100644 --- a/ipc/ipc_sysctl.c +++ b/ipc/ipc_sysctl.c @@ -62,29 +62,6 @@ static int proc_ipc_dointvec_minmax_orphans(struct ctl_table *table, int write, return err; } -static int proc_ipc_callback_dointvec_minmax(struct ctl_table *table, int write, - void __user *buffer, size_t *lenp, loff_t *ppos) -{ - struct ctl_table ipc_table; - size_t lenp_bef = *lenp; - int rc; - - memcpy(&ipc_table, table, sizeof(ipc_table)); - ipc_table.data = get_ipc(table); - - rc = proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos); - - if (write && !rc && lenp_bef == *lenp) - /* - * Tunable has successfully been changed by hand. Disable its - * automatic adjustment. This simply requires unregistering - * the notifiers that trigger recalculation. - */ - unregister_ipcns_notifier(current->nsproxy->ipc_ns); - - return rc; -} - static int proc_ipc_doulongvec_minmax(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { @@ -96,54 +73,19 @@ static int proc_ipc_doulongvec_minmax(struct ctl_table *table, int write, lenp, ppos); } -/* - * Routine that is called when the file "auto_msgmni" has successfully been - * written. - * Two values are allowed: - * 0: unregister msgmni's callback routine from the ipc namespace notifier - * chain. This means that msgmni won't be recomputed anymore upon memory - * add/remove or ipc namespace creation/removal. - * 1: register back the callback routine. - */ -static void ipc_auto_callback(int val) -{ - if (!val) - unregister_ipcns_notifier(current->nsproxy->ipc_ns); - else { - /* - * Re-enable automatic recomputing only if not already - * enabled. - */ - recompute_msgmni(current->nsproxy->ipc_ns); - cond_register_ipcns_notifier(current->nsproxy->ipc_ns); - } -} - -static int proc_ipcauto_dointvec_minmax(struct ctl_table *table, int write, +static int proc_ipc_auto_msgmni(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { struct ctl_table ipc_table; - int oldval; - int rc; + int dummy = 0; memcpy(&ipc_table, table, sizeof(ipc_table)); - ipc_table.data = get_ipc(table); - oldval = *((int *)(ipc_table.data)); + ipc_table.data = &dummy; - rc = proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos); + if (write) + pr_info_once("writing to auto_msgmni has no effect"); - if (write && !rc) { - int newval = *((int *)(ipc_table.data)); - /* - * The file "auto_msgmni" has correctly been set. - * React by (un)registering the corresponding tunable, if the - * value has changed. - */ - if (newval != oldval) - ipc_auto_callback(newval); - } - - return rc; + return proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos); } #else @@ -151,8 +93,7 @@ static int proc_ipcauto_dointvec_minmax(struct ctl_table *table, int write, #define proc_ipc_dointvec NULL #define proc_ipc_dointvec_minmax NULL #define proc_ipc_dointvec_minmax_orphans NULL -#define proc_ipc_callback_dointvec_minmax NULL -#define proc_ipcauto_dointvec_minmax NULL +#define proc_ipc_auto_msgmni NULL #endif static int zero; @@ -204,10 +145,19 @@ static struct ctl_table ipc_kern_table[] = { .data = &init_ipc_ns.msg_ctlmni, .maxlen = sizeof(init_ipc_ns.msg_ctlmni), .mode = 0644, - .proc_handler = proc_ipc_callback_dointvec_minmax, + .proc_handler = proc_ipc_dointvec_minmax, .extra1 = &zero, .extra2 = &int_max, }, + { + .procname = "auto_msgmni", + .data = NULL, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_ipc_auto_msgmni, + .extra1 = &zero, + .extra2 = &one, + }, { .procname = "msgmnb", .data = &init_ipc_ns.msg_ctlmnb, @@ -224,15 +174,6 @@ static struct ctl_table ipc_kern_table[] = { .mode = 0644, .proc_handler = proc_ipc_dointvec, }, - { - .procname = "auto_msgmni", - .data = &init_ipc_ns.auto_msgmni, - .maxlen = sizeof(int), - .mode = 0644, - .proc_handler = proc_ipcauto_dointvec_minmax, - .extra1 = &zero, - .extra2 = &one, - }, #ifdef CONFIG_CHECKPOINT_RESTORE { .procname = "sem_next_id", diff --git a/ipc/ipcns_notifier.c b/ipc/ipcns_notifier.c deleted file mode 100644 index b9b31a4f77e1..000000000000 --- a/ipc/ipcns_notifier.c +++ /dev/null @@ -1,92 +0,0 @@ -/* - * linux/ipc/ipcns_notifier.c - * Copyright (C) 2007 BULL SA. Nadia Derbey - * - * Notification mechanism for ipc namespaces: - * The callback routine registered in the memory chain invokes the ipcns - * notifier chain with the IPCNS_MEMCHANGED event. - * Each callback routine registered in the ipcns namespace recomputes msgmni - * for the owning namespace. - */ - -#include -#include -#include -#include -#include - -#include "util.h" - - - -static BLOCKING_NOTIFIER_HEAD(ipcns_chain); - - -static int ipcns_callback(struct notifier_block *self, - unsigned long action, void *arg) -{ - struct ipc_namespace *ns; - - switch (action) { - case IPCNS_MEMCHANGED: /* amount of lowmem has changed */ - case IPCNS_CREATED: - case IPCNS_REMOVED: - /* - * It's time to recompute msgmni - */ - ns = container_of(self, struct ipc_namespace, ipcns_nb); - /* - * No need to get a reference on the ns: the 1st job of - * free_ipc_ns() is to unregister the callback routine. - * blocking_notifier_chain_unregister takes the wr lock to do - * it. - * When this callback routine is called the rd lock is held by - * blocking_notifier_call_chain. - * So the ipc ns cannot be freed while we are here. - */ - recompute_msgmni(ns); - break; - default: - break; - } - - return NOTIFY_OK; -} - -int register_ipcns_notifier(struct ipc_namespace *ns) -{ - int rc; - - memset(&ns->ipcns_nb, 0, sizeof(ns->ipcns_nb)); - ns->ipcns_nb.notifier_call = ipcns_callback; - ns->ipcns_nb.priority = IPCNS_CALLBACK_PRI; - rc = blocking_notifier_chain_register(&ipcns_chain, &ns->ipcns_nb); - if (!rc) - ns->auto_msgmni = 1; - return rc; -} - -int cond_register_ipcns_notifier(struct ipc_namespace *ns) -{ - int rc; - - memset(&ns->ipcns_nb, 0, sizeof(ns->ipcns_nb)); - ns->ipcns_nb.notifier_call = ipcns_callback; - ns->ipcns_nb.priority = IPCNS_CALLBACK_PRI; - rc = blocking_notifier_chain_cond_register(&ipcns_chain, - &ns->ipcns_nb); - if (!rc) - ns->auto_msgmni = 1; - return rc; -} - -void unregister_ipcns_notifier(struct ipc_namespace *ns) -{ - blocking_notifier_chain_unregister(&ipcns_chain, &ns->ipcns_nb); - ns->auto_msgmni = 0; -} - -int ipcns_notify(unsigned long val) -{ - return blocking_notifier_call_chain(&ipcns_chain, val, NULL); -} diff --git a/ipc/msg.c b/ipc/msg.c index c5d8e3749985..a7261d5cbc89 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -989,43 +989,12 @@ SYSCALL_DEFINE5(msgrcv, int, msqid, struct msgbuf __user *, msgp, size_t, msgsz, return do_msgrcv(msqid, msgp, msgsz, msgtyp, msgflg, do_msg_fill); } -/* - * Scale msgmni with the available lowmem size: the memory dedicated to msg - * queues should occupy at most 1/MSG_MEM_SCALE of lowmem. - * Also take into account the number of nsproxies created so far. - * This should be done staying within the (MSGMNI , IPCMNI/nr_ipc_ns) range. - */ -void recompute_msgmni(struct ipc_namespace *ns) -{ - struct sysinfo i; - unsigned long allowed; - int nb_ns; - - si_meminfo(&i); - allowed = (((i.totalram - i.totalhigh) / MSG_MEM_SCALE) * i.mem_unit) - / MSGMNB; - nb_ns = atomic_read(&nr_ipc_ns); - allowed /= nb_ns; - - if (allowed < MSGMNI) { - ns->msg_ctlmni = MSGMNI; - return; - } - - if (allowed > IPCMNI / nb_ns) { - ns->msg_ctlmni = IPCMNI / nb_ns; - return; - } - - ns->msg_ctlmni = allowed; -} void msg_init_ns(struct ipc_namespace *ns) { ns->msg_ctlmax = MSGMAX; ns->msg_ctlmnb = MSGMNB; - - recompute_msgmni(ns); + ns->msg_ctlmni = MSGMNI; atomic_set(&ns->msg_bytes, 0); atomic_set(&ns->msg_hdrs, 0); @@ -1069,9 +1038,6 @@ void __init msg_init(void) { msg_init_ns(&init_ipc_ns); - printk(KERN_INFO "msgmni has been set to %d\n", - init_ipc_ns.msg_ctlmni); - ipc_init_proc_interface("sysvipc/msg", " key msqid perms cbytes qnum lspid lrpid uid gid cuid cgid stime rtime ctime\n", IPC_MSG_IDS, sysvipc_msg_proc_show); diff --git a/ipc/namespace.c b/ipc/namespace.c index b54468e48e32..1a3ffd40356e 100644 --- a/ipc/namespace.c +++ b/ipc/namespace.c @@ -45,14 +45,6 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns, msg_init_ns(ns); shm_init_ns(ns); - /* - * msgmni has already been computed for the new ipc ns. - * Thus, do the ipcns creation notification before registering that - * new ipcns in the chain. - */ - ipcns_notify(IPCNS_CREATED); - register_ipcns_notifier(ns); - ns->user_ns = get_user_ns(user_ns); return ns; @@ -99,25 +91,11 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids, static void free_ipc_ns(struct ipc_namespace *ns) { - /* - * Unregistering the hotplug notifier at the beginning guarantees - * that the ipc namespace won't be freed while we are inside the - * callback routine. Since the blocking_notifier_chain_XXX routines - * hold a rw lock on the notifier list, unregister_ipcns_notifier() - * won't take the rw lock before blocking_notifier_call_chain() has - * released the rd lock. - */ - unregister_ipcns_notifier(ns); sem_exit_ns(ns); msg_exit_ns(ns); shm_exit_ns(ns); atomic_dec(&nr_ipc_ns); - /* - * Do the ipcns removal notification after decrementing nr_ipc_ns in - * order to have a correct value when recomputing msgmni. - */ - ipcns_notify(IPCNS_REMOVED); put_user_ns(ns->user_ns); proc_free_inum(ns->proc_inum); kfree(ns); diff --git a/ipc/util.c b/ipc/util.c index 88adc329888c..106bed0378ab 100644 --- a/ipc/util.c +++ b/ipc/util.c @@ -71,44 +71,6 @@ struct ipc_proc_iface { int (*show)(struct seq_file *, void *); }; -static void ipc_memory_notifier(struct work_struct *work) -{ - ipcns_notify(IPCNS_MEMCHANGED); -} - -static int ipc_memory_callback(struct notifier_block *self, - unsigned long action, void *arg) -{ - static DECLARE_WORK(ipc_memory_wq, ipc_memory_notifier); - - switch (action) { - case MEM_ONLINE: /* memory successfully brought online */ - case MEM_OFFLINE: /* or offline: it's time to recompute msgmni */ - /* - * This is done by invoking the ipcns notifier chain with the - * IPC_MEMCHANGED event. - * In order not to keep the lock on the hotplug memory chain - * for too long, queue a work item that will, when waken up, - * activate the ipcns notification chain. - */ - schedule_work(&ipc_memory_wq); - break; - case MEM_GOING_ONLINE: - case MEM_GOING_OFFLINE: - case MEM_CANCEL_ONLINE: - case MEM_CANCEL_OFFLINE: - default: - break; - } - - return NOTIFY_OK; -} - -static struct notifier_block ipc_memory_nb = { - .notifier_call = ipc_memory_callback, - .priority = IPC_CALLBACK_PRI, -}; - /** * ipc_init - initialise ipc subsystem * @@ -124,8 +86,6 @@ static int __init ipc_init(void) sem_init(); msg_init(); shm_init(); - register_hotmemory_notifier(&ipc_memory_nb); - register_ipcns_notifier(&init_ipc_ns); return 0; } device_initcall(ipc_init); -- cgit v1.2.3 From 0809ab69a2782afac8c4d7f3d35cd123050aab9a Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Fri, 12 Dec 2014 16:58:36 -0800 Subject: fsnotify: unify inode and mount marks handling There's a lot of common code in inode and mount marks handling. Factor it out to a common helper function. Signed-off-by: Jan Kara Cc: Eric Paris Cc: Heinrich Schuchardt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/notify/dnotify/dnotify.c | 4 +- fs/notify/fdinfo.c | 6 +- fs/notify/fsnotify.c | 4 +- fs/notify/fsnotify.h | 12 ++++ fs/notify/inode_mark.c | 113 ++++++----------------------------- fs/notify/inotify/inotify_fsnotify.c | 2 +- fs/notify/inotify/inotify_user.c | 10 ++-- fs/notify/mark.c | 89 ++++++++++++++++++++++++++- fs/notify/vfsmount_mark.c | 109 ++++++--------------------------- include/linux/fsnotify_backend.h | 24 ++------ kernel/audit_tree.c | 16 ++--- 11 files changed, 160 insertions(+), 229 deletions(-) (limited to 'include/linux') diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c index caaaf9dfe353..44523f4a6084 100644 --- a/fs/notify/dnotify/dnotify.c +++ b/fs/notify/dnotify/dnotify.c @@ -69,8 +69,8 @@ static void dnotify_recalc_inode_mask(struct fsnotify_mark *fsn_mark) if (old_mask == new_mask) return; - if (fsn_mark->i.inode) - fsnotify_recalc_inode_mask(fsn_mark->i.inode); + if (fsn_mark->inode) + fsnotify_recalc_inode_mask(fsn_mark->inode); } /* diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c index 6ffd220eb14d..58b7cdb63da9 100644 --- a/fs/notify/fdinfo.c +++ b/fs/notify/fdinfo.c @@ -80,7 +80,7 @@ static void inotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark) return; inode_mark = container_of(mark, struct inotify_inode_mark, fsn_mark); - inode = igrab(mark->i.inode); + inode = igrab(mark->inode); if (inode) { seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:%x ", inode_mark->wd, inode->i_ino, inode->i_sb->s_dev, @@ -112,7 +112,7 @@ static void fanotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark) mflags |= FAN_MARK_IGNORED_SURV_MODIFY; if (mark->flags & FSNOTIFY_MARK_FLAG_INODE) { - inode = igrab(mark->i.inode); + inode = igrab(mark->inode); if (!inode) return; seq_printf(m, "fanotify ino:%lx sdev:%x mflags:%x mask:%x ignored_mask:%x ", @@ -122,7 +122,7 @@ static void fanotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark) seq_putc(m, '\n'); iput(inode); } else if (mark->flags & FSNOTIFY_MARK_FLAG_VFSMOUNT) { - struct mount *mnt = real_mount(mark->m.mnt); + struct mount *mnt = real_mount(mark->mnt); seq_printf(m, "fanotify mnt_id:%x mflags:%x mask:%x ignored_mask:%x\n", mnt->mnt_id, mflags, mark->mask, mark->ignored_mask); diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 41e39102743a..dd3fb0b17be7 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -242,13 +242,13 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is, if (inode_node) { inode_mark = hlist_entry(srcu_dereference(inode_node, &fsnotify_mark_srcu), - struct fsnotify_mark, i.i_list); + struct fsnotify_mark, obj_list); inode_group = inode_mark->group; } if (vfsmount_node) { vfsmount_mark = hlist_entry(srcu_dereference(vfsmount_node, &fsnotify_mark_srcu), - struct fsnotify_mark, m.m_list); + struct fsnotify_mark, obj_list); vfsmount_group = vfsmount_mark->group; } diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h index 3b68b0ae0a97..13a00be516d2 100644 --- a/fs/notify/fsnotify.h +++ b/fs/notify/fsnotify.h @@ -12,12 +12,19 @@ extern void fsnotify_flush_notify(struct fsnotify_group *group); /* protects reads of inode and vfsmount marks list */ extern struct srcu_struct fsnotify_mark_srcu; +/* Calculate mask of events for a list of marks */ +extern u32 fsnotify_recalc_mask(struct hlist_head *head); + /* compare two groups for sorting of marks lists */ extern int fsnotify_compare_groups(struct fsnotify_group *a, struct fsnotify_group *b); extern void fsnotify_set_inode_mark_mask_locked(struct fsnotify_mark *fsn_mark, __u32 mask); +/* Add mark to a proper place in mark list */ +extern int fsnotify_add_mark_list(struct hlist_head *head, + struct fsnotify_mark *mark, + int allow_dups); /* add a mark to an inode */ extern int fsnotify_add_inode_mark(struct fsnotify_mark *mark, struct fsnotify_group *group, struct inode *inode, @@ -31,6 +38,11 @@ extern int fsnotify_add_vfsmount_mark(struct fsnotify_mark *mark, extern void fsnotify_destroy_vfsmount_mark(struct fsnotify_mark *mark); /* inode specific destruction of a mark */ extern void fsnotify_destroy_inode_mark(struct fsnotify_mark *mark); +/* Destroy all marks in the given list */ +extern void fsnotify_destroy_marks(struct list_head *to_free); +/* Find mark belonging to given group in the list of marks */ +extern struct fsnotify_mark *fsnotify_find_mark(struct hlist_head *head, + struct fsnotify_group *group); /* run the list of all marks associated with inode and flag them to be freed */ extern void fsnotify_clear_marks_by_inode(struct inode *inode); /* run the list of all marks associated with vfsmount and flag them to be freed */ diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c index dfbf5447eea4..3daf513ee99e 100644 --- a/fs/notify/inode_mark.c +++ b/fs/notify/inode_mark.c @@ -30,21 +30,6 @@ #include "../internal.h" -/* - * Recalculate the mask of events relevant to a given inode locked. - */ -static void fsnotify_recalc_inode_mask_locked(struct inode *inode) -{ - struct fsnotify_mark *mark; - __u32 new_mask = 0; - - assert_spin_locked(&inode->i_lock); - - hlist_for_each_entry(mark, &inode->i_fsnotify_marks, i.i_list) - new_mask |= mark->mask; - inode->i_fsnotify_mask = new_mask; -} - /* * Recalculate the inode->i_fsnotify_mask, or the mask of all FS_* event types * any notifier is interested in hearing for this inode. @@ -52,7 +37,7 @@ static void fsnotify_recalc_inode_mask_locked(struct inode *inode) void fsnotify_recalc_inode_mask(struct inode *inode) { spin_lock(&inode->i_lock); - fsnotify_recalc_inode_mask_locked(inode); + inode->i_fsnotify_mask = fsnotify_recalc_mask(&inode->i_fsnotify_marks); spin_unlock(&inode->i_lock); __fsnotify_update_child_dentry_flags(inode); @@ -60,23 +45,22 @@ void fsnotify_recalc_inode_mask(struct inode *inode) void fsnotify_destroy_inode_mark(struct fsnotify_mark *mark) { - struct inode *inode = mark->i.inode; + struct inode *inode = mark->inode; BUG_ON(!mutex_is_locked(&mark->group->mark_mutex)); assert_spin_locked(&mark->lock); spin_lock(&inode->i_lock); - hlist_del_init_rcu(&mark->i.i_list); - mark->i.inode = NULL; + hlist_del_init_rcu(&mark->obj_list); + mark->inode = NULL; /* * this mark is now off the inode->i_fsnotify_marks list and we * hold the inode->i_lock, so this is the perfect time to update the * inode->i_fsnotify_mask */ - fsnotify_recalc_inode_mask_locked(inode); - + inode->i_fsnotify_mask = fsnotify_recalc_mask(&inode->i_fsnotify_marks); spin_unlock(&inode->i_lock); } @@ -85,30 +69,19 @@ void fsnotify_destroy_inode_mark(struct fsnotify_mark *mark) */ void fsnotify_clear_marks_by_inode(struct inode *inode) { - struct fsnotify_mark *mark, *lmark; + struct fsnotify_mark *mark; struct hlist_node *n; LIST_HEAD(free_list); spin_lock(&inode->i_lock); - hlist_for_each_entry_safe(mark, n, &inode->i_fsnotify_marks, i.i_list) { - list_add(&mark->i.free_i_list, &free_list); - hlist_del_init_rcu(&mark->i.i_list); + hlist_for_each_entry_safe(mark, n, &inode->i_fsnotify_marks, obj_list) { + list_add(&mark->free_list, &free_list); + hlist_del_init_rcu(&mark->obj_list); fsnotify_get_mark(mark); } spin_unlock(&inode->i_lock); - list_for_each_entry_safe(mark, lmark, &free_list, i.free_i_list) { - struct fsnotify_group *group; - - spin_lock(&mark->lock); - fsnotify_get_group(mark->group); - group = mark->group; - spin_unlock(&mark->lock); - - fsnotify_destroy_mark(mark, group); - fsnotify_put_mark(mark); - fsnotify_put_group(group); - } + fsnotify_destroy_marks(&free_list); } /* @@ -119,27 +92,6 @@ void fsnotify_clear_inode_marks_by_group(struct fsnotify_group *group) fsnotify_clear_marks_by_group_flags(group, FSNOTIFY_MARK_FLAG_INODE); } -/* - * given a group and inode, find the mark associated with that combination. - * if found take a reference to that mark and return it, else return NULL - */ -static struct fsnotify_mark *fsnotify_find_inode_mark_locked( - struct fsnotify_group *group, - struct inode *inode) -{ - struct fsnotify_mark *mark; - - assert_spin_locked(&inode->i_lock); - - hlist_for_each_entry(mark, &inode->i_fsnotify_marks, i.i_list) { - if (mark->group == group) { - fsnotify_get_mark(mark); - return mark; - } - } - return NULL; -} - /* * given a group and inode, find the mark associated with that combination. * if found take a reference to that mark and return it, else return NULL @@ -150,7 +102,7 @@ struct fsnotify_mark *fsnotify_find_inode_mark(struct fsnotify_group *group, struct fsnotify_mark *mark; spin_lock(&inode->i_lock); - mark = fsnotify_find_inode_mark_locked(group, inode); + mark = fsnotify_find_mark(&inode->i_fsnotify_marks, group); spin_unlock(&inode->i_lock); return mark; @@ -168,10 +120,10 @@ void fsnotify_set_inode_mark_mask_locked(struct fsnotify_mark *mark, assert_spin_locked(&mark->lock); if (mask && - mark->i.inode && + mark->inode && !(mark->flags & FSNOTIFY_MARK_FLAG_OBJECT_PINNED)) { mark->flags |= FSNOTIFY_MARK_FLAG_OBJECT_PINNED; - inode = igrab(mark->i.inode); + inode = igrab(mark->inode); /* * we shouldn't be able to get here if the inode wasn't * already safely held in memory. But bug in case it @@ -192,9 +144,7 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark, struct fsnotify_group *group, struct inode *inode, int allow_dups) { - struct fsnotify_mark *lmark, *last = NULL; - int ret = 0; - int cmp; + int ret; mark->flags |= FSNOTIFY_MARK_FLAG_INODE; @@ -202,37 +152,10 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark, assert_spin_locked(&mark->lock); spin_lock(&inode->i_lock); - - mark->i.inode = inode; - - /* is mark the first mark? */ - if (hlist_empty(&inode->i_fsnotify_marks)) { - hlist_add_head_rcu(&mark->i.i_list, &inode->i_fsnotify_marks); - goto out; - } - - /* should mark be in the middle of the current list? */ - hlist_for_each_entry(lmark, &inode->i_fsnotify_marks, i.i_list) { - last = lmark; - - if ((lmark->group == group) && !allow_dups) { - ret = -EEXIST; - goto out; - } - - cmp = fsnotify_compare_groups(lmark->group, mark->group); - if (cmp < 0) - continue; - - hlist_add_before_rcu(&mark->i.i_list, &lmark->i.i_list); - goto out; - } - - BUG_ON(last == NULL); - /* mark should be the last entry. last is the current last entry */ - hlist_add_behind_rcu(&mark->i.i_list, &last->i.i_list); -out: - fsnotify_recalc_inode_mask_locked(inode); + mark->inode = inode; + ret = fsnotify_add_mark_list(&inode->i_fsnotify_marks, mark, + allow_dups); + inode->i_fsnotify_mask = fsnotify_recalc_mask(&inode->i_fsnotify_marks); spin_unlock(&inode->i_lock); return ret; diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c index 7d888d77d59a..2cd900c2c737 100644 --- a/fs/notify/inotify/inotify_fsnotify.c +++ b/fs/notify/inotify/inotify_fsnotify.c @@ -156,7 +156,7 @@ static int idr_callback(int id, void *p, void *data) */ if (fsn_mark) printk(KERN_WARNING "fsn_mark->group=%p inode=%p wd=%d\n", - fsn_mark->group, fsn_mark->i.inode, i_mark->wd); + fsn_mark->group, fsn_mark->inode, i_mark->wd); return 0; } diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c index 283aa312d745..450648697433 100644 --- a/fs/notify/inotify/inotify_user.c +++ b/fs/notify/inotify/inotify_user.c @@ -433,7 +433,7 @@ static void inotify_remove_from_idr(struct fsnotify_group *group, if (wd == -1) { WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p" " i_mark->inode=%p\n", __func__, i_mark, i_mark->wd, - i_mark->fsn_mark.group, i_mark->fsn_mark.i.inode); + i_mark->fsn_mark.group, i_mark->fsn_mark.inode); goto out; } @@ -442,7 +442,7 @@ static void inotify_remove_from_idr(struct fsnotify_group *group, if (unlikely(!found_i_mark)) { WARN_ONCE(1, "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p" " i_mark->inode=%p\n", __func__, i_mark, i_mark->wd, - i_mark->fsn_mark.group, i_mark->fsn_mark.i.inode); + i_mark->fsn_mark.group, i_mark->fsn_mark.inode); goto out; } @@ -456,9 +456,9 @@ static void inotify_remove_from_idr(struct fsnotify_group *group, "mark->inode=%p found_i_mark=%p found_i_mark->wd=%d " "found_i_mark->group=%p found_i_mark->inode=%p\n", __func__, i_mark, i_mark->wd, i_mark->fsn_mark.group, - i_mark->fsn_mark.i.inode, found_i_mark, found_i_mark->wd, + i_mark->fsn_mark.inode, found_i_mark, found_i_mark->wd, found_i_mark->fsn_mark.group, - found_i_mark->fsn_mark.i.inode); + found_i_mark->fsn_mark.inode); goto out; } @@ -470,7 +470,7 @@ static void inotify_remove_from_idr(struct fsnotify_group *group, if (unlikely(atomic_read(&i_mark->fsn_mark.refcnt) < 3)) { printk(KERN_ERR "%s: i_mark=%p i_mark->wd=%d i_mark->group=%p" " i_mark->inode=%p\n", __func__, i_mark, i_mark->wd, - i_mark->fsn_mark.group, i_mark->fsn_mark.i.inode); + i_mark->fsn_mark.group, i_mark->fsn_mark.inode); /* we can't really recover with bad ref cnting.. */ BUG(); } diff --git a/fs/notify/mark.c b/fs/notify/mark.c index 34c38fabf514..3942d5c9eb8d 100644 --- a/fs/notify/mark.c +++ b/fs/notify/mark.c @@ -110,6 +110,17 @@ void fsnotify_put_mark(struct fsnotify_mark *mark) } } +/* Calculate mask of events for a list of marks */ +u32 fsnotify_recalc_mask(struct hlist_head *head) +{ + u32 new_mask = 0; + struct fsnotify_mark *mark; + + hlist_for_each_entry(mark, head, obj_list) + new_mask |= mark->mask; + return new_mask; +} + /* * Any time a mark is getting freed we end up here. * The caller had better be holding a reference to this mark so we don't actually @@ -133,7 +144,7 @@ void fsnotify_destroy_mark_locked(struct fsnotify_mark *mark, mark->flags &= ~FSNOTIFY_MARK_FLAG_ALIVE; if (mark->flags & FSNOTIFY_MARK_FLAG_INODE) { - inode = mark->i.inode; + inode = mark->inode; fsnotify_destroy_inode_mark(mark); } else if (mark->flags & FSNOTIFY_MARK_FLAG_VFSMOUNT) fsnotify_destroy_vfsmount_mark(mark); @@ -192,6 +203,27 @@ void fsnotify_destroy_mark(struct fsnotify_mark *mark, mutex_unlock(&group->mark_mutex); } +/* + * Destroy all marks in the given list. The marks must be already detached from + * the original inode / vfsmount. + */ +void fsnotify_destroy_marks(struct list_head *to_free) +{ + struct fsnotify_mark *mark, *lmark; + struct fsnotify_group *group; + + list_for_each_entry_safe(mark, lmark, to_free, free_list) { + spin_lock(&mark->lock); + fsnotify_get_group(mark->group); + group = mark->group; + spin_unlock(&mark->lock); + + fsnotify_destroy_mark(mark, group); + fsnotify_put_mark(mark); + fsnotify_put_group(group); + } +} + void fsnotify_set_mark_mask_locked(struct fsnotify_mark *mark, __u32 mask) { assert_spin_locked(&mark->lock); @@ -245,6 +277,39 @@ int fsnotify_compare_groups(struct fsnotify_group *a, struct fsnotify_group *b) return -1; } +/* Add mark into proper place in given list of marks */ +int fsnotify_add_mark_list(struct hlist_head *head, struct fsnotify_mark *mark, + int allow_dups) +{ + struct fsnotify_mark *lmark, *last = NULL; + int cmp; + + /* is mark the first mark? */ + if (hlist_empty(head)) { + hlist_add_head_rcu(&mark->obj_list, head); + return 0; + } + + /* should mark be in the middle of the current list? */ + hlist_for_each_entry(lmark, head, obj_list) { + last = lmark; + + if ((lmark->group == mark->group) && !allow_dups) + return -EEXIST; + + cmp = fsnotify_compare_groups(lmark->group, mark->group); + if (cmp >= 0) { + hlist_add_before_rcu(&mark->obj_list, &lmark->obj_list); + return 0; + } + } + + BUG_ON(last == NULL); + /* mark should be the last entry. last is the current last entry */ + hlist_add_behind_rcu(&mark->obj_list, &last->obj_list); + return 0; +} + /* * Attach an initialized mark to a given group and fs object. * These marks may be used for the fsnotify backend to determine which @@ -322,6 +387,24 @@ int fsnotify_add_mark(struct fsnotify_mark *mark, struct fsnotify_group *group, return ret; } +/* + * Given a list of marks, find the mark associated with given group. If found + * take a reference to that mark and return it, else return NULL. + */ +struct fsnotify_mark *fsnotify_find_mark(struct hlist_head *head, + struct fsnotify_group *group) +{ + struct fsnotify_mark *mark; + + hlist_for_each_entry(mark, head, obj_list) { + if (mark->group == group) { + fsnotify_get_mark(mark); + return mark; + } + } + return NULL; +} + /* * clear any marks in a group in which mark->flags & flags is true */ @@ -352,8 +435,8 @@ void fsnotify_clear_marks_by_group(struct fsnotify_group *group) void fsnotify_duplicate_mark(struct fsnotify_mark *new, struct fsnotify_mark *old) { assert_spin_locked(&old->lock); - new->i.inode = old->i.inode; - new->m.mnt = old->m.mnt; + new->inode = old->inode; + new->mnt = old->mnt; if (old->group) fsnotify_get_group(old->group); new->group = old->group; diff --git a/fs/notify/vfsmount_mark.c b/fs/notify/vfsmount_mark.c index faefa72a11eb..326b148e623c 100644 --- a/fs/notify/vfsmount_mark.c +++ b/fs/notify/vfsmount_mark.c @@ -32,31 +32,20 @@ void fsnotify_clear_marks_by_mount(struct vfsmount *mnt) { - struct fsnotify_mark *mark, *lmark; + struct fsnotify_mark *mark; struct hlist_node *n; struct mount *m = real_mount(mnt); LIST_HEAD(free_list); spin_lock(&mnt->mnt_root->d_lock); - hlist_for_each_entry_safe(mark, n, &m->mnt_fsnotify_marks, m.m_list) { - list_add(&mark->m.free_m_list, &free_list); - hlist_del_init_rcu(&mark->m.m_list); + hlist_for_each_entry_safe(mark, n, &m->mnt_fsnotify_marks, obj_list) { + list_add(&mark->free_list, &free_list); + hlist_del_init_rcu(&mark->obj_list); fsnotify_get_mark(mark); } spin_unlock(&mnt->mnt_root->d_lock); - list_for_each_entry_safe(mark, lmark, &free_list, m.free_m_list) { - struct fsnotify_group *group; - - spin_lock(&mark->lock); - fsnotify_get_group(mark->group); - group = mark->group; - spin_unlock(&mark->lock); - - fsnotify_destroy_mark(mark, group); - fsnotify_put_mark(mark); - fsnotify_put_group(group); - } + fsnotify_destroy_marks(&free_list); } void fsnotify_clear_vfsmount_marks_by_group(struct fsnotify_group *group) @@ -64,67 +53,36 @@ void fsnotify_clear_vfsmount_marks_by_group(struct fsnotify_group *group) fsnotify_clear_marks_by_group_flags(group, FSNOTIFY_MARK_FLAG_VFSMOUNT); } -/* - * Recalculate the mask of events relevant to a given vfsmount locked. - */ -static void fsnotify_recalc_vfsmount_mask_locked(struct vfsmount *mnt) -{ - struct mount *m = real_mount(mnt); - struct fsnotify_mark *mark; - __u32 new_mask = 0; - - assert_spin_locked(&mnt->mnt_root->d_lock); - - hlist_for_each_entry(mark, &m->mnt_fsnotify_marks, m.m_list) - new_mask |= mark->mask; - m->mnt_fsnotify_mask = new_mask; -} - /* * Recalculate the mnt->mnt_fsnotify_mask, or the mask of all FS_* event types * any notifier is interested in hearing for this mount point */ void fsnotify_recalc_vfsmount_mask(struct vfsmount *mnt) { + struct mount *m = real_mount(mnt); + spin_lock(&mnt->mnt_root->d_lock); - fsnotify_recalc_vfsmount_mask_locked(mnt); + m->mnt_fsnotify_mask = fsnotify_recalc_mask(&m->mnt_fsnotify_marks); spin_unlock(&mnt->mnt_root->d_lock); } void fsnotify_destroy_vfsmount_mark(struct fsnotify_mark *mark) { - struct vfsmount *mnt = mark->m.mnt; + struct vfsmount *mnt = mark->mnt; + struct mount *m = real_mount(mnt); BUG_ON(!mutex_is_locked(&mark->group->mark_mutex)); assert_spin_locked(&mark->lock); spin_lock(&mnt->mnt_root->d_lock); - hlist_del_init_rcu(&mark->m.m_list); - mark->m.mnt = NULL; - - fsnotify_recalc_vfsmount_mask_locked(mnt); + hlist_del_init_rcu(&mark->obj_list); + mark->mnt = NULL; + m->mnt_fsnotify_mask = fsnotify_recalc_mask(&m->mnt_fsnotify_marks); spin_unlock(&mnt->mnt_root->d_lock); } -static struct fsnotify_mark *fsnotify_find_vfsmount_mark_locked(struct fsnotify_group *group, - struct vfsmount *mnt) -{ - struct mount *m = real_mount(mnt); - struct fsnotify_mark *mark; - - assert_spin_locked(&mnt->mnt_root->d_lock); - - hlist_for_each_entry(mark, &m->mnt_fsnotify_marks, m.m_list) { - if (mark->group == group) { - fsnotify_get_mark(mark); - return mark; - } - } - return NULL; -} - /* * given a group and vfsmount, find the mark associated with that combination. * if found take a reference to that mark and return it, else return NULL @@ -132,10 +90,11 @@ static struct fsnotify_mark *fsnotify_find_vfsmount_mark_locked(struct fsnotify_ struct fsnotify_mark *fsnotify_find_vfsmount_mark(struct fsnotify_group *group, struct vfsmount *mnt) { + struct mount *m = real_mount(mnt); struct fsnotify_mark *mark; spin_lock(&mnt->mnt_root->d_lock); - mark = fsnotify_find_vfsmount_mark_locked(group, mnt); + mark = fsnotify_find_mark(&m->mnt_fsnotify_marks, group); spin_unlock(&mnt->mnt_root->d_lock); return mark; @@ -151,9 +110,7 @@ int fsnotify_add_vfsmount_mark(struct fsnotify_mark *mark, int allow_dups) { struct mount *m = real_mount(mnt); - struct fsnotify_mark *lmark, *last = NULL; - int ret = 0; - int cmp; + int ret; mark->flags |= FSNOTIFY_MARK_FLAG_VFSMOUNT; @@ -161,37 +118,9 @@ int fsnotify_add_vfsmount_mark(struct fsnotify_mark *mark, assert_spin_locked(&mark->lock); spin_lock(&mnt->mnt_root->d_lock); - - mark->m.mnt = mnt; - - /* is mark the first mark? */ - if (hlist_empty(&m->mnt_fsnotify_marks)) { - hlist_add_head_rcu(&mark->m.m_list, &m->mnt_fsnotify_marks); - goto out; - } - - /* should mark be in the middle of the current list? */ - hlist_for_each_entry(lmark, &m->mnt_fsnotify_marks, m.m_list) { - last = lmark; - - if ((lmark->group == group) && !allow_dups) { - ret = -EEXIST; - goto out; - } - - cmp = fsnotify_compare_groups(lmark->group, mark->group); - if (cmp < 0) - continue; - - hlist_add_before_rcu(&mark->m.m_list, &lmark->m.m_list); - goto out; - } - - BUG_ON(last == NULL); - /* mark should be the last entry. last is the current last entry */ - hlist_add_behind_rcu(&mark->m.m_list, &last->m.m_list); -out: - fsnotify_recalc_vfsmount_mask_locked(mnt); + mark->mnt = mnt; + ret = fsnotify_add_mark_list(&m->mnt_fsnotify_marks, mark, allow_dups); + m->mnt_fsnotify_mask = fsnotify_recalc_mask(&m->mnt_fsnotify_marks); spin_unlock(&mnt->mnt_root->d_lock); return ret; diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index ca060d7c4fa6..442847a02b8f 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -196,24 +196,6 @@ struct fsnotify_group { #define FSNOTIFY_EVENT_PATH 1 #define FSNOTIFY_EVENT_INODE 2 -/* - * Inode specific fields in an fsnotify_mark - */ -struct fsnotify_inode_mark { - struct inode *inode; /* inode this mark is associated with */ - struct hlist_node i_list; /* list of marks by inode->i_fsnotify_marks */ - struct list_head free_i_list; /* tmp list used when freeing this mark */ -}; - -/* - * Mount point specific fields in an fsnotify_mark - */ -struct fsnotify_vfsmount_mark { - struct vfsmount *mnt; /* vfsmount this mark is associated with */ - struct hlist_node m_list; /* list of marks by inode->i_fsnotify_marks */ - struct list_head free_m_list; /* tmp list used when freeing this mark */ -}; - /* * a mark is simply an object attached to an in core inode which allows an * fsnotify listener to indicate they are either no longer interested in events @@ -232,9 +214,11 @@ struct fsnotify_mark { struct fsnotify_group *group; /* group this mark is for */ struct list_head g_list; /* list of marks by group->i_fsnotify_marks */ spinlock_t lock; /* protect group and inode */ + struct hlist_node obj_list; /* list of marks for inode / vfsmount */ + struct list_head free_list; /* tmp list used when freeing this mark */ union { - struct fsnotify_inode_mark i; - struct fsnotify_vfsmount_mark m; + struct inode *inode; /* inode this mark is associated with */ + struct vfsmount *mnt; /* vfsmount this mark is associated with */ }; __u32 ignored_mask; /* events types to ignore */ #define FSNOTIFY_MARK_FLAG_INODE 0x01 diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 80f29e015570..2e0c97427b33 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -174,9 +174,9 @@ static void insert_hash(struct audit_chunk *chunk) struct fsnotify_mark *entry = &chunk->mark; struct list_head *list; - if (!entry->i.inode) + if (!entry->inode) return; - list = chunk_hash(entry->i.inode); + list = chunk_hash(entry->inode); list_add_rcu(&chunk->hash, list); } @@ -188,7 +188,7 @@ struct audit_chunk *audit_tree_lookup(const struct inode *inode) list_for_each_entry_rcu(p, list, hash) { /* mark.inode may have gone NULL, but who cares? */ - if (p->mark.i.inode == inode) { + if (p->mark.inode == inode) { atomic_long_inc(&p->refs); return p; } @@ -231,7 +231,7 @@ static void untag_chunk(struct node *p) new = alloc_chunk(size); spin_lock(&entry->lock); - if (chunk->dead || !entry->i.inode) { + if (chunk->dead || !entry->inode) { spin_unlock(&entry->lock); if (new) free_chunk(new); @@ -258,7 +258,7 @@ static void untag_chunk(struct node *p) goto Fallback; fsnotify_duplicate_mark(&new->mark, entry); - if (fsnotify_add_mark(&new->mark, new->mark.group, new->mark.i.inode, NULL, 1)) { + if (fsnotify_add_mark(&new->mark, new->mark.group, new->mark.inode, NULL, 1)) { fsnotify_put_mark(&new->mark); goto Fallback; } @@ -386,7 +386,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) chunk_entry = &chunk->mark; spin_lock(&old_entry->lock); - if (!old_entry->i.inode) { + if (!old_entry->inode) { /* old_entry is being shot, lets just lie */ spin_unlock(&old_entry->lock); fsnotify_put_mark(old_entry); @@ -395,7 +395,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree) } fsnotify_duplicate_mark(chunk_entry, old_entry); - if (fsnotify_add_mark(chunk_entry, chunk_entry->group, chunk_entry->i.inode, NULL, 1)) { + if (fsnotify_add_mark(chunk_entry, chunk_entry->group, chunk_entry->inode, NULL, 1)) { spin_unlock(&old_entry->lock); fsnotify_put_mark(chunk_entry); fsnotify_put_mark(old_entry); @@ -611,7 +611,7 @@ void audit_trim_trees(void) list_for_each_entry(node, &tree->chunks, list) { struct audit_chunk *chunk = find_chunk(node); /* this could be NULL if the watch is dying else where... */ - struct inode *inode = chunk->mark.i.inode; + struct inode *inode = chunk->mark.inode; node->index |= 1U<<31; if (iterate_mounts(compare_root, inode, root_mnt)) node->index &= ~(1U<<31); -- cgit v1.2.3 From 37d469e7673a663cbf38360beb1eaa3224c9d272 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Fri, 12 Dec 2014 16:58:39 -0800 Subject: fsnotify: remove destroy_list from fsnotify_mark destroy_list is used to track marks which still need waiting for srcu period end before they can be freed. However by the time mark is added to destroy_list it isn't in group's list of marks anymore and thus we can reuse fsnotify_mark->g_list for queueing into destroy_list. This saves two pointers for each fsnotify_mark. Signed-off-by: Jan Kara Cc: Eric Paris Cc: Heinrich Schuchardt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/notify/mark.c | 8 ++++---- include/linux/fsnotify_backend.h | 7 +++++-- 2 files changed, 9 insertions(+), 6 deletions(-) (limited to 'include/linux') diff --git a/fs/notify/mark.c b/fs/notify/mark.c index 3942d5c9eb8d..92e48c70f0f0 100644 --- a/fs/notify/mark.c +++ b/fs/notify/mark.c @@ -161,7 +161,7 @@ void fsnotify_destroy_mark_locked(struct fsnotify_mark *mark, mutex_unlock(&group->mark_mutex); spin_lock(&destroy_lock); - list_add(&mark->destroy_list, &destroy_list); + list_add(&mark->g_list, &destroy_list); spin_unlock(&destroy_lock); wake_up(&destroy_waitq); /* @@ -370,7 +370,7 @@ err: spin_unlock(&mark->lock); spin_lock(&destroy_lock); - list_add(&mark->destroy_list, &destroy_list); + list_add(&mark->g_list, &destroy_list); spin_unlock(&destroy_lock); wake_up(&destroy_waitq); @@ -469,8 +469,8 @@ static int fsnotify_mark_destroy(void *ignored) synchronize_srcu(&fsnotify_mark_srcu); - list_for_each_entry_safe(mark, next, &private_destroy_list, destroy_list) { - list_del_init(&mark->destroy_list); + list_for_each_entry_safe(mark, next, &private_destroy_list, g_list) { + list_del_init(&mark->g_list); fsnotify_put_mark(mark); } diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index 442847a02b8f..0f313f93c586 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -212,7 +212,11 @@ struct fsnotify_mark { * in kernel that found and may be using this mark. */ atomic_t refcnt; /* active things looking at this mark */ struct fsnotify_group *group; /* group this mark is for */ - struct list_head g_list; /* list of marks by group->i_fsnotify_marks */ + struct list_head g_list; /* list of marks by group->i_fsnotify_marks + * Also reused for queueing mark into + * destroy_list when it's waiting for + * the end of SRCU period before it can + * be freed */ spinlock_t lock; /* protect group and inode */ struct hlist_node obj_list; /* list of marks for inode / vfsmount */ struct list_head free_list; /* tmp list used when freeing this mark */ @@ -227,7 +231,6 @@ struct fsnotify_mark { #define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY 0x08 #define FSNOTIFY_MARK_FLAG_ALIVE 0x10 unsigned int flags; /* vfsmount or inode mark? */ - struct list_head destroy_list; void (*free_mark)(struct fsnotify_mark *mark); /* called on final put+free */ }; -- cgit v1.2.3 From 6c51ec4d18d24b2ffa69de5d60bebaeb4f8e2398 Mon Sep 17 00:00:00 2001 From: Christoph Lameter Date: Fri, 12 Dec 2014 16:58:42 -0800 Subject: percpu: remove __get_cpu_var and __raw_get_cpu_var macros No user is left in the kernel source tree. Therefore we can drop the definitions. This is the final merge of the transition away from __get_cpu_var. After this patch the kernel will not build if anyone uses __get_cpu_var. Signed-off-by: Christoph Lameter Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/percpu-defs.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'include/linux') diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index 420032d41d27..57f3a1c550dc 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -254,8 +254,6 @@ do { \ #endif /* CONFIG_SMP */ #define per_cpu(var, cpu) (*per_cpu_ptr(&(var), cpu)) -#define __raw_get_cpu_var(var) (*raw_cpu_ptr(&(var))) -#define __get_cpu_var(var) (*this_cpu_ptr(&(var))) /* * Must be an lvalue. Since @var must be a simple identifier, -- cgit v1.2.3