user/sven/linux.git/ipc/shm.c, branch v3.8

mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB

2012-12-12T01:22:25Z

There was some desire in large applications using MAP_HUGETLB or SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on others. This is useful together with NUMA policy: use 2MB interleaving on some mappings, but 1GB on local mappings. This patch extends the IPC/SHM syscall interfaces slightly to allow specifying the page size. It borrows some upper bits in the existing flag arguments and allows encoding the log of the desired page size in addition to the *_HUGETLB flag. When 0 is specified the default size is used, this makes the change fully compatible. Extending the internal hugetlb code to handle this is straight forward. Instead of a single mount it just keeps an array of them and selects the right mount based on the specified page size. When no page size is specified it uses the mount of the default page size. The change is not visible in /proc/mounts because internal mounts don't appear there. It also has very little overhead: the additional mounts just consume a super block, but not more memory when not used. I also exported the new flags to the user headers (they were previously under __KERNEL__). Right now only symbols for x86 and some other architecture for 1GB and 2MB are defined. The interface should already work for all other architectures though. Only architectures that define multiple hugetlb sizes actually need it (that is currently x86, tile, powerpc). However tile and powerpc have user configurable hugetlb sizes, so it's not easy to add defines. A program on those architectures would need to query sysfs and use the appropiate log2. [akpm@linux-foundation.org: cleanups] [rientjes@google.com: fix build] [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Andi Kleen Cc: Michael Kerrisk Acked-by: Rik van Riel Acked-by: KAMEZAWA Hiroyuki Cc: Hillf Danton Signed-off-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

userns: Convert ipc to use kuid and kgid where appropriate

2012-09-07T05:17:20Z

- Store the ipc owner and creator with a kuid - Store the ipc group and the crators group with a kgid. - Add error handling to ipc_update_perms, allowing it to fail if the uids and gids can not be converted to kuids or kgids. - Modify the proc files to display the ipc creator and owner in the user namespace of the opener of the proc file. Signed-off-by: Eric W. Biederman

ipc: add COMPAT_SHMLBA support

2012-07-31T00:25:20Z

If the SHMLBA definition for a native task differs from the definition for a compat task, the do_shmat() function would need to handle both. This patch introduces COMPAT_SHMLBA, which is used by the compat shmat syscall when calling the ipc code and allows architectures such as AArch64 (where the native SHMLBA is 64k but the compat (AArch32) definition is 16k) to provide the correct semantics for compat IPC system calls. Cc: David S. Miller Cc: Chris Zankel Cc: Arnd Bergmann Acked-by: Catalin Marinas Signed-off-by: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

ipc: shm: restore MADV_REMOVE functionality on shared memory segments

2012-06-07T21:43:55Z

Commit 17cf28afea2a ("mm/fs: remove truncate_range") removed the truncate_range inode operation in favour of the fallocate file operation. When using SYSV IPC shared memory segments, calling madvise with the MADV_REMOVE advice on an area of shared memory will attempt to invoke the .fallocate function for the shm_file_operations, which is NULL and therefore returns -EOPNOTSUPP to userspace. The previous behaviour would inherit the inode_operations from the underlying tmpfs file and invoke truncate_range there. This patch restores the previous behaviour by wrapping the underlying fallocate function in shm_fallocate, as we do for fsync. [hughd@google.com: use -ENOTSUPP in shm_fallocate()] Signed-off-by: Will Deacon Acked-by: Hugh Dickins Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

switch aio and shm to do_mmap_pgoff(), make do_mmap() static

2012-06-01T14:37:17Z

after all, 0 bytes and 0 pages is the same thing... Signed-off-by: Al Viro

take security_mmap_file() outside of ->mmap_sem

2012-06-01T14:37:01Z

Signed-off-by: Al Viro

hugetlbfs: fix alignment of huge page requests

2012-03-22T00:54:59Z

When calling shmget() with SHM_HUGETLB, shmget aligns the request size to PAGE_SIZE, but this is not sufficient. Modify hugetlb_file_setup() to align requests to the huge page size, and to accept an address argument so that all alignment checks can be performed in hugetlb_file_setup(), rather than in its callers. Change newseg() and mmap_pgoff() to match the new prototype and eliminate a now redundant alignment check. [akpm@linux-foundation.org: fix build] Signed-off-by: Steven Truelove Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

SHM_UNLOCK: fix Unevictable pages stranded after swap

2012-01-23T16:38:48Z

Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in find_get_pages") correctly fixed an infinite loop; but left a problem that find_get_pages() on shmem would return 0 (appearing to callers to mean end of tree) when it meets a run of nr_pages swap entries. The only uses of find_get_pages() on shmem are via pagevec_lookup(), called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's scan_mapping_unevictable_pages(). The first is already commented, and not worth worrying about; but the second can leave pages on the Unevictable list after an unusual sequence of swapping and locking. Fix that by using shmem_find_get_pages_and_swap() (then ignoring the swap) instead of pagevec_lookup(). But I don't want to contaminate vmscan.c with shmem internals, nor shmem.c with LRU locking. So move scan_mapping_unevictable_pages() into shmem.c, renaming it shmem_unlock_mapping(); and rename check_move_unevictable_page() to check_move_unevictable_pages(), looping down an array of pages, oftentimes under the same lock. Leave out the "rotate unevictable list" block: that's a leftover from when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed handling involved looking at pages at tail of LRU. Was there significance to the sequence first ClearPageUnevictable, then test page_evictable, then SetPageUnevictable here? I think not, we're under LRU lock, and have no barriers between those. Signed-off-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Cc: Minchan Kim Cc: Rik van Riel Cc: Shaohua Li Cc: Eric Dumazet Cc: Johannes Weiner Cc: Michel Lespinasse Cc: [back to 3.1 but will need respins] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

SHM_UNLOCK: fix long unpreemptible section

2012-01-23T16:38:48Z

scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages evictable again once the shared memory is unlocked. It does this with pagevec_lookup()s across the whole object (which might occupy most of memory), and takes 300ms to unlock 7GB here. A cond_resched() every PAGEVEC_SIZE pages would be good. However, KOSAKI-san points out that this is called under shmem.c's info->lock, and it's also under shm.c's shm_lock(), both spinlocks. There is no strong reason for that: we need to take these pages off the unevictable list soonish, but those locks are not required for it. So move the call to scan_mapping_unevictable_pages() from shmem.c's unlock handling up to shm.c's unlock handling. Remove the recently added barrier, not needed now we have spin_unlock() before the scan. Use get_file(), with subsequent fput(), to make sure we have a reference to mapping throughout scan_mapping_unevictable_pages(): that's something that was previously guaranteed by the shm_lock(). Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK time, and we lazily discover them to be Unevictable later, so it serves no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since pages still on pagevec are not marked Unevictable. The original code avoided redundant rescans by checking VM_LOCKED flag at its level: now avoid them by checking shp's SHM_LOCKED. The original code called scan_mapping_unevictable_pages() on a locked area at shm_destroy() time: perhaps we once had accounting cross-checks which required that, but not now, so skip the overhead and just let inode eviction deal with them. Put check_move_unevictable_page() and scan_mapping_unevictable_pages() under CONFIG_SHMEM (with stub for the TINY case when ramfs is used), more as comment than to save space; comment them used for SHM_UNLOCK. Signed-off-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Cc: Minchan Kim Cc: Rik van Riel Cc: Shaohua Li Cc: Eric Dumazet Cc: Johannes Weiner Cc: Michel Lespinasse Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

Do 'shm_init_ns()' in an early pure_initcall

2011-08-05T05:35:59Z

This isn't really critical any more, since other patches (commit 298507d4d2cf: "shm: optimize exit_shm()") have caused us to not actually need to touch the rw_mutex unless there are actual shm segments associated with the namespace, but we really should do tne shm_init_ns() earlier than we do now. This, together with commit 288d5abec831 ("Boot up with usermodehelper disabled") will mean that we really do initialize the initial ipc namespace data structure before we run any tasks. Tested-by: Marc Zyngier Signed-off-by: Linus Torvalds