user/sven/linux.git - Linux Kernel

Age	Commit message (Collapse)	Author
2022-09-26	mm/msync: use vma_find() instead of vma linked list	Liam R. Howlett
	Remove a single use of the vma linked list in preparation for the removal of the linked list. Uses find_vma() to get the next element. Link: https://lkml.kernel.org/r/20220906194824.2110408-61-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2021-04-30	mm/msync: exit early when the flags is an MS_ASYNC and start < vm_start	Nikita Ermakov
	If an unmapped region was found and the flag is MS_ASYNC (without MS_INVALIDATE) there is nothing to do and the result would be always -ENOMEM, so return immediately. Link: https://lkml.kernel.org/r/20201025092901.56399-1-sh1r4s3@mail.si-head.nl Signed-off-by: Nikita Ermakov <sh1r4s3@mail.si-head.nl> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09	mmap locking API: use coccinelle to convert mmap_sem rwsem call sites	Michel Lespinasse
	This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock \| -down_write +mmap_write_lock \| -down_write_killable +mmap_write_lock_killable \| -down_write_trylock +mmap_write_trylock \| -up_write +mmap_write_unlock \| -downgrade_write +mmap_write_downgrade \| -down_read +mmap_read_lock \| -down_read_killable +mmap_read_lock_killable \| -down_read_trylock +mmap_read_trylock \| -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25	mm: untag user pointers passed to memory syscalls	Andrey Konovalov
	This patch is a part of a series that extends kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments. This patch allows tagged pointers to be passed to the following memory syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect, mremap, msync, munlock, move_pages. The mmap and mremap syscalls do not currently accept tagged addresses. Architectures may interpret the tag as a background colour for the corresponding vma. Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jens Wiklander <jens.wiklander@linaro.org> Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-02	License cleanup: add SPDX GPL-2.0 license identifier to files with no license	Greg Kroah-Hartman
	Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a /uapi/ one with no licensing information in it, - file was a /uapi/ one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non /uapi/ files that summary was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a /uapi/ path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the /uapi/ ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------\|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-05	mm/msync: use offset_in_page macro	Alexander Kuleshov
	linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10	mm: remove rest usage of VM_NONLINEAR and pte_file()	Kirill A. Shutemov
	One bit in ->vm_flags is unused now! Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-03	msync: fix incorrect fstart calculation	Namjae Jeon
	Fix a regression caused by 7fc34a62ca44 ("mm/msync.c: sync only the requested range in msync()"). xfstests generic/075 fail occured on ext4 data=journal mode because the intended range was not syncing due to wrong fstart calculation. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Reported-by: Eric Whitney <enwlinux@gmail.com> Tested-by: Eric Whitney <enwlinux@gmail.com> Acked-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Tested-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04	mm/msync.c: sync only the requested range in msync()	Matthew Wilcox
	msync() currently syncs more than POSIX requires or BSD or Solaris implement. It is supposed to be equivalent to fdatasync(), not fsync(), and it is only supposed to sync the portion of the file that overlaps the range passed to msync. If the VMA is non-linear, fall back to syncing the entire file, but we still optimise to only fdatasync() the entire file, not the full fsync(). akpm: there are obvious concerns with bck-compatibility: is anyone relying on the undocumented side-effect for their data integrity? And how would they ever know if this change broke their data integrity? We think the risk is reasonably low, and this patch brings the kernel into line with other OS's and with what the manpage has always said... Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-21	sanitize vfs_fsync calling conventions	Christoph Hellwig
	Now that the last user passing a NULL file pointer is gone we can remove the redundant dentry argument and associated hacks inside vfs_fsynmc_range. The next step will be removig the dentry argument from ->fsync, but given the luck with the last round of method prototype changes I'd rather defer this until after the main merge window. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-01-14	[CVE-2009-0029] System call wrappers part 13	Heiko Carstens
	Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-05	add a vfs_fsync helper	Christoph Hellwig
	Fsync currently has a fdatawrite/fdatawait pair around the method call, and a mutex_lock/unlock of the inode mutex. All callers of fsync have to duplicate this, but we have a few and most of them don't quite get it right. This patch adds a new vfs_fsync that takes care of this. It's a little more complicated as usual as ->fsync might get a NULL file pointer and just a dentry from nfsd, but otherwise gets afile and we want to take the mapping and file operations from it when it is there. Notes on the fsync callers: - ecryptfs wasn't calling filemap_fdatawrite / filemap_fdatawait on the lower file - coda wasn't calling filemap_fdatawrite / filemap_fdatawait on the host file, and returning 0 when ->fsync was missing - shm wasn't calling either filemap_fdatawrite / filemap_fdatawait nor taking i_mutex. Now given that shared memory doesn't have disk backing not doing anything in fsync seems fine and I left it out of the vfs_fsync conversion for now, but in that case we might just not pass it through to the lower file at all but just call the no-op simple_sync_file directly. [and now actually export vfs_fsync] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-21	Detach sched.h from mm.h	Alexey Dobriyan
	First thing mm.h does is including sched.h solely for can_do_mlock() inline function which has "current" dereference inside. By dealing with can_do_mlock() mm.h can be detached from sched.h which is good. See below, why. This patch a) removes unconditional inclusion of sched.h from mm.h b) makes can_do_mlock() normal function in mm/mlock.c c) exports can_do_mlock() to not break compilation d) adds sched.h inclusions back to files that were getting it indirectly. e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were getting them indirectly Net result is: a) mm.h users would get less code to open, read, preprocess, parse, ... if they don't need sched.h b) sched.h stops being dependency for significant number of files: on x86_64 allmodconfig touching sched.h results in recompile of 4083 files, after patch it's only 3744 (-8.3%). Cross-compile tested on all arm defconfigs, all mips defconfigs, all powerpc defconfigs, alpha alpha-up arm i386 i386-up i386-defconfig i386-allnoconfig ia64 ia64-up m68k mips parisc parisc-up powerpc powerpc-up s390 s390-up sparc sparc-up sparc64 sparc64-up um-x86_64 x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig as well as my two usual configs. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-09-26	[PATCH] mm: msync() cleanup	Peter Zijlstra
	With the tracking of dirty pages properly done now, msync doesn't need to scan the PTEs anymore to determine the dirty status. From: Hugh Dickins <hugh@veritas.com> In looking to do that, I made some other tidyups: can remove several #includes, and sys_msync loop termination not quite right. Most of those points are criticisms of the existing sys_msync, not of your patch. In particular, the loop termination errors were introduced in 2.6.17: I did notice this shortly before it came out, but decided I was more likely to get it wrong myself, and make matters worse if I tried to rush a last-minute fix in. And it's not terribly likely to go wrong, nor disastrous if it does go wrong (may miss reporting an unmapped area; may also fsync file of a following vma). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23	[PATCH] Kill PF_SYNCWRITE flag	Jens Axboe
	A process flag to indicate whether we are doing sync io is incredibly ugly. It also causes performance problems when one does a lot of async io and then proceeds to sync it. Part of the io will go out as async, and the other part as sync. This causes a disconnect between the previously submitted io and the synced io. For io schedulers such as CFQ, this will cause us lost merges and suboptimal behaviour in scheduling. Remove PF_SYNCWRITE completely from the fsync/msync paths, and let the O_DIRECT path just directly indicate that the writes are sync by using WRITE_SYNC instead. Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-24	The comment describing how MS_ASYNC works in msync.c is confusing	Amos Waterland
	because of a typo. This patch just changes "my" to "by", which I believe was the original intent. Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-24	[PATCH] msync(): use do_fsync()	Andrew Morton
	No need to duplicate all that code. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24	[PATCH] msync: fix return value	Andrew Morton
	msync() does a strange thing. Essentially: vma = find_vma(); for ( ; ; ) { if (!vma) return -ENOMEM; ... vma = vma->vm_next; } so an msync() request which starts within or before a valid VMA and which ends within or beyond the final VMA will incorrectly return -ENOMEM. Fix. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24	[PATCH] msync(MS_SYNC): don't hold mmap_sem while syncing	Andrew Morton
	It seems bad to hold mmap_sem while performing synchronous disk I/O. Alter the msync(MS_SYNC) code so that the lock is released while we sync the file. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24	[PATCH] msync(): perform dirty page levelling	Andrew Morton
	It seems sensible to perform dirty page throttling in msync: as the application dirties pages we can kick off pdflush early, or even force the msync() caller to perform writeout, or even throttle the msync() caller. The main effect of this is to start disk writeback earlier if we've just discovered that a large amount of pagecache has been dirtied. (Otherwise it wouldn't happen for up to five seconds, next time pdflush wakes up). It also will cause the page-dirtying process to get panalised for dirtying those pages rather than whacking someone else with the problem. We should do this for munmap() and possibly even exit(), too. We drop the mmap_sem while performing the dirty page balancing. It doesn't seem right to hold mmap_sem for that long. Note that this patch only affects MS_ASYNC. MS_SYNC will be syncing all the dirty pages anyway. We note that msync(MS_SYNC) does a full-file-sync inside mmap_sem, and always has. We can fix that up... The patch also tightens up the mmap_sem coverage in sys_msync(): no point in taking it while we perform the incoming arg checking. Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-09	[PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem	Jes Sorensen
	This patch converts the inode semaphore to a mutex. I have tested it on XFS and compiled as much as one can consider on an ia64. Anyway your luck with it might be different. Modified-by: Ingo Molnar <mingo@elte.hu> (finished the conversion) Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2005-11-28	mm: re-architect the VM_UNPAGED logic	Linus Torvalds
	This replaces the (in my opinion horrible) VM_UNMAPPED logic with very explicit support for a "remapped page range" aka VM_PFNMAP. It allows a VM area to contain an arbitrary range of page table entries that the VM never touches, and never considers to be normal pages. Any user of "remap_pfn_range()" automatically gets this new functionality, and doesn't even have to mark the pages reserved or indeed mark them any other way. It just works. As a side effect, doing mmap() on /dev/mem works for arbitrary ranges. Sparc update from David in the next commit. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-22	[PATCH] unpaged: VM_UNPAGED	Hugh Dickins
	Although we tend to associate VM_RESERVED with remap_pfn_range, quite a few drivers set VM_RESERVED on areas which are then populated by nopage. The PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free pages in zap_pte_range, without changing those drivers not to set it: so their pages just leak away. Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the core, to flag the special areas where the ptes may have no struct page, or if they have then it's not to be touched. Replace most instances of VM_RESERVED in core mm by VM_UNPAGED. Force it on in remap_pfn_range, and the sparc and sparc64 io_remap_pfn_range. Revert addition of VM_RESERVED to powerpc vdso, it's not needed there. Is it needed anywhere? It still governs the mm->reserved_vm statistic, and special vmas not to be merged, and areas not to be core dumped; but could probably be eliminated later (the drivers are probably specifying it because in 2.4 it kept swapout off the vma, but in 2.6 we work from the LRU, which these pages don't get on). Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no purpose whatsoever, and should be removed from drivers when we clean up. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] mm: pte_offset_map_lock loops	Hugh Dickins
	Convert those common loops using page_table_lock on the outside and pte_offset_map within to use just pte_offset_map_lock within instead. These all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them. But whereas pte_alloc loops tested with the "atomic" pmd_present, these loops are testing with pmd_none, which on i386 PAE tests both lower and upper halves. That's now unsafe, so add a cast into pmd_none to test only the vital lower half: we lose a little sensitivity to a corrupt middle directory, but not enough to worry about. It appears that i386 and UML were the only architectures vulnerable in this way, and pgd and pud no problem. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] core remove PageReserved	Nick Piggin
	Remove PageReserved() calls from core code by tightening VM_RESERVED handling in mm/ to cover PageReserved functionality. PageReserved special casing is removed from get_page and put_page. All setting and clearing of PageReserved is retained, and it is now flagged in the page_alloc checks to help ensure we don't introduce any refcount based freeing of Reserved pages. MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being deprecated. We never completely handled it correctly anyway, and is be reintroduced in future if required (Hugh has a proof of concept). Once PageReserved() calls are removed from kernel/power/swsusp.c, and all arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can be trivially removed. Last real user of PageReserved is swsusp, which uses PageReserved to determine whether a struct page points to valid memory or not. This still needs to be addressed (a generic page_is_ram() should work). A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. Signed-off-by: Nick Piggin <npiggin@suse.de> Refcount bug fix for filemap_xip.c Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] mm: msync_pte_range progress	Hugh Dickins
	Use latency breaking in msync_pte_range like that in copy_pte_range, instead of the ugly CONFIG_PREEMPT filemap_msync alternatives. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29	[PATCH] mm/msync.c cleanup	OGAWA Hirofumi
	This is not problem actually, but sync_page_range() is using for exported function to filesystems. The msync_xxx is more readable at least to me. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21	[PATCH] msync: check pte dirty earlier	Abhijit Karmarkar
	It's common practice to msync a large address range regularly, in which often only a few ptes have actually been dirtied since the previous pass. sync_pte_range then goes much faster if it tests whether pte is dirty before locating and accessing each struct page cacheline; and it is hardly slowed by ptep_clear_flush_dirty repeating that test in the opposite case, when every pte actually is dirty. But beware, s390's pte_dirty always says false, since its dirty bit is kept in the storage key, located via the struct page address. So skip this optimization in its case: use a pte_maybe_dirty macro which just says true if page_test_and_clear_dirty is implemented. Signed-off-by: Abhijit Karmarkar <abhijitk@veritas.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-13	[PATCH] ptwalk: inline pmd_range and pud_range	Hugh Dickins
	As a general rule, ask the compiler to inline action_on_pmd_range and action_on_pud_range: they're none very interesting, and it has a better chance of eliding them that way. But conversely, it helps debug traces if action_on_pte_range and top action_on_page_range remain uninlined. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-13	[PATCH] ptwalk: move p?d_none_or_clear_bad	Hugh Dickins
	To handle large sparse areas a little more efficiently, follow Nick and move the p?d_none_or_clear_bad tests up from the start of each function to its callsite. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-13	[PATCH] ptwalk: sync_page_range	Hugh Dickins
	Convert filemap_sync pagetable walkers to loops using p?d_addr_end; use similar loop to split filemap_sync into chunks. Merge filemap_sync_pte into sync_pte_range, cut filemap_ off the longer names, vma arg first. There is no error from filemap_sync, nor is any use made of the flags: if it should do something else for MS_INVALIDATE, reinstate it when that is implemented. Remove the redundant flush_tlb_range from afterwards: as its comment noted, each dirty pte has already been flushed. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-13	[PATCH] ptwalk: p?d_none_or_clear_bad	Hugh Dickins
	Replace the repetitive p?d_none, p?d_bad, p?d_ERROR, p?d_clear clauses by pgd_none_or_clear_bad, pud_none_or_clear_bad, pmd_none_or_clear_bad inlines throughout common and i386 - avoids a sprinkling of "unlikely"s. Tests inline, but unlikely error handling in mm/memory.c - so the ERROR file and line won't tell much; but it comes too late anyway, and hardly ever seen outside development. Let mremap use them in get_one_pte_map, as it already did in _nested; but leave follow_page and untouched_anonymous page just skipping _bad as before - they don't have quite the same ownership of the mm. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-07	[PATCH] sched: mm: fix scheduling latencies in filemap_sync()	Ingo Molnar
	The attached patch, written by Andrew Morton, fixes long scheduling latencies in filemap_sync(). Has been tested as part of the -VP patchset. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] msync(): set PF_SYNCWRITE	Andrew Morton
	Pass the "we are doing synchronous writes" hint down from msync(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-01-04	[PATCH] Reduce i_sem usage during file sync operations	Andrew Morton
	We hold i_sem during the various sync() operations to prevent livelocks: if another thread is dirtying the file, a sync() may never return. Or at least, that used to be true when we were using the per-address_space page lists. Since writeback has used radix tree traversal it is not possible to livelock the sync() operations, because they only visit each page a single time. sync_page_range() (used by O_SYNC writes) has not been holding i_sem for quite some time, for the above reasons. The patch converts fsync(), fdatasync() and msync() to also not hold i_sem during the radix-tree-based writeback. Now, we _do_ still need to hold i_sem across the file->f_op->fsync() call, because that is still based on a list_head walk, and is still livelockable. But in the case of msync() I deliberately left i_sem untaken. This is because we're currently deadlockable in msync, because mmap_sem is already held, and mmap_sem nexts inside i_sem, due to direct-io.c. And yes, the ranking of down_read() veruss down() does matter: Task A Task B Task C down_read(rwsem) down(sem) down_write(rwsem) down(sem) down_read(rwsem) C's down_write() will cause B's down_read to block. B holds `sem', so A will never release `rwsem'. So the patch fixes a hard-to-hit triple-task deadlock, but adds a possible livelock in msync(). It is possible to fix sys_msync() so that it takes i_sem outside i_mmap_sem. Later. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-12-31	[PATCH] convert Linux to 4-level page tables	Andi Kleen
	Extend the Linux MM to 4level page tables. This is the core patch for mm/, fs/, include/linux/* It breaks all architectures, which will be fixed in separate patches. The conversion is quite straight forward. All the functions walking the page table hierarchy have been changed to deal with another level at the top. The additional level is called pml4. mm/memory.c has changed a lot because it did most of the heavy lifting here. Most of the changes here are extensions of the previous code. Signed-off-by: Andi Kleen <ak@suse.de> Converted by Nick Piggin to use the pud_t 'page upper' level between pgd and pmd instead of Andi's pml4 level above pgd. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18	[PATCH] add missing linux/syscalls.h includes	Arnd Bergmann
	I found that the prototypes for sys_waitid and sys_fcntl in <linux/syscalls.h> don't match the implementation. In order to keep all prototypes in sync in the future, now include the header from each file implementing any syscall. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-17	[PATCH] Clean up asm/pgalloc.h include	Russell King
	This patch cleans up needless includes of asm/pgalloc.h from the fs/ kernel/ and mm/ subtrees. Compile tested on multiple ARM platforms, and x86, this patch appears safe. This patch is part of a larger patch aiming towards getting the include of asm/pgtable.h out of linux/mm.h, so that asm/pgtable.h can sanely get at things like mm_struct and friends. I suggest testing in -mm for a while to ensure there aren't any hidden arch issues. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-06-02	[PATCH] hugetlbpage msync() fix	Andrew Morton
	From: David Gibson <david@gibson.dropbear.id.au> Currently, calling msync() on a hugepage area will cause the kernel to blow up with a bad_page() (at least on ppc64, but I think the problem will exist on other archs too). The msync path attempts to walk pagetables which may not be there, or may have an unusual layout for hugepages. Lucikly we shouldn't need to do anything for an msync on hugetlbfs beyond flushing the cache, so this patch should be sufficient to fix the problem. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-01-18	[PATCH] s390: superflous flush_tlb_range calls.	Andrew Morton
	From: Martin Schwidefsky <schwidefsky@de.ibm.com> while searching for a s390 tlb flush problem I noticed some superflous tlb flushes. One in zeromap_page_range, one in remap_page_range, and another one in filemap_sync. The patch just adds comments but I think these three flush_tlb_range calls can be removed.
2004-01-18	[PATCH] s390: physical dirty/referenced bits.	Andrew Morton
	From: Martin Schwidefsky <schwidefsky@de.ibm.com> this is another s/390 related mm patch. It introduces the concept of physical dirty and referenced bits into the common mm code. I always had the nagging feeling that the pte functions for setting/clearing the dirty and referenced bits are not appropriate for s/390. It works but it is a bit of a hack. After the wake of rmap it is now possible to put a much better solution into place. The idea is simple: since there are not dirty/referenced bits in the pte make these function nops on s/390 and add operations on the physical page to the appropriate places. For the referenced bit this is the page_referenced() function. For the dirty bit there are two relevant spots: in page_remove_rmap after the last user of the page removed its reverse mapping and in try_to_unmap after the last user was unmapped. There are two new functions to accomplish this: * page_test_and_clear_dirty: Test and clear the dirty bit of a physical page. This function is analog to ptep_test_and_clear_dirty but gets a struct page as argument instead of a pte_t pointer. * page_test_and_clear_young: Test and clear the referenced bit of a physical page. This function is analog to ptep_test_and_clear_young but gets a struct page as argument instead of a pte_t pointer. Its pretty straightforward and with it the s/390 mm makes much more sense. You'll need the tls flush optimization patch for the patch. Comments ?
2004-01-18	[PATCH] s390: tlb flush optimization.	Andrew Morton
	From: Martin Schwidefsky <schwidefsky@de.ibm.com> On the s/390 architecture we still have the issue with tlb flushing and the ipte instruction. We can optimize the tlb flushing a lot with some minor interface changes between the arch backend and the memory management core. In the end the whole thing is about the Invalidate Page Table Entry (ipte) instruction. The instruction sets the invalid bit in the pte and removes the tlb for the page on all cpus for the virtual to physical mapping of the page in a particular address space. The nice thing is that only the tlb for this page gets removed, all the other tlbs stay valid. The reason we can't use ipte to implement flush_tlb_page() is one of the requirements of the instruction: the pte that should get flushed needs to be valid. I'd like to add the following four functions to the mm interface: * ptep_establish: Establish a new mapping. This sets a pte entry to a page table and flushes the tlb of the old entry on all cpus if it exists. This is more or less what establish_pte in mm/memory.c does right now but without the update_mmu_cache call. * ptep_test_and_clear_and_flush_young. Do what ptep_test_and_clear_young does and flush the tlb. * ptep_test_and_clear_and_flush_dirty. Do what ptep_test_and_clear_dirty does and flush the tlb. * ptep_get_and_clear_and_flush: Do what ptep_get_and_clear does and flush the tlb. The s/390 specific functions in include/pgtable.h define their own optimized version of these four functions by use of the ipte. I avoid the definition of these function for every architecture I added them to include/asm-generic/pgtable.h. Since i386/x86 and others don't include this header yet and define their own version of the functions found there I #ifdef'd all functions in include/asm-generic/pgtable.h to be able to pick the ones that are needed for each architecture (see patch for details). With the new functions in place it is easy to do the optimization, e.g. the sequence ptep_get_and_clear(ptep); flush_tlb_page(vma, address); gets replace by ptep_get_and_clear_and_flush(vma, address, ptep); The old sequence still works but it is suboptimal on s/390.
2004-01-18	[PATCH] bdev: use correct mapping's i_sem	Andrew Morton
	From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk> In a bunch of places we used file->f_dentry->d_inode->i_sem to protect fdatasync et.al. Replaced with corrent file->f_mapping->host->i_sem - the object we are protecting is address_space, so we want an exclusion that would work for redirected ->i_mapping. For normal files (not coda, not bdev) it's all the same, of course - there we have file->f_mapping->host == file->f_dentry->d_inode and change above is an equivalent transfromation.
2003-04-08	[PATCH] Make msync(MS_ASYNC) no longer start the I/O	Andrew Morton
	MS_ASYNC will currently wait on previously-submitted I/O, then start new I/O and not wait on it. This can cause undesirable blocking if msync is called rapidly against the same memory. So instead, change msync(MS_ASYNC) to not start any IO at all. Just flush the pte dirty bits into the pageframe and leave it at that. The IO _will_ happen within a kupdate period. And the application can use fsync() or fadvise(FADV_DONTNEED) if it actually wants to schedule the IO immediately. (This has triggered an ext3 bug - the page's buffers get dirtied so fast that kjournald keeps writing the buffers over and over for 10-20 seconds before deciding to give up for some reason)
2002-10-31	[PATCH] uninline some things in mm/*.c	Andrew Morton
	Tuned for gcc-2.95.3: filemap.c: 10815 -> 10046 highmem.c: 3392 -> 3104 mmap.c: 5998 -> 5854 mremap.c: 3058 -> 2802 msync.c: 1521 -> 1489 page_alloc.c: 8487 -> 8167
2002-10-15	[PATCH] make filemap_sync static	Andrew Morton
	From Christpoh Hellwig. Make filemap_sync() static, and not exported to modules
2002-10-13	[PATCH] msync correctness fixes	Andrew Morton
	From Anton Blanchard. This fixes a couple of Linux Test Project failures. - Returns EBUSY if the caller is trying to invalidate memory which is covered by a locked vma. The open group say: [EBUSY] Some or all of the addresses in the range starting at addr and continuing for len bytes are locked, and MS_INVALIDATE is specified. - Returns EINVAL if the caller specified both MS_SYNC and MS_ASYNC [EINVAL] The value of flags is invalid. and: "Either MS_ASYNC or MS_SYNC is specified, but not both."
2002-08-30	[PATCH] writeback correctness and efficiency changes	Andrew Morton
	This is a performance and correctness fix against the writeback paths. The writeback code has competing requirements. Sometimes it is used for "memory cleansing": kupdate, bdflush, writer throttling, page allocator writeback, etc. And sometimes this same code is used for data integrity pruposes: fsync, msync, fdatasync, sync, umount, various other kernel-internal uses. The problem is: how to handle a dirty buffer or page which is currently under writeback. For memory cleansing, we just want to skip that buffer/page and go onto the next one. But for sync, we must wait on the old writeback and then start new writeback. mpage_writepages() is current correct for cleansing, but incorrect for sync. block_write_full_page() is currently correct for sync, but inefficient for cleansing. The fix is fairly simple. - In mpage_writepages(), don't skip the page is it's a sync operation. - In block_write_full_page(), skip the buffer if it is a sync operation. And return -EAGAIN to tell the caller that the writeout didn't work out. The caller must then set the page dirty again and move it onto mapping->dirty_pages. This is an extension of the writepage API: writepage can now return EAGAIN. There are only three callers, and they have been updated. fail_writepage() and ext3_writepage() were actually doing this by hand. They have been changed to return -EAGAIN. NTFS will want to be able to return -EAGAIN from its writepage as well. - A sticky question is: how to tell the writeout code which mode it is operating in? Cleansing or sync? It's such a tiny code change that I didn't have the heart to go and propagate a `mode' argument down every instance of writepages() and writepage() in the kernel. So I passed it in via current->flags. Incidentally, the occurrence of a locked-and-dirty buffer in block_write_full_page() is fairly rare: normally the collision avoidance happens at the address_space level, via PageWriteback. But some mappings (blockdevs, ext3 files, etc) have their dirty buffers written out via submit_bh(). It is these buffers which can stall block_write_full_page(). This wart will be pretty intrusive to fix. ext3 needs to become fully page-based (ugh. It's a block-based journalling filesystem, and pages are unnatural). blockdev mappings are still written out by buffers because that's how filesystems use them. Putting _all_ metadata (indirects, inodes, superblocks, etc) into standalone address_spaces would fix that up. - filemap_fdatawrite() sets PF_SYNC. So filemap_fdatawrite() is the kernel function which will start writeback against a mapping for "data integrity" purposes, whereas the unexported, internal-only do_writepages() is the writeback function which is used for memory cleansing. This difference is the reason why I didn't consolidate those functions ages ago... - Lots of code paths had a bogus extra call to filemap_fdatawait(), which I previously added in a moment of weak-headedness. They have all been removed.
2002-07-14	[PATCH] error code for msync()	Hirofumi Ogawa
	SuSv3 says: "The msync() function shall fail if: [EBUSY] Some or all of the addresses in the range starting at addr and continuing for len bytes are locked, and MS_INVALIDATE is specified. [EINVAL] The value of flags is invalid. [EINVAL] The value of addr is not a multiple of the page size {PAGESIZE}. [ENOMEM] The addresses in the range starting at addr and continuing for len bytes are outside the range allowed for the address space of a process or specify one or more pages that are not mapped." This fixes error code of msync() of the EINVAL case.
2002-06-17	[PATCH] msync(bad address) should return -ENOMEM	Andrew Morton
	Heaven knows why, but that's what the opengroup say, and returning -EFAULT causes 2.5 to fail one of the Linux Test Project tests. [ENOMEM] The addresses in the range starting at addr and continuing for len bytes are outside the range allowed for the address space of a process or specify one or more pages that are not mapped. 2.4 has it right, but 2.5 doesn't.