| Age | Commit message (Collapse) | Author |
|
Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The page_clear_dirty primitive always sets the default storage key
which resets the access control bits and the fetch protection bit.
That will surprise a KVM guest that sets non-zero access control
bits or the fetch protection bit. Merge page_test_dirty and
page_clear_dirty back to a single function and only clear the
dirty bit from the storage key.
In addition move the function page_test_and_clear_dirty and
page_test_and_clear_young to page.h where they belong. This
requires to change the parameter from a struct page * to a page
frame number.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Commit e2cda3226481 ("thp: add pmd mangling generic functions") replaced
some macros in <asm-generic/pgtable.h> with inline functions.
If the functions are to be defined (not all architectures need them)
then struct vm_area_struct must be defined first. So include
<linux/mm_types.h>.
Fixes a build failure seen in Debian:
CC [M] drivers/media/dvb/mantis/mantis_pci.o
In file included from arch/arm/include/asm/pgtable.h:460,
from drivers/media/dvb/mantis/mantis_pci.c:25:
include/asm-generic/pgtable.h: In function 'ptep_test_and_clear_young':
include/asm-generic/pgtable.h:29: error: dereferencing pointer to incomplete type
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as
BUG() and they were defined only to diminish the risk of build issues on
not-x86 archs and to be consistent with the generic pte methods previously
defined in include/asm-generic/pgtable.h.
But they are causing more trouble than they were supposed to solve, so
it's simpler not to define them when THP is off.
This is also correcting the export of pmdp_splitting_flush which is
currently unused (x86 isn't using the generic implementation in
mm/pgtable-generic.c and no other arch needs that [yet]).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Sam Ravnborg <sam@ravnborg.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Some are needed to build but not actually used on archs not supporting
transparent hugepages. Others like pmdp_clear_flush are used by x86 too.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
These returns 0 at compile time when the config option is disabled, to
allow gcc to eliminate the transparent hugepage function calls at compile
time without additional #ifdefs (only the export of those functions have
to be visible to gcc but they won't be required at link time and
huge_memory.o can be not built at all).
_PAGE_BIT_UNUSED1 is never used for pmd, only on pte.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Improve performance of the sske operation by using the nonquiescing
variant if the affected page has no mappings established. On machines
with no support for the new sske variant the mask bit will be ignored.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
In x86, access and dirty bits are set automatically by CPU when CPU accesses
memory. When we go into the code path of below flush_tlb_fix_spurious_fault(),
we already set dirty bit for pte and don't need flush tlb. This might mean
tlb entry in some CPUs hasn't dirty bit set, but this doesn't matter. When
the CPUs do page write, they will automatically check the bit and no software
involved.
On the other hand, flush tlb in below position is harmful. Test creates CPU
number of threads, each thread writes to a same but random address in same vma
range and we measure the total time. Under a 4 socket system, original time is
1.96s, while with the patch, the time is 0.8s. Under a 2 socket system, there is
20% time cut too. perf shows a lot of time are taking to send ipi/handle ipi for
tlb flush.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <20100816011655.GA362@sli10-desk.sh.intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Andrea Archangeli <aarcange@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
|
|
Most architectures now provide a pgprot_noncached(), the
remaining ones can simply use an dummy default implementation,
except for cris and xtensa, which should override the
default appropriately.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Magnus Damm <magnus.damm@gmail.com>
|
|
Impact: fix lazy context switch API
Pass the previous and next tasks into the context switch start
end calls, so that the called functions can properly access the
task state (esp in end_context_switch, in which the next task
is not yet completely current).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
|
|
Impact: simplification, prepare for later changes
Make lazy cpu mode more specific to context switching, so that
it makes sense to do more context-switch specific things in
the callbacks.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
|
|
Impact: cleanup
Change the protection parameter for track_pfn_vma_new() into a pgprot_t pointer.
Subsequent patch changes the x86 PAT handling to return a compatible
memtype in pgprot_t, if what was requested cannot be allowed due to conflicts.
No fuctionality change in this patch.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Impact: Cleanup and branch hints only.
Move the track and untrack pfn stub routines from memory.c to asm-generic.
Also add unlikely to pfnmap related calls in fork and exit path.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
|
|
Impact: New mm functionality.
Add pgprot_writecombine. pgprot_writecombine will be aliased to
pgprot_noncached when not supported by the architecture.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
|
|
Commit 1ea0704e0d aka "mm: add a ptep_modify_prot transaction abstraction"
caused:
| CC init/main.o
|In file included from include2/asm/pgtable.h:68,
| from /home/bigeasy/git/linux-2.6-m68k/include/linux/mm.h:39,
| from include2/asm/uaccess.h:8,
| from /home/bigeasy/git/linux-2.6-m68k/include/linux/poll.h:13,
| from /home/bigeasy/git/linux-2.6-m68k/include/linux/rtc.h:113,
| from /home/bigeasy/git/linux-2.6-m68k/include/linux/efi.h:19,
| from /home/bigeasy/git/linux-2.6-m68k/init/main.c:43:
|/linux-2.6/include/asm-generic/pgtable.h: In function '__ptep_modify_prot_start':
|/linux-2.6/include/asm-generic/pgtable.h:209: error: implicit declaration of function 'ptep_get_and_clear'
|/linux-2.6/include/asm-generic/pgtable.h:209: error: incompatible types in return
|/linux-2.6/include/asm-generic/pgtable.h: In function '__ptep_modify_prot_commit':
|/linux-2.6/include/asm-generic/pgtable.h:220: error: implicit declaration of function 'set_pte_at'
|make[2]: *** [init/main.o] Error 1
|make[1]: *** [init] Error 2
|make: *** [sub-make] Error 2
on my m68knommu box.
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch adds an API for doing read-modify-write updates to a pte's
protection bits which may race against hardware updates to the pte.
After reading the pte, the hardware may asynchonously set the accessed
or dirty bits on a pte, which would be lost when writing back the
modified pte value.
The existing technique to handle this race is to use
ptep_get_and_clear() atomically fetch the old pte value and clear it
in memory. This has the effect of marking the pte as non-present,
which will prevent the hardware from updating its state. When the new
value is written back, the pte will be present again, and the hardware
can resume updating the access/dirty flags.
When running in a virtualized environment, pagetable updates are
relatively expensive, since they generally involve some trap into the
hypervisor. To mitigate the cost of these updates, we tend to batch
them.
However, because of the atomic nature of ptep_get_and_clear(), it is
inherently non-batchable. This new interface allows batching by
giving the underlying implementation enough information to open a
transaction between the read and write phases:
ptep_modify_prot_start() returns the current pte value, and puts the
pte entry into a state where either the hardware will not update the
pte, or if it does, the updates will be preserved on commit.
ptep_modify_prot_commit() writes back the updated pte, makes sure that
any hardware updates made since ptep_modify_prot_start() are
preserved.
ptep_modify_prot_start() and _commit() must be exactly paired, and
used while holding the appropriate pte lock. They do not protect
against other software updates of the pte in any way.
The current implementations of ptep_modify_prot_start and _commit are
functionally unchanged from before: _start() uses ptep_get_and_clear()
fetch the pte and zero the entry, preventing any hardware updates.
_commit() simply writes the new pte value back knowing that the
hardware has not updated the pte in the meantime.
The only current user of this interface is mprotect
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
set_pte(). This is too late. This patch removes lazy_mmu_prot_update and
add modfied set_pte() for flushing if necessary.
This patch flush icache of a page when
new pte has exec bit.
&& new pte has present bit
&& new pte is user's page.
&& (old *ptep is not present
|| new pte's pfn is not same to old *ptep's ptn)
&& new pte's page has no Pg_arch_1 bit.
Pg_arch_1 is set when a page is cache consistent.
I think this condition checks are much easier to understand than considering
"Where sync_icache_dcache() should be inserted ?".
pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
clean-up. So, I added it again.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There are some parts of include/asm-generic/pgtable.h that are relevant to
the non-mmu architectures. To make it easier to include this from them I
would like to ifdef the relevant parts.
Without this there is a handful of functions that are referenced in here
that are not defined on many non-mmu architectures. They could be defined
out of course, as an alternative approach.
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Nobody is using ptep_test_and_clear_dirty and ptep_clear_flush_dirty. Remove
the functions from all architectures.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The last user of ptep_establish in mm/ is long gone. Remove the architecture
primitive as well.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Some changes done a while ago to avoid pounding on ptep_set_access_flags and
update_mmu_cache in some race situations break sun4c which requires
update_mmu_cache() to always be called on minor faults.
This patch reworks ptep_set_access_flags() semantics, implementations and
callers so that it's now responsible for returning whether an update is
necessary or not (basically whether the PTE actually changed). This allow
fixing the sparc implementation to always return 1 on sun4c.
[akpm@linux-foundation.org: fixes, cleanups]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: David Miller <davem@davemloft.net>
Cc: Mark Fortescue <mark@mtfhpc.demon.co.uk>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The page_test_and_clear_dirty primitive really consists of two
operations, page_test_dirty and the page_clear_dirty. The combination
of the two is not an atomic operation, so it makes more sense to have
two separate operations instead of one.
In addition to the improved readability of the s390 version of
SetPageUptodate, it now avoids the page_test_dirty operation which is
an insert-storage-key-extended (iske) instruction which is an expensive
operation.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Since lazy MMU batching mode still allows interrupts to enter, it is
possible for interrupt handlers to try to use kmap_atomic, which fails when
lazy mode is active, since the PTE update to highmem will be delayed. The
best workaround is to issue an explicit flush in kmap_atomic_functions
case; this is the only way nested PTE updates can happen in the interrupt
handler.
Thanks to Jeremy Fitzhardinge for noting the bug and suggestions on a fix.
This patch gets reverted again when we start 2.6.22 and the bug gets fixed
differently.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The VMI ROM has a mode where hypercalls can be queued and batched. This turns
out to be a significant win during context switch, but must be done at a
specific point before side effects to CPU state are visible to subsequent
instructions. This is similar to the MMU batching hooks already provided.
The same hooks could be used by the Xen backend to implement a context switch
multicall.
To explain a bit more about lazy modes in the paravirt patches, basically, the
idea is that only one of lazy CPU or MMU mode can be active at any given time.
Lazy MMU mode is similar to this lazy CPU mode, and allows for batching of
multiple PTE updates (say, inside a remap loop), but to avoid keeping some
kind of state machine about when to flush cpu or mmu updates, we just allow
one or the other to be active. Although there is no real reason a more
comprehensive scheme could not be implemented, there is also no demonstrated
need for this extra complexity.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
|
|
Now that ptep_establish has a definition in PAE i386 3-level paging code, the
only paging model which is insane enough to have multi-word hardware PTEs
which are not efficient to set atomically, we can remove the ghost of
set_pte_atomic from other architectures which falesly duplicated it, and
remove all knowledge of it from the generic pgtable code.
set_pte_atomic is now a private pte operator which is specific to i386
Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Implement lazy MMU update hooks which are SMP safe for both direct and shadow
page tables. The idea is that PTE updates and page invalidations while in
lazy mode can be batched into a single hypercall. We use this in VMI for
shadow page table synchronization, and it is a win. It also can be used by
PPC and for direct page tables on Xen.
For SMP, the enter / leave must happen under protection of the page table
locks for page tables which are being modified. This is because otherwise,
you end up with stale state in the batched hypercall, which other CPUs can
race ahead of. Doing this under the protection of the locks guarantees the
synchronization is correct, and also means that spurious faults which are
generated during this window by remote CPUs are properly handled, as the page
fault handler must re-check the PTE under protection of the same lock.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Change pte_clear_full to a more appropriately named pte_clear_not_present,
allowing optimizations when not-present mapping changes need not be reflected
in the hardware TLB for protected page table modes. There is also another
case that can use it in the fremap code.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Parsing generic pgtable.h in assembler is simply crazy. None of this file is
needed in assembler code, and C inline functions and structures routine break
one or more different compiles.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
If we move a mapping from one virtual address to another,
and this changes the virtual color of the mapping to those
pages, we can see corrupt data due to D-cache aliasing.
Check for and deal with this by overriding the move_pte()
macro. Set things up so that other platforms can cleanly
override the move_pte() macro too.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix more include file problems that surfaced since I submitted the previous
fix-missing-includes.patch. This should now allow not to include sched.h
from module.h, which is done by a followup patch.
Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Updated several references to page_table_lock in common code comments.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Move the ZERO_PAGE remapping complexity to the move_pte macro in
asm-generic, have it conditionally depend on
__HAVE_ARCH_MULTIPLE_ZERO_PAGE, which gets defined for MIPS.
For architectures without __HAVE_ARCH_MULTIPLE_ZERO_PAGE, move_pte becomes
a noop.
From: Hugh Dickins <hugh@veritas.com>
Fix nasty little bug we've missed in Nick's mremap move ZERO_PAGE patch.
The "pte" at that point may be a swap entry or a pte_file entry: we must
check pte_present before perhaps corrupting such an entry.
Patch below against 2.6.14-rc2-mm1, but the same bug is in 2.6.14-rc2's
mm/mremap.c, and more dangerous there since it's affecting all arches: I
think the safest course is to send Nick's patch and Yoichi's build fix and
this fix (build tested) on to Linus - so only MIPS can be affected.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add a new accessor for PTEs, which passes the full hint from the mmu_gather
struct; this allows architectures with hardware pagetables to optimize away
atomic PTE operations when destroying an address space. Removing the
locked operation should allow better pipelining of memory access in this
loop. I measured an average savings of 30-35 cycles per zap_pte_range on
the first 500 destructions on Pentium-M, but I believe the optimization
would win more on older processors which still assert the bus lock on xchg
for an exclusive cacheline.
Update: I made some new measurements, and this saves exactly 26 cycles over
ptep_get_and_clear on Pentium M. On P4, with a PAE kernel, this saves 180
cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a
full address space destruction.
pte_clear_full is not yet used, but is provided for future optimizations
(in particular, when running inside of a hypervisor that queues page table
updates, the full hint allows us to avoid queueing unnecessary page table
update for an address space in the process of being destroyed.
This is not a huge win, but it does help a bit, and sets the stage for
further hypervisor optimization of the mm layer on all architectures.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Christoph Lameter <christoph@lameter.com>
Cc: <linux-mm@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It's common practice to msync a large address range regularly, in which
often only a few ptes have actually been dirtied since the previous pass.
sync_pte_range then goes much faster if it tests whether pte is dirty
before locating and accessing each struct page cacheline; and it is hardly
slowed by ptep_clear_flush_dirty repeating that test in the opposite case,
when every pte actually is dirty.
But beware, s390's pte_dirty always says false, since its dirty bit is kept
in the storage key, located via the struct page address. So skip this
optimization in its case: use a pte_maybe_dirty macro which just says true
if page_test_and_clear_dirty is implemented.
Signed-off-by: Abhijit Karmarkar <abhijitk@veritas.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
ia64 and sparc64 hurriedly had to introduce their own variants of
pgd_addr_end, to leapfrog over the holes in their virtual address spaces which
the final clear_page_range suddenly presented when converted from pgd_index to
pgd_addr_end. But now that free_pgtables respects the vma list, those holes
are never presented, and the arch variants can go.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Recently on IA-64, we have found an issue where old data could be used by
apps. The sequence of operations includes few mprotects from user space
(glibc) goes like this:
1- The text region of an executable is mmaped using
PROT_READ|PROT_EXEC. As a result, a shared page is allocated to user.
2- User then requests the text region to be mprotected with
PROT_READ|PROT_WRITE. Kernel removes the execute permission and leave
the read permission on the text region.
3- Subsequent write operation by user results in page fault and
eventually resulting in COW break. User gets a new private copy of the
page. At this point kernel marks the new page for defered flush.
4- User then request the text region to be mprotected back with
PROT_READ|PROT_EXEC. mprotect suppport code in kernel, flushes the
caches, updates the PTEs and then flushes the TLBs. Though after
updating the PTEs with new permissions, we don't let the arch specific
code know about the new mappings (through update_mmu_cache like
routine). IA-64 typically uses update_mmu_cache to check for the
defered flush flag (that got set in step 3) to maintain cache coherency
lazily (The local I and D caches on IA-64 are incoherent).
DavidM suggeested that we would need to add a hook in the function
change_pte_range in mm/mprotect.c This would let the architecture specific
code to look at the new ptes to decide if it needs to update any other
architectual/kernel state based on the updated (new permissions) PTE
values.
We have added a new hook lazy_mmu_prot_update(pte_t) that gets called
protection bits in PTEs change. This hook provides an opportunity to arch
specific code to do needful. On IA-64 this will be used for lazily making
the I and D caches coherent.
Signed-off-by: David Mosberger <davidm@hpl.hp.com>
Signed-off-by: Rohit Seth <rohit.seth@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Nick Piggin's patch to fold away most of the pud and pmd levels when not
required. Adjusted to define minimal pud_addr_end (in the 4LEVEL_HACK
case too) and pmd_addr_end. Responsible for half of the savings.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Begin the pagetable walker cleanup with a straightforward example,
mprotect's change_protection. Started out from Nick Piggin's for_each
proposal, but I prefer less hidden; and these are all do while loops,
which degrade slightly when converted to for loops.
Firmly agree with Andi and Nick that addr,end is the way to go: size is
good at the user interface level, but unhelpful down in the loops. And
the habit of an "address" which is actually an offset from some base has
bitten us several times: use proper address at each level, whyever not?
Don't apply each mask at two levels: all we need is a set of macros
pgd_addr_end, pud_addr_end, pmd_addr_end to give the address of the end
of each range. Which need to take the min of two addresses, with 0 as
the greatest. Started out with a different macro, assumed end never 0;
but clear_page_range (alone) might be passed end 0 by some out-of-tree
memory layouts: could special case it, but this macro compiles smaller.
Check "addr != end" instead of "addr < end" to work on that end 0 case.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Replace the repetitive p?d_none, p?d_bad, p?d_ERROR, p?d_clear clauses
by pgd_none_or_clear_bad, pud_none_or_clear_bad, pmd_none_or_clear_bad
inlines throughout common and i386 - avoids a sprinkling of "unlikely"s.
Tests inline, but unlikely error handling in mm/memory.c - so the ERROR
file and line won't tell much; but it comes too late anyway, and hardly
ever seen outside development.
Let mremap use them in get_one_pte_map, as it already did in _nested;
but leave follow_page and untouched_anonymous page just skipping _bad
as before - they don't have quite the same ownership of the mm.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I'm taking a slightly different approach this time around so things
are easier to integrate. Here is the first patch which builds the
infrastructure. Basically:
1) Add set_pte_at() which is set_pte() with 'mm' and 'addr' arguments
added. All generic code uses set_pte_at().
Most platforms simply get this define:
#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
I chose this method over simply changing all set_pte() call sites
because many platforms implement this in assembler and it would
take forever to preserve the build and stabilize things if modifying
that was necessary.
Soon, with platform maintainer's help, we can kill of set_pte() entirely.
To be honest, there are only a handful of set_pte() call sites in the
arch specific code.
Actually, in this patch ppc64 is completely set_pte() free and does not
define it.
2) pte_clear() gets 'mm' and 'addr' arguments now.
This had a cascading effect on many ptep_test_and_*() routines. Specifically:
a) ptep_test_and_clear_{young,dirty}() now take 'vma' and 'address' args.
b) ptep_get_and_clear now take 'mm' and 'address' args.
c) ptep_mkdirty was deleted, unused by any code.
d) ptep_set_wrprotect now takes 'mm' and 'address' args.
I've tested this patch as follows:
1) compile and run tested on sparc64/SMP
2) compile tested on:
a) ppc64/SMP
b) i386 both with and without PAE enabled
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This avoid userspace mm corruption during COWs with threads (i.e.
malloc;fork;clone) on x86 PAE with >4G of ram
Signed-Off-By: Andrea Arcangeli <andrea@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Changeset
roland@redhat.com[torvalds]|ChangeSet|20040624165002|30880
inadvertently broke ia64 because the patch assumed that pgd_offset_k() is
just an optimization of pgd_offset(), which it is not. This patch fixes
the problem by introducing pgd_offset_gate(). On architectures on which
the gate area lives in the user's address-space, this should be aliased to
pgd_offset() and on architectures on which the gate area lives in the
kernel-mapped segment, this should be aliased to pgd_offset_k().
This bug was found and tracked down by Peter Chubb.
Signed-off-by: <davidm@hpl.hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
ptep_establish() is used to establish a new mapping at COW time,
and it always replaces a non-writable page mapping with a totally
new page mapping that is dirty (and likely writable, although ptrace
may cause a non-writable new mapping). Because it was nonwritable,
we don't have to worry about losing concurrent dirty page bit updates.
ptep_update_access_flags() leaves the same page mapping, but updates
the accessed/dirty/writable bits (it only ever sets them, and never
removes any permissions). Often easier, but it may race with a dirty
bit update on another CPU.
Booted on x86 and ppc64.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
helper function to write-back the dirty and accessed bits from
ptep_establish().
Right now this defaults to the same old "set_pte()" that we've
always done, except for x86 where we now fix the (unlikely)
race in updating accessed bits and dropping a concurrent dirty
bit.
|
|
preparation for pte update race fix.
This does not actually use the information yet, but
the next few patches will start to put it to some
good use.
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
this is another s/390 related mm patch. It introduces the concept of
physical dirty and referenced bits into the common mm code. I always
had the nagging feeling that the pte functions for setting/clearing
the dirty and referenced bits are not appropriate for s/390. It works
but it is a bit of a hack.
After the wake of rmap it is now possible to put a much better solution
into place. The idea is simple: since there are not dirty/referenced
bits in the pte make these function nops on s/390 and add operations
on the physical page to the appropriate places. For the referenced bit
this is the page_referenced() function. For the dirty bit there are
two relevant spots: in page_remove_rmap after the last user of the
page removed its reverse mapping and in try_to_unmap after the last
user was unmapped. There are two new functions to accomplish this:
* page_test_and_clear_dirty: Test and clear the dirty bit of a
physical page. This function is analog to ptep_test_and_clear_dirty
but gets a struct page as argument instead of a pte_t pointer.
* page_test_and_clear_young: Test and clear the referenced bit
of a physical page. This function is analog to ptep_test_and_clear_young
but gets a struct page as argument instead of a pte_t pointer.
Its pretty straightforward and with it the s/390 mm makes much more
sense. You'll need the tls flush optimization patch for the patch.
Comments ?
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
On the s/390 architecture we still have the issue with tlb flushing and the
ipte instruction. We can optimize the tlb flushing a lot with some minor
interface changes between the arch backend and the memory management core.
In the end the whole thing is about the Invalidate Page Table Entry (ipte)
instruction. The instruction sets the invalid bit in the pte and removes the
tlb for the page on all cpus for the virtual to physical mapping of the page
in a particular address space. The nice thing is that only the tlb for this
page gets removed, all the other tlbs stay valid. The reason we can't use
ipte to implement flush_tlb_page() is one of the requirements of the
instruction: the pte that should get flushed needs to be *valid*.
I'd like to add the following four functions to the mm interface:
* ptep_establish: Establish a new mapping. This sets a pte entry to a
page table and flushes the tlb of the old entry on all cpus if it
exists. This is more or less what establish_pte in mm/memory.c does
right now but without the update_mmu_cache call.
* ptep_test_and_clear_and_flush_young. Do what ptep_test_and_clear_young
does and flush the tlb.
* ptep_test_and_clear_and_flush_dirty. Do what ptep_test_and_clear_dirty
does and flush the tlb.
* ptep_get_and_clear_and_flush: Do what ptep_get_and_clear does and
flush the tlb.
The s/390 specific functions in include/pgtable.h define their own optimized
version of these four functions by use of the ipte.
I avoid the definition of these function for every architecture I added them
to include/asm-generic/pgtable.h. Since i386/x86 and others don't include
this header yet and define their own version of the functions found there I
#ifdef'd all functions in include/asm-generic/pgtable.h to be able to pick
the ones that are needed for each architecture (see patch for details).
With the new functions in place it is easy to do the optimization, e.g. the
sequence
ptep_get_and_clear(ptep);
flush_tlb_page(vma, address);
gets replace by
ptep_get_and_clear_and_flush(vma, address, ptep);
The old sequence still works but it is suboptimal on s/390.
|
|
|