| Age | Commit message (Collapse) | Author |
|
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Add support for different number of page table levels dependent
on the highest address used for a process. This will cause a 31 bit
process to use a two level page table instead of the four level page
table that is the default after the pud has been introduced. Likewise
a normal 64 bit process will use three levels instead of four. Only
if a process runs out of the 4 tera bytes which can be addressed with
a three level page table the fourth level is dynamically added. Then
the process can use up to 8 peta byte.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
This patch implements 1K/2K page table pages for s390.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Background: I've implemented 1K/2K page tables for s390. These sub-page
page tables are required to properly support the s390 virtualization
instruction with KVM. The SIE instruction requires that the page tables
have 256 page table entries (pte) followed by 256 page status table entries
(pgste). The pgstes are only required if the process is using the SIE
instruction. The pgstes are updated by the hardware and by the hypervisor
for a number of reasons, one of them is dirty and reference bit tracking.
To avoid wasting memory the standard pte table allocation should return
1K/2K (31/64 bit) and 2K/4K if the process is using SIE.
Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means
the s390 version for pte_alloc_one cannot return a pointer to a struct
page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
cannot return a pointer to a pte either, since that would require more than
32 bit for the return value of pte_alloc_one (and the pte * would not be
accessible since its not kmapped).
Solution: The only solution I found to this dilemma is a new typedef: a
pgtable_t. For s390 pgtable_t will be a (pte *) - to be introduced with a
later patch. For everybody else it will be a (struct page *). The
additional problem with the initialization of the ptl lock and the
NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
a destructor pgtable_page_dtor. The page table allocation and free
functions need to call these two whenever a page table page is allocated or
freed. pmd_populate will get a pgtable_t instead of a struct page pointer.
To get the pgtable_t back from a pmd entry that has been installed with
pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page
call in free_pte_range and apply_to_pte_range.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
(with Martin Schwidefsky <schwidefsky@de.ibm.com>)
The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
first argument. The free functions do not get the mm_struct argument. This
is 1) asymmetrical and 2) to do mm related page table allocations the mm
argument is needed on the free function as well.
[kamalesh@linux.vnet.ibm.com: i386 fix]
[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Get independent from asm-generic/4level-fixup.h
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
- De-confuse the defines for the address-space-control-elements
and the segment/region table entries.
- Create out of line functions for page table allocation / freeing.
- Simplify get_shadow_xxx functions.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
The current tlb flushing code for page table entries violates the
s390 architecture in a small detail. The relevant section from the
principles of operation (SA22-7832-02 page 3-47):
"A valid table entry must not be changed while it is attached
to any CPU and may be used for translation by that CPU except to
(1) invalidate the entry by using INVALIDATE PAGE TABLE ENTRY or
INVALIDATE DAT TABLE ENTRY, (2) alter bits 56-63 of a page-table
entry, or (3) make a change by means of a COMPARE AND SWAP AND
PURGE instruction that purges the TLB."
That means if one thread of a multithreaded applciation uses a vma
while another thread does an unmap on it, the page table entries of
that vma needs to get removed with IPTE, IDTE or CSP. In some strange
and rare situations a cpu could check-stop (die) because a entry has
been pushed out of the TLB that is still needed to complete a
(milli-coded) instruction. I've never seen it happen with the current
code on any of the supported machines, so right now this is a
theoretical problem. But I want to fix it nevertheless, to avoid
headaches in the futures.
To get this implemented correctly without changing common code the
primitives ptep_get_and_clear, ptep_get_and_clear_full and
ptep_set_wrprotect need to use the IPTE instruction to invalidate the
pte before the new pte value gets stored. If IPTE is always used for
the three primitives three important operations will have a performace
hit: fork, mprotect and exit_mmap. Time for some workarounds:
* 1: ptep_get_and_clear_full is used in unmap_vmas to remove page
tables entries in a batched tlb gather operation. If the mmu_gather
context passed to unmap_vmas has been started with full_mm_flush==1
or if only one cpu is online or if the only user of a mm_struct is the
current process then the fullmm indication in the mmu_gather context is
set to one. All TLBs for mm_struct are flushed by the tlb_gather_mmu
call. No new TLBs can be created while the unmap is in progress. In
this case ptep_get_and_clear_full clears the ptes with a simple store.
* 2: ptep_get_and_clear is used in change_protection to clear the
ptes from the page tables before they are reentered with the new
access flags. At the end of the update flush_tlb_range clears the
remaining TLBs. In general the ptep_get_and_clear has to issue IPTE
for each pte and flush_tlb_range is a nop. But if there is only one
user of the mm_struct then ptep_get_and_clear uses simple stores
to do the update and flush_tlb_range will flush the TLBs.
* 3: Similar to 2, ptep_set_wrprotect is used in copy_page_range
for a fork to make all ptes of a cow mapping read-only. At the end of
of copy_page_range dup_mmap will flush the TLBs with a call to
flush_tlb_mm. Check for mm->mm_users and if there is only one user
avoid using IPTE in ptep_set_wrprotect and let flush_tlb_mm clear the
TLBs.
Overall for single threaded programs the tlb flush code now performs
better, for multi threaded programs it is slightly worse. In particular
exit_mmap() now does a single IDTE for the mm and then just frees every
page cache reference and every page table page directly without a delay
over the mmu_gather structure.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
There are several s390 diagnose calls, which must be executed below the
2GB memory boundary. In order to enforce this, those diagnoses must be
compiled into the kernel. Currently diag 14 can be called within the
vmur kernel module from addresses above 2GB. This leads to specification
exceptions. This patch moves diag10, diag14 and diag210 into the new
diag.c file.
Signed-off-by: Michael Holzheu <holzheu@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
This provides a noexec protection on s390 hardware. Our hardware does
not have any bits left in the pte for a hw noexec bit, so this is a
different approach using shadow page tables and a special addressing
mode that allows separate address spaces for code and data.
As a special feature of our "secondary-space" addressing mode, separate
page tables can be specified for the translation of data addresses
(storage operands) and instruction addresses. The shadow page table is
used for the instruction addresses and the standard page table for the
data addresses.
The shadow page table is linked to the standard page table by a pointer
in page->lru.next of the struct page corresponding to the page that
contains the standard page table (since page->private is not really
private with the pte_lock and the page table pages are not in the LRU
list).
Depending on the software bits of a pte, it is either inserted into
both page tables or just into the standard (data) page table. Pages of
a vma that does not have the VM_EXEC bit set get mapped only in the
data address space. Any try to execute code on such a page will cause a
page translation exception. The standard reaction to this is a SIGSEGV
with two exceptions: the two system call opcodes 0x0a77 (sys_sigreturn)
and 0x0aad (sys_rt_sigreturn) are allowed. They are stored by the
kernel to the signal stack frame. Unfortunately, the signal return
mechanism cannot be modified to use an SA_RESTORER because the
exception unwinding code depends on the system call opcode stored
behind the signal stack frame.
This feature requires that user space is executed in secondary-space
mode and the kernel in home-space mode, which means that the addressing
modes need to be switched and that the noexec protection only works
for user space.
After switching the addressing modes, we cannot use the mvcp/mvcs
instructions anymore to copy between kernel and user space. A new
mvcos instruction has been added to the z9 EC/BC hardware which allows
to copy between arbitrary address spaces, but on older hardware the
page tables need to be walked manually.
Signed-off-by: Gerald Schaefer <geraldsc@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Virtual memmap support for s390. Inspired by the ia64 implementation.
Unlike ia64 we need a mechanism which allows us to dynamically attach
shared memory regions.
These memory regions are accessed via the dcss device driver. dcss
implements the 'direct_access' operation, which requires struct pages
for every single shared page.
Therefore this implementation provides an interface to attach/detach
shared memory:
int add_shared_memory(unsigned long start, unsigned long size);
int remove_shared_memory(unsigned long start, unsigned long size);
The purpose of the add_shared_memory function is to add the given
memory range to the 1:1 mapping and to make sure that the
corresponding range in the vmemmap is backed with physical pages.
It also initialises the new struct pages.
remove_shared_memory in turn only invalidates the page table
entries in the 1:1 mapping. The page tables and the memory used for
struct pages in the vmemmap are currently not freed. They will be
reused when the next segment will be attached.
Given that the maximum size of a shared memory region is 2GB and
in addition all regions must reside below 2GB this is not too much of
a restriction, but there is room for improvement.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Use page_to_phys and pfn_to_page to avoid open-coded mem_map usage.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
|
Changed and simplified some page table related #defines and code.
Signed-off-by: Gerald Schaefer <geraldsc@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
|
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
|
|
set_pgdir isn't needed anymore for a very long time. Remove the leftover
implementation on sh64 and the stub on s390.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Richard Curnow <rc@rc0.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch cleans up asm-*/pgalloc.h by removing the generous includes
which are obsoleted (duplicated) by including linux/mm.h (and friends)
They are double checked and verified by the PLM cross compiling service
(the patched kernel gives the same warnings/errors as the unpatched)
http://osdl.org/plm-cgi/plm?module=patch_info&patch_id=4313
Signed-off-by: Herbert Pötzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I'm taking a slightly different approach this time around so things
are easier to integrate. Here is the first patch which builds the
infrastructure. Basically:
1) Add set_pte_at() which is set_pte() with 'mm' and 'addr' arguments
added. All generic code uses set_pte_at().
Most platforms simply get this define:
#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
I chose this method over simply changing all set_pte() call sites
because many platforms implement this in assembler and it would
take forever to preserve the build and stabilize things if modifying
that was necessary.
Soon, with platform maintainer's help, we can kill of set_pte() entirely.
To be honest, there are only a handful of set_pte() call sites in the
arch specific code.
Actually, in this patch ppc64 is completely set_pte() free and does not
define it.
2) pte_clear() gets 'mm' and 'addr' arguments now.
This had a cascading effect on many ptep_test_and_*() routines. Specifically:
a) ptep_test_and_clear_{young,dirty}() now take 'vma' and 'address' args.
b) ptep_get_and_clear now take 'mm' and 'address' args.
c) ptep_mkdirty was deleted, unused by any code.
d) ptep_set_wrprotect now takes 'mm' and 'address' args.
I've tested this patch as follows:
1) compile and run tested on sparc64/SMP
2) compile tested on:
a) ppc64/SMP
b) i386 both with and without PAE enabled
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Just found an small bug in pgalloc for s390*. Comparing notes with other
architectures I found that pte_alloc_one is sick for alpha and sparc64 as
well.
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add collaborative memory management interface.
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
On the s/390 architecture we still have the issue with tlb flushing and the
ipte instruction. We can optimize the tlb flushing a lot with some minor
interface changes between the arch backend and the memory management core.
In the end the whole thing is about the Invalidate Page Table Entry (ipte)
instruction. The instruction sets the invalid bit in the pte and removes the
tlb for the page on all cpus for the virtual to physical mapping of the page
in a particular address space. The nice thing is that only the tlb for this
page gets removed, all the other tlbs stay valid. The reason we can't use
ipte to implement flush_tlb_page() is one of the requirements of the
instruction: the pte that should get flushed needs to be *valid*.
I'd like to add the following four functions to the mm interface:
* ptep_establish: Establish a new mapping. This sets a pte entry to a
page table and flushes the tlb of the old entry on all cpus if it
exists. This is more or less what establish_pte in mm/memory.c does
right now but without the update_mmu_cache call.
* ptep_test_and_clear_and_flush_young. Do what ptep_test_and_clear_young
does and flush the tlb.
* ptep_test_and_clear_and_flush_dirty. Do what ptep_test_and_clear_dirty
does and flush the tlb.
* ptep_get_and_clear_and_flush: Do what ptep_get_and_clear does and
flush the tlb.
The s/390 specific functions in include/pgtable.h define their own optimized
version of these four functions by use of the ipte.
I avoid the definition of these function for every architecture I added them
to include/asm-generic/pgtable.h. Since i386/x86 and others don't include
this header yet and define their own version of the functions found there I
#ifdef'd all functions in include/asm-generic/pgtable.h to be able to pick
the ones that are needed for each architecture (see patch for details).
With the new functions in place it is easy to do the optimization, e.g. the
sequence
ptep_get_and_clear(ptep);
flush_tlb_page(vma, address);
gets replace by
ptep_get_and_clear_and_flush(vma, address, ptep);
The old sequence still works but it is suboptimal on s/390.
|
|
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
- Add console_unblank in machine_{restart,halt,power_off} to get
all messages on the screen.
- Set console_irq to -1 if condev= parameter is present.
- Fix write_trylock for 64 bit.
- Fix svc restarting.
- System call number on 64 bit is an int. Fix compare in entry64.S.
- Fix tlb flush problem.
- Use the idte instruction to flush tlbs of a particular mm.
- Fix ptrace.
- Add fadvise64_64 system call wrapper.
- Fix pfault handling.
- Do not clobber _PAGE_INVALID_NONE pages in pte_wrprotect.
- Fix siginfo_t size problem (needs to be 128 for s390x, not 136).
- Avoid direct assignment to tsk->state, use __set_task_state.
- Always make any pending restarted system call return -EINTR.
- Add panic_on_oops.
- Display symbol for psw address in show_trace.
- Don't discard sections .exit.text, .exit.data and .eh_frame,
otherwise stabs information for kerntypes will get lost.
- Add memory clobber to assembler inline in ip_fast_checksum for gcc 3.3.
- Fix softirq_pending calls for the current cpu (cpu == smp_processor_id()).
- Remove BUG_ON in irq_enter. Two irq_enters are possible.
|
|
Remove all the open-coded retry loops in various architectures, use
__GFP_REPEAT.
It could be that at some time in the future we change __GFP_REPEAT to give up
after ten seconds or so, so all the checks for failed allocations are
retained.
|
|
Merge s390x and s390 to one architecture.
|
|
s390 include file changes for 2.5.39.
|
|
It has been noticed that across a kernel build many calls to
tlb_flush_mmu() do not have anything to flush, apparently because glibc
is mmapping a file over a previously-mapped region which has no
faulted-in ptes.
This patch detects this case and optimises away a little over one third
of the tlb invalidations.
The functions which potentially cause an invalidate are
tlb_remove_tlb_entry(), pte_free_tlb() and pmd_free_tlb(). These have
been front-ended in asm-generic/tlb.h and the per-arch versions now
have leading double-underscores. The generic versions tag the
mmu_gather_t as needing a flush and then call the arch-specific
version.
tlb_flush_mmu() looks at tlb->need_flush and if it sees that no real
activity has happened, the invalidation is avoided.
The success rate is displayed in /proc/meminfo for the while. This
should be removed later.
|
|
Second patch of the s/390 update. Contains all the include file changes in
include/asm-{s390,s390x}.
|
|
- Al Viro: fix up silly problem in swapfile filp cleanups in 2.5.2
- Tachino Nobuhiro: fix another error return for swapfile filp code
- Robert Love: merge some of Ingo's scheduler fixes
- David Miller: networking, sparc and some scsi driver fixes
- Tim Waugh: parport update
- OGAWA Hirofumi: fatfs cleanups and bugfixes
- Roland Dreier: fix vsscanf buglets.
- Ben LaHaise: include file cleanup
- Andre Hedrick: IDE taskfile update
|
|
- Trond Myklebust: deadlock checking in lockd server
- Tim Waugh: fix up parport wrong #define
- Christoph Hellwig: i2c update, ext2 cleanup
- Al Viro: fix partition handling sanity check.
- Trond Myklebust: make NFS use SLAB_NOFS, and not play games with PF_MEMALLOC
- Ben Fennema: UDF update
- Alan Cox: continued merging
- Chris Mason: get /proc buffer memory sizes right after buf-in-page-cache
|
|
- Johannes Erdfelt: USB updates
- David Howells: more rw-sem stuff
- David Miller: network callback cleanups and fixes
- Jan Harkes: make Coda use the proper VFS layer interfaces, so that it can use
"non-traditional-unix" filesystems without inode numbers for backing store.
|
|
- Ingo Molnar/Al Viro: don't use bforget() on ext2 (and minix) metadata
where we may not be the only owner of the buffer! FS corruption.
- Andi Kleen: IPv6 packet re-assembly fix.
- David Howells: fix up rwsem implementation
- Alan Cox: more merging (S/390 down, ARM to go).
- Jens Axboe: LVM and loop fixes
|
|
- big S/390x 64-bit merge
- typos and license name fixes. doc updates.
- more include file cleanups (phase out "malloc.h")
- even more elevator corner cases.. When not merging, find the best insertion point.
- pmac ide update
- network fixes (netif_wake_queue on tx timeout)
- USB printer select() fix
- NFS client missed initialization, deamon fixed client address check
|
|
|