| Age | Commit message (Collapse) | Author |
|
pagevec_swap_free() is now unused.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
speculative page references patch (commit:
e286781d5f2e9c846e012a39653a166e9d31777d) removed last
pagevec_release_nonlru() caller.
So this function can be removed now.
This patch doesn't have any functional change.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon"). The latter includes tmpfs.
The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.
This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.
[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If vm_swap_full() (swap space more than 50% full), the system will free
swap space at swapin time. With this patch, the system will also free the
swap space in the pageout code, when we decide that the page is not a
candidate for swapout (and just wasting swap space).
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Turn the pagevecs into an array just like the LRUs. This significantly
cleans up the source code and reduces the size of the kernel by about 13kB
after all the LRU lists have been created further down in the split VM
patch series.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Make it possible to include linux/pagevec.h multiple times without
incurring errors due to duplicate definitions.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Change pagevec "nr" and "cold" back to "unsigned long", because <4 byte
accesses can be slow on architectures < Pentium III (additional "data16"
operand on instruction).
This still honours the cacheline alignment, making the size of "pagevec"
structure a power of two (either 64 or 128 bytes).
Haven't been able to see any significant change on performance on my
limited testing.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We can shrink the pagevec structure to cacheline align it. It is used all
over VM reclaiming and mpage pagecache read code.
Right now it is 140 bytes on 64-bit and 72 bytes on 32-bit. Thats just a
little bit more than a power of 2 (which will cacheline align), so shrink
it to be aligned: 64 bytes on 32bit and 124bytes on 64-bit.
It now occupies two cachelines most of the time instead of three.
I changed nr and cold to "unsigned short" because they'll never reach 2 ^ 16.
Did some reaim benchmarking on 4way PIII (32byte cacheline), with 512MB RAM:
#### stock 2.6.9-rc1-mm4 ####
Peak load Test: Maximum Jobs per Minute 4144.44 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 4007.86 (average of 3 runs)
Peak load Test: Maximum Jobs per Minute 4207.48 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 3999.28 (average of 3 runs)
#### shrink-pagevec #####
Peak load Test: Maximum Jobs per Minute 4717.88 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 4360.59 (average of 3 runs)
Peak load Test: Maximum Jobs per Minute 4493.18 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 4327.77 (average of 3 runs)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Move everything over to walking the radix tree via the PAGECACHE_TAG_DIRTY
tag. Remove address_space.dirty_pages.
|
|
The vm_writeback address_space operation was designed to provide the VM
with a "clustered writeout" capability. It allowed the filesystem to
perform more intelligent writearound decisions when the VM was trying
to clean a particular page.
I can't say I ever saw any real benefit from this - not much writeout
actually happens on that path - quite a lot of work has gone into
minimising it actually.
The default ->vm_writeback a_op which I provided wrote back the pages
in ->dirty_pages order. But there is one scenario in which this causes
problems - writing a single 4G file with mem=4G. We end up with all of
ZONE_NORMAL full of dirty pages, but all writeback effort is against
highmem pages. (Because there is about 1.5G of dirty memory total).
Net effect: the machine stalls ZONE_NORMAL allocation attempts until
the ->dirty_pages writeback advances onto ZONE_NORMAL pages.
This can be fixed most sweetly with additional radix-tree
infrastructure which will be quite complex. Later.
So this patch dumps it all, and goes back to using writepage
against individual pages as they come off the LRU.
|
|
If we're about to return to userspace after performing some swap
readahead, the pages in the deferred-addition LRU queues could stay
there for some time. So drain them after performing readahead.
|
|
This is the first in a series of patches which tune up the 2.5
performance under heavy swap loads.
Throughput on stupid swapstormy tests is increased by 1.5x to 3x.
Still about 20% behind 2.4 with multithreaded tests. That is not
easily fixable - the virtual scan tends to apply a form of load
control: particular processes are heavily swapped out so the others can
get ahead. With 2.5 all processes make very even progress and much
more swapping is needed. It's on par with 2.4 for single-process
swapstorms.
In this patch:
The code which tries to start mapped pages out on the active list
doesn't work very well. It uses an "is it mapped into pagetables"
test. Which doesn't work for, say, swap readahead pages. They are not
mapped into pagetables when they are spilled onto the LRU.
So create a new `lru_cache_add_active()' function for deferred addition
of pages to their active list.
Also move mark_page_accessed() from filemap.c to swap.c where all
similar functions live. And teach it to not try to move pages which
are in the deferred-addition list onto the active list. That won't
work, and it's bogusly clearing PageReferenced in that case.
The deferred-addition lists are a pest. But lru_cache_add used to be
really expensive in sime workloads on some machines. Must persist.
|
|
Add a `cold' hint to struct pagevec, and teach truncate and page
reclaim to use it.
Empirical testing showed that truncate's pages tend to be hot. And page
reclaim's are certainly cold.
|
|
Rewrite these functions to use gang lookup.
- This probably has similar performance to the old code in the common case.
- It will be vastly quicker than current code for the worst case
(single-page truncate).
- invalidate_inode_pages() has been changed. It used to use
page_count(page) as the "is it mapped into pagetables" heuristic. It
now uses the (page->pte.direct != 0) heuristic.
- Removes the worst cause of scheduling latency in the kernel.
- It's a big code cleanup.
- invalidate_inode_pages() has been changed to take an address_space
*, not an inode *.
- the maximum hold times for mapping->page_lock are enormously reduced,
making it quite feasible to turn this into an irq-safe lock. Which, it
seems, is a requirement for sane AIO<->direct-io integration, as well
as possibly other AIO things.
(Thanks Hugh for fixing a bug in this one as well).
(Christoph added some stuff too)
|
|
This patch addresses the excessive consumption of ZONE_NORMAL by
buffer_heads on highmem machines. The algorithms which decide which
buffers to shoot down are fairly dumb, but they only cut in on machines
with large highmem:lowmem ratios and the code footprint is tiny.
The buffer.c change implements the buffer_head accounting - it sets the
upper limit on buffer_head memory occupancy to 10% of ZONE_NORMAL.
A possible side-effect of this change is that the kernel will perform
more calls to get_block() to map pages to disk. This will only be
observed when a file is being repeatadly overwritten - this is the only
case in which the "cached get_block result" in the buffers is useful.
I did quite some testing of this back in the delalloc ext2 days, and
was not able to come up with a test in which the cached get_block
result was measurably useful. That's for ext2, which has a fast
get_block().
A desirable side effect of this patch is that the kernel will be able
to cache much more blockdev pagecache in ZONE_NORMAL, so there are more
ext2/3 indirect blocks in cache, so with some workloads, less I/O will
be performed.
In mpage_writepage(): if the number of buffer_heads is excessive then
buffers are stripped from pages as they are submitted for writeback.
This change is only useful for filesystems which are using the mpage
code. That's ext2 and ext3-writeback and JFS. An mpage patch for
reiserfs was floating about but seems to have got lost.
There is no need to strip buffers for reads because the mpage code does
not attach buffers for reads.
These are perhaps not the most appropriate buffer_heads to toss away.
Perhaps something smarter should be done to detect file overwriting, or
to toss the 'oldest' buffer_heads first.
In refill_inactive(): if the number of buffer_heads is excessive then
strip buffers from pages as they move onto the inactive list. This
change is useful for all filesystems. This approach is good because
pages which are being repeatedly overwritten will remain on the active
list and will retain their buffers, whereas pages which are not being
overwritten will be stripped.
|
|
it was only being used in invalidate_inode_pages(), and from there,
pagevec_release() does the same thing.
|
|
The remaining source of page-at-a-time activity against
pagemap_lru_lock is the anonymous pagefault path, which cannot be
changed to operate against multiple pages at a time.
But what we can do is to batch up just its adding of pages to the LRU,
via buffering and deferral.
This patch is based on work from Bill Irwin.
The patch changes lru_cache_add to put the pages into a per-CPU
pagevec. They are added to the LRU 16-at-a-time.
And in the page reclaim code, purge the local CPU's buffer before
starting. This is mainly to decrease the chances of pages staying off
the LRU for very long periods: if the machine is under memory pressure,
CPUs will spill their pages onto the LRU promptly.
A consequence of this change is that we can have up to 15*num_cpus
pages which are not on the LRU. Which could have a slight effect on VM
accuracy, but I find that doubtful. If the system is under memory
pressure the pages will be added to the LRU promptly, and these pages
are the most-recently-touched ones - the VM isn't very interested in
them anyway.
This optimisation could be made SMP-specific, but I felt it best to
turn it on for UP as well for consistency and better testing coverage.
|
|
This is the first patch in a series of eight which address
pagemap_lru_lock contention, and which simplify the VM locking
hierarchy.
Most testing has been done with all eight patches applied, so it would
be best not to cherrypick, please.
The workload which was optimised was: 4x500MHz PIII CPUs, mem=512m, six
disks, six filesystems, six processes each flat-out writing a large
file onto one of the disks. ie: heavy page replacement load.
The frequency with which pagemap_lru_lock is taken is reduced by 90%.
Lockmeter claims that pagemap_lru_lock contention on the 4-way has been
reduced by 98%. Total amount of system time lost to lock spinning went
from 2.5% to 0.85%.
Anton ran a similar test on 8-way PPC, the reduction in system time was
around 25%, and the reduction in time spent playing with
pagemap_lru_lock was 80%.
http://samba.org/~anton/linux/2.5.30/standard/
versus
http://samba.org/~anton/linux/2.5.30/akpm/
Throughput changes on uniprocessor are modest: a 1% speedup with this
workload due to shortened code paths and improved cache locality.
The patches do two main things:
1: In almost all places where the kernel was doing something with
lots of pages one-at-a-time, convert the code to do the same thing
sixteen-pages-at-a-time. Take the lock once rather than sixteen
times. Take the lock for the minimum possible time.
2: Multithread the pagecache reclaim function: don't hold
pagemap_lru_lock while reclaiming pagecache pages. That function
was massively expensive.
One fallout from this work is that we never take any other locks while
holding pagemap_lru_lock. So this lock conceptually disappears from
the VM locking hierarchy.
So. This is all basically a code tweak to improve kernel scalability.
It does it by optimising the existing design, rather than by redesign.
There is little conceptual change to how the VM works.
This is as far as I can tweak it. It seems that the results are now
acceptable on SMP. But things are still bad on NUMA. It is expected
that the per-zone LRU and per-zone LRU lock patches will fix NUMA as
well, but that has yet to be tested.
This first patch introduces `struct pagevec', which is the basic unit
of batched work. It is simply:
struct pagevec {
unsigned nr;
struct page *pages[16];
};
pagevecs are used in the following patches to get the VM away from
page-at-a-time operations.
This patch includes all the pagevec library functions which are used in
later patches.
|