| Age | Commit message (Collapse) | Author |
|
This patch addresses the excessive consumption of ZONE_NORMAL by
buffer_heads on highmem machines. The algorithms which decide which
buffers to shoot down are fairly dumb, but they only cut in on machines
with large highmem:lowmem ratios and the code footprint is tiny.
The buffer.c change implements the buffer_head accounting - it sets the
upper limit on buffer_head memory occupancy to 10% of ZONE_NORMAL.
A possible side-effect of this change is that the kernel will perform
more calls to get_block() to map pages to disk. This will only be
observed when a file is being repeatadly overwritten - this is the only
case in which the "cached get_block result" in the buffers is useful.
I did quite some testing of this back in the delalloc ext2 days, and
was not able to come up with a test in which the cached get_block
result was measurably useful. That's for ext2, which has a fast
get_block().
A desirable side effect of this patch is that the kernel will be able
to cache much more blockdev pagecache in ZONE_NORMAL, so there are more
ext2/3 indirect blocks in cache, so with some workloads, less I/O will
be performed.
In mpage_writepage(): if the number of buffer_heads is excessive then
buffers are stripped from pages as they are submitted for writeback.
This change is only useful for filesystems which are using the mpage
code. That's ext2 and ext3-writeback and JFS. An mpage patch for
reiserfs was floating about but seems to have got lost.
There is no need to strip buffers for reads because the mpage code does
not attach buffers for reads.
These are perhaps not the most appropriate buffer_heads to toss away.
Perhaps something smarter should be done to detect file overwriting, or
to toss the 'oldest' buffer_heads first.
In refill_inactive(): if the number of buffer_heads is excessive then
strip buffers from pages as they move onto the inactive list. This
change is useful for all filesystems. This approach is good because
pages which are being repeatedly overwritten will remain on the active
list and will retain their buffers, whereas pages which are not being
overwritten will be stripped.
|
|
it was only being used in invalidate_inode_pages(), and from there,
pagevec_release() does the same thing.
|
|
The remaining source of page-at-a-time activity against
pagemap_lru_lock is the anonymous pagefault path, which cannot be
changed to operate against multiple pages at a time.
But what we can do is to batch up just its adding of pages to the LRU,
via buffering and deferral.
This patch is based on work from Bill Irwin.
The patch changes lru_cache_add to put the pages into a per-CPU
pagevec. They are added to the LRU 16-at-a-time.
And in the page reclaim code, purge the local CPU's buffer before
starting. This is mainly to decrease the chances of pages staying off
the LRU for very long periods: if the machine is under memory pressure,
CPUs will spill their pages onto the LRU promptly.
A consequence of this change is that we can have up to 15*num_cpus
pages which are not on the LRU. Which could have a slight effect on VM
accuracy, but I find that doubtful. If the system is under memory
pressure the pages will be added to the LRU promptly, and these pages
are the most-recently-touched ones - the VM isn't very interested in
them anyway.
This optimisation could be made SMP-specific, but I felt it best to
turn it on for UP as well for consistency and better testing coverage.
|
|
This is the first patch in a series of eight which address
pagemap_lru_lock contention, and which simplify the VM locking
hierarchy.
Most testing has been done with all eight patches applied, so it would
be best not to cherrypick, please.
The workload which was optimised was: 4x500MHz PIII CPUs, mem=512m, six
disks, six filesystems, six processes each flat-out writing a large
file onto one of the disks. ie: heavy page replacement load.
The frequency with which pagemap_lru_lock is taken is reduced by 90%.
Lockmeter claims that pagemap_lru_lock contention on the 4-way has been
reduced by 98%. Total amount of system time lost to lock spinning went
from 2.5% to 0.85%.
Anton ran a similar test on 8-way PPC, the reduction in system time was
around 25%, and the reduction in time spent playing with
pagemap_lru_lock was 80%.
http://samba.org/~anton/linux/2.5.30/standard/
versus
http://samba.org/~anton/linux/2.5.30/akpm/
Throughput changes on uniprocessor are modest: a 1% speedup with this
workload due to shortened code paths and improved cache locality.
The patches do two main things:
1: In almost all places where the kernel was doing something with
lots of pages one-at-a-time, convert the code to do the same thing
sixteen-pages-at-a-time. Take the lock once rather than sixteen
times. Take the lock for the minimum possible time.
2: Multithread the pagecache reclaim function: don't hold
pagemap_lru_lock while reclaiming pagecache pages. That function
was massively expensive.
One fallout from this work is that we never take any other locks while
holding pagemap_lru_lock. So this lock conceptually disappears from
the VM locking hierarchy.
So. This is all basically a code tweak to improve kernel scalability.
It does it by optimising the existing design, rather than by redesign.
There is little conceptual change to how the VM works.
This is as far as I can tweak it. It seems that the results are now
acceptable on SMP. But things are still bad on NUMA. It is expected
that the per-zone LRU and per-zone LRU lock patches will fix NUMA as
well, but that has yet to be tested.
This first patch introduces `struct pagevec', which is the basic unit
of batched work. It is simply:
struct pagevec {
unsigned nr;
struct page *pages[16];
};
pagevecs are used in the following patches to get the VM away from
page-at-a-time operations.
This patch includes all the pagevec library functions which are used in
later patches.
|