| Age | Commit message (Collapse) | Author |
|
- added typedef unsigned int __nocast gfp_t;
- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Use a bit spin lock in the first buffer of the page to synchronise asynch
IO buffer completions, instead of the global page_uptodate_lock, which is
showing some scalabilty problems.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This makes it hard(er) to mix argument orders by mistake for things like
kmalloc() and friends, since silent integer promotion is now caught by
sparse.
|
|
Add nobh_wripage() support for the filesystems which uses
nobh_prepare_write/nobh_commit_write().
Idea here is to reduce unnecessary bufferhead creation/attachment to the
page through pageout()->block_write_full_page(). nobh_wripage() tries to
operate by directly creating bios, but it falls back to
__block_write_full_page() if it can't make progress.
Note that this is not really generic routine and can't be used for
filesystems which uses page->Private for anything other than buffer heads.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I renamed the functions to more descriptive names:
create_buffers -> alloc_page_buffers
__set_page_buffers -> attach_page_buffers
And I added a EXPORT_SYMBOL_GPL for alloc_page_buffers and made
attach_page_buffers static inline and moved it to <linux/buffer_head.h>.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Allow the buffer_foo() predicates to take a (const struct buffer_head *).
I've checked that the argument of test_bit is indeed "const" on all
architectures.
Signed-off-by: Werner Almesberger <werner@almesberger.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Processes can sleep in do_get_write_access(), waiting for buffers to be
removed from the BJ_Shadow state. We did this by doing a wake_up_buffer() in
the commit path and sleeping on the buffer in do_get_write_access().
With the filtered bit-level wakeup code this doesn't work properly any more -
the wake_up_buffer() accidentally wakes up tasks which are sleeping in
lock_buffer() as well. Those tasks now implicitly assume that the buffer came
unlocked. Net effect: Bogus I/O errors when reading journal blocks, because
the buffer isn't up to date yet. Hence the recently spate of journal_bmap()
failure reports.
The patch creates a new jbd-private BH flag purely for this wakeup function.
So a wake_up_bit(..., BH_Unshadow) doesn't wake up someone who is waiting for
a wake_up_bit(BH_Lock).
JBD was the only user of wake_up_buffer(), so remove it altogether.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The exports were for reiserfs in 2.4, but reiserfs doesn't need them
anymore.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add a whole bunch more might_sleep() checks. We also enable might_sleep()
checking in copy_*_user(). This was non-trivial because of the "copy_*_user()
in atomic regions" trick would generate false positives. Fix that up by
adding a new __copy_*_user_inatomic(), which avoids the might_sleep() check.
Only i386 is supported in this patch.
With: Arjan van de Ven <arjanv@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Reduce size of buffer_head from 96 to 88 bytes on 64bit architectures by
putting b_count and b_size together. b_count will still be in the first 16
bytes on 32bit architectures, so 16 byte cacheline machines shouldnt be
affected.
With this change the number of objects per 4kB slab goes up from 40 to 44
on ppc64.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
In databases it is common to have multiple threads or processes performing
O_SYNC writes against different parts of the same file.
Our performance at this is poor, because each writer blocks access to the
file by waiting on I/O completion while holding i_sem: everything is
serialised.
The patch improves things by moving the writing and waiting outside i_sem.
So other threads can get in and submit their I/O and permit the disk
scheduler to optimise the IO patterns better.
Also, the O_SYNC writer only writes and waits on the pages which he wrote,
rather than writing and waiting on all dirty pages in the file.
The reason we haven't been able to do this before is that the required walk
of the address_space page lists is easily livelockable without the i_sem
serialisation. But in this patch we perform the waiting via a radix-tree
walk of the affected pages. This cannot be livelocked.
The sync of the inode's metadata is still performed inside i_sem. This is
because it is list-based and is hence still livelockable. However it is
usually the case that databases are overwriting existing file blocks and
there will be no dirty buffers attached to the address_space anyway.
The code is careful to ensure that the IO for the pages and the IO for the
metadata are nonblockingly scheduled at the same time. This is am improvemtn
over the current code, which will issue two separate write-and-wait cycles:
one for metadata, one for pages.
Note from Suparna:
Reworked to use the tagged radix-tree based writeback infrastructure.
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
In order for filesystems to detect asynchronous ordered write failures for
buffers sent via submit_bh, they need a bit they can test for in the buffer
head. This adds BH_Eopnotsupp and the related buffer operations
end_buffer_write_sync is changed to avoid a printk for BH_Eoptnotsupp
related failures, since the FS is responsible for a retry.
sync_dirty_buffer is changed to test for BH_Eopnotsupp and return
-EOPNOTSUPP to the caller
Some of this came from Jens Axboe
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Make sync_dirty_buffer() return the result of its syncing.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
IDE disk barrier core.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: William Lee Irwin III <wli@holomorphy.com>
This patch implements wake-one semantics for buffer_head wakeups in a single
step. The buffer_head being waited on is passed to the waiter's wakeup
function by the waker, and the wakeup function compares that to the a pointer
stored in its on-stack structure and checking the readiness of the bit there
also. Wake-one semantics are achieved by using WQ_FLAG_EXCLUSIVE in the
codepaths waiting to acquire the bit for mutual exclusion.
|
|
From: Christoph Hellwig <hch@lst.de>
These are the generic lockfs bits. Basically it takes the XFS freezing
statemachine into the VFS. It's all behind the kernel-doc documented
freeze_bdev and thaw_bdev interfaces.
Based on an older patch from Chris Mason.
|
|
From: Jeff Garzik <jgarzik@pobox.com>
It was debug code, no longer required.
|
|
From: Jeff Garzik <jgarzik@pobox.com>
Nobody ever checks the return value of submit_bh(), and submit_bh() is the
only caller that checks the submit_bio() return value.
This changes the kernel I/O submission path -- a fast path -- so this
cleanup is also a microoptimization.
|
|
From: Adrian Bunk <bunk@fs.tum.de>
four months ago, Rolf Eike Beer <eike-kernel@sf-tec.de> sent a patch
against 2.6.0-test5-bk1 that converted several if ... BUG() to BUG_ON()
This might in some cases result in slightly faster code because BUG_ON()
uses unlikely().
|
|
From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk>
generic_osync_inode() got an extra argument - mapping and doesn't calculate
inode->i_mapping anymore. Callers updated and switched to use of
->f_mapping.
|
|
From: Oliver Xymoron <oxymoron@waste.org>
These patches add the infrastructure for reporting asynchronous write errors
to block devices to userspace. Error which are detected due to pdflush or VM
writeout are reported at the next fsync, fdatasync, or msync on the given
file, and on close if the error occurs in time.
We do this by propagating any errors into page->mapping->error when they are
detected. In fsync(), msync(), fdatasync() and close() we return that error
and zero it out.
The Open Group say close() _may_ fail if an I/O error occurred while reading
from or writing to the file system. Well, in this implementation close() can
return -EIO or -ENOSPC. And in that case it will succeed, not fail - perhaps
that is what they meant.
There are three patches in this series and testing has only been performed
with all three applied.
|
|
uses it for now, but I needed it for some tuning tests,
and it is potentially useful for others.
|
|
* bogus calls of invalidate_buffers() gone from floppy_open()
* invalidate_buffers() killed.
* new helper - __invalidate_device(bdev, do_sync). invalidate_device()
is calling it.
* fixed races between floppy_open()/floppy_open and
floppy_open()/set_geometry():
a) floppy_open()/floppy_release() is done under a semaphore. That
closes the races between simultaneous open() on /dev/fd0foo and /dev/fd0bar.
b) pointer to struct block_device is kept as long as floppy is
opened (per-drive, non-NULL when number of openers is non-zero, does not
contribute to block_device refcount).
c) set_geometry() grabs the same semaphore and invalidates the
devices directly instead of messing with setting fake "it had changed"
and calling __check_disk_change().
* __check_disk_change() killed - no remaining callers
* full_check_disk_change() killed - ditto.
|
|
- alloc_buffer_head() should take the allocation mode as an arg, and not
assume.
- Use __GFP_NOFAIL in JBD's call to alloc_buffer_head().
- Remove all the retry code from jbd_kmalloc() - do it via page allocator
controls.
|
|
It is generally illegal to wait on an unpinned buffer - another CPU could
free it up even before __wait_on_buffer() has taken a ref against the buffer.
Maybe external locking rules will prevent this in specific cases, but that is
really subtle and fragile as locking rules are evolved.
The patch detects people calling wait_on_buffer() against an unpinned buffer
and issues a diagnostic.
Also remove the get_bh() from __wait_on_buffer(). It is too late.
|
|
SGI Modid: 2.5.x-xfs:slinx:141507a
|
|
Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> points out a bug in
ll_rw_block() usage.
Typical usage is:
mark_buffer_dirty(bh);
ll_rw_block(WRITE, 1, &bh);
wait_on_buffer(bh);
the problem is that if the buffer was locked on entry to this code sequence
(due to in-progress I/O), ll_rw_block() will not wait, and start new I/O. So
this code will wait on the _old_ I/O, and will then continue execution,
leaving the buffer dirty.
It turns out that all callers were only writing one buffer, and they were all
waiting on that writeout. So I added a new sync_dirty_buffer() function:
void sync_dirty_buffer(struct buffer_head *bh)
{
lock_buffer(bh);
if (test_clear_buffer_dirty(bh)) {
get_bh(bh);
bh->b_end_io = end_buffer_io_sync;
submit_bh(WRITE, bh);
} else {
unlock_buffer(bh);
}
}
which allowed a fair amount of code to be removed, while adding the desired
data-integrity guarantees.
UFS has its own wrappers around ll_rw_block() which got in the way, so this
operation was open-coded in that case.
|
|
There have been sporadic sightings of ext3 causing little blips of 100,000
context switches per second when under load.
At the start of do_get_write_access() we have this logic:
repeat:
lock_buffer(jh->bh);
...
unlock_buffer(jh->bh);
...
if (jh->j_list == BJ_Shadow) {
sleep_on_buffer(jh->bh);
goto repeat;
}
The problem is that the unlock_buffer() will wake up anyone who is sleeping
in the sleep_on_buffer().
So if task A is asleep in sleep_on_buffer() and task B now runs
do_get_write_access(), task B will wake task A by accident. Task B will then
sleep on the buffer and task A will loop, will run unlock_buffer() and then
wake task B.
This state will continue until I/O completes against the buffer and kjournal
changes jh->j_list.
Unless task A and task B happen to both have realtime scheduling policy - if
they do then kjournald will never run. The state is never cleared and your
box locks up.
The fix is to not do the `goto repeat;' until the buffer has been taken of
the shadow list. So we don't go and wake up the other waiter(s) until they
can actually proceed to use the buffer.
The patch removes the exported sleep_on_buffer() function and simply exports
an existing function which provides access to a buffer_head's waitqueue
pointer. Which is a better interface anyway, because it permits the use of
wait_event().
This bug was introduced introduced into 2.4.20-pre5 and was faithfully ported
up.
|
|
major changes to actually fit.
SGI Modid: 2.5.x-xfs:slinx:132210a
|
|
current->flags:PF_SYNC was a hack I added because I didn't want to
change all ->writepage implementations.
It's foul. And it means that if someone happens to run direct page
reclaim within the context of (say) sys_sync, the writepage invokations
from the VM will be treated as "data integrity" operations, not "memory
cleansing" operations, which would cause latency.
So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage.
It is the `writeback_control' structure which contains the full context
information about why writepage was called.
The initial version of this patch just passed in a bare `int sync', but
the XFS team need more info so they can perform writearound from within
page reclaim.
The patch also adds writeback_control.for_reclaim, so writepage
implementations can inspect that to work out the call context rather
than peeking at current->flags:PF_MEMALLOC.
|
|
Implements a new set of block address_space_operations which will never
attach buffer_heads to file pagecache. These can be turned on for ext2
with the `nobh' mount option.
During write-intensive testing on a 7G machine, total buffer_head
storage remained below 0.3 megabytes. And those buffer_heads are
against ZONE_NORMAL pagecache and will be reclaimed by ZONE_NORMAL
memory pressure.
This work is, of course, a special for the huge highmem machines.
Possibly it obsoletes the buffer_heads_over_limit stuff (which doesn't
work terribly well), but that code is simple, and will provide relief
for other filesystems.
It should be noted that the nobh_prepare_write() function and the
PageMappedToDisk() infrastructure is what is needed to solve the
problem of user data corruption when the filesystem which backs a
sparse MAP_SHARED mapping runs out of space. We can use this code in
filemap_nopage() to ensure that all mapped pages have space allocated
on-disk. Deliver SIGBUS on ENOSPC.
This will require a new address_space op, I expect.
|
|
Stephen Tweedie reports a 2.4.7 problem in which kswapd is chewing lots
of CPU trying to reclaim inodes which are pinned by buffer_heads at
i_dirty_buffers.
This can only happen when there's memory pressure on ZONE_HIGHMEM - the
2.4 kernel runs shrink_icache_memory in that case as well. But there's
no reclaim pressure on ZONE_NORMAL so the VM is never running
try_to_free_buffers() against the ZONE_NORMAL buffers which are pinning
the inodes.
The 2.5 kernel also runs the slab shrinkers in response to ZONE_HIGHMEM
pressure. This may be wrong - still thinking about that.
This patch arranges for prune_icache to try to remove the inode's buffers
when the inode is to be reclaimed.
It also changes inode_has_buffers() and the other inode-buffer-list
functions to look at inode->i_data, not inode->i_mapping. The latter
was wrong.
|
|
This patch from Christoph Hellwig removes the kiobuf/kiovec
infrastructure.
This affects three subsystems:
video-buf.c:
This patch includes an earlier diff from Gerd which converts
video-buf.c to use get_user_pages() directly.
Gerd has acked this patch.
LVM1:
Is now even more broken.
drivers/mtd/devices/blkmtd.c:
blkmtd is broken by this change. I contacted Simon Evans, who
said "I had done a rewrite of blkmtd anyway and just need to convert
it to BIO. Feel free to break it in the 2.5 tree, it will force me
to finish my code."
Neither EVMS nor LVM2 use kiobufs. The only remaining breakage
of which I am aware is a proprietary MPEG2 streaming module. It
could use get_user_pages().
|
|
This is the replacement for write_mapping_buffers().
Whenever the mpage code sees that it has just written a block which had
buffer_boundary() set, it assumes that the next block is dirty
filesystem metadata. (This is a good assumption - that's what
buffer_boundary is for).
So we do a lookup in the blockdev mapping for the next block and it if
is present and dirty, then schedule it for IO.
So the indirect blocks in the blockdev mapping get merged with the data
blocks in the file mapping.
This is a bit more general than the write_mapping_buffers() approach.
write_mapping_buffers() required that the fs carefully maintain the
correct buffers on the mapping->private_list, and that the fs call
write_mapping_buffers(), and the implementation was generally rather
yuk.
This version will "just work" for filesystems which implement
buffer_boundary correctly. Currently this is ext2, ext3 and some
not-yet-merged reiserfs patches. JFS implements buffer_boundary() but
does not use ext2-like layouts - so there will be no change there.
Works nicely.
|
|
When the global buffer LRU was present, dirty ext2 indirect blocks were
automatically scheduled for writeback alongside their data.
I added write_mapping_buffers() to replace this - the idea was to
schedule the indirects close in time to the scheduling of their data.
It works OK for small-to-medium sized files but for large, linear writes
it doesn't work: the request queue is completely full of file data and
when we later come to scheduling the indirects, their neighbouring data
has already been written.
So writeback of really huge files tends to be a bit seeky.
So. Kill it. Will fix this problem by other means.
|
|
Convert the VM to not wait on other people's dirty data.
- If we find a dirty page and its queue is not congested, do some writeback.
- If we find a dirty page and its queue _is_ congested then just
refile the page.
- If we find a PageWriteback page then just refile the page.
- There is additional throttling for write(2) callers. Within
generic_file_write(), record their backing queue in ->current.
Within page reclaim, if this tasks encounters a page which is dirty
or under writeback onthis queue, block on it. This gives some more
writer throttling and reduces the page refiling frequency.
It's somewhat CPU expensive - under really heavy load we only get a 50%
reclaim rate in pages coming off the tail of the LRU. This can be
fixed by splitting the inactive list into reclaimable and
non-reclaimable lists. But the CPU load isn't too bad, and latency is
much, much more important in these situations.
Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34
took 35 minutes to compile a kernel. With this patch, it took three
minutes, 45 seconds.
I haven't done swapcache or MAP_SHARED pages yet. If there's tons of
dirty swapcache or mmap data around we still stall heavily in page
reclaim. That's less important.
This patch also has a tweak for swapless machines: don't even bother
bringing anon pages onto the inactive list if there is no swap online.
|
|
Patch from Christoph Hellwig.
Move the buffer_head-based IO functions out of ll_rw_blk.c and into
fs/buffer.c. So the buffer IO functions are all in buffer.c, and
ll_rw_blk.c knows nothing about buffer_heads.
This patch has been acked by Jens.
|
|
This patch addresses the excessive consumption of ZONE_NORMAL by
buffer_heads on highmem machines. The algorithms which decide which
buffers to shoot down are fairly dumb, but they only cut in on machines
with large highmem:lowmem ratios and the code footprint is tiny.
The buffer.c change implements the buffer_head accounting - it sets the
upper limit on buffer_head memory occupancy to 10% of ZONE_NORMAL.
A possible side-effect of this change is that the kernel will perform
more calls to get_block() to map pages to disk. This will only be
observed when a file is being repeatadly overwritten - this is the only
case in which the "cached get_block result" in the buffers is useful.
I did quite some testing of this back in the delalloc ext2 days, and
was not able to come up with a test in which the cached get_block
result was measurably useful. That's for ext2, which has a fast
get_block().
A desirable side effect of this patch is that the kernel will be able
to cache much more blockdev pagecache in ZONE_NORMAL, so there are more
ext2/3 indirect blocks in cache, so with some workloads, less I/O will
be performed.
In mpage_writepage(): if the number of buffer_heads is excessive then
buffers are stripped from pages as they are submitted for writeback.
This change is only useful for filesystems which are using the mpage
code. That's ext2 and ext3-writeback and JFS. An mpage patch for
reiserfs was floating about but seems to have got lost.
There is no need to strip buffers for reads because the mpage code does
not attach buffers for reads.
These are perhaps not the most appropriate buffer_heads to toss away.
Perhaps something smarter should be done to detect file overwriting, or
to toss the 'oldest' buffer_heads first.
In refill_inactive(): if the number of buffer_heads is excessive then
strip buffers from pages as they move onto the inactive list. This
change is useful for all filesystems. This approach is good because
pages which are being repeatedly overwritten will remain on the active
list and will retain their buffers, whereas pages which are not being
overwritten will be stripped.
|
|
This patch changes cont_prepare_write(), in order to support a 4G-1
file for FAT32.
int cont_prepare_write(struct page *page, unsigned offset,
- unsigned to, get_block_t *get_block, unsigned long *bytes)
+ unsigned to, get_block_t *get_block, loff_t *bytes)
And it fixes broken adfs/affs/fat/hfs/hpfs/qnx4 by this
cont_prepare_write() change.
|
|
Here's a patch which converts O_DIRECT to go direct-to-BIO, bypassing
the kiovec layer. It's followed by a patch which converts the raw
driver to use the O_DIRECT engine.
CPU utilisation is about the same as the kiovec-based implementation.
Read and write bandwidth are the same too, for 128k chunks. But with
one megabyte chunks, this implementation is 20% faster at writing.
I assume this is because the kiobuf-based implementation has to stop
and wait for each 128k chunk, whereas this code streams the entire
request, regardless of its size.
This is with a single (oldish) scsi disk on aic7xxx. I'd expect the
margin to widen on higher-end hardware which likes to have more
requests in flight.
Question is: what do we want to do with this sucker? These are the
remaining users of kiovecs:
drivers/md/lvm-snap.c
drivers/media/video/video-buf.c
drivers/mtd/devices/blkmtd.c
drivers/scsi/sg.c
the video and mtd drivers seems to be fairly easy to de-kiobufize.
I'm aware of one proprietary driver which uses kiobufs. XFS uses
kiobufs a little bit - just to map the pages.
So with a bit of effort and maintainer-irritation, we can extract
the kiobuf layer from the kernel.
|
|
into home.transmeta.com:/home/torvalds/v2.5/linux
|
|
* since the last caller of is_read_only() is gone, the function
itself is removed.
* destroy_buffers() is not used anymore; gone.
* fsync_dev() is gone; the only user is (broken) lvm.c and first
step in fixing lvm.c will consist of propagating struct block_device *
anyway; at that point we'll just use fsync_bdev() in there.
* prototype of bio_ioctl() removed - function doesn't exist
anymore.
|
|
ext2 and ext3 implement a custom LRU cache of buffer_heads - the eight
most-recently-used inode bitmap buffers and the eight MRU block bitmap
buffers.
I don't like them, for a number of reasons:
- The code is duplicated between filesystems
- The functionality is unavailable to other filesystems
- The LRU only applies to bitmap buffers. And not, say, indirects.
- The LRUs are subtly dependent upon lock_super() for protection:
without lock_super protection a bitmap could be evicted and freed
while in use.
And removing this dependence on lock_super() gets us one step on
the way toward getting that semaphore out of the ext2 block allocator -
it causes significant contention under some loads and should be a
spinlock.
- The LRUs pin 64 kbytes per mounted filesystem.
Now, we could just delete those LRUs and rely on the VM to manage the
memory. But that would introduce significant lock contention in
__find_get_block - the blockdev mapping's private_lock and page_lock
are heavily used.
So this patch introduces a transparent per-CPU bh lru which is hidden
inside __find_get_block(), __getblk() and __bread(). It is designed to
shorten code paths and to reduce lock contention. It uses a seven-slot
LRU. It achieves a 99% hit rate in `dbench 64'. It provides benefit
to all filesystems.
The next patches remove the open-coded LRUs from ext2 and ext3.
Taken together, these patches are a code cleanup (300-400 lines gone),
and they reduce lock contention. Anton tested these patches on the
32-way and demonstrated a throughput improvement of up to 15% on
RAM-only dbench runs. See http://samba.org/~anton/linux/2.5.24/dbench/
Most of this benefit is from avoiding find_get_page() on the blockdev
mapping. Because the generic LRU copes with indirect blocks as well as
bitmaps.
|
|
Renames the buffer_head lookup function `get_hash_table' to
`find_get_block'.
get_hash_table() is too generic a name. Plus it doesn't even use a hash
any more.
|
|
The set_page_buffers() and clear_page_buffers() macros are each used in
only one place. Fold them into their callers.
|
|
highmem.h includes bio.h, so just about every compilation unit in the
kernel gets to process bio.h.
The patch moves the BIO-related functions out of highmem.h and into
bio-related headers. The nested include is removed and all files which
need to include bio.h now do so.
|
|
alloc_bufer_head() does not need the additional argument - GFP_NOFS is
always correct.
|
|
This patch changes the swap I/O handling. The objectives are:
- Remove swap special-casing
- Stop using buffer_heads -> direct-to-BIO
- Make S_ISREG swapfiles more robust.
I've spent quite some time with swap. The first patches converted swap to
use block_read/write_full_page(). These were discarded because they are
still using buffer_heads, and a reasonable amount of otherwise unnecessary
infrastructure had to be added to the swap code just to make it look like a
regular fs. So this code just has a custom direct-to-BIO path for swap,
which seems to be the most comfortable approach.
A significant thing here is the introduction of "swap extents". A swap
extent is a simple data structure which maps a range of swap pages onto a
range of disk sectors. It is simply:
struct swap_extent {
struct list_head list;
pgoff_t start_page;
pgoff_t nr_pages;
sector_t start_block;
};
At swapon time (for an S_ISREG swapfile), each block in the file is bmapped()
and the block numbers are parsed to generate the device's swap extent list.
This extent list is quite compact - a 512 megabyte swapfile generates about
130 nodes in the list. That's about 4 kbytes of storage. The conversion
from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon
time.
At swapon time (for an S_ISBLK swapfile), we install a single swap extent
which describes the entire device.
The advantages of the swap extents are:
1: We never have to run bmap() (ie: read from disk) at swapout time. So
S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles.
2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are
handled at swapon time. During normal operation, we just don't care.
Both types of swapfiles are handled the same way.
3: The extent lists always operate in PAGE_SIZE units. So the problems of
going from fs blocksize to PAGE_SIZE are handled at swapon time and normal
operating code doesn't need to care.
4: Because we don't have to fiddle with different blocksizes, we can go
direct-to-BIO for swap_readpage() and swap_writepage(). This introduces
the kernel-wide invariant "anonymous pages never have buffers attached",
which cleans some things up nicely. All those block_flushpage() calls in
the swap code simply go away.
5: The kernel no longer has to allocate both buffer_heads and BIOs to
perform swapout. Just a BIO.
6: It permits us to perform swapcache writeout and throttling for
GFP_NOFS allocations (a later patch).
(Well, there is one sort of anon page which can have buffers: the pages which
are cast adrift in truncate_complete_page() because do_invalidatepage()
failed. But these pages are never added to swapcache, and nobody except the
VM LRU has to deal with them).
The swapfile parser in setup_swap_extents() will attempt to extract the
largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of
disk from the S_ISREG swapfile. Any stray blocks (due to file
discontiguities) are simply discarded - we never swap to those.
If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then
the swapon attempt will fail.
The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG
swapfile). It needs to be consulted once for each page within
swap_readpage() and swap_writepage(). Hence there is a risk that we could
blow significant amounts of CPU walking that list. However I have
implemented a "where we found the last block" cache, which is used as the
starting point for the next search. Empirical testing indicates that this is
wildly effective - the average length of the list walk in map_swap_page() is
0.3 iterations per page, with a 130-element list.
It _could_ be that some workloads do start suffering long walks in that code,
and perhaps a tree would be needed there. But I doubt that, and if this is
happening then it means that we're seeking all over the disk for swap I/O,
and the list walk is the least of our problems.
rw_swap_page_nolock() now takes a page*, not a kernel virtual address. It
has been renamed to rw_swap_page_sync() and it takes care of locking and
unlocking the page itself. Which is all a much better interface.
Support for type 0 swap has been removed. Current versions of mkwap(8) seem
to never produce v0 swap unless you explicitly ask for it, so I doubt if this
will affect anyone. If you _do_ have a type 0 swapfile, swapon will fail and
the message
version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3
is printed. We can remove that code for real later on. Really, all that
swapfile header parsing should be pushed out to userspace.
This code always uses single-page BIOs for swapin and swapout. I have an
additional patch which converts swap to use mpage_writepages(), so we swap
out in 16-page BIOs. It works fine, but I don't intend to submit that.
There just doesn't seem to be any significant advantage to it.
I can't see anything in sys_swapon()/sys_swapoff() which needs the
lock_kernel() calls, so I deleted them.
If you ftruncate an S_ISREG swapfile to a shorter size while it is in use,
subsequent swapout will destroy the filesystem. It was always thus, but it
is much, much easier to do now. Not really a kernel problem, but swapon(8)
should not be allowing the kernel to use swapfiles which are modifiable by
unprivileged users.
|
|
Fixes a pet peeve: the identifier "flushpage" implies "flush the page
to disk". Which is very much not what the flushpage functions actually
do.
The patch renames block_flushpage and the flushpage
address_space_operation to "invalidatepage".
It also fixes a buglet in invalidate_this_page2(), which was calling
block_flushpage() directly - it needs to call do_flushpage() (now
do_invalidatepage()) so that the filesystem's ->flushpage (now
->invalidatepage) a_op gets a chance to relinquish any interest which
it has in the page's buffers.
|
|
block_symlink() is not a "block" function at all. It is a pure
pagecache/address_space function. Seeing driverfs calling it was
the last straw.
The patch renames it to `page_symlink()' and moves it into fs/namei.c
|