diff options
| author | Andrew Morton <akpm@zip.com.au> | 2002-06-17 20:19:13 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@home.transmeta.com> | 2002-06-17 20:19:13 -0700 |
| commit | 88c4650a9ece8fef2be042fbbec2dde2d0afa1a4 (patch) | |
| tree | 980e9aab05bbbe63602c0b01a8d81e6d727f246d /include/linux/loop.h | |
| parent | 3ab86fb0d43ce886f67521f4f0bb959901fa12c8 (diff) | |
[PATCH] direct-to-BIO I/O for swapcache pages
This patch changes the swap I/O handling. The objectives are:
- Remove swap special-casing
- Stop using buffer_heads -> direct-to-BIO
- Make S_ISREG swapfiles more robust.
I've spent quite some time with swap. The first patches converted swap to
use block_read/write_full_page(). These were discarded because they are
still using buffer_heads, and a reasonable amount of otherwise unnecessary
infrastructure had to be added to the swap code just to make it look like a
regular fs. So this code just has a custom direct-to-BIO path for swap,
which seems to be the most comfortable approach.
A significant thing here is the introduction of "swap extents". A swap
extent is a simple data structure which maps a range of swap pages onto a
range of disk sectors. It is simply:
struct swap_extent {
struct list_head list;
pgoff_t start_page;
pgoff_t nr_pages;
sector_t start_block;
};
At swapon time (for an S_ISREG swapfile), each block in the file is bmapped()
and the block numbers are parsed to generate the device's swap extent list.
This extent list is quite compact - a 512 megabyte swapfile generates about
130 nodes in the list. That's about 4 kbytes of storage. The conversion
from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon
time.
At swapon time (for an S_ISBLK swapfile), we install a single swap extent
which describes the entire device.
The advantages of the swap extents are:
1: We never have to run bmap() (ie: read from disk) at swapout time. So
S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles.
2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are
handled at swapon time. During normal operation, we just don't care.
Both types of swapfiles are handled the same way.
3: The extent lists always operate in PAGE_SIZE units. So the problems of
going from fs blocksize to PAGE_SIZE are handled at swapon time and normal
operating code doesn't need to care.
4: Because we don't have to fiddle with different blocksizes, we can go
direct-to-BIO for swap_readpage() and swap_writepage(). This introduces
the kernel-wide invariant "anonymous pages never have buffers attached",
which cleans some things up nicely. All those block_flushpage() calls in
the swap code simply go away.
5: The kernel no longer has to allocate both buffer_heads and BIOs to
perform swapout. Just a BIO.
6: It permits us to perform swapcache writeout and throttling for
GFP_NOFS allocations (a later patch).
(Well, there is one sort of anon page which can have buffers: the pages which
are cast adrift in truncate_complete_page() because do_invalidatepage()
failed. But these pages are never added to swapcache, and nobody except the
VM LRU has to deal with them).
The swapfile parser in setup_swap_extents() will attempt to extract the
largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of
disk from the S_ISREG swapfile. Any stray blocks (due to file
discontiguities) are simply discarded - we never swap to those.
If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then
the swapon attempt will fail.
The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG
swapfile). It needs to be consulted once for each page within
swap_readpage() and swap_writepage(). Hence there is a risk that we could
blow significant amounts of CPU walking that list. However I have
implemented a "where we found the last block" cache, which is used as the
starting point for the next search. Empirical testing indicates that this is
wildly effective - the average length of the list walk in map_swap_page() is
0.3 iterations per page, with a 130-element list.
It _could_ be that some workloads do start suffering long walks in that code,
and perhaps a tree would be needed there. But I doubt that, and if this is
happening then it means that we're seeking all over the disk for swap I/O,
and the list walk is the least of our problems.
rw_swap_page_nolock() now takes a page*, not a kernel virtual address. It
has been renamed to rw_swap_page_sync() and it takes care of locking and
unlocking the page itself. Which is all a much better interface.
Support for type 0 swap has been removed. Current versions of mkwap(8) seem
to never produce v0 swap unless you explicitly ask for it, so I doubt if this
will affect anyone. If you _do_ have a type 0 swapfile, swapon will fail and
the message
version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3
is printed. We can remove that code for real later on. Really, all that
swapfile header parsing should be pushed out to userspace.
This code always uses single-page BIOs for swapin and swapout. I have an
additional patch which converts swap to use mpage_writepages(), so we swap
out in 16-page BIOs. It works fine, but I don't intend to submit that.
There just doesn't seem to be any significant advantage to it.
I can't see anything in sys_swapon()/sys_swapoff() which needs the
lock_kernel() calls, so I deleted them.
If you ftruncate an S_ISREG swapfile to a shorter size while it is in use,
subsequent swapout will destroy the filesystem. It was always thus, but it
is much, much easier to do now. Not really a kernel problem, but swapon(8)
should not be allowing the kernel to use swapfiles which are modifiable by
unprivileged users.
Diffstat (limited to 'include/linux/loop.h')
0 files changed, 0 insertions, 0 deletions
