diff options
| author | Andrew Morton <akpm@zip.com.au> | 2002-06-17 20:19:13 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@home.transmeta.com> | 2002-06-17 20:19:13 -0700 |
| commit | 88c4650a9ece8fef2be042fbbec2dde2d0afa1a4 (patch) | |
| tree | 980e9aab05bbbe63602c0b01a8d81e6d727f246d /include | |
| parent | 3ab86fb0d43ce886f67521f4f0bb959901fa12c8 (diff) | |
[PATCH] direct-to-BIO I/O for swapcache pages
This patch changes the swap I/O handling. The objectives are:
- Remove swap special-casing
- Stop using buffer_heads -> direct-to-BIO
- Make S_ISREG swapfiles more robust.
I've spent quite some time with swap. The first patches converted swap to
use block_read/write_full_page(). These were discarded because they are
still using buffer_heads, and a reasonable amount of otherwise unnecessary
infrastructure had to be added to the swap code just to make it look like a
regular fs. So this code just has a custom direct-to-BIO path for swap,
which seems to be the most comfortable approach.
A significant thing here is the introduction of "swap extents". A swap
extent is a simple data structure which maps a range of swap pages onto a
range of disk sectors. It is simply:
struct swap_extent {
struct list_head list;
pgoff_t start_page;
pgoff_t nr_pages;
sector_t start_block;
};
At swapon time (for an S_ISREG swapfile), each block in the file is bmapped()
and the block numbers are parsed to generate the device's swap extent list.
This extent list is quite compact - a 512 megabyte swapfile generates about
130 nodes in the list. That's about 4 kbytes of storage. The conversion
from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon
time.
At swapon time (for an S_ISBLK swapfile), we install a single swap extent
which describes the entire device.
The advantages of the swap extents are:
1: We never have to run bmap() (ie: read from disk) at swapout time. So
S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles.
2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are
handled at swapon time. During normal operation, we just don't care.
Both types of swapfiles are handled the same way.
3: The extent lists always operate in PAGE_SIZE units. So the problems of
going from fs blocksize to PAGE_SIZE are handled at swapon time and normal
operating code doesn't need to care.
4: Because we don't have to fiddle with different blocksizes, we can go
direct-to-BIO for swap_readpage() and swap_writepage(). This introduces
the kernel-wide invariant "anonymous pages never have buffers attached",
which cleans some things up nicely. All those block_flushpage() calls in
the swap code simply go away.
5: The kernel no longer has to allocate both buffer_heads and BIOs to
perform swapout. Just a BIO.
6: It permits us to perform swapcache writeout and throttling for
GFP_NOFS allocations (a later patch).
(Well, there is one sort of anon page which can have buffers: the pages which
are cast adrift in truncate_complete_page() because do_invalidatepage()
failed. But these pages are never added to swapcache, and nobody except the
VM LRU has to deal with them).
The swapfile parser in setup_swap_extents() will attempt to extract the
largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of
disk from the S_ISREG swapfile. Any stray blocks (due to file
discontiguities) are simply discarded - we never swap to those.
If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then
the swapon attempt will fail.
The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG
swapfile). It needs to be consulted once for each page within
swap_readpage() and swap_writepage(). Hence there is a risk that we could
blow significant amounts of CPU walking that list. However I have
implemented a "where we found the last block" cache, which is used as the
starting point for the next search. Empirical testing indicates that this is
wildly effective - the average length of the list walk in map_swap_page() is
0.3 iterations per page, with a 130-element list.
It _could_ be that some workloads do start suffering long walks in that code,
and perhaps a tree would be needed there. But I doubt that, and if this is
happening then it means that we're seeking all over the disk for swap I/O,
and the list walk is the least of our problems.
rw_swap_page_nolock() now takes a page*, not a kernel virtual address. It
has been renamed to rw_swap_page_sync() and it takes care of locking and
unlocking the page itself. Which is all a much better interface.
Support for type 0 swap has been removed. Current versions of mkwap(8) seem
to never produce v0 swap unless you explicitly ask for it, so I doubt if this
will affect anyone. If you _do_ have a type 0 swapfile, swapon will fail and
the message
version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3
is printed. We can remove that code for real later on. Really, all that
swapfile header parsing should be pushed out to userspace.
This code always uses single-page BIOs for swapin and swapout. I have an
additional patch which converts swap to use mpage_writepages(), so we swap
out in 16-page BIOs. It works fine, but I don't intend to submit that.
There just doesn't seem to be any significant advantage to it.
I can't see anything in sys_swapon()/sys_swapoff() which needs the
lock_kernel() calls, so I deleted them.
If you ftruncate an S_ISREG swapfile to a shorter size while it is in use,
subsequent swapout will destroy the filesystem. It was always thus, but it
is much, much easier to do now. Not really a kernel problem, but swapon(8)
should not be allowing the kernel to use swapfiles which are modifiable by
unprivileged users.
Diffstat (limited to 'include')
| -rw-r--r-- | include/linux/buffer_head.h | 1 | ||||
| -rw-r--r-- | include/linux/swap.h | 32 |
2 files changed, 28 insertions, 5 deletions
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 90767fc78617..fda967ab9358 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -183,7 +183,6 @@ struct buffer_head * __bread(struct block_device *, int, int); void wakeup_bdflush(void); struct buffer_head *alloc_buffer_head(int async); void free_buffer_head(struct buffer_head * bh); -int brw_page(int, struct page *, struct block_device *, sector_t [], int); void FASTCALL(unlock_buffer(struct buffer_head *bh)); /* diff --git a/include/linux/swap.h b/include/linux/swap.h index d0160265e3c5..0b448a811a39 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -5,6 +5,7 @@ #include <linux/kdev_t.h> #include <linux/linkage.h> #include <linux/mmzone.h> +#include <linux/list.h> #include <asm/page.h> #define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */ @@ -62,6 +63,21 @@ typedef struct { #ifdef __KERNEL__ /* + * A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of + * disk blocks. A list of swap extents maps the entire swapfile. (Where the + * term `swapfile' refers to either a blockdevice or an IS_REG file. Apart + * from setup, they're handled identically. + * + * We always assume that blocks are of size PAGE_SIZE. + */ +struct swap_extent { + struct list_head list; + pgoff_t start_page; + pgoff_t nr_pages; + sector_t start_block; +}; + +/* * Max bad pages in the new format.. */ #define __swapoffset(x) ((unsigned long)&((union swap_header *)0)->x) @@ -83,11 +99,17 @@ enum { /* * The in-memory structure used to track swap areas. + * extent_list.prev points at the lowest-index extent. That list is + * sorted. */ struct swap_info_struct { unsigned int flags; spinlock_t sdev_lock; struct file *swap_file; + struct block_device *bdev; + struct list_head extent_list; + int nr_extents; + struct swap_extent *curr_swap_extent; unsigned old_block_size; unsigned short * swap_map; unsigned int lowest_bit; @@ -134,8 +156,9 @@ extern wait_queue_head_t kswapd_wait; extern int FASTCALL(try_to_free_pages(zone_t *, unsigned int, unsigned int)); /* linux/mm/page_io.c */ -extern void rw_swap_page(int, struct page *); -extern void rw_swap_page_nolock(int, swp_entry_t, char *); +int swap_readpage(struct file *file, struct page *page); +int swap_writepage(struct page *page); +int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page); /* linux/mm/page_alloc.c */ @@ -163,12 +186,13 @@ extern unsigned int nr_swapfiles; extern struct swap_info_struct swap_info[]; extern void si_swapinfo(struct sysinfo *); extern swp_entry_t get_swap_page(void); -extern void get_swaphandle_info(swp_entry_t, unsigned long *, struct inode **); extern int swap_duplicate(swp_entry_t); -extern int swap_count(struct page *); extern int valid_swaphandles(swp_entry_t, unsigned long *); extern void swap_free(swp_entry_t); extern void free_swap_and_cache(swp_entry_t); +sector_t map_swap_page(struct swap_info_struct *p, pgoff_t offset); +struct swap_info_struct *get_swap_info_struct(unsigned type); + struct swap_list_t { int head; /* head of priority-ordered swapfile list */ int next; /* swapfile to be used next */ |
