[PATCH] direct-to-BIO I/O for swapcache pages

This patch changes the swap I/O handling. The objectives are: - Remove swap special-casing - Stop using buffer_heads -> direct-to-BIO - Make S_ISREG swapfiles more robust. I've spent quite some time with swap. The first patches converted swap to use block_read/write_full_page(). These were discarded because they are still using buffer_heads, and a reasonable amount of otherwise unnecessary infrastructure had to be added to the swap code just to make it look like a regular fs. So this code just has a custom direct-to-BIO path for swap, which seems to be the most comfortable approach. A significant thing here is the introduction of "swap extents". A swap extent is a simple data structure which maps a range of swap pages onto a range of disk sectors. It is simply: struct swap_extent { struct list_head list; pgoff_t start_page; pgoff_t nr_pages; sector_t start_block; }; At swapon time (for an S_ISREG swapfile), each block in the file is bmapped() and the block numbers are parsed to generate the device's swap extent list. This extent list is quite compact - a 512 megabyte swapfile generates about 130 nodes in the list. That's about 4 kbytes of storage. The conversion from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon time. At swapon time (for an S_ISBLK swapfile), we install a single swap extent which describes the entire device. The advantages of the swap extents are: 1: We never have to run bmap() (ie: read from disk) at swapout time. So S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles. 2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are handled at swapon time. During normal operation, we just don't care. Both types of swapfiles are handled the same way. 3: The extent lists always operate in PAGE_SIZE units. So the problems of going from fs blocksize to PAGE_SIZE are handled at swapon time and normal operating code doesn't need to care. 4: Because we don't have to fiddle with different blocksizes, we can go direct-to-BIO for swap_readpage() and swap_writepage(). This introduces the kernel-wide invariant "anonymous pages never have buffers attached", which cleans some things up nicely. All those block_flushpage() calls in the swap code simply go away. 5: The kernel no longer has to allocate both buffer_heads and BIOs to perform swapout. Just a BIO. 6: It permits us to perform swapcache writeout and throttling for GFP_NOFS allocations (a later patch). (Well, there is one sort of anon page which can have buffers: the pages which are cast adrift in truncate_complete_page() because do_invalidatepage() failed. But these pages are never added to swapcache, and nobody except the VM LRU has to deal with them). The swapfile parser in setup_swap_extents() will attempt to extract the largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of disk from the S_ISREG swapfile. Any stray blocks (due to file discontiguities) are simply discarded - we never swap to those. If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then the swapon attempt will fail. The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG swapfile). It needs to be consulted once for each page within swap_readpage() and swap_writepage(). Hence there is a risk that we could blow significant amounts of CPU walking that list. However I have implemented a "where we found the last block" cache, which is used as the starting point for the next search. Empirical testing indicates that this is wildly effective - the average length of the list walk in map_swap_page() is 0.3 iterations per page, with a 130-element list. It _could_ be that some workloads do start suffering long walks in that code, and perhaps a tree would be needed there. But I doubt that, and if this is happening then it means that we're seeking all over the disk for swap I/O, and the list walk is the least of our problems. rw_swap_page_nolock() now takes a page*, not a kernel virtual address. It has been renamed to rw_swap_page_sync() and it takes care of locking and unlocking the page itself. Which is all a much better interface. Support for type 0 swap has been removed. Current versions of mkwap(8) seem to never produce v0 swap unless you explicitly ask for it, so I doubt if this will affect anyone. If you _do_ have a type 0 swapfile, swapon will fail and the message version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3 is printed. We can remove that code for real later on. Really, all that swapfile header parsing should be pushed out to userspace. This code always uses single-page BIOs for swapin and swapout. I have an additional patch which converts swap to use mpage_writepages(), so we swap out in 16-page BIOs. It works fine, but I don't intend to submit that. There just doesn't seem to be any significant advantage to it. I can't see anything in sys_swapon()/sys_swapoff() which needs the lock_kernel() calls, so I deleted them. If you ftruncate an S_ISREG swapfile to a shorter size while it is in use, subsequent swapout will destroy the filesystem. It was always thus, but it is much, much easier to do now. Not really a kernel problem, but swapon(8) should not be allowing the kernel to use swapfiles which are modifiable by unprivileged users.
author: Andrew Morton <akpm@zip.com.au> 2002-06-17 20:19:13 -0700
committer: Linus Torvalds <torvalds@home.transmeta.com> 2002-06-17 20:19:13 -0700
commit: 88c4650a9ece8fef2be042fbbec2dde2d0afa1a4 (patch)
tree: 980e9aab05bbbe63602c0b01a8d81e6d727f246d /include
parent: 3ab86fb0d43ce886f67521f4f0bb959901fa12c8 (diff)
2 files changed, 28 insertions, 5 deletions
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 90767fc78617..fda967ab9358 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -183,7 +183,6 @@ struct buffer_head * __bread(struct block_device *, int, int);
 void wakeup_bdflush(void);
 struct buffer_head *alloc_buffer_head(int async);
 void free_buffer_head(struct buffer_head * bh);
-int brw_page(int, struct page *, struct block_device *, sector_t [], int);
 void FASTCALL(unlock_buffer(struct buffer_head *bh));
 
 /*
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d0160265e3c5..0b448a811a39 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -5,6 +5,7 @@
 #include <linux/kdev_t.h>
 #include <linux/linkage.h>
 #include <linux/mmzone.h>
+#include <linux/list.h>
 #include <asm/page.h>
 
 #define SWAP_FLAG_PREFER	0x8000	/* set if swap priority specified */
@@ -62,6 +63,21 @@ typedef struct {
 #ifdef __KERNEL__
 
 /*
+ * A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of
+ * disk blocks.  A list of swap extents maps the entire swapfile.  (Where the
+ * term `swapfile' refers to either a blockdevice or an IS_REG file.  Apart
+ * from setup, they're handled identically.
+ *
+ * We always assume that blocks are of size PAGE_SIZE.
+ */
+struct swap_extent {
+	struct list_head list;
+	pgoff_t start_page;
+	pgoff_t nr_pages;
+	sector_t start_block;
+};
+
+/*
  * Max bad pages in the new format..
  */
 #define __swapoffset(x) ((unsigned long)&((union swap_header *)0)->x)
@@ -83,11 +99,17 @@ enum {
 
 /*
  * The in-memory structure used to track swap areas.
+ * extent_list.prev points at the lowest-index extent.  That list is
+ * sorted.
  */
 struct swap_info_struct {
 	unsigned int flags;
 	spinlock_t sdev_lock;
 	struct file *swap_file;
+	struct block_device *bdev;
+	struct list_head extent_list;
+	int nr_extents;
+	struct swap_extent *curr_swap_extent;
 	unsigned old_block_size;
 	unsigned short * swap_map;
 	unsigned int lowest_bit;
@@ -134,8 +156,9 @@ extern wait_queue_head_t kswapd_wait;
 extern int FASTCALL(try_to_free_pages(zone_t *, unsigned int, unsigned int));
 
 /* linux/mm/page_io.c */
-extern void rw_swap_page(int, struct page *);
-extern void rw_swap_page_nolock(int, swp_entry_t, char *);
+int swap_readpage(struct file *file, struct page *page);
+int swap_writepage(struct page *page);
+int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page);
 
 /* linux/mm/page_alloc.c */
 
@@ -163,12 +186,13 @@ extern unsigned int nr_swapfiles;
 extern struct swap_info_struct swap_info[];
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
-extern void get_swaphandle_info(swp_entry_t, unsigned long *, struct inode **);
 extern int swap_duplicate(swp_entry_t);
-extern int swap_count(struct page *);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
 extern void swap_free(swp_entry_t);
 extern void free_swap_and_cache(swp_entry_t);
+sector_t map_swap_page(struct swap_info_struct *p, pgoff_t offset);
+struct swap_info_struct *get_swap_info_struct(unsigned type);
+
 struct swap_list_t {
 	int head;	/* head of priority-ordered swapfile list */
 	int next;	/* swapfile to be used next */
author	Andrew Morton <akpm@zip.com.au>	2002-06-17 20:19:13 -0700
committer	Linus Torvalds <torvalds@home.transmeta.com>	2002-06-17 20:19:13 -0700
commit	88c4650a9ece8fef2be042fbbec2dde2d0afa1a4 (patch)
tree	980e9aab05bbbe63602c0b01a8d81e6d727f246d /include
parent	3ab86fb0d43ce886f67521f4f0bb959901fa12c8 (diff)