[PATCH] direct-to-BIO readahead

Implements BIO-based multipage reads into the pagecache, and turns this on for ext2. CPU load for `cat large_file > /dev/null' is reduced by approximately 15%. Similar reductions for tiobench with a single thread. (Earlier claims of 25% were exaggerated - they were measured with slab debug enabled. But 15% isn't bad for a load which is dominated by copy_*_user costs). With 2, 4 and 8 tiobench threads, throughput is increased as well, which was unexpected. It's due to request queue weirdness. (Generally the request queueing is doing bad things under certain workloads - that's a separate issue.) BIOs of up to 64 kbytes are assembled and submitted for readahead and for single-page reads. So the work involved in reading 32 pages has gone from: - allocate and attach 32 buffer_heads - submit 32 buffer_heads - allocate 32 bios - submit 32 bios to: - allocate 2 bios - submit 2 bios These pages never have buffers attached. Buffers will be attached later if the application writes to these pages (file overwrite). The first version of this code (in the "delayed allocation" patches) tries to handle everything - bios which start mid-page, bios which end mid-page and pages which are covered by multiple bios. It is very complex code and in fact appears to be incorrect: out-of-order BIO completion could cause a page to come unlocked at the wrong time. This implementation is much simpler: if things get complex, it just falls back to the buffer-based block_read_full_page(), which isn't going away, and which understands all that complexity. There's no point in doing this in two places. This code will bypass the buffer layer for - fully-mapped pages which are on-disk contiguous. - fully unmapoped pages (holes) - partially unmapped pages, where the unmappedness is at the end of the page (end-of-file). and everything else falls back to buffers. This means that with blocksize == PAGE_CACHE_SIZE, 100% of pages are handed direct to BIO. With a heavy 10-minute dbench run on 4k PAGE_CACHE_SIZE and 1k blocks, 95% of pages were handed direct to BIO. Almost all of the other 5% were passed to block_read_full_page() because they were already partially uptodate from an earlier sub-page write(). This ratio will fall if PAGE_CACHE_SIZE/blocksize is greater than four. But if that's the case, CPU efficiency is far from the main concern - there are significant seek and bandwidth problems just at 4 blocks per page. This code will stress out the block layer somewhat - RAID0 doesn't like multipage BIOs, and there are probably others. RAID0 seems to struggle along - readahead fails but read falls back to single-page reads, which succeed. Such problems may be worked around by setting MPAGE_BIO_MAX_SIZE to PAGE_CACHE_SIZE in fs/mpage.c. It is trivial to enable multipage reads for many other filesystems. We can do that after completion of external testing of ext2.
author: Andrew Morton <akpm@zip.com.au> 2002-05-27 05:12:36 -0700
committer: Linus Torvalds <torvalds@home.transmeta.com> 2002-05-27 05:12:36 -0700
commit: bc67de559fb1ceda7a34ad67f03b643399dfa284 (patch)
tree: c20d6a75483dc9a1391890aeb7aa3f396fbadbc5 /include
parent: 47279570a1fd14b1fbd75f020ae151d5ef20a7c4 (diff)
3 files changed, 20 insertions, 0 deletions
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index b52eb321e898..7550c3bfb7c0 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -24,6 +24,7 @@ enum bh_state_bits {
 	BH_Async_Write,	/* Is under end_buffer_async_write I/O */
 
 	BH_JBD,		/* Has an attached ext3 journal_head */
+	BH_Boundary,	/* Block is followed by a discontiguity */
 	BH_PrivateStart,/* not a state bit, but the first bit available
 			 * for private allocation by other entities
 			 */
@@ -106,6 +107,7 @@ BUFFER_FNS(Mapped, mapped)
 BUFFER_FNS(New, new)
 BUFFER_FNS(Async_Read, async_read)
 BUFFER_FNS(Async_Write, async_write)
+BUFFER_FNS(Boundary, boundary)
 
 /*
  * FIXME: this is used only by bh_kmap, which is used only by RAID5.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0c7ec5ff6b9c..862c641c1819 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -290,6 +290,9 @@ struct address_space_operations {
 	/* Set a page dirty */
 	int (*set_page_dirty)(struct page *page);
 
+	int (*readpages)(struct address_space *mapping,
+			struct list_head *pages, unsigned nr_pages);
+
 	/*
 	 * ext3 requires that a successful prepare_write() call be followed
 	 * by a commit_write() call - they must be balanced
diff --git a/include/linux/mpage.h b/include/linux/mpage.h
new file mode 100644
index 000000000000..f41ee607416f
--- /dev/null
+++ b/include/linux/mpage.h
@@ -0,0 +1,15 @@
+/*
+ * include/linux/mpage.h
+ *
+ * Contains declarations related to preparing and submitting BIOS which contain
+ * multiple pagecache pages.
+ */
+
+/*
+ * (And no, it doesn't do the #ifdef __MPAGE_H thing, and it doesn't do
+ * nested includes.  Get it right in the .c file).
+ */
+
+int mpage_readpages(struct address_space *mapping, struct list_head *pages,
+				unsigned nr_pages, get_block_t get_block);
+int mpage_readpage(struct page *page, get_block_t get_block);
author	Andrew Morton <akpm@zip.com.au>	2002-05-27 05:12:36 -0700
committer	Linus Torvalds <torvalds@home.transmeta.com>	2002-05-27 05:12:36 -0700
commit	bc67de559fb1ceda7a34ad67f03b643399dfa284 (patch)
tree	c20d6a75483dc9a1391890aeb7aa3f396fbadbc5 /include
parent	47279570a1fd14b1fbd75f020ae151d5ef20a7c4 (diff)