user/sven/linux.git - Linux Kernel

Age	Commit message (Collapse)	Author
2003-04-24	[PATCH] invalidate_device()/check_disk_change() fixes	Alexander Viro
	* bogus calls of invalidate_buffers() gone from floppy_open() * invalidate_buffers() killed. * new helper - __invalidate_device(bdev, do_sync). invalidate_device() is calling it. * fixed races between floppy_open()/floppy_open and floppy_open()/set_geometry(): a) floppy_open()/floppy_release() is done under a semaphore. That closes the races between simultaneous open() on /dev/fd0foo and /dev/fd0bar. b) pointer to struct block_device is kept as long as floppy is opened (per-drive, non-NULL when number of openers is non-zero, does not contribute to block_device refcount). c) set_geometry() grabs the same semaphore and invalidates the devices directly instead of messing with setting fake "it had changed" and calling __check_disk_change(). * __check_disk_change() killed - no remaining callers * full_check_disk_change() killed - ditto.
2003-04-20	[PATCH] make alloc_buffer_head take gfp_flags	Andrew Morton
	- alloc_buffer_head() should take the allocation mode as an arg, and not assume. - Use __GFP_NOFAIL in JBD's call to alloc_buffer_head(). - Remove all the retry code from jbd_kmalloc() - do it via page allocator controls.
2003-03-28	[PATCH] wait_on_buffer refcounting checks	Andrew Morton
	It is generally illegal to wait on an unpinned buffer - another CPU could free it up even before __wait_on_buffer() has taken a ref against the buffer. Maybe external locking rules will prevent this in specific cases, but that is really subtle and fragile as locking rules are evolved. The patch detects people calling wait_on_buffer() against an unpinned buffer and issues a diagnostic. Also remove the get_bh() from __wait_on_buffer(). It is too late.
2003-03-18	[XFS] Export end_buffer_async_write, needed for unwritten extent support in XFS.	Nathan Scott
	SGI Modid: 2.5.x-xfs:slinx:141507a
2003-02-10	[PATCH] Fix synchronous writers to wait properly for the result	Andrew Morton
	Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> points out a bug in ll_rw_block() usage. Typical usage is: mark_buffer_dirty(bh); ll_rw_block(WRITE, 1, &bh); wait_on_buffer(bh); the problem is that if the buffer was locked on entry to this code sequence (due to in-progress I/O), ll_rw_block() will not wait, and start new I/O. So this code will wait on the _old_ I/O, and will then continue execution, leaving the buffer dirty. It turns out that all callers were only writing one buffer, and they were all waiting on that writeout. So I added a new sync_dirty_buffer() function: void sync_dirty_buffer(struct buffer_head *bh) { lock_buffer(bh); if (test_clear_buffer_dirty(bh)) { get_bh(bh); bh->b_end_io = end_buffer_io_sync; submit_bh(WRITE, bh); } else { unlock_buffer(bh); } } which allowed a fair amount of code to be removed, while adding the desired data-integrity guarantees. UFS has its own wrappers around ll_rw_block() which got in the way, so this operation was open-coded in that case.
2003-02-02	[PATCH] ext3: fix scheduling storm and lockups	Andrew Morton
	There have been sporadic sightings of ext3 causing little blips of 100,000 context switches per second when under load. At the start of do_get_write_access() we have this logic: repeat: lock_buffer(jh->bh); ... unlock_buffer(jh->bh); ... if (jh->j_list == BJ_Shadow) { sleep_on_buffer(jh->bh); goto repeat; } The problem is that the unlock_buffer() will wake up anyone who is sleeping in the sleep_on_buffer(). So if task A is asleep in sleep_on_buffer() and task B now runs do_get_write_access(), task B will wake task A by accident. Task B will then sleep on the buffer and task A will loop, will run unlock_buffer() and then wake task B. This state will continue until I/O completes against the buffer and kjournal changes jh->j_list. Unless task A and task B happen to both have realtime scheduling policy - if they do then kjournald will never run. The state is never cleared and your box locks up. The fix is to not do the `goto repeat;' until the buffer has been taken of the shadow list. So we don't go and wake up the other waiter(s) until they can actually proceed to use the buffer. The patch removes the exported sleep_on_buffer() function and simply exports an existing function which provides access to a buffer_head's waitqueue pointer. Which is a better interface anyway, because it permits the use of wait_event(). This bug was introduced introduced into 2.4.20-pre5 and was faithfully ported up.
2002-12-20	[XFS] "merge" the 2.4 fsx fix for block size < page size to 2.5. This needed	Russell Cattelan
	major changes to actually fit. SGI Modid: 2.5.x-xfs:slinx:132210a
2002-12-14	[PATCH] remove PF_SYNC	Andrew Morton
	current->flags:PF_SYNC was a hack I added because I didn't want to change all ->writepage implementations. It's foul. And it means that if someone happens to run direct page reclaim within the context of (say) sys_sync, the writepage invokations from the VM will be treated as "data integrity" operations, not "memory cleansing" operations, which would cause latency. So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage. It is the `writeback_control' structure which contains the full context information about why writepage was called. The initial version of this patch just passed in a bare `int sync', but the XFS team need more info so they can perform writearound from within page reclaim. The patch also adds writeback_control.for_reclaim, so writepage implementations can inspect that to work out the call context rather than peeking at current->flags:PF_MEMALLOC.
2002-11-21	[PATCH] no-buffer-head ext2 option	Andrew Morton
	Implements a new set of block address_space_operations which will never attach buffer_heads to file pagecache. These can be turned on for ext2 with the `nobh' mount option. During write-intensive testing on a 7G machine, total buffer_head storage remained below 0.3 megabytes. And those buffer_heads are against ZONE_NORMAL pagecache and will be reclaimed by ZONE_NORMAL memory pressure. This work is, of course, a special for the huge highmem machines. Possibly it obsoletes the buffer_heads_over_limit stuff (which doesn't work terribly well), but that code is simple, and will provide relief for other filesystems. It should be noted that the nobh_prepare_write() function and the PageMappedToDisk() infrastructure is what is needed to solve the problem of user data corruption when the filesystem which backs a sparse MAP_SHARED mapping runs out of space. We can use this code in filemap_nopage() to ensure that all mapped pages have space allocated on-disk. Deliver SIGBUS on ENOSPC. This will require a new address_space op, I expect.
2002-11-15	[PATCH] try to remove buffer_heads from to-be-reaped inodes	Andrew Morton
	Stephen Tweedie reports a 2.4.7 problem in which kswapd is chewing lots of CPU trying to reclaim inodes which are pinned by buffer_heads at i_dirty_buffers. This can only happen when there's memory pressure on ZONE_HIGHMEM - the 2.4 kernel runs shrink_icache_memory in that case as well. But there's no reclaim pressure on ZONE_NORMAL so the VM is never running try_to_free_buffers() against the ZONE_NORMAL buffers which are pinning the inodes. The 2.5 kernel also runs the slab shrinkers in response to ZONE_HIGHMEM pressure. This may be wrong - still thinking about that. This patch arranges for prune_icache to try to remove the inode's buffers when the inode is to be reclaimed. It also changes inode_has_buffers() and the other inode-buffer-list functions to look at inode->i_data, not inode->i_mapping. The latter was wrong.
2002-10-13	[PATCH] remove kiobufs	Andrew Morton
	This patch from Christoph Hellwig removes the kiobuf/kiovec infrastructure. This affects three subsystems: video-buf.c: This patch includes an earlier diff from Gerd which converts video-buf.c to use get_user_pages() directly. Gerd has acked this patch. LVM1: Is now even more broken. drivers/mtd/devices/blkmtd.c: blkmtd is broken by this change. I contacted Simon Evans, who said "I had done a rewrite of blkmtd anyway and just need to convert it to BIO. Feel free to break it in the 2.5 tree, it will force me to finish my code." Neither EVMS nor LVM2 use kiobufs. The only remaining breakage of which I am aware is a proprietary MPEG2 streaming module. It could use get_user_pages().
2002-10-04	[PATCH] use buffer_boundary() for writeback scheduling hints	Andrew Morton
	This is the replacement for write_mapping_buffers(). Whenever the mpage code sees that it has just written a block which had buffer_boundary() set, it assumes that the next block is dirty filesystem metadata. (This is a good assumption - that's what buffer_boundary is for). So we do a lookup in the blockdev mapping for the next block and it if is present and dirty, then schedule it for IO. So the indirect blocks in the blockdev mapping get merged with the data blocks in the file mapping. This is a bit more general than the write_mapping_buffers() approach. write_mapping_buffers() required that the fs carefully maintain the correct buffers on the mapping->private_list, and that the fs call write_mapping_buffers(), and the implementation was generally rather yuk. This version will "just work" for filesystems which implement buffer_boundary correctly. Currently this is ext2, ext3 and some not-yet-merged reiserfs patches. JFS implements buffer_boundary() but does not use ext2-like layouts - so there will be no change there. Works nicely.
2002-10-04	[PATCH] remove write_mapping_buffers()	Andrew Morton
	When the global buffer LRU was present, dirty ext2 indirect blocks were automatically scheduled for writeback alongside their data. I added write_mapping_buffers() to replace this - the idea was to schedule the indirects close in time to the scheduling of their data. It works OK for small-to-medium sized files but for large, linear writes it doesn't work: the request queue is completely full of file data and when we later come to scheduling the indirects, their neighbouring data has already been written. So writeback of really huge files tends to be a bit seeky. So. Kill it. Will fix this problem by other means.
2002-09-22	[PATCH] low-latency page reclaim	Andrew Morton
	Convert the VM to not wait on other people's dirty data. - If we find a dirty page and its queue is not congested, do some writeback. - If we find a dirty page and its queue _is_ congested then just refile the page. - If we find a PageWriteback page then just refile the page. - There is additional throttling for write(2) callers. Within generic_file_write(), record their backing queue in ->current. Within page reclaim, if this tasks encounters a page which is dirty or under writeback onthis queue, block on it. This gives some more writer throttling and reduces the page refiling frequency. It's somewhat CPU expensive - under really heavy load we only get a 50% reclaim rate in pages coming off the tail of the LRU. This can be fixed by splitting the inactive list into reclaimable and non-reclaimable lists. But the CPU load isn't too bad, and latency is much, much more important in these situations. Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34 took 35 minutes to compile a kernel. With this patch, it took three minutes, 45 seconds. I haven't done swapcache or MAP_SHARED pages yet. If there's tons of dirty swapcache or mmap data around we still stall heavily in page reclaim. That's less important. This patch also has a tweak for swapless machines: don't even bother bringing anon pages onto the inactive list if there is no swap online.
2002-09-17	[PATCH] move the buffer_head IO functions into buffer.c	Andrew Morton
	Patch from Christoph Hellwig. Move the buffer_head-based IO functions out of ll_rw_blk.c and into fs/buffer.c. So the buffer IO functions are all in buffer.c, and ll_rw_blk.c knows nothing about buffer_heads. This patch has been acked by Jens.
2002-09-09	[PATCH] buffer_head takedown for bighighmem machines	Andrew Morton
	This patch addresses the excessive consumption of ZONE_NORMAL by buffer_heads on highmem machines. The algorithms which decide which buffers to shoot down are fairly dumb, but they only cut in on machines with large highmem:lowmem ratios and the code footprint is tiny. The buffer.c change implements the buffer_head accounting - it sets the upper limit on buffer_head memory occupancy to 10% of ZONE_NORMAL. A possible side-effect of this change is that the kernel will perform more calls to get_block() to map pages to disk. This will only be observed when a file is being repeatadly overwritten - this is the only case in which the "cached get_block result" in the buffers is useful. I did quite some testing of this back in the delalloc ext2 days, and was not able to come up with a test in which the cached get_block result was measurably useful. That's for ext2, which has a fast get_block(). A desirable side effect of this patch is that the kernel will be able to cache much more blockdev pagecache in ZONE_NORMAL, so there are more ext2/3 indirect blocks in cache, so with some workloads, less I/O will be performed. In mpage_writepage(): if the number of buffer_heads is excessive then buffers are stripped from pages as they are submitted for writeback. This change is only useful for filesystems which are using the mpage code. That's ext2 and ext3-writeback and JFS. An mpage patch for reiserfs was floating about but seems to have got lost. There is no need to strip buffers for reads because the mpage code does not attach buffers for reads. These are perhaps not the most appropriate buffer_heads to toss away. Perhaps something smarter should be done to detect file overwriting, or to toss the 'oldest' buffer_heads first. In refill_inactive(): if the number of buffer_heads is excessive then strip buffers from pages as they move onto the inactive list. This change is useful for all filesystems. This approach is good because pages which are being repeatedly overwritten will remain on the active list and will retain their buffers, whereas pages which are not being overwritten will be stripped.
2002-07-18	[PATCH] Add 4G-1 file support to FAT32	Hirofumi Ogawa
	This patch changes cont_prepare_write(), in order to support a 4G-1 file for FAT32. int cont_prepare_write(struct page page, unsigned offset, - unsigned to, get_block_t get_block, unsigned long bytes) + unsigned to, get_block_t get_block, loff_t *bytes) And it fixes broken adfs/affs/fat/hfs/hpfs/qnx4 by this cont_prepare_write() change.
2002-07-14	[PATCH] direct-to-BIO for O_DIRECT	Andrew Morton
	Here's a patch which converts O_DIRECT to go direct-to-BIO, bypassing the kiovec layer. It's followed by a patch which converts the raw driver to use the O_DIRECT engine. CPU utilisation is about the same as the kiovec-based implementation. Read and write bandwidth are the same too, for 128k chunks. But with one megabyte chunks, this implementation is 20% faster at writing. I assume this is because the kiobuf-based implementation has to stop and wait for each 128k chunk, whereas this code streams the entire request, regardless of its size. This is with a single (oldish) scsi disk on aic7xxx. I'd expect the margin to widen on higher-end hardware which likes to have more requests in flight. Question is: what do we want to do with this sucker? These are the remaining users of kiovecs: drivers/md/lvm-snap.c drivers/media/video/video-buf.c drivers/mtd/devices/blkmtd.c drivers/scsi/sg.c the video and mtd drivers seems to be fairly easy to de-kiobufize. I'm aware of one proprietary driver which uses kiobufs. XFS uses kiobufs a little bit - just to map the pages. So with a bit of effort and maintainer-irritation, we can extract the kiobuf layer from the kernel.
2002-07-04	Merge home.transmeta.com:/home/torvalds/v2.5/viro	Linus Torvalds
	into home.transmeta.com:/home/torvalds/v2.5/linux
2002-07-04	[PATCH] kdev_t crapectomy	Alexander Viro
	* since the last caller of is_read_only() is gone, the function itself is removed. * destroy_buffers() is not used anymore; gone. * fsync_dev() is gone; the only user is (broken) lvm.c and first step in fixing lvm.c will consist of propagating struct block_device * anyway; at that point we'll just use fsync_bdev() in there. * prototype of bio_ioctl() removed - function doesn't exist anymore.
2002-07-04	[PATCH] per-cpu buffer_head cache	Andrew Morton
	ext2 and ext3 implement a custom LRU cache of buffer_heads - the eight most-recently-used inode bitmap buffers and the eight MRU block bitmap buffers. I don't like them, for a number of reasons: - The code is duplicated between filesystems - The functionality is unavailable to other filesystems - The LRU only applies to bitmap buffers. And not, say, indirects. - The LRUs are subtly dependent upon lock_super() for protection: without lock_super protection a bitmap could be evicted and freed while in use. And removing this dependence on lock_super() gets us one step on the way toward getting that semaphore out of the ext2 block allocator - it causes significant contention under some loads and should be a spinlock. - The LRUs pin 64 kbytes per mounted filesystem. Now, we could just delete those LRUs and rely on the VM to manage the memory. But that would introduce significant lock contention in __find_get_block - the blockdev mapping's private_lock and page_lock are heavily used. So this patch introduces a transparent per-CPU bh lru which is hidden inside __find_get_block(), __getblk() and __bread(). It is designed to shorten code paths and to reduce lock contention. It uses a seven-slot LRU. It achieves a 99% hit rate in `dbench 64'. It provides benefit to all filesystems. The next patches remove the open-coded LRUs from ext2 and ext3. Taken together, these patches are a code cleanup (300-400 lines gone), and they reduce lock contention. Anton tested these patches on the 32-way and demonstrated a throughput improvement of up to 15% on RAM-only dbench runs. See http://samba.org/~anton/linux/2.5.24/dbench/ Most of this benefit is from avoiding find_get_page() on the blockdev mapping. Because the generic LRU copes with indirect blocks as well as bitmaps.
2002-06-17	[PATCH] rename get_hash_table() to find_get_block()	Andrew Morton
	Renames the buffer_head lookup function `get_hash_table' to `find_get_block'. get_hash_table() is too generic a name. Plus it doesn't even use a hash any more.
2002-06-17	[PATCH] remove set_page_buffers() and clear_page_buffers()	Andrew Morton
	The set_page_buffers() and clear_page_buffers() macros are each used in only one place. Fold them into their callers.
2002-06-17	[PATCH] take bio.h out of highmem.h	Andrew Morton
	highmem.h includes bio.h, so just about every compilation unit in the kernel gets to process bio.h. The patch moves the BIO-related functions out of highmem.h and into bio-related headers. The nested include is removed and all files which need to include bio.h now do so.
2002-06-17	[PATCH] clean up alloc_buffer_head()	Andrew Morton
	alloc_bufer_head() does not need the additional argument - GFP_NOFS is always correct.
2002-06-17	[PATCH] direct-to-BIO I/O for swapcache pages	Andrew Morton
	This patch changes the swap I/O handling. The objectives are: - Remove swap special-casing - Stop using buffer_heads -> direct-to-BIO - Make S_ISREG swapfiles more robust. I've spent quite some time with swap. The first patches converted swap to use block_read/write_full_page(). These were discarded because they are still using buffer_heads, and a reasonable amount of otherwise unnecessary infrastructure had to be added to the swap code just to make it look like a regular fs. So this code just has a custom direct-to-BIO path for swap, which seems to be the most comfortable approach. A significant thing here is the introduction of "swap extents". A swap extent is a simple data structure which maps a range of swap pages onto a range of disk sectors. It is simply: struct swap_extent { struct list_head list; pgoff_t start_page; pgoff_t nr_pages; sector_t start_block; }; At swapon time (for an S_ISREG swapfile), each block in the file is bmapped() and the block numbers are parsed to generate the device's swap extent list. This extent list is quite compact - a 512 megabyte swapfile generates about 130 nodes in the list. That's about 4 kbytes of storage. The conversion from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon time. At swapon time (for an S_ISBLK swapfile), we install a single swap extent which describes the entire device. The advantages of the swap extents are: 1: We never have to run bmap() (ie: read from disk) at swapout time. So S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles. 2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are handled at swapon time. During normal operation, we just don't care. Both types of swapfiles are handled the same way. 3: The extent lists always operate in PAGE_SIZE units. So the problems of going from fs blocksize to PAGE_SIZE are handled at swapon time and normal operating code doesn't need to care. 4: Because we don't have to fiddle with different blocksizes, we can go direct-to-BIO for swap_readpage() and swap_writepage(). This introduces the kernel-wide invariant "anonymous pages never have buffers attached", which cleans some things up nicely. All those block_flushpage() calls in the swap code simply go away. 5: The kernel no longer has to allocate both buffer_heads and BIOs to perform swapout. Just a BIO. 6: It permits us to perform swapcache writeout and throttling for GFP_NOFS allocations (a later patch). (Well, there is one sort of anon page which can have buffers: the pages which are cast adrift in truncate_complete_page() because do_invalidatepage() failed. But these pages are never added to swapcache, and nobody except the VM LRU has to deal with them). The swapfile parser in setup_swap_extents() will attempt to extract the largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of disk from the S_ISREG swapfile. Any stray blocks (due to file discontiguities) are simply discarded - we never swap to those. If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then the swapon attempt will fail. The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG swapfile). It needs to be consulted once for each page within swap_readpage() and swap_writepage(). Hence there is a risk that we could blow significant amounts of CPU walking that list. However I have implemented a "where we found the last block" cache, which is used as the starting point for the next search. Empirical testing indicates that this is wildly effective - the average length of the list walk in map_swap_page() is 0.3 iterations per page, with a 130-element list. It _could_ be that some workloads do start suffering long walks in that code, and perhaps a tree would be needed there. But I doubt that, and if this is happening then it means that we're seeking all over the disk for swap I/O, and the list walk is the least of our problems. rw_swap_page_nolock() now takes a page*, not a kernel virtual address. It has been renamed to rw_swap_page_sync() and it takes care of locking and unlocking the page itself. Which is all a much better interface. Support for type 0 swap has been removed. Current versions of mkwap(8) seem to never produce v0 swap unless you explicitly ask for it, so I doubt if this will affect anyone. If you _do_ have a type 0 swapfile, swapon will fail and the message version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3 is printed. We can remove that code for real later on. Really, all that swapfile header parsing should be pushed out to userspace. This code always uses single-page BIOs for swapin and swapout. I have an additional patch which converts swap to use mpage_writepages(), so we swap out in 16-page BIOs. It works fine, but I don't intend to submit that. There just doesn't seem to be any significant advantage to it. I can't see anything in sys_swapon()/sys_swapoff() which needs the lock_kernel() calls, so I deleted them. If you ftruncate an S_ISREG swapfile to a shorter size while it is in use, subsequent swapout will destroy the filesystem. It was always thus, but it is much, much easier to do now. Not really a kernel problem, but swapon(8) should not be allowing the kernel to use swapfiles which are modifiable by unprivileged users.
2002-06-02	[PATCH] rename flushpage to invalidatepage	Andrew Morton
	Fixes a pet peeve: the identifier "flushpage" implies "flush the page to disk". Which is very much not what the flushpage functions actually do. The patch renames block_flushpage and the flushpage address_space_operation to "invalidatepage". It also fixes a buglet in invalidate_this_page2(), which was calling block_flushpage() directly - it needs to call do_flushpage() (now do_invalidatepage()) so that the filesystem's ->flushpage (now ->invalidatepage) a_op gets a chance to relinquish any interest which it has in the page's buffers.
2002-06-02	[PATCH] rename block_symlink() to page_symlink()	Andrew Morton
	block_symlink() is not a "block" function at all. It is a pure pagecache/address_space function. Seeing driverfs calling it was the last straw. The patch renames it to `page_symlink()' and moves it into fs/namei.c
2002-05-27	[PATCH] move BH_JBD out of buffer_head.h	Andrew Morton
	For historical reasons, ext3 has a private BH state bit which has global scope. This patch moves it inside ext3.
2002-05-27	[PATCH] direct-to-BIO writeback	Andrew Morton
	Multipage BIO writeout from the pagecache. It's pretty much the same as multipage reads. It falls back to buffers if things got complex. The write case is a little more complex because it handles pages which have buffers and pages which do not. If the page didn't have buffers this code does not add them.
2002-05-27	[PATCH] direct-to-BIO readahead	Andrew Morton
	Implements BIO-based multipage reads into the pagecache, and turns this on for ext2. CPU load for `cat large_file > /dev/null' is reduced by approximately 15%. Similar reductions for tiobench with a single thread. (Earlier claims of 25% were exaggerated - they were measured with slab debug enabled. But 15% isn't bad for a load which is dominated by copy_*_user costs). With 2, 4 and 8 tiobench threads, throughput is increased as well, which was unexpected. It's due to request queue weirdness. (Generally the request queueing is doing bad things under certain workloads - that's a separate issue.) BIOs of up to 64 kbytes are assembled and submitted for readahead and for single-page reads. So the work involved in reading 32 pages has gone from: - allocate and attach 32 buffer_heads - submit 32 buffer_heads - allocate 32 bios - submit 32 bios to: - allocate 2 bios - submit 2 bios These pages never have buffers attached. Buffers will be attached later if the application writes to these pages (file overwrite). The first version of this code (in the "delayed allocation" patches) tries to handle everything - bios which start mid-page, bios which end mid-page and pages which are covered by multiple bios. It is very complex code and in fact appears to be incorrect: out-of-order BIO completion could cause a page to come unlocked at the wrong time. This implementation is much simpler: if things get complex, it just falls back to the buffer-based block_read_full_page(), which isn't going away, and which understands all that complexity. There's no point in doing this in two places. This code will bypass the buffer layer for - fully-mapped pages which are on-disk contiguous. - fully unmapoped pages (holes) - partially unmapped pages, where the unmappedness is at the end of the page (end-of-file). and everything else falls back to buffers. This means that with blocksize == PAGE_CACHE_SIZE, 100% of pages are handed direct to BIO. With a heavy 10-minute dbench run on 4k PAGE_CACHE_SIZE and 1k blocks, 95% of pages were handed direct to BIO. Almost all of the other 5% were passed to block_read_full_page() because they were already partially uptodate from an earlier sub-page write(). This ratio will fall if PAGE_CACHE_SIZE/blocksize is greater than four. But if that's the case, CPU efficiency is far from the main concern - there are significant seek and bandwidth problems just at 4 blocks per page. This code will stress out the block layer somewhat - RAID0 doesn't like multipage BIOs, and there are probably others. RAID0 seems to struggle along - readahead fails but read falls back to single-page reads, which succeed. Such problems may be worked around by setting MPAGE_BIO_MAX_SIZE to PAGE_CACHE_SIZE in fs/mpage.c. It is trivial to enable multipage reads for many other filesystems. We can do that after completion of external testing of ext2.
2002-05-23	Fix up header file	Linus Torvalds

2002-05-22	[PATCH] include buffer_head.h in actual users instead of fs.h (2/10)	Christoph Hellwig
	Declare buffer_init() extern in init/main.c like the other _init so that it doesn't have to include buffer_head.h. Remove buffer_init() there.
2002-05-22	[PATCH] include buffer_head.h in actual users instead of fs.h (1/10)	Christoph Hellwig
	Now that fs.h grow due to the lock.h removal let's reduce it's overhead again: Instead of penalizing ever user of fs.h with the overhead of the buffer head interface let it's users include it directly. This also shows nicely which parts of the core kernel still depend on the buffer head interface, and allows that to be cleaned up properly. This is the first of ten patches and adds the includes needed by buffer_head.h to it and fixes it's inclusion guard.
2002-05-19	[PATCH] improved I/O scheduling for indirect blocks	Andrew Morton
	Fixes a performance problem with many-small-file writeout. At present, files are written out via their mapping and their indirect blocks are written out via the blockdev mapping. As we know that indirects are disk-adjacent to the data it is better to start I/O against the indirects at the same time as the data. The delalloc pathes have code in ext2_writepage() which recognises when the target page->index was at an indirect boundary and does an explicit hunt-and-write against the neighbouring indirect block. Which is ideal. (Unless the file was dirtied seekily and the page which is next to the indirect was not dirtied). This patch does it the other way: when we start writeback against a mapping, also start writeback against any dirty buffers which are attached to mapping->private_list. Let the elevator take care of the rest. The patch makes a number of tuning changes to the writeback path in fs-writeback.c. This is very fiddly code: getting the throughput tuned, getting the data-integrity "sync" operations right, avoiding most of the livelock opportunities, getting the `kupdate' function working efficiently, keeping it all least somewhat comprehensible. An important intent here is to ensure that metadata blocks for inodes are marked dirty before writeback starts working the blockdev mapping, so all the inode blocks are efficiently written back. The patch removes try_to_writeback_unused_inodes(), which became unreferenced in vm-writeback.patch. The patch has a tweak in ext2_put_inode() to prevent ext2 from incorrectly droppping its preallocation window in response to a random iput(). Generally, many-small-file writeout is a lot faster than 2.5.7 (which is linux-before-I-futzed-with-it). The workload which was optimised was tar xfz /nfs/mountpoint/linux-2.4.18.tar.gz ; sync on mem=128M and mem=2048M. With these patches, 2.5.15 is completing in about 2/3 of the time of 2.5.7. But it is only a shade faster than 2.4.19-pre7. Why is 2.5.7 so much slower than 2.4.19? Not sure yet. Heavy dbench loads (dbench 32 on mem=128M) are slightly faster than 2.5.7 and significantly slower than 2.4.19. It appears that the cause is poor read throughput at the later stages of the run. Because there are background writeback threads operating at the same time. The 2.4.19-pre8 write scheduling manages to stop writeback during the latter stages of the dbench run in a way which I haven't been able to sanely emulate yet. It may not be desirable to do this anyway - it's optimising for the case where the files are about to be deleted. But it would be good to find a way of "pausing" the writeback for a few seconds to allow readers to get an interval of decent bandwidth. tiobench throughput is basically the same across all recent kernels. CPU load on writes is down maybe 30% in 2.5.15.
2002-05-19	[PATCH] larger b_size, and misc fixlets	Andrew Morton
	Miscellany. - make the printk in buffer_io_error() sector_t-aware. - Some buffer.c cleanups from AntonA: remove a couple of !uptodate checks, and set a new buffer's b_blocknr to -1 in a more sensible place. - Make buffer_head.b_size a 32-bit quantity. Needed for 64k pagesize on ia64. Does not increase sizeof(struct buffer_head).
2002-05-19	[PATCH] i_dirty_buffers locking fix	Andrew Morton
	This fixes a race between try_to_free_buffers' call to __remove_inode_queue() and other users of b_inode_buffers (fsync_inode_buffers and mark_buffer_dirty_inode()). They are presently taking different locks. The patch relocates and redefines and clarifies(?) the role of inode.i_dirty_buffers. The 2.4 definition of i_dirty_buffers is "a list of random buffers which is protected by a kernel-wide lock". This definition needs to be narrowed in the 2.5 context. It is now "a list of buffers from a different mapping, protected by a lock within that mapping". This list of buffers is specifically for fsync(). As this is a "data plane" operation, all the structures have been moved out of the inode and into the address_space. So address_space now has: list_head private_list; A list, available to the address_space for any purpose. If that address_space chooses to use the helper functions mark_buffer_dirty_inode and sync_mapping_buffers() then this list will contain buffer_heads, attached via buffer_head.b_assoc_buffers. If the address_space does not call those helper functions then the list is free for other usage. The only requirement is that the list be list_empty() at destroy_inode() time. At least, this is the objective. At present, generic_file_write() will call generic_osync_inode(), which expects that list to contain buffer_heads. So private_list isn't useful for anything else yet. spinlock_t private_lock; A spinlock, available to the address_space. If the address_space is using try_to_free_buffers(), mark_inode_dirty_buffers() and fsync_inode_buffers() then this lock is used to protect the private_list of other mappings which have listed buffers from this mapping onto themselves. That is: for buffer_heads, mapping_A->private_lock does not protect mapping_A->private_list! It protects the b_assoc_buffers list from buffers which are backed by mapping_A and it protects mapping_B->private_list, mapping_C->private_list, ... So what we have here is a cross-mapping association. S_ISREG mappings maintain a list of buffers from the blockdev's address_space which they need to know about for a successful fsync(). The locking follows the buffers: the lock in in the blockdev's mapping, not in the S_ISREG file's mapping. For address_spaces which use try_to_free_buffers, private_lock is also (and quite unrelatedly) used for protection of the buffer ring at page->private. Exclusion between try_to_free_buffers(), __get_hash_table() and __set_page_dirty_buffers(). This is in fact its major use. address_space *assoc_mapping Sigh. This is the address of the mapping which backs the buffers which are attached to private_list. It's here so that generic_osync_inode() can locate the lock which protects this mapping's private_list. Will probably go away. A consequence of all the above is that: a) All the buffers at a mapping_A's ->private_list must come from the same mapping, mapping_B. There is no requirement that mapping_B be a blockdev mapping, but that's how it's used. There is a BUG() check in mark_buffer_dirty_inode() for this. b) blockdev mappings never have any buffers on ->private_list. It just never happens, and doesn't make a lot of sense. reiserfs is using b_inode_buffers for attaching dependent buffers to its journal and that caused a few problems. Fixed in reiserfs_releasepage.patch
2002-05-05	[PATCH] Fix concurrent writepage and readpage	Andrew Morton
	Pages under writeback are not locked. So it is possible (and quite legal) for a page to be under readpage() while it is still under writeback. For a partially uptodate page with blocksize < PAGE_CACHE_SIZE. When this happens, the read and write I/O completion handlers get confused over the shared BH_Async usage and the page ends up not getting PG_writeback cleared. Truncate gets stuck in D state. The patch separates the read and write I/O completion state. It also shuffles the buffer fields around. Putting the commonly-accessed b_state at offset zero shrinks the kernel by a few hundred bytes because it can be accessed with indirect addressing, not indirect+indexed.
2002-04-30	[PATCH] (5/6) blksize_size[] removal	Alexander Viro
	- kill bread()/getblk()/get_hash_table() (kdev_t-using wrappers; struct block_device * counterparts are obviously still alive).
2002-04-29	[PATCH] cleanup sync_buffers()	Andrew Morton
	Renames sync_buffers() to sync_blockdev() and removes its (never used) second argument. Removes fsync_no_super() in favour of direct calls to sync_blockdev().
2002-04-29	[PATCH] page writeback locking update	Andrew Morton
	- Fixes a performance problem - callers of prepare_write/commit_write, etc are locking pages, which synchronises them behind writeback, which also locks these pages. Significant slowdowns for some workloads. - So pages are no longer locked while under writeout. Introduce a new PG_writeback and associated infrastructure to support this design change. - Pages which are under read I/O still use PageLocked. Pages which are under write I/O have PageWriteback() true. I considered creating Page_IO instead of PageWriteback, and marking both readin and writeout pages as PageIO(). So pages are unlocked during both read and write. There just doesn't seem a need to do this - nobody ever needs unblocking access to a page which is under read I/O. - Pages under swapout (brw_page) are PageLocked, not PageWriteback. So their treatment is unchangeded. It's not obvious that pages which are under swapout actually need the more asynchronous behaviour of PageWriteback. I was setting the swapout pages PageWriteback and unlocking them prior to submitting the buffers in brw_page(). This led to deadlocks on the exit_mmap->zap_page_range->free_swap_and_cache path. These functions call block_flushpage under spinlock. If the page is unlocked but has locked buffers, block_flushpage->discard_buffer() sleeps. Under spinlock. So that will need fixing if for some reason we want swapout to use PageWriteback. Kernel has called block_flushpage() under spinlock for a long time. It is assuming that a locked page will never have locked buffers. This appears to be true, but it's ugly. - Adds new function wait_on_page_writeback(). Renames wait_on_page() to wait_on_page_locked() to remind people that they need to call the appropriate one. - Renames filemap_fdatasync() to filemap_fdatawrite(). It's more accurate - "sync" implies, if anything, writeout and wait. (fsync, msync) Or writeout. it's not clear. - Subtly changes the filemap_fdatawrite() internals - this function used to do a lock_page() - it waited for any other user of the page to let go before submitting new I/O against a page. It has been changed to simply skip over any pages which are currently under writeback. This is the right thing to do for memory-cleansing reasons. But it's the wrong thing to do for data consistency operations (eg, fsync()). For those operations we must ensure that all data which was dirty at the time of the system call are tight on disk before the call returns. So all places which care about this have been converted to do: filemap_fdatawait(mapping); /* Wait for current writeback / filemap_fdatawrite(mapping); / Write all dirty pages / filemap_fdatawait(mapping); / Wait for I/O to complete / - Fixes a truncate_inode_pages problem - truncate currently will block when it hits a locked page, so it ends up getting into lockstep behind writeback and all of the file is pointlessly written back. One fix for this is for truncate to simply walk the page list in the opposite direction from writeback. I chose to use a separate cleansing pass. It is more CPU-intensive, but it is surer and clearer. This is because there is no reason why the per-address_space ->vm_writeback and ->writeback_mapping functions have* to perform writeout in ->dirty_pages order. They may choose to do something totally different. (set_page_dirty() is an a_op now, so address_spaces could almost privatise the whole dirty-page handling thing. Except truncate_inode_pages and invalidate_inode_pages assume that the pages are on the address_space lists. hmm. So making truncate_inode_pages and invalidate_inode_pages a_ops would make some sense).
2002-04-29	[PATCH] hashed b_wait	Andrew Morton
	Implements hashed waitqueues for buffer_heads. Drops twelve bytes from struct buffer_head.
2002-04-29	[PATCH] cleanup of bh->flags	Andrew Morton
	Moves all buffer_head-related stuff out of linux/fs.h and into linux/buffer_head.h. buffer_head.h is currently included at the very end of fs.h. So it is possible to include buffer_head directly from all .c files and remove this nested include. Also rationalises all the set_buffer_foo() and mark_buffer_bar() functions. We have: set_buffer_foo(bh) clear_buffer_foo(bh) buffer_foo(bh) and, in some cases, where needed: test_set_buffer_foo(bh) test_clear_buffer_foo(bh) And that's it. BUFFER_FNS() and TAS_BUFFER_FNS() macros generate all the above real inline functions. Normally not a big fan of cpp abuse, but in this case it fits. These function-generating macros are available to filesystems to expand their own b_state functions. JBD uses this in one case.