summaryrefslogtreecommitdiff
path: root/include/linux/buffer_head.h
AgeCommit message (Collapse)Author
2006-03-26[PATCH] pass b_size to ->get_block()Badari Pulavarty
Pass amount of disk needs to be mapped to get_block(). This way one can modify the fs ->get_block() functions to map multiple blocks at the same time. [akpm@osdl.org: performance tweak] [akpm@osdl.org: remove unneeded assignments] Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] change buffer_head.b_size to size_tBadari Pulavarty
Increase the size of the buffer_head b_size field (only) for 64 bit platforms. Update some old and moldy comments in and around the structure as well. The b_size increase allows us to perform larger mappings and allocations for large I/O requests from userspace, which tie in with other changes allowing the get_block_t() interface to map multiple blocks at once. Signed-off-by: Nathan Scott <nathans@sgi.com> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] Make address_space_operations->invalidatepage return voidNeilBrown
The return value of this function is never used, so let's be honest and declare it as void. Some places where invalidatepage returned 0, I have inserted comments suggesting a BUG_ON. [akpm@osdl.org: JBD BUG fix] [akpm@osdl.org: rework for git-nfs] [akpm@osdl.org: don't go BUG in block_invalidate_page()] Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] Make address_space_operations->sync_page return voidNeilBrown
The only user ignores the return value, and the only instanace (block_sync_page) always returns 0... Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] fat: support a truncate() for expanding size (generic_cont_expand)OGAWA Hirofumi
This patch changes generic_cont_expand(), in order to share the code with fatfs. - Use vmtruncate() if ->prepare_write() returns a error. Even if ->prepare_write() returns an error, it may already have added some blocks. So, this truncates blocks outside of ->i_size by vmtruncate(). - Add generic_cont_expand_simple(). The generic_cont_expand_simple() assumes that ->prepare_write() can handle the block boundary. With this, we don't need to care the extra byte. And for expanding a file size by truncate(), fatfs uses the added generic_cont_expand_simple(). Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30[PATCH] ext3: Fix unmapped buffers in transaction's listsJan Kara
Fix the problem (BUG 4964) with unmapped buffers in transaction's t_sync_data list. The problem is we need to call filesystem's own invalidatepage() from block_write_full_page(). block_write_full_page() must call filesystem's invalidatepage(). Otherwise following nasty race can happen: proc 1 proc 2 ------ ------ - write some new data to 'offset' => bh gets to the transactions data list - starts truncate => i_size set to new size - mpage_writepages() - ext3_ordered_writepage() to 'offset' - block_write_full_page() - page->index > end_index+1 - block_invalidatepage() - discard_buffer() - clear_buffer_mapped() - commit triggers and finds unmapped buffer - BOOM! Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29[PATCH] mm: split page table lockHugh Dickins
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-28[PATCH] gfp_t: fs/*Al Viro
- ->releasepage() annotated (s/int/gfp_t), instances updated - missing gfp_t in fs/* added - fixed misannotation from the original sweep caught by bitwise checks: XFS used __nocast both for gfp_t and for flags used by XFS allocator. The latter left with unsigned int __nocast; we might want to add a different type for those but for now let's leave them alone. That, BTW, is a case when __nocast use had been actively confusing - it had been used in the same code for two different and similar types, with no way to catch misuses. Switch of gfp_t to bitwise had caught that immediately... One tricky bit is left alone to be dealt with later - mapping->flags is a mix of gfp_t and error indications. Left alone for now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08[PATCH] gfp flags annotations - part 1Al Viro
- added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07[PATCH] page_uptodate locking scalabilityNick Piggin
Use a bit spin lock in the first buffer of the page to synchronise asynch IO buffer completions, instead of the global page_uptodate_lock, which is showing some scalabilty problems. Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-03-28Mark "gfp" masks as "unsigned int" and use __nocast to find violations.Linus Torvalds
This makes it hard(er) to mix argument orders by mistake for things like kmalloc() and friends, since silent integer promotion is now caught by sparse.
2005-03-07[PATCH] Add nobh_writepage() supportBadari Pulavarty
Add nobh_wripage() support for the filesystems which uses nobh_prepare_write/nobh_commit_write(). Idea here is to reduce unnecessary bufferhead creation/attachment to the page through pageout()->block_write_full_page(). nobh_wripage() tries to operate by directly creating bios, but it falls back to __block_write_full_page() if it can't make progress. Note that this is not really generic routine and can't be used for filesystems which uses page->Private for anything other than buffer heads. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-11-01[PATCH] fs/buffer.c exports for NTFSAnton Altaparmakov
I renamed the functions to more descriptive names: create_buffers -> alloc_page_buffers __set_page_buffers -> attach_page_buffers And I added a EXPORT_SYMBOL_GPL for alloc_page_buffers and made attach_page_buffers static inline and moved it to <linux/buffer_head.h>. Signed-off-by: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-27[PATCH] make buffer head argument of buffer_##name "const"Werner Almesberger
Allow the buffer_foo() predicates to take a (const struct buffer_head *). I've checked that the argument of test_bit is indeed "const" on all architectures. Signed-off-by: Werner Almesberger <werner@almesberger.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-10-18[PATCH] jbd wakeup fixAndrew Morton
Processes can sleep in do_get_write_access(), waiting for buffers to be removed from the BJ_Shadow state. We did this by doing a wake_up_buffer() in the commit path and sleeping on the buffer in do_get_write_access(). With the filtered bit-level wakeup code this doesn't work properly any more - the wake_up_buffer() accidentally wakes up tasks which are sleeping in lock_buffer() as well. Those tasks now implicitly assume that the buffer came unlocked. Net effect: Bogus I/O errors when reading journal blocks, because the buffer isn't up to date yet. Hence the recently spate of journal_bmap() failure reports. The patch creates a new jbd-private BH flag purely for this wakeup function. So a wake_up_bit(..., BH_Unshadow) doesn't wake up someone who is waiting for a wake_up_bit(BH_Lock). JBD was the only user of wake_up_buffer(), so remove it altogether. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-09-07[PATCH] Missing static in buffer.cChristoph Hellwig
The exports were for reiserfs in 2.4, but reiserfs doesn't need them anymore. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-26[PATCH] Add a few might_sleep() checksIngo Molnar
Add a whole bunch more might_sleep() checks. We also enable might_sleep() checking in copy_*_user(). This was non-trivial because of the "copy_*_user() in atomic regions" trick would generate false positives. Fix that up by adding a new __copy_*_user_inatomic(), which avoids the might_sleep() check. Only i386 is supported in this patch. With: Arjan van de Ven <arjanv@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-23[PATCH] reduce size of struct buffer_head on 64bitAnton Blanchard
Reduce size of buffer_head from 96 to 88 bytes on 64bit architectures by putting b_count and b_size together. b_count will still be in the first 16 bytes on 32bit architectures, so 16 byte cacheline machines shouldnt be affected. With this change the number of objects per 4kB slab goes up from 40 to 44 on ppc64. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] Concurrent O_SYNC write supportAndrew Morton
In databases it is common to have multiple threads or processes performing O_SYNC writes against different parts of the same file. Our performance at this is poor, because each writer blocks access to the file by waiting on I/O completion while holding i_sem: everything is serialised. The patch improves things by moving the writing and waiting outside i_sem. So other threads can get in and submit their I/O and permit the disk scheduler to optimise the IO patterns better. Also, the O_SYNC writer only writes and waits on the pages which he wrote, rather than writing and waiting on all dirty pages in the file. The reason we haven't been able to do this before is that the required walk of the address_space page lists is easily livelockable without the i_sem serialisation. But in this patch we perform the waiting via a radix-tree walk of the affected pages. This cannot be livelocked. The sync of the inode's metadata is still performed inside i_sem. This is because it is list-based and is hence still livelockable. However it is usually the case that databases are overwriting existing file blocks and there will be no dirty buffers attached to the address_space anyway. The code is careful to ensure that the IO for the pages and the IO for the metadata are nonblockingly scheduled at the same time. This is am improvemtn over the current code, which will issue two separate write-and-wait cycles: one for metadata, one for pages. Note from Suparna: Reworked to use the tagged radix-tree based writeback infrastructure. Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] add BH_Eopnotsupp for testing async barrier failuresChris Mason
In order for filesystems to detect asynchronous ordered write failures for buffers sent via submit_bh, they need a bit they can test for in the buffer head. This adds BH_Eopnotsupp and the related buffer operations end_buffer_write_sync is changed to avoid a printk for BH_Eoptnotsupp related failures, since the FS is responsible for a retry. sync_dirty_buffer is changed to test for BH_Eopnotsupp and return -EOPNOTSUPP to the caller Some of this came from Jens Axboe Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] make sync_dirty_buffer() return something usefulAndrew Morton
Make sync_dirty_buffer() return the result of its syncing. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-08-22[PATCH] disk barriers: coreJens Axboe
IDE disk barrier core. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2004-05-14[PATCH] filtered wakeups: apply to buffer_head functionsAndrew Morton
From: William Lee Irwin III <wli@holomorphy.com> This patch implements wake-one semantics for buffer_head wakeups in a single step. The buffer_head being waited on is passed to the waiter's wakeup function by the waker, and the wakeup function compares that to the a pointer stored in its on-stack structure and checking the readiness of the bit there also. Wake-one semantics are achieved by using WQ_FLAG_EXCLUSIVE in the codepaths waiting to acquire the bit for mutual exclusion.
2004-04-20[PATCH] lockfs - vfs bitsAndrew Morton
From: Christoph Hellwig <hch@lst.de> These are the generic lockfs bits. Basically it takes the XFS freezing statemachine into the VFS. It's all behind the kernel-doc documented freeze_bdev and thaw_bdev interfaces. Based on an older patch from Chris Mason.
2004-04-17[PATCH] remove buffer_error()Andrew Morton
From: Jeff Garzik <jgarzik@pobox.com> It was debug code, no longer required.
2004-04-17[PATCH] kill submit_{bh,bio} return valueAndrew Morton
From: Jeff Garzik <jgarzik@pobox.com> Nobody ever checks the return value of submit_bh(), and submit_bh() is the only caller that checks the submit_bio() return value. This changes the kernel I/O submission path -- a fast path -- so this cleanup is also a microoptimization.
2004-01-19[PATCH] if ... BUG() -> BUG_ON()Andrew Morton
From: Adrian Bunk <bunk@fs.tum.de> four months ago, Rolf Eike Beer <eike-kernel@sf-tec.de> sent a patch against 2.6.0-test5-bk1 that converted several if ... BUG() to BUG_ON() This might in some cases result in slightly faster code because BUG_ON() uses unlikely().
2004-01-18[PATCH] bdev: generic_osync_inode() conversionAndrew Morton
From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk> generic_osync_inode() got an extra argument - mapping and doesn't calculate inode->i_mapping anymore. Callers updated and switched to use of ->f_mapping.
2003-08-18[PATCH] async write errors: report truncate and io errors onAndrew Morton
From: Oliver Xymoron <oxymoron@waste.org> These patches add the infrastructure for reporting asynchronous write errors to block devices to userspace. Error which are detected due to pdflush or VM writeout are reported at the next fsync, fdatasync, or msync on the given file, and on close if the error occurs in time. We do this by propagating any errors into page->mapping->error when they are detected. In fsync(), msync(), fdatasync() and close() we return that error and zero it out. The Open Group say close() _may_ fail if an I/O error occurred while reading from or writing to the file system. Well, in this implementation close() can return -EIO or -ENOSPC. And in that case it will succeed, not fail - perhaps that is what they meant. There are three patches in this series and testing has only been performed with all three applied.
2003-07-03Add an asynchronous buffer read-ahead facility. NobodyLinus Torvalds
uses it for now, but I needed it for some tuning tests, and it is potentially useful for others.
2003-04-24[PATCH] invalidate_device()/check_disk_change() fixesAlexander Viro
* bogus calls of invalidate_buffers() gone from floppy_open() * invalidate_buffers() killed. * new helper - __invalidate_device(bdev, do_sync). invalidate_device() is calling it. * fixed races between floppy_open()/floppy_open and floppy_open()/set_geometry(): a) floppy_open()/floppy_release() is done under a semaphore. That closes the races between simultaneous open() on /dev/fd0foo and /dev/fd0bar. b) pointer to struct block_device is kept as long as floppy is opened (per-drive, non-NULL when number of openers is non-zero, does not contribute to block_device refcount). c) set_geometry() grabs the same semaphore and invalidates the devices directly instead of messing with setting fake "it had changed" and calling __check_disk_change(). * __check_disk_change() killed - no remaining callers * full_check_disk_change() killed - ditto.
2003-04-20[PATCH] make alloc_buffer_head take gfp_flagsAndrew Morton
- alloc_buffer_head() should take the allocation mode as an arg, and not assume. - Use __GFP_NOFAIL in JBD's call to alloc_buffer_head(). - Remove all the retry code from jbd_kmalloc() - do it via page allocator controls.
2003-03-28[PATCH] wait_on_buffer refcounting checksAndrew Morton
It is generally illegal to wait on an unpinned buffer - another CPU could free it up even before __wait_on_buffer() has taken a ref against the buffer. Maybe external locking rules will prevent this in specific cases, but that is really subtle and fragile as locking rules are evolved. The patch detects people calling wait_on_buffer() against an unpinned buffer and issues a diagnostic. Also remove the get_bh() from __wait_on_buffer(). It is too late.
2003-03-18[XFS] Export end_buffer_async_write, needed for unwritten extent support in XFS.Nathan Scott
SGI Modid: 2.5.x-xfs:slinx:141507a
2003-02-10[PATCH] Fix synchronous writers to wait properly for the resultAndrew Morton
Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> points out a bug in ll_rw_block() usage. Typical usage is: mark_buffer_dirty(bh); ll_rw_block(WRITE, 1, &bh); wait_on_buffer(bh); the problem is that if the buffer was locked on entry to this code sequence (due to in-progress I/O), ll_rw_block() will not wait, and start new I/O. So this code will wait on the _old_ I/O, and will then continue execution, leaving the buffer dirty. It turns out that all callers were only writing one buffer, and they were all waiting on that writeout. So I added a new sync_dirty_buffer() function: void sync_dirty_buffer(struct buffer_head *bh) { lock_buffer(bh); if (test_clear_buffer_dirty(bh)) { get_bh(bh); bh->b_end_io = end_buffer_io_sync; submit_bh(WRITE, bh); } else { unlock_buffer(bh); } } which allowed a fair amount of code to be removed, while adding the desired data-integrity guarantees. UFS has its own wrappers around ll_rw_block() which got in the way, so this operation was open-coded in that case.
2003-02-02[PATCH] ext3: fix scheduling storm and lockupsAndrew Morton
There have been sporadic sightings of ext3 causing little blips of 100,000 context switches per second when under load. At the start of do_get_write_access() we have this logic: repeat: lock_buffer(jh->bh); ... unlock_buffer(jh->bh); ... if (jh->j_list == BJ_Shadow) { sleep_on_buffer(jh->bh); goto repeat; } The problem is that the unlock_buffer() will wake up anyone who is sleeping in the sleep_on_buffer(). So if task A is asleep in sleep_on_buffer() and task B now runs do_get_write_access(), task B will wake task A by accident. Task B will then sleep on the buffer and task A will loop, will run unlock_buffer() and then wake task B. This state will continue until I/O completes against the buffer and kjournal changes jh->j_list. Unless task A and task B happen to both have realtime scheduling policy - if they do then kjournald will never run. The state is never cleared and your box locks up. The fix is to not do the `goto repeat;' until the buffer has been taken of the shadow list. So we don't go and wake up the other waiter(s) until they can actually proceed to use the buffer. The patch removes the exported sleep_on_buffer() function and simply exports an existing function which provides access to a buffer_head's waitqueue pointer. Which is a better interface anyway, because it permits the use of wait_event(). This bug was introduced introduced into 2.4.20-pre5 and was faithfully ported up.
2002-12-20[XFS] "merge" the 2.4 fsx fix for block size < page size to 2.5. This neededRussell Cattelan
major changes to actually fit. SGI Modid: 2.5.x-xfs:slinx:132210a
2002-12-14[PATCH] remove PF_SYNCAndrew Morton
current->flags:PF_SYNC was a hack I added because I didn't want to change all ->writepage implementations. It's foul. And it means that if someone happens to run direct page reclaim within the context of (say) sys_sync, the writepage invokations from the VM will be treated as "data integrity" operations, not "memory cleansing" operations, which would cause latency. So the patch removes PF_SYNC and adds an extra arg to a_ops->writepage. It is the `writeback_control' structure which contains the full context information about why writepage was called. The initial version of this patch just passed in a bare `int sync', but the XFS team need more info so they can perform writearound from within page reclaim. The patch also adds writeback_control.for_reclaim, so writepage implementations can inspect that to work out the call context rather than peeking at current->flags:PF_MEMALLOC.
2002-11-21[PATCH] no-buffer-head ext2 optionAndrew Morton
Implements a new set of block address_space_operations which will never attach buffer_heads to file pagecache. These can be turned on for ext2 with the `nobh' mount option. During write-intensive testing on a 7G machine, total buffer_head storage remained below 0.3 megabytes. And those buffer_heads are against ZONE_NORMAL pagecache and will be reclaimed by ZONE_NORMAL memory pressure. This work is, of course, a special for the huge highmem machines. Possibly it obsoletes the buffer_heads_over_limit stuff (which doesn't work terribly well), but that code is simple, and will provide relief for other filesystems. It should be noted that the nobh_prepare_write() function and the PageMappedToDisk() infrastructure is what is needed to solve the problem of user data corruption when the filesystem which backs a sparse MAP_SHARED mapping runs out of space. We can use this code in filemap_nopage() to ensure that all mapped pages have space allocated on-disk. Deliver SIGBUS on ENOSPC. This will require a new address_space op, I expect.
2002-11-15[PATCH] try to remove buffer_heads from to-be-reaped inodesAndrew Morton
Stephen Tweedie reports a 2.4.7 problem in which kswapd is chewing lots of CPU trying to reclaim inodes which are pinned by buffer_heads at i_dirty_buffers. This can only happen when there's memory pressure on ZONE_HIGHMEM - the 2.4 kernel runs shrink_icache_memory in that case as well. But there's no reclaim pressure on ZONE_NORMAL so the VM is never running try_to_free_buffers() against the ZONE_NORMAL buffers which are pinning the inodes. The 2.5 kernel also runs the slab shrinkers in response to ZONE_HIGHMEM pressure. This may be wrong - still thinking about that. This patch arranges for prune_icache to try to remove the inode's buffers when the inode is to be reclaimed. It also changes inode_has_buffers() and the other inode-buffer-list functions to look at inode->i_data, not inode->i_mapping. The latter was wrong.
2002-10-13[PATCH] remove kiobufsAndrew Morton
This patch from Christoph Hellwig removes the kiobuf/kiovec infrastructure. This affects three subsystems: video-buf.c: This patch includes an earlier diff from Gerd which converts video-buf.c to use get_user_pages() directly. Gerd has acked this patch. LVM1: Is now even more broken. drivers/mtd/devices/blkmtd.c: blkmtd is broken by this change. I contacted Simon Evans, who said "I had done a rewrite of blkmtd anyway and just need to convert it to BIO. Feel free to break it in the 2.5 tree, it will force me to finish my code." Neither EVMS nor LVM2 use kiobufs. The only remaining breakage of which I am aware is a proprietary MPEG2 streaming module. It could use get_user_pages().
2002-10-04[PATCH] use buffer_boundary() for writeback scheduling hintsAndrew Morton
This is the replacement for write_mapping_buffers(). Whenever the mpage code sees that it has just written a block which had buffer_boundary() set, it assumes that the next block is dirty filesystem metadata. (This is a good assumption - that's what buffer_boundary is for). So we do a lookup in the blockdev mapping for the next block and it if is present and dirty, then schedule it for IO. So the indirect blocks in the blockdev mapping get merged with the data blocks in the file mapping. This is a bit more general than the write_mapping_buffers() approach. write_mapping_buffers() required that the fs carefully maintain the correct buffers on the mapping->private_list, and that the fs call write_mapping_buffers(), and the implementation was generally rather yuk. This version will "just work" for filesystems which implement buffer_boundary correctly. Currently this is ext2, ext3 and some not-yet-merged reiserfs patches. JFS implements buffer_boundary() but does not use ext2-like layouts - so there will be no change there. Works nicely.
2002-10-04[PATCH] remove write_mapping_buffers()Andrew Morton
When the global buffer LRU was present, dirty ext2 indirect blocks were automatically scheduled for writeback alongside their data. I added write_mapping_buffers() to replace this - the idea was to schedule the indirects close in time to the scheduling of their data. It works OK for small-to-medium sized files but for large, linear writes it doesn't work: the request queue is completely full of file data and when we later come to scheduling the indirects, their neighbouring data has already been written. So writeback of really huge files tends to be a bit seeky. So. Kill it. Will fix this problem by other means.
2002-09-22[PATCH] low-latency page reclaimAndrew Morton
Convert the VM to not wait on other people's dirty data. - If we find a dirty page and its queue is not congested, do some writeback. - If we find a dirty page and its queue _is_ congested then just refile the page. - If we find a PageWriteback page then just refile the page. - There is additional throttling for write(2) callers. Within generic_file_write(), record their backing queue in ->current. Within page reclaim, if this tasks encounters a page which is dirty or under writeback onthis queue, block on it. This gives some more writer throttling and reduces the page refiling frequency. It's somewhat CPU expensive - under really heavy load we only get a 50% reclaim rate in pages coming off the tail of the LRU. This can be fixed by splitting the inactive list into reclaimable and non-reclaimable lists. But the CPU load isn't too bad, and latency is much, much more important in these situations. Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34 took 35 minutes to compile a kernel. With this patch, it took three minutes, 45 seconds. I haven't done swapcache or MAP_SHARED pages yet. If there's tons of dirty swapcache or mmap data around we still stall heavily in page reclaim. That's less important. This patch also has a tweak for swapless machines: don't even bother bringing anon pages onto the inactive list if there is no swap online.
2002-09-17[PATCH] move the buffer_head IO functions into buffer.cAndrew Morton
Patch from Christoph Hellwig. Move the buffer_head-based IO functions out of ll_rw_blk.c and into fs/buffer.c. So the buffer IO functions are all in buffer.c, and ll_rw_blk.c knows nothing about buffer_heads. This patch has been acked by Jens.
2002-09-09[PATCH] buffer_head takedown for bighighmem machinesAndrew Morton
This patch addresses the excessive consumption of ZONE_NORMAL by buffer_heads on highmem machines. The algorithms which decide which buffers to shoot down are fairly dumb, but they only cut in on machines with large highmem:lowmem ratios and the code footprint is tiny. The buffer.c change implements the buffer_head accounting - it sets the upper limit on buffer_head memory occupancy to 10% of ZONE_NORMAL. A possible side-effect of this change is that the kernel will perform more calls to get_block() to map pages to disk. This will only be observed when a file is being repeatadly overwritten - this is the only case in which the "cached get_block result" in the buffers is useful. I did quite some testing of this back in the delalloc ext2 days, and was not able to come up with a test in which the cached get_block result was measurably useful. That's for ext2, which has a fast get_block(). A desirable side effect of this patch is that the kernel will be able to cache much more blockdev pagecache in ZONE_NORMAL, so there are more ext2/3 indirect blocks in cache, so with some workloads, less I/O will be performed. In mpage_writepage(): if the number of buffer_heads is excessive then buffers are stripped from pages as they are submitted for writeback. This change is only useful for filesystems which are using the mpage code. That's ext2 and ext3-writeback and JFS. An mpage patch for reiserfs was floating about but seems to have got lost. There is no need to strip buffers for reads because the mpage code does not attach buffers for reads. These are perhaps not the most appropriate buffer_heads to toss away. Perhaps something smarter should be done to detect file overwriting, or to toss the 'oldest' buffer_heads first. In refill_inactive(): if the number of buffer_heads is excessive then strip buffers from pages as they move onto the inactive list. This change is useful for all filesystems. This approach is good because pages which are being repeatedly overwritten will remain on the active list and will retain their buffers, whereas pages which are not being overwritten will be stripped.
2002-07-18[PATCH] Add 4G-1 file support to FAT32Hirofumi Ogawa
This patch changes cont_prepare_write(), in order to support a 4G-1 file for FAT32. int cont_prepare_write(struct page *page, unsigned offset, - unsigned to, get_block_t *get_block, unsigned long *bytes) + unsigned to, get_block_t *get_block, loff_t *bytes) And it fixes broken adfs/affs/fat/hfs/hpfs/qnx4 by this cont_prepare_write() change.
2002-07-14[PATCH] direct-to-BIO for O_DIRECTAndrew Morton
Here's a patch which converts O_DIRECT to go direct-to-BIO, bypassing the kiovec layer. It's followed by a patch which converts the raw driver to use the O_DIRECT engine. CPU utilisation is about the same as the kiovec-based implementation. Read and write bandwidth are the same too, for 128k chunks. But with one megabyte chunks, this implementation is 20% faster at writing. I assume this is because the kiobuf-based implementation has to stop and wait for each 128k chunk, whereas this code streams the entire request, regardless of its size. This is with a single (oldish) scsi disk on aic7xxx. I'd expect the margin to widen on higher-end hardware which likes to have more requests in flight. Question is: what do we want to do with this sucker? These are the remaining users of kiovecs: drivers/md/lvm-snap.c drivers/media/video/video-buf.c drivers/mtd/devices/blkmtd.c drivers/scsi/sg.c the video and mtd drivers seems to be fairly easy to de-kiobufize. I'm aware of one proprietary driver which uses kiobufs. XFS uses kiobufs a little bit - just to map the pages. So with a bit of effort and maintainer-irritation, we can extract the kiobuf layer from the kernel.
2002-07-04Merge home.transmeta.com:/home/torvalds/v2.5/viroLinus Torvalds
into home.transmeta.com:/home/torvalds/v2.5/linux
2002-07-04[PATCH] kdev_t crapectomyAlexander Viro
* since the last caller of is_read_only() is gone, the function itself is removed. * destroy_buffers() is not used anymore; gone. * fsync_dev() is gone; the only user is (broken) lvm.c and first step in fixing lvm.c will consist of propagating struct block_device * anyway; at that point we'll just use fsync_bdev() in there. * prototype of bio_ioctl() removed - function doesn't exist anymore.