| Age | Commit message (Collapse) | Author |
|
This is the fs/ part of the big kfree cleanup patch.
Remove pointless checks for NULL prior to calling kfree() in fs/.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We must be sure that the current data in buffer are sent to disk. Hence we
have to call ll_rw_block() with SWRITE.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix race between journal_commit_transaction() and other places as
journal_unmap_buffer() that are adding buffers to transaction's t_forget list.
We have to protect against such places by holding j_list_lock even when
traversing the t_forget list. The fact that other places can only add buffers
to the list makes the locking easier. OTOH the lock ranking complicates the
stuff...
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix destruction of in-use journal_head
journal_put_journal_head() can destroy a journal_head at any time as
long as the jh's b_jcount is zero and b_transaction is NULL. It has no
locking protection against the rest of the journaling code, as the lock
it uses to protect b_jcount and bh->b_private is not used elsewhere in
jbd.
However, there are small windows where b_transaction is getting set
temporarily to NULL during normal operations; typically this is
happening in
__journal_unfile_buffer(jh);
__journal_file_buffer(jh, ...);
call pairs, as __journal_unfile_buffer() will set b_transaction to NULL
and __journal_file_buffer() re-sets it afterwards. A truncate running
in parallel can lead to journal_unmap_buffer() destroying the jh if it
occurs between these two calls.
Fix this by adding a variant of __journal_unfile_buffer() which is only
used for these temporary jh unlinks, and which leaves the b_transaction
field intact so that we never leave a window open where b_transaction is
NULL.
Additionally, trap this error if it does occur, by checking against
jh->b_jlist being non-null when we destroy a jh.
Signed-off-by: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
journal_commit_transaction() is 720 lines long. This patch pulls about 55
of them out into their own function, removes a goto and cleans up the
control flow a little.
Signed-off-by: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Dynamically allocate the holding array for kjournald write patching rather
than allocating it on the stack.
Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
fix against credits leak in journal_release_buffer()
The idea is to charge a buffer in journal_dirty_metadata(), not in
journal_get_*_access()). Each buffer has flag call
journal_dirty_metadata() sets on the buffer.
Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
I found bugs on error handlings in the functions arround the ext3 file
system, which cause inadequate completions of synchronous write I/O
operations when disk I/O failures occur. Both 2.4 and 2.6 have this
problem.
I carried out following experiment:
1. Mount a ext3 file system on a SCSI disk with ordered mode.
2. Open a file on the file system with O_SYNC|O_RDWR|O_TRUNC|O_CREAT flag.
3. Write 512 bytes data to the file by calling write() every 5 seconds, and
examine return values from the syscall.
from write().
4. Disconnect the SCSI cable, and examine messages from the kernel.
After the SCSI cable is disconnected, write() must fail. But the result
was different: write() succeeded for a while even though messages of the
kernel notified SCSI I/O error.
By applying following modifications, the above problem was solved.
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The attached patch fixes long scheduling latencies in the ext3 code, and it
also cleans up the existing lock-break functionality to use the new
primitives.
This patch has been in the -VP patchset for quite some time.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
With the demise of intermezzo, the journal callback stuff in jbd is
entirely unused (neither ext3 nor ocfs2 use it), and thus will only bitrot
and bloat the kernel with code and datastructure growth. If intermezzo
ever gets resurrected this will be the least of the problems they have to
face (both with generic kernel as jbd).
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Processes can sleep in do_get_write_access(), waiting for buffers to be
removed from the BJ_Shadow state. We did this by doing a wake_up_buffer() in
the commit path and sleeping on the buffer in do_get_write_access().
With the filtered bit-level wakeup code this doesn't work properly any more -
the wake_up_buffer() accidentally wakes up tasks which are sleeping in
lock_buffer() as well. Those tasks now implicitly assume that the buffer came
unlocked. Net effect: Bogus I/O errors when reading journal blocks, because
the buffer isn't up to date yet. Hence the recently spate of journal_bmap()
failure reports.
The patch creates a new jbd-private BH flag purely for this wakeup function.
So a wake_up_bit(..., BH_Unshadow) doesn't wake up someone who is waiting for
a wake_up_bit(BH_Lock).
JBD was the only user of wake_up_buffer(), so remove it altogether.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Al Viro <viro@parcelfarce.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Mount with "mount -o barrier=1" to enable barriers.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
From: Chris Mason <mason@suse.com>
jbd needs to wait for any io to complete on the buffer before changing the
end_io function. Using set_buffer_locked means that it can change the
end_io function while the page is in the middle of writeback, and the
writeback bit on the page will never get cleared.
Since we set the buffer dirty earlier on, if the page was previously dirty,
pdflush or memory pressure might trigger a writepage call, which will race
with jbd's set_buffer_locked.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix a problem discovered by Jeff Mahoney <jeffm@suse.com>, based on an initial
patch from Chris Mason <mason@suse.com>.
journal_get_descriptor_buffer() is used to obtain a regular old buffer_head
against the blockdev mapping. The caller will populate that bh by hand and
will then submit it for writing.
But there are problems:
a) The function sets bh->b_state nonatomically. But this buffer is
accessible to other CPUs via pagecache lookup.
b) The function sets the buffer dirty and then the caller populates it and
then it is submitted for I/O. Wrong order: there's a window in which the
VM could write the buffer before it is fully populated.
c) The function fails to set the buffer uptodate after zeroing it. And one
caller forgot to mark it uptodate as well. So if the VM happens to decide
to write the containing page back __block_write_full_page() encounters a
dirty, not uptodate buffer, which is an illegal state. This was generating
buffer_error() warnings before we removed buffer_error().
Leaving the buffer not uptodate also means that a concurrent reader of
/dev/hda1 could cause physical I/O against the buffer, scribbling on what
we just put in it.
So journal_get_descriptor_buffer() is changed to mark the buffer
uptodate, under the buffer lock.
I considered changing journal_get_descriptor_buffer() to return a locked
buffer but there doesn't seem to be a need for this, and both callers end up
using ll_rw_block() anyway, which requires that the buffer be unlocked again.
Note that the journal_get_descriptor_buffer() callers dirty these buffers with
set_buffer_dirty(). That's a bit naughty, because it could create dirty
buffers against a clean page - an illegal state. They really should use
mark_buffer_dirty() to dirty the page and inode as well. But all callers will
immediately write and clean the buffer anyway, so we can safely leave this
optimising cheat in place.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Almost everywhere where JBD removes a buffer from the transaction lists the
caller then nulls out jh->b_transaction. Sometimes, the caller does that
without holding the locks which are defined to protect b_transaction. This
makes me queazy.
So change things so that __journal_unfile_buffer() nulls out b_transaction
inside both j_list_lock and jbd_lock_bh_state().
It cleans things up a bit, too.
|
|
Fix a few buglets spotted by Jeff Mahoney <jeffm@suse.com>. We're currently
only checking for I/O errors against journal buffers if they were locked when
they were first inspected.
We need to check buffer_uptodate() even if the buffers were already unlocked.
|
|
For data=ordered, kjournald at commit time has to write out and wait upon a
long list of buffers. It does this in a rather awkward way with a single
list. it causes complexity and long lock hold times, and makes the addition
of rescheduling points quite hard
So what we do instead (based on Chris Mason's suggestion) is to add a new
buffer list (t_locked_list) to the journal. It contains buffers which have
been placed under I/O.
So as we walk the t_sync_datalist list we move buffers over to t_locked_list
as they are written out.
When t_sync_datalist is empty we may then walk t_locked_list waiting for the
I/O to complete.
As a side-effect this means that we can remove the nasty synchronous wait in
journal_dirty_data which is there to avoid the kjournald livelock which would
otherwise occur when someone is continuously dirtying a buffer.
|
|
There's some nasty code in commit which deals with a lock ranking problem.
Currently if it fails to get the lock when and local variable `bufs' is zero
we forget to write out some ordered-data buffers. So a subsequent
crash+recovery could yield stale data in existing files.
Fix it by correctly restarting the t_sync_datalist search.
|
|
The locking rules say that b_committed_data is covered by
jbd_lock_bh_state(), so implement that during the start of commit, while
throwing away unused shadow buffers.
I don't expect that there is really a race here, but them's the rules.
|
|
Sometimes kjournald has to refile a huge number of buffers, because someone
else wrote them out beforehand - they are all clean.
This happens under a lock and scheduling latencies of 88 milliseconds on a
2.7GHx CPU were observed.
The patch forward-ports a little bit of the 2.4 low-latency patch to fix this
problem.
Worst-case on ext3 is now sub-half-millisecond, except for when the RCU
dentry reaping softirq cuts in :(
|
|
Plug the two-megabyte-per-day memory leak.
|
|
We're getting asserion failures in commit in data=journal mode.
journal_unmap_buffer() has unexpectedly donated this buffer to the committing
transaction, and the commit-time assertion doesn't expect that to happen. It
doesn't happen in 2.4 because both paths are under lock_journal().
Simply remove the assertion: the commit code will uncheckpoint the buffer and
then recheckpoint it if needed.
|
|
From: Alex Tomas <bzzz@tmi.comex.ru>
start_this_handle() takes into account t_outstanding_credits when calculating
log free space, but journal_next_log_block() accounts for blocks being logged
also. Hence, blocks are accounting twice. This effectively reduces the
amount of log space available to transactions and forces more commits.
Fix it by decrementing t_outstanding_credits each time we allocate a new
journal block.
|
|
- remove accidental debug code from ext3 commit.
- /proc/profile documentation fix (Randy Dunlap)
- use sb_breadahead() in ext2_preread_inode()
- unused var in mpage_writepages()
|
|
CPU0 CPU1
journal_get_write_access(bh)
(Add buffer to t_reserved_list)
journal_get_write_access(bh)
(It's already on t_reserved_list:
nothing to do)
(We decide we don't want to
journal the buffer after all)
journal_release_buffer()
(It gets pulled off the transaction)
journal_dirty_metadata()
(The buffer isn't on the reserved
list! The kernel explodes)
Simple fix: just leave the buffer on t_reserved_list in
journal_release_buffer(). If nobody ends up claiming the buffer then it will
get thrown away at start of transaction commit.
|
|
We need to unconditionally brelse() the buffer in there, because
journal_remove_journal_head() leaves a ref behind.
release_buffer_page() does that. Call it all the time because we can usually
strip the buffers and free the page even if it was not marked buffer_freed().
Mainly affects data=journal mode
|
|
ext3 and JBD still have enormous numbers of lines which end in tabs. Fix
them all up.
|
|
With data=ordered it is often the case that a quick write-and-truncate will
leave large numbers of pages on the page LRU with no ->mapping, and attached
buffers. Because ext3 was not ready to let the pages go at the time of
truncation.
These pages are trivially reclaimable, but their seeming absence makes the VM
overcommit accounting confused (they don't count as "free", nor as
pagecache). And they make the /proc/meminfo stats look odd.
So what we do here is to try to strip the buffers from these pages as the
buffers exit the journal commit.
|
|
start_this_handle() can decide to add this handle to a transaction, but
kjournald then moves the handle into commit phase.
Extend the coverage of j_state_lock so that start_this_transaction()'s
examination of journal->j_state is atomic wrt journal_commit_transaction().
|
|
Plug a conceivable race with the freeing up of trasnactions, and add some
more debug checks.
|
|
This filesystem-wide sleeping lock is no longer needed. Remove it.
|
|
lock_kernel() is no longer needed in JBD. Remove all the lock_kernel() calls
from fs/jbd/.
Here is where I get to say "ex-parrot".
|
|
Remove the remaining sleep_on() calls from JBD.
|
|
From: Alex Tomas <bzzz@tmi.comex.ru>
We're about to remove lock_journal(), and it is lock_journal which separates
the running and committing transaction's revokes on the single revoke table.
So implement two revoke tables and rotate them at commit time.
|
|
Go through all sites which use j_committing_transaction and ensure that the
deisgned locking is correctly implemented there.
|
|
Implement the designed locking around journal->j_running_transaction.
A lot more of the new locking scheme falls into place.
|
|
Provide the designed locking around the transaction's t_jcb callback list.
It turns out that this is wholly redundant at present.
|
|
Provide the designating locking for transaction_t.t_updates.
|
|
This was a system-wide spinlock.
Simple transformation: make it a filesystem-wide spinlock, in the JBD
journal.
That's a bit lame, and later it might be nice to make it per-transaction_t.
But there are interesting ranking and ordering problems with that, especially
around __journal_refile_buffer().
|
|
Go through all use of b_transaction and implement the rules.
Fairly straightforward.
|
|
We now start to move across the JBD data structure's fields, from "innermost"
and outwards.
Start with journal_head.b_frozen_data, because the locking for this field was
partially implemented in jbd-010-b_committed_data-race-fix.patch.
It is protected by jbd_lock_bh_state(). We keep the lock_journal() and
spin_lock(&journal_datalist_lock) calls in place. Later,
spin_lock(&journal_datalist_lock) is replaced by
spin_lock(&journal->j_list_lock).
Of course, this completion of the locking around b_frozen_data also puts a
lot of the locking for other fields in place.
|
|
journal_unlock_journal_head() is misnamed: what it does is to drop a ref on
the journal_head and free it if that ref fell to zero. It doesn't actually
unlock anything.
Rename it to journal_put_journal_head().
|
|
buffer_heads and journal_heads are joined at the hip. We need a lock to
protect the joint and its refcounts.
JBD is currently using a global spinlock for that. Change it to use one bit
in bh->b_state.
|
|
This was a strange spinlock which was designed to prevent another CPU from
ripping a buffer's journal_head away while this CPU was inspecting its state.
Really, we don't need it - we can inspect that state directly from bh->b_state.
So kill it off, along with a few things which used it which are themselves
not actually used any more.
|
|
From: Alex Tomas <bzzz@tmi.comex.ru>
We have a race wherein the block allocator can decide that
journal_head.b_committed_data is present and then will use it. But kjournald
can concurrently free it and set the pointer to NULL. It goes oops.
We introduce per-buffer_head "spinlocking" based on a bit in b_state. To do
this we abstract out pte_chain_lock() and reuse the implementation.
The bit-based spinlocking is pretty inefficient CPU-wise (hence the warning
in there) and we may move this to a hashed spinlock later.
|
|
From: Hua Zhong <hzhong@cisco.com>
The current ext3 totally ignores I/O errors that happened during a
journal_force_commit time, causing user space to falsely believe it has
succeeded, which actually did not.
This patch checks IO error during journal_commit_transaction. and aborts
the journal when there is I/O error.
Originally I thought about reporting the error without doing aborting the
journal, but it probably needs a new flag. Aborting the journal seems to be
the easy way to signal "hey sth is wrong..".
|
|
Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> points out a bug in
ll_rw_block() usage.
Typical usage is:
mark_buffer_dirty(bh);
ll_rw_block(WRITE, 1, &bh);
wait_on_buffer(bh);
the problem is that if the buffer was locked on entry to this code sequence
(due to in-progress I/O), ll_rw_block() will not wait, and start new I/O. So
this code will wait on the _old_ I/O, and will then continue execution,
leaving the buffer dirty.
It turns out that all callers were only writing one buffer, and they were all
waiting on that writeout. So I added a new sync_dirty_buffer() function:
void sync_dirty_buffer(struct buffer_head *bh)
{
lock_buffer(bh);
if (test_clear_buffer_dirty(bh)) {
get_bh(bh);
bh->b_end_io = end_buffer_io_sync;
submit_bh(WRITE, bh);
} else {
unlock_buffer(bh);
}
}
which allowed a fair amount of code to be removed, while adding the desired
data-integrity guarantees.
UFS has its own wrappers around ll_rw_block() which got in the way, so this
operation was open-coded in that case.
|
|
This is the leak which Con found. Long story...
- If a dirty page is fed into ext3_writepage() during truncate,
block_write_full_page() will reutrn -EIO (it's outside i_size) and will
leave the buffers dirty. In the expectation that discard_buffer() will
clean them.
- ext3_writepage() then adds the still-dirty buffers to the journal's
"async data list". These are buffers which are known to have had IO
started. All we need to do is to wait on them in commit.
- meanwhile, truncate will chop the pages off the address_space. But
truncate cannot invalidate the buffers (in journal_unmap_buffer()) because
the buffers are attached to the committing transaction. (hm. This
behaviour in journal_unmap_buffer() is bogus. We just never need to write
these buffers.)
- ext3 commit will "wait on writeout" of these writepage buffers (even
though it was never started) and will then release them from the
journalling system.
So we end up with pages which are attached to no mapping, which are clean and
which have dirty buffers. These are unreclaimable.
Aside:
ext3-ordered has two buffer lists: the "sync data list" and the "async
data list".
The sync list consists of probably-dirty buffers which were dirtied in
commit_write(). Transaction commit must write all thee out and wait on
them.
The async list supposedly consists of clean buffers which were attached
to the journal in ->writepage. These have had IO started (by writepage) so
commit merely needs to wait on them.
This is all designed for the 2.4 VM really. In 2.5, tons of writeback
goes via writepage (instead of the buffer lru) and these buffers end up
madly hpooing between the async and sync lists.
Plus it's arguably incorrect to just wait on the writes in commit - if
the buffers were set dirty again (say, by zap_pte_range()) then perhaps we
should write them again before committing.
So what the patch does is to remove the async list. All ordered-data buffers
are now attached to the single "sync data list". So when we come to commit,
those buffers which are dirty will have IO started and all buffers are waited
upon.
This means that the dirty buffers against a clean page which came about from
block_write_full_page()'s -EIO will be written to disk in commit - this
cleans them, and the page is now reclaimable. No leak.
It seems bogus to write these buffers in commit, and indeed it is. But ext3
will not allow those blocks to be reused until the commit has ended so there
is no corruption risk. And the amount of data involved is low - it only
comes about as a race between truncate and writepage().
|