| Age | Commit message (Collapse) | Author |
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Looks like locking can be optimised quite a lot. Increase lock widths
slightly so lo_lock is taken fewer times per request. Also it was quite
trivial to cover lo_pending with that lock, and remove the atomic
requirement. This also makes memory ordering explicitly correct, which is
nice (not that I particularly saw any mem ordering bugs).
Test was reading 4 250MB files in parallel on ext2-on-tmpfs filesystem (1K
block size, 4K page size). System is 2 socket Xeon with HT (4 thread).
intel:/home/npiggin# umount /dev/loop0 ; mount /dev/loop0 /mnt/loop ; /usr/bin/time ./mtloop.sh
Before:
0.24user 5.51system 0:02.84elapsed 202%CPU (0avgtext+0avgdata 0maxresident)k
0.19user 5.52system 0:02.88elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
0.19user 5.57system 0:02.89elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
0.22user 5.51system 0:02.90elapsed 197%CPU (0avgtext+0avgdata 0maxresident)k
0.19user 5.44system 0:02.91elapsed 193%CPU (0avgtext+0avgdata 0maxresident)k
After:
0.07user 2.34system 0:01.68elapsed 143%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.37system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.39system 0:01.68elapsed 145%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.36system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.42system 0:01.68elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This is a megarollup of ~60 patches which give various things static scope.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Implements fallback to file_operations->write in the case that
aops->{prepare,commit}_write are not present on the backing filesystem.
The fallback happens in two different ways:
- For normal loop devices, i.e. ones which do not do transformation on
the data but simply pass it along, we simply call fops->write. This
should be pretty much just as fast as using aops->{prepare,commit}_write
directly.
- For all other loop devices (e.g. xor and cryptoloop), i.e. all the
ones which may be doing transformations on the data, we allocate and map
a page (once for each bio), then for each bio vec we copy the bio vec
page data to our mapped page, apply the loop transformation, and use
fops->write to write out the transformed data from our page. Once all
bio vecs from the bio are done, we unmap and free the page.
This approach is the absolute minimum of overhead I could come up with and
for performance hungry people, as you can see I left the address space
operations method in place for filesystems which implement
aops->{prepare,commit}_write.
I have tested this patch with normal loop devices using
aops->{prepare,commit}_write on the backing filesystem, with normal loop
devices using the fops->write code path and with cryptoloop devices using
the double buffering + fops->write code path.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
With Andries Brouwer <Andries.Brouwer@cwi.nl>
Fix various recursion scenarios wherein it was possible to mount a loop
device on itself, either directly or via intermediate loops devices.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Convert loopback device to new module_param to get rid of warning.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
We have a fun situation with read_descriptor_t - all its instances end
up passed to some actor; these actors use desc->buf as their private
data; there are 5 of them and they expect resp:
struct lo_read_data *
struct svc_rqst *
struct file *
struct rpc_xprt *
char __user *
IOW, there is no type safety whatsoever; the field is essentially untyped,
we rely on the fact that actor is chosen by the same code that sets ->buf
and expect it to put something of the right type there.
Right now desc->buf is declared as char __user *. Moreover, the last
argument of ->sendfile() (what should be stored in ->buf) is void __user *,
even though it's actually _never_ a userland pointer.
If nothing else, ->sendfile() should take void * instead; that alone removes
a bunch of bogus warnings. I went further and replaced desc->buf with a
union of void * and char __user *.
|
|
From: Russell King <rmk+lkml@arm.linux.org.uk>
It appears the loop driver has had one flush_dcache_page() call added for
the case where it writes to the backing device page cache pages.
However, it seems to be missing the call where it writes to its own page
cache pages.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
|
|
From: Yury Umanets <torque@ukrpost.net>
I have found small inconsistency in loop_set_fd(). It checks if
->sendfile() is implemented for passed block device file. But in fact,
loop back device driver never calls it. It uses ->sendfile() from backing
store file.
|
|
From: Nigel Cunningham <ncunningham@users.sourceforge.net>
A few weeks ago, Pavel and I agreed that PF_IOTHREAD should be renamed to
PF_NOFREEZE. This reflects the fact that some threads so marked aren't
actually used for IO while suspending, but simply shouldn't be frozen.
This patch, against 2.6.5 vanilla, applies that change. In the
refrigerator calls, the actual value doesn't matter (so long as it's
non-zero) and it makes more sense to use PF_FREEZE so I've used that.
|
|
From: Jens Axboe <axboe@suse.de>,
Chris Mason,
me, others.
The global unplug list causes horrid spinlock contention on many-disk
many-CPU setups - throughput is worse than halved.
The other problem with the global unplugging is of course that it will cause
the unplugging of queues which are unrelated to the I/O upon which the caller
is about to wait.
So what we do to solve these problems is to remove the global unplug and set
up the infrastructure under which the VFS can tell the block layer to unplug
only those queues which are relevant to the page or buffer_head whcih is
about to be waited upon.
We do this via the very appropriate address_space->backing_dev_info structure.
Most of the complexity is in devicemapper, MD and swapper_space, because for
these backing devices, multiple queues may need to be unplugged to complete a
page/buffer I/O. In each case we ensure that data structures are in place to
permit us to identify all the lower-level queues which contribute to the
higher-level backing_dev_info. Each contributing queue is told to unplug in
response to a higher-level unplug.
To simplify things in various places we also introduce the concept of a
"synchronous BIO": it is tagged with BIO_RW_SYNC. The block layer will
perform an immediate unplug when it sees one of these go past.
|
|
From: Chris Mason <mason@suse.com>
I think Andrew and I managed to mismerge the loop setup race fix.
loop_set_fd is using get_capacity() to read the size of the disk and
sending that to bd_set_size.
But, it is doing this before calling set_capacity, so the size being used
is wrong. This should clean things up.
|
|
From: Chris Mason <mason@suse.com>
There's a race in loopback setup, it's easiest to trigger with one or more
procs doing loopback mounts at the same time. The problem is that
fs/block_dev.c:do_open() only calls bdev_set_size on the first open.
Picture two procs:
proc1: mount -o loop file1 mnt1
proc2: mount -o loop file2 mnt2
proc1 proc2
open /dev/loop0 # bd_openers now 1
do_open
bd_set_size(bdev, 0) # loop unbound, so bdev size is 0
open /dev/loop0 # bd_openers now 2
loop_set_fd # disk capacity now correct, but
# bdev not updated
mount /dev/loop0 /mnt
do_open
Because bd_openers != 0 for the last do_open, bd_set_size is not called
again and a size of 0 is used. This eventually leads to an oops when the
loop device is unmounted, because fsync_bdev calls block_write_full_page
who decides every page on the block device is outside i_size and unmaps
them.
When ext2 or reiserfs try to sync a metadata buffer, we get an oops on
because the buffers are no longer mapped.
The patch below changes loop_set_fd and loop_clr_fd to also manipulate the
size of the block device, which fixes things for me.
|
|
From: Arjan van de Ven <arjanv@redhat.com>
The patch below (written by Al Viro) solves a nasty chicken-and-egg issue
for operating system installers (well at least anaconda but the problem
domain is not exclusive to that)
The basic problem is this:
- The small first stage installer locates the image file of the second
stage installer (which has X and all the graphical stuff); this image can
be on the same CD, but it can come via NFS, http or ftp or ... as well.
- The first stage installer loop-back mounts this image and gives control
to the second stage installer by calling some binary there.
- The graphical installer then asks the user all those questions and
starts installing packages. Again the packages can come from the CD but
also from NFS or http or ...
Now in case of a CD install, once all requested packages from the first CD
are installed, the installer wants to unmount and eject the CD and prompt
the user to put CD 2 in....... EXCEPT that the unmount can't work since
the installer is actually running from a loopback mount of this cd.
The solution is a "LOOP_CHANGE_FD" ioctl, where basically the installer
copies the image to the harddisk (which can only be done late since only
late the target harddisk is mkfs'd) and then magically switches the backing
store FD from underneath the loop device to the one on the target harddisk
(and thus unbusying the CD mount).
This is obviously only allowed if the size of the new image is identical
and if the loop image is read-only in the first place. It's the
responsibility of root to make sure the contents is the same (but that's of
the give-root-enough-rope kind)
|
|
From: "Yury V. Umanets" <umka@namesys.com>
This removes a redundant assignment in loop.
|
|
From: BlaisorBlade <blaisorblade_spam@yahoo.it>
loop_init doesn't fail gracefully for two reasons:
1) If initialization of loop driver fails, we have an call to
devfs_add("loop") without any devfs_remove; I add that.
2) On lwn.net 2.6 kernel docs, Jonathan Corbet says: "If you are calling
add_disk() in your driver initialization routine, you should not fail
the initialization process after the first call."
So I make loop.c conform to this request by moving add_disk after all
memory allocations.
|
|
From: Ben Slusky <sluskyb@paranoiacs.org>
One more patch --- this fixes a minor bio handling bug in the filebacked
code path. I'd fixed it incidentally in the loop-recycle patch.
I don't think you could actually see damage from this bug unless you
run device mapper on top of loop devices, but still this is the correct
behavior.
|
|
From: Ben Slusky <sluskyb@paranoiacs.org>
The attached patch changes the loop device transfer functions (including
cryptoloop transfers) to accept page/offset pairs instead of virtual
addresses, and removes the redundant kmaps in do_lo_send, do_lo_receive,
and loop_transfer_bio. Per Andrew Morton's request a while back.
|
|
This patch removes the loop feature wherein we remap BIOs for block-backed
loop. So file-backed and block-backed loop are handled identically.
It cleans up the code a lot and fixes the low-on-memory lockups which
block-backed loop currently suffers.
What we lose is the journalling ordering guarantees which
exts-on-loop-on-blockdev had. But dm-crypt provides that.
|
|
From: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
This patch fixes the error number when invalid file is passed (neother
S_ISBLK nor S_ISREG is true). We should return -EINVAL.
|
|
From: Erik van Konijnenburg <ekonijn@xs4all.nl>
There are two issues here:
- absense of a MODULE_ALIAS_BLOCK in loop.c
- mismatch between the patterns used in the MODULE_ALIAS_BLOCK define and
the modprobe invokation in request_module.
(acked by Rusty)
|
|
From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk>
For bdevfs inodes (ones created along with struct block_device by
fs/block_dev.c) we have inode->i_bdev equal to &BDEV_I(inode)->bdev (i.e.
it's at the constant offset from inode). New helper added for such inodes
(I_BDEV(inode)). A bunch of places (mostly in block_dev.c) switched to use
of that helper. A bunch of places that used
file->f_dentry->d_inode->i_bdev->bd_inode
switched to
file->f_mapping->host
- those expressions are equal whenever the former is valid.
|
|
From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk>
More uses of ->i_mapping switched to uses of ->f_mapping - stuff that was not
caught by the earlier f_mapping conversion.
|
|
From: viro@parcelfarce.linux.theplanet.co.uk <viro@parcelfarce.linux.theplanet.co.uk>
A lot of places used to use ->f_dentry->d_inode->i_mapping all over the
place. Replaced with use of ->f_mapping. For now - just the places where we
literally could do search-and-replace.
|
|
- Fix an error-path file refcount leak
- Remove unnecessary get_file()/fput() pair.
- Clean up error handling a little
|
|
From: Ben Slusky <sluskyb@paranoiacs.org>
We need to set the hardsect_size of the loop device to that of the real
device.
The loop device advertises a block size of 1024 even when configured over a
cdrom.
When burning a ext2 on a cd, and mounting it directly, I get:
blocksize=2048;
when I losetup /dev/loop0 /dev/cdrom, and then try to mount, I get:
blocksize=1024; and then misaligned transfer; this results in not being able
to read the superblock.
The loop device should be changed to export the same blocksize of the
underlying device
|
|
Real conversion to 32bit dev_t. Expansion to:
* mknod() - 32
* newstat() - 32 on 64bit platforms
* stat64() - 32 on mips, 64 on everything else (mips has weird struct
stat64 and can't get more than 32 bits). Note that right now the difference
is purely theoretical - we don't have internal values above 32 bits, so
huge_... vs. new_... only marks the places where 64bit conversion will need
extra work.
* arch-dependent stat variants - depending on width available.
* ustat et.al. - 32
* filesystems that can handle 32 bits right now - 32
* ext2 and ext3 - 32, with large dev_t inodes having 0 in the first
element of i_data[] (where we store dev_t value for small device numbers) and
keeping the value in the second element.
* nfsd - 32; it can be driven to 64, but we'll get several issues with
NFSv2 support.
* RAID - 32
* devmapper - with v1 it's still 16 (nothing to do here), with v4 it's
64.
* loop - 64
* initramfs - 32
* do_mounts code - 32. Parts that scan devfs tree are using newstat()
on 64bit platforms and stat64() on the rest (IOW, the latest stat variant on
given platform).
* old_valid_dev()/new_valid_dev() added where needed (stat variants,
mostly - we fail with -EOVERFLOW if values do not fit).
|
|
Added old_encode_dev() in loop.c
|
|
This uses CLONE_KERNEL in place of the individual
flags, only changing the places where it is an exact
match.
I strongly suspect that CLONE_KERNEL ought to be
used in many more places, but they require a more
careful examination.
|
|
From: Peter Osterlund <petero2@telia.com>
It oopses on module unload in the kobject layer due to misordered destruction
of things.
And we need to initialise the unplug timer in blk_alloc_queue(), because we
kill that timer in blk_alloc_queue()'s companion function,
blk_cleanup_queue().
|
|
From: Oliver Xymoron <oxymoron@waste.org>
This patch just saves a few bytes in the inode by turning mapping->gfp_mask
into an unsigned long mapping->flags.
The mapping's gfp mask is placed in the 16 high bits of mapping->flags and
two of the remaining 16 bits are used for tracking EIO and ENOSPC errors.
This leaves 14 bits in the mapping for future use. They should be accessed
with the atomic bitops.
|
|
loop-on-file oopses during unmount. This is because lo_queue is now freed
during lo_ioctl(LOOP_CLR_FD). I think the scenario is:
1: umount(8) opens /dev/loop0
2: umount(8) runs lo_ioctl(LOOP_CLR_FD) (this frees the queue)
3: umount(8) closes the /dev/loop0 handle. The blockdev layer syncs the
blockdev, but its mapping->backing_dev_info now points into la-la-land.
We shouldn't be freeing the queue until all refs to it have gone away. This
patch gives the queue the same lifetime as the controlling loop_device
itself. It also makes the loop driver's queue appear in sysfs again.
It would be better to free the queue when the device is not in use, but I'm
not sure how we can hook into the blockdev layer to do that.
|
|
It was caused by improper IV calculation in loop.c
|
|
To be able to properly be able to keep references to block queues,
we make blk_init_queue() return the queue that it initialized, and
let it be independently allocated and then cleaned up on the last
reference.
I have grepped high and low, and there really shouldn't be any broken
uses of blk_init_queue() in the kernel drivers left. The added bonus
being blk_init_queue() error checking is explicit now, most of the
drivers were broken in this regard (even IDE/SCSI).
No drivers have embedded request queue structures. Drivers that don't
use blk_init_queue() but blk_queue_make_request(), should allocate the
queue with blk_alloc_queue(gfp_mask). I've converted all of them to do
that, too. They can call blk_cleanup_queue() now too, using the define
blk_put_queue() is probably cleaner though.
|
|
From: Daniel McNeil <daniel@osdl.org>
This adds i_seqcount to the inode structure and then uses i_size_read() and
i_size_write() to provide atomic access to i_size. This is a port of
Andrea Arcangeli's i_size atomic access patch from 2.4. This only uses the
generic reader/writer consistent mechanism.
Before:
mnm:/usr/src/25> size vmlinux
text data bss dec hex filename
2229582 1027683 162436 3419701 342e35 vmlinux
After:
mnm:/usr/src/25> size vmlinux
text data bss dec hex filename
2225642 1027655 162436 3415733 341eb5 vmlinux
3.9k more text, a lot of it fastpath :(
It's a very minor bug, and the fix has a fairly non-minor cost. The most
compelling reason for fixing this is that writepage() checks i_size. If it
sees a transient value it may decide that page is outside i_size and will
refuse to write it. Lost user data.
|
|
util-linux is waiting for this: it needs to update "struct loop_info64"
to add the encryption policy name.
|
|
This does the following:
- IV value is current 512-byte sector relative to start of loop
container file. This is what all cryptoloop people have done, if I
am not mistaken. Andi or others - if you can demonstrate the need
for a more flexible setup an additional ioctl field may be needed. I
hope we can do without.
- made some things static
- made lo_offset a loff_t
- added lo_sizelimit
If one wanted a (crypto)loop somewhere inside a container file, the
old code allowed a starting offset, but no size, so that the
cryptoloop always extended to the end of the container file. This
field allows one to select an arbitrary interval. Note that this
changes struct loop_info64.
- improve error handling of loop_init()
- removed the unused typedef transfer_proc_t.
- added a define for LO_CRYPT_CRYPTOAPI
|
|
This does the following:
- remove trailing spaces
- make loop.h independent by including bio.h, blk.h, spinlock.h
- replace the lock/unlock functions by module_get/module_put;
in struct loop this is the change
- void (*lock)(struct loop_device *);
- void (*unlock)(struct loop_device *);
+ struct module *owner;
- replace the integer lo_encrypt_type by the pointer lo_encryption;
there was a race with loop_unregister_transfer
- fixed an off-by-one in loop_register_transfer
This is Step 1 of a series of half a dozen or so.
Half of the above is from Jari. Anything that is wrong is mine.
|
|
From: Hugh Dickins <hugh@veritas.com>
loop_get_buffer loses PF_MEMDIE if it's added while in loop_copy_bio: not a
high probability since it's not waiting there, but could happen, and sets a
bad example (compare with add_to_swap fixed a while back).
|
|
From: Hugh Dickins <hugh@veritas.com>
loop_copy_bio uses one gfp_mask for bio_alloc and alloc_page calls. The
bio_alloc obviously can't use highmem, but the alloc_page can. Yes, the
underlying device might be unable to use highmem, and have to use one of
its bounce buffers, with an extra copy: so be it.
(Originally I did propagate the underlying device's bounce needs down to
the loop device, to avoid that possible extra copy; but let's keep this
simple, the low end doesn't have highmem and the high end can I/O it.)
|
|
From: Hugh Dickins <hugh@veritas.com>
What purpose does loop_make_request's blk_queue_bounce serve? None, it's
just a relic from before the kmaps were added to loop's transfers, and ties
up mempooled resources - in the file-backed case, with no guarantee they'll
soon be freed. And what purpose does loop_set_fd's blk_queue_bounce_limit
serve? None, blk_queue_make_request did that.
|
|
From: Hugh Dickins <hugh@veritas.com>
Jonah Sherman <jsherman@stuy.edu> pointed out back in February how
LO_FLAGS_BH_REMAP is never actually set, since loop_init_xfer only calls
the init for non-0 encryption type. Fix that or scrap it? Let's scrap it
for now, that path (hacking values in bio instead of copying data) seems
never to have been tested, and adds to the number of paths through loop:
leave that optimization to some other occasion.
|
|
From: Hugh Dickins <hugh@veritas.com>
Remove unused IV from loop_make_request (loop_transfer_bio does that).
|
|
From: Hugh Dickins <hugh@veritas.com>
Remove copy flag and code from loop_copy_bio: wasn't used when reading, and
waste of time when writing - the loop transfer function does that. And
don't initialize bio fields immediately reinitialized by caller.
|
|
From: Hugh Dickins <hugh@veritas.com>
Now it's in loop not bio, better rename bio_copy to loop_copy_bio: loop
prefers names that way; and bio_transfer better named loop_transfer_bio.
Rename bio,b to rbh,bio to follow call from loop_get_buffer more easily.
|
|
From: Hugh Dickins <hugh@veritas.com>
bio_copy is used only by the loop driver, which already has to walk the bio
segments itself: so it makes sense to change it from bio.c export to loop.c
static, as prelude to working upon it there.
bio_copy itself is unchanged by this patch, with one exception. On oom
failure it must use bio_put, instead of mempool_free to static bio_pool:
which it should have been doing all along - it was leaking the veclist.
(Grudgingly acked by Jens)
|
|
From: Hugh Dickins <hugh@veritas.com>
When loop restricts underlying file's allocation mask to avoid deadlock, it
unintentionally masks out its highmem capability, making failures at the
underlying level much more likely.
|
|
The loop thread is getting permanently stuck in balance_dirty_pages()
(nr_writeback is exceeded) because the loop thread itself is responsible for
completing writeback on behalf of higher layers.
So we need to take that out: don't throttle the loop thread. Throttle the
tasks which are generating all the dirty data instead.
|
|
Well, this is it for me and strlcpy. I'll leave the rest of the
non-obvious usages of strncpy to the kernel janitors. Seems like quite a
few uses really wanted memcpy instead, but I don't have time to
investigate them all. It does appear that nearly all strncpy's will be
removable. Obsoleting strncpy will probably atleast make the remaining
few think about how they are using it.
This is the patch for my trip through drivers/*.
|