| Age | Commit message (Collapse) | Author |
|
currently, md updates all superblocks (one on each device) in series. It
waits for one write to complete before starting the next. This isn't a big
problem as superblock updates don't happen that often.
However it is neater to do it in parallel, and if the drives in the array have
gone to "sleep" after a period of idleness, then waking them is parallel is
faster (and someone else should be worrying about power drain).
Futher, we will need parallel superblock updates for a future patch which
keeps the intent-logging bitmap near the superblock.
Also remove the silly code that retired superblock updates 100 times. This
simply never made sense.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This provides an alternate to storing the bitmap in a separate file. The
bitmap can be stored at a given offset from the superblock. Obviously the
creator of the array must make sure this doesn't intersect with data....
After is good for version-0.90 superblocks.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Before completing a 'write' the md superblock might need to be updated.
This is best done by the md_thread.
The current code schedules this up and queues the write request for later
handling by the md_thread.
However some personalities (Raid5/raid6) will deadlock if the md_thread
tries to submit requests to its own array.
So this patch changes things so the processes submitting the request waits
for the superblock to be written and then submits the request itself.
This fixes a recently-created deadlock in raid5/raid6
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
When an array is degraded, bit in the intent-bitmap are never cleared. So if
a recently failed drive is re-added, we only need to reconstruct the block
that are still reflected in the bitmap.
This patch adds support for this re-adding.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Currently we don't wait for updates to the bitmap to be flushed to disk
properly. The infrastructure all there, but it isn't being used....
A separate kernel thread (bitmap_writeback_daemon) is needed to wait for each
page as we cannot get callbacks when a page write completes.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
With this patch, the intent to write to some block in the array can be logged
to a bitmap file. Each bit represents some number of sectors and is set
before any update happens, and only cleared when all writes relating to all
sectors are complete.
After an unclean shutdown, information in this bitmap can be used to optimise
resync - only sectors which could be out-of-sync need to be updated.
Also if a drive is removed and then added back into an array, the recovery can
make use of the bitmap to optimise reconstruction. This is not implemented in
this patch.
Currently the bitmap is stored in a file which must (obviously) be stored on a
separate device.
The patch only provided infrastructure. It does not update any personalities
to bitmap intent logging.
Md arrays can still be used with no bitmap file. This patch has minimal
impact on such arrays.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
1/ change the return value (which is number-of-sectors synced)
from 'int' to 'sector_t'.
The number of sectors is usually easily small enough to fit
in an int, but if resync needs to abort, it may want to return
the total number of remaining sectors, which could be large.
Also errors cannot be returned as negative numbers now, so use
0 instead
2/ Add a 'skipped' return parameter to allow the array to report
that it skipped the sectors. This allows md to take this into account
in the speed calculations.
Currently there is no important skipping, but the bitmap-based-resync
that is coming will use this.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
When md marks the superblock dirty before a write, it calls
generic_make_request (to write the superblock) from within
generic_make_request (to write the first dirty block), which could cause
problems later.
With this patch, the superblock write is always done by the helper thread, and
write request are delayed until that write completes.
Also, the locking around marking the array dirty and writing the superblock is
improved to avoid possible races.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
If we detect an overlap, we set a flag and wait for a wakeup. When requests
are handled, if the flag was set, we perform the wakeup.
Note that the code currently in -mm is badly broken. With this patch applied,
it passes tests the use O_DIRECT to cause lots of overlapping requests.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The hashtable that linear uses to find the right device stores
two pointers for every entry.
The second is always one of:
The first plus 1
NULL
When NULL, it is never accessed, so any value can be stored.
Thus it could always be "first plus 1", and so we don't need to store
it as it is trivial to calculate.
This patch halves the size of this table, which results in some simpler
code as well.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The 'faulty' personality provides a layer over any block device in which
errors may be synthesised.
A variety of errors are possible including transient and persistent read
and write errors, and read errors that persist until the next write.
There error mode can be changed on a live array.
Accessing this personality requires mdadm 2.8.0 or later.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Some size fields were "int" instead of "sector_t".
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add some missing data_offset additions and some le_to_cpu convertions and fix
a few other little mistakes.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Both raid1 and multipath have a "retry_list" which is global, so all raid1
arrays (for example) us the same list. This is rather ugly, and it is simple
enough to make it per-array, so this patch does that.
It also changes to multipath code to use list.h lists instead of
roll-your-own.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch adds a 'raid10' module which provides features similar to both
raid0 and raid1 in the one array. Various combinations of layout are
supported.
This code is still "experimental", but appears to work.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
1/ Introduce "mddev->resync_max_sectors" so that an md personality
can ask for resync to cover a different address range than that of a
single drive. raid10 will use this.
2/ fix is_mddev_idle so that if there seem to be a negative number
of events, it doesn't immediately assume activity.
3/ make "sync_io" (the count of IO sectors used for array resync)
an atomic_t to avoid SMP races.
4/ Pass md_sync_acct a "block_device" rather than the containing "rdev",
as the whole rdev isn't needed. Also make this an inline function.
5/ Make sure recovery gets interrupted on any error.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This allows the number of "raid_disks" in a raid1 to be changed.
This requires allocating a new pool of "r1bio" structures which a different
number of bios, suspending IO, and swapping the new pool in place of the old.
(and a few other related changes).
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It is possible to have raid1/4/5/6 arrays that do not use all the space on the
drive. This can be done explicitly, or can happen info you, one by one,
replace all the drives with larger devices.
This patch extends the "SET_ARRAY_INFO" ioctl (which previously invalid on
active arrays) allow some attributes of the array to be changed and implements
changing of the "size" attribute.
"size" is the amount of each device that is actually used. If "size" is
increased, the new space will immediately be "resynced".
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
->nr_pending hits 0
md_check_recovery only locks a device and does stuff when it thinks there is a
real likelyhood that something needs doing. So the test at the top must cover
all possibilities.
But it didn't cover the possibility that the last outstanding request on a
failed device had finished and so the device needed to be removed.
As a result, a failed drive might not get removed from the personalities
perspective on the array, and so it could never be removed from the array as a
whole.
With this patch, whenever ->nr_pending hits zero on a faulty device,
MD_RECOVERY_NEEDED is set so that md_check_recovery will do stuff.
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
It no longer exists.
|
|
From: Neil Brown <neilb@cse.unsw.edu.au>
I've made a bunch of changes to the 'md' bits - largely moving the
unplugging into the individual personalities which know more about which
drives are actually in use.
|
|
From: Jens Axboe <axboe@suse.de>,
Chris Mason,
me, others.
The global unplug list causes horrid spinlock contention on many-disk
many-CPU setups - throughput is worse than halved.
The other problem with the global unplugging is of course that it will cause
the unplugging of queues which are unrelated to the I/O upon which the caller
is about to wait.
So what we do to solve these problems is to remove the global unplug and set
up the infrastructure under which the VFS can tell the block layer to unplug
only those queues which are relevant to the page or buffer_head whcih is
about to be waited upon.
We do this via the very appropriate address_space->backing_dev_info structure.
Most of the complexity is in devicemapper, MD and swapper_space, because for
these backing devices, multiple queues may need to be unplugged to complete a
page/buffer I/O. In each case we ensure that data structures are in place to
permit us to identify all the lower-level queues which contribute to the
higher-level backing_dev_info. Each contributing queue is told to unplug in
response to a higher-level unplug.
To simplify things in various places we also introduce the concept of a
"synchronous BIO": it is tagged with BIO_RW_SYNC. The block layer will
perform an immediate unplug when it sees one of these go past.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
This helps raid5 work on at least 1 very large array..
Thanks to Evan Felix <evan.felix@pnl.gov>
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
With this patch, md used two major numbers for arrays.
One Major is number 9 with name 'md' have unpartitioned md arrays, one per
minor number.
The other Major is allocated dynamically with name 'mdp' and had on array for
every 64 minors, allowing for upto 63 partitions.
The arrays under one major are completely separate from the arrays under the
other.
The preferred name for devices with the new major are of the form:
/dev/md/d1p3 # partion 3 of device 1 - minor 67
When a paritioned md device is assembled, the partitions are not recognised
until after the whole-array device is opened again. A future version of
mdadm will perform this open so that the need will be transparent.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
For each resync request, we allocate a "r1_bio" which has a bio "master_bio"
attached that goes largely unused. We also allocate a read_bio which is
used. This patch removes the read_bio and just uses the master_bio instead.
This fixes a bug wherein bi_bdev of the master_bio wasn't being set, but was
being used.
We also introduce a new "sectors" field into the r1_bio as we can no-longer
rely in master_bio->bi_sectors.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
next_r1 is never used, so it can just go.
read_bio isn't needed as we can easily use one of the pointers in the
write_bios array - write_bios[->read_disk]. So rename "write_bios" to "bios"
and store the pointer to the read bio in there.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
The only time it is really needed is to differentiate a retry-on-fail from a
write-after-read-for-resync request to raid1d. So we use a bit in 'state'
for that.
|
|
messages.
From: NeilBrown <neilb@cse.unsw.edu.au>
Instead of using ("md%d", mdidx(mddev)), we now use ("%s", mdname(mddev))
where mdname is the disk_name field in the associated gendisk structure.
This allows future flexability in naming.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
Move the pointers into mddev. The reduces dependance on MAX_MD_DEVS.
|
|
From: NeilBrown <neilb@cse.unsw.edu.au>
Thanks dann frazier <dannf@hp.com>
|
|
From: "H. Peter Anvin" <hpa@zytor.com>
RAID6 implementation. See Kconfig help for usage details.
The next release of `mdadm' has raid6 userspace support.
|
|
Starting the conversion:
* internal dev_t made 32bit.
* new helpers - new_encode_dev(), new_decode_dev(), huge_encode_dev(),
huge_decode_dev(), new_valid_dev(). They do encoding/decoding of 32bit and
64bit values; for now huge_... are aliases for new_... and new_valid_dev()
is always true. We do 12:20 for 32bit; representation is compatible with
16bit one - we have major in bits 19--8 and minor in 31--20,7--0. That's
what the userland sees; internally we have (major << 20)|minor, of course.
* MKDEV(), MAJOR() and MINOR() updated.
* several places used to handle Missed'em'V dev_t (14:18 split)
manually; that stuff had been taken into common helpers.
Now we can start replacing old_... with new_... and huge_..., depending
on the width available. MKDEV() callers should (for now) make sure that major
and minor are within 12:20. That's what the next chunk will do.
|
|
To be able to properly be able to keep references to block queues,
we make blk_init_queue() return the queue that it initialized, and
let it be independently allocated and then cleaned up on the last
reference.
I have grepped high and low, and there really shouldn't be any broken
uses of blk_init_queue() in the kernel drivers left. The added bonus
being blk_init_queue() error checking is explicit now, most of the
drivers were broken in this regard (even IDE/SCSI).
No drivers have embedded request queue structures. Drivers that don't
use blk_init_queue() but blk_queue_make_request(), should allocate the
queue with blk_alloc_queue(gfp_mask). I've converted all of them to do
that, too. They can call blk_cleanup_queue() now too, using the define
blk_put_queue() is probably cleaner though.
|
|
|
|
Linear uses one array sized by MD_SB_DISKS inside a structure.
We move it to the end of the structure, declare it as size 0,
and arrange for approprate extra space to be allocated on
structure allocation.
|
|
raid1 uses MD_SB_DISKS to size two data structures,
but the new version-1 superblock allows for more than
this number of disks (and most actual arrays use many
fewer).
This patch sizes to two arrays dynamically.
One becomes a separate kmalloced array.
The other is moved to the end of the containing structure
and appropriate extra space is allocated.
Also, change r1buf_pool_alloc (which allocates buffers for
a mempool for doing re-sync) to not get r1bio structures
from the r1bio pool (which could exhaust the pool) but instead
to allocate them separately.
|
|
Arrays with type-1 superblock can have more than
MD_SB_DISKS, so we remove the dependancy on that number from
raid0, replacing several fixed sized arrays with one
dynamically allocated array.
|
|
One embeded array gets moved to end of structure and
sized dynamically.
|
|
Multipath has a dependancy on MD_SB_DISKS which is no
longer authoritative. We change it to use a separately
allocated array.
|
|
To cope with a raid0 array with differing sized devices,
raid0 divides an array into "strip zones".
The first zone covers the start of all devices, upto an offset
equal to the size of the smallest device.
The second strip zone covers the remaining devices upto the size of the
next smallest size, etc.
In order to determing which strip zone a given address is in,
the array is logically divided into slices the size of the smallest
zone, and a 'hash' table is created listing the first and, if relevant,
second zone in each slice.
As the smallest slice can be very small (imagine an array with a
76G drive and a 75.5G drive) this hash table can be rather large.
With this patch, we limit the size of the hash table to one page,
at the possible cost of making several probes into the zone list
before we find the correct zone.
We also cope with the possibility that a zone could be larger than
a 32bit sector address would allow.
|
|
Sometimes raid0 and linear are required to take a single page bio that
spans two devices. We use bio_split to split such a bio into two.
The the same time, bio.h is included by linux/raid/md.h so
we don't included it elsewhere anymore.
We also modify the mergeable_bvec functions to allow a bvec
that doesn't fit if it is the first bvec to be added to
the bio, and be careful never to return a negative length from a
bvec_mergable funciton.
|
|
From: Christoph Hellwig <hch@lst.de>
partition_name() is a variant of __bdevname() that caches results and
returns a pointrer to kmalloc()ed data instead of printing into a buffer.
Due to it's caching it gets utterly confused when the name for a dev_t
changes (can happen easily now with device mapper and probably in the
future with dynamic dev_t users).
It's only used by the raid code and most calls are through a wrapper,
bdev_partition_name() which takes a struct block_device * that maybe be
NULL.
The patch below changes the bdev_partition_name() to call bdevname() if
possible and the other calls where we really have nothing more than a dev_t
to __bdevname.
Btw, it would be nice if someone who knows the md code a bit better than me
could remove bdev_partition_name() in favour of direct calls to bdevname()
where possible - that would also get rid of the returns pointer to string
on stack issue that this patch can't fix yet.
|
|
|
|
Thanks to Angus Sawyer <angus.sawyer@dsl.pipex.com> and
Daniel McNeil <daniel@osdl.org>
|
|
Superblock format '1' resolves a number of issues with
superblock format '0'.
It is more dense and can support many more sub-devices.
It does not contains un-needed redundancy.
It adds a few new useful fields
|
|
from start of device.
Normally the data stored on a component of a RAID array is stored
from the start of the device. This patch allows a per-device
data_offset so the data can start elsewhere. This will allow
RAID arrays where the metadata is at the head of the device
rather than the tail.
|
|
From: Angus Sawyer <angus.sawyer@dsl.pipex.com>
If there are no writes for 20 milliseconds, write out superblock
to mark array as clean. Write out superblock with
dirty flag before allowing any further write to succeed.
If an md thread gets signaled with SIGKILL, reduce the
delay to 0.
Also tidy up some printk's and make sure writing the
superblock isn't noisy.
|
|
The md_recoveryd thread is responsible for initiating and cleaning
up resync threads.
This job can be equally well done by the per-array threads
for those arrays which might need it.
So the mdrecoveryd thread is gone and the core code that
it ran is now run by raid5d, raid1d or multipathd.
We add an MD_RECOVERY_NEEDED flag so those daemon don't have
to bother trying to lock the md array unless it is likely
that something needs to be done.
Also modify the names of all threads to have the number of
md device.
|