| Age | Commit message (Collapse) | Author |
|
Starting the conversion:
* internal dev_t made 32bit.
* new helpers - new_encode_dev(), new_decode_dev(), huge_encode_dev(),
huge_decode_dev(), new_valid_dev(). They do encoding/decoding of 32bit and
64bit values; for now huge_... are aliases for new_... and new_valid_dev()
is always true. We do 12:20 for 32bit; representation is compatible with
16bit one - we have major in bits 19--8 and minor in 31--20,7--0. That's
what the userland sees; internally we have (major << 20)|minor, of course.
* MKDEV(), MAJOR() and MINOR() updated.
* several places used to handle Missed'em'V dev_t (14:18 split)
manually; that stuff had been taken into common helpers.
Now we can start replacing old_... with new_... and huge_..., depending
on the width available. MKDEV() callers should (for now) make sure that major
and minor are within 12:20. That's what the next chunk will do.
|
|
To be able to properly be able to keep references to block queues,
we make blk_init_queue() return the queue that it initialized, and
let it be independently allocated and then cleaned up on the last
reference.
I have grepped high and low, and there really shouldn't be any broken
uses of blk_init_queue() in the kernel drivers left. The added bonus
being blk_init_queue() error checking is explicit now, most of the
drivers were broken in this regard (even IDE/SCSI).
No drivers have embedded request queue structures. Drivers that don't
use blk_init_queue() but blk_queue_make_request(), should allocate the
queue with blk_alloc_queue(gfp_mask). I've converted all of them to do
that, too. They can call blk_cleanup_queue() now too, using the define
blk_put_queue() is probably cleaner though.
|
|
|
|
Linear uses one array sized by MD_SB_DISKS inside a structure.
We move it to the end of the structure, declare it as size 0,
and arrange for approprate extra space to be allocated on
structure allocation.
|
|
raid1 uses MD_SB_DISKS to size two data structures,
but the new version-1 superblock allows for more than
this number of disks (and most actual arrays use many
fewer).
This patch sizes to two arrays dynamically.
One becomes a separate kmalloced array.
The other is moved to the end of the containing structure
and appropriate extra space is allocated.
Also, change r1buf_pool_alloc (which allocates buffers for
a mempool for doing re-sync) to not get r1bio structures
from the r1bio pool (which could exhaust the pool) but instead
to allocate them separately.
|
|
Arrays with type-1 superblock can have more than
MD_SB_DISKS, so we remove the dependancy on that number from
raid0, replacing several fixed sized arrays with one
dynamically allocated array.
|
|
One embeded array gets moved to end of structure and
sized dynamically.
|
|
Multipath has a dependancy on MD_SB_DISKS which is no
longer authoritative. We change it to use a separately
allocated array.
|
|
To cope with a raid0 array with differing sized devices,
raid0 divides an array into "strip zones".
The first zone covers the start of all devices, upto an offset
equal to the size of the smallest device.
The second strip zone covers the remaining devices upto the size of the
next smallest size, etc.
In order to determing which strip zone a given address is in,
the array is logically divided into slices the size of the smallest
zone, and a 'hash' table is created listing the first and, if relevant,
second zone in each slice.
As the smallest slice can be very small (imagine an array with a
76G drive and a 75.5G drive) this hash table can be rather large.
With this patch, we limit the size of the hash table to one page,
at the possible cost of making several probes into the zone list
before we find the correct zone.
We also cope with the possibility that a zone could be larger than
a 32bit sector address would allow.
|
|
Sometimes raid0 and linear are required to take a single page bio that
spans two devices. We use bio_split to split such a bio into two.
The the same time, bio.h is included by linux/raid/md.h so
we don't included it elsewhere anymore.
We also modify the mergeable_bvec functions to allow a bvec
that doesn't fit if it is the first bvec to be added to
the bio, and be careful never to return a negative length from a
bvec_mergable funciton.
|
|
From: Christoph Hellwig <hch@lst.de>
partition_name() is a variant of __bdevname() that caches results and
returns a pointrer to kmalloc()ed data instead of printing into a buffer.
Due to it's caching it gets utterly confused when the name for a dev_t
changes (can happen easily now with device mapper and probably in the
future with dynamic dev_t users).
It's only used by the raid code and most calls are through a wrapper,
bdev_partition_name() which takes a struct block_device * that maybe be
NULL.
The patch below changes the bdev_partition_name() to call bdevname() if
possible and the other calls where we really have nothing more than a dev_t
to __bdevname.
Btw, it would be nice if someone who knows the md code a bit better than me
could remove bdev_partition_name() in favour of direct calls to bdevname()
where possible - that would also get rid of the returns pointer to string
on stack issue that this patch can't fix yet.
|
|
|
|
Thanks to Angus Sawyer <angus.sawyer@dsl.pipex.com> and
Daniel McNeil <daniel@osdl.org>
|
|
Superblock format '1' resolves a number of issues with
superblock format '0'.
It is more dense and can support many more sub-devices.
It does not contains un-needed redundancy.
It adds a few new useful fields
|
|
from start of device.
Normally the data stored on a component of a RAID array is stored
from the start of the device. This patch allows a per-device
data_offset so the data can start elsewhere. This will allow
RAID arrays where the metadata is at the head of the device
rather than the tail.
|
|
From: Angus Sawyer <angus.sawyer@dsl.pipex.com>
If there are no writes for 20 milliseconds, write out superblock
to mark array as clean. Write out superblock with
dirty flag before allowing any further write to succeed.
If an md thread gets signaled with SIGKILL, reduce the
delay to 0.
Also tidy up some printk's and make sure writing the
superblock isn't noisy.
|
|
The md_recoveryd thread is responsible for initiating and cleaning
up resync threads.
This job can be equally well done by the per-array threads
for those arrays which might need it.
So the mdrecoveryd thread is gone and the core code that
it ran is now run by raid5d, raid1d or multipathd.
We add an MD_RECOVERY_NEEDED flag so those daemon don't have
to bother trying to lock the md array unless it is likely
that something needs to be done.
Also modify the names of all threads to have the number of
md device.
|
|
Md uses ->recovery_running and ->recovery_err to keep track of the
status or recovery. This is rather ad hoc and race prone.
This patch changes it to ->recovery which has bit flags for various
states.
|
|
From: Angus Sawyer <angus.sawyer@dsl.pipex.com>
Mainly straightforward convert of sprintf -> seq_printf. seq_start and
seq_next modelled on /proc/partitions. locking/ref counting as for
ITERATE_MDDEV.
pos == 0 -> header
pos == n -> nth mddev
pos == 0x10000 -> tail
|
|
When a raid1 or raid5 array is in 'safe-mode', then the array
is marked clean whenever there are no outstanding write requests,
and is marked dirty again before allowing any write request to
proceed.
This means than an unclean shutdown while no write activity is happening
will NOT cause a resync to be required. However it does mean extra
updates to the superblock.
Currently safe-mode is turned on by sending SIGKILL to the raid thread
as would happen at a normal shutdown. This should mean that the
reboot notifier is no longer needed.
After looking more at performance issues I may make safemode be on
all the time. I will almost certainly make it on when RAID5 is degraded
as an unclean shutdown of a degraded RAID5 means data loss.
This code was provided by Angus Sawyer <angus.sawyer@dsl.pipex.com>
|
|
This allows the thread to easily identified and signalled.
The point of signalling will appear in the next patch.
|
|
from there.
Add a new field to the md superblock, in an used area, to record where
resync was up-to on a clean shutdown while resync is active. Restart from
this point.
The extra field is verified by having a second copy of the event counter.
If the second event counter is wrong, we ignore the extra field.
This patch thanks to Angus Sawyer <angus.sawyer@dsl.pipex.com>
|
|
Define an interface for interpreting and updating superblocks
so we can more easily define new formats.
With this patch, (almost) all superblock layout information is
locating in a small set of routines dedicated to superblock
handling. This will allow us to provide a similar set for
a different format.
The two exceptions are:
1/ autostart_array where the devices listed in the superblock
are searched for.
2/ raid5 'knows' the maximum number of devices for
compute_parity.
These will be addressed in a later patch.
|
|
|
|
From Peter Chubb
Compaq Smart array sector_t cleanup: prepare for possible 64-bit sector_t
Clean up loop device to allow huge backing files.
MD transition to 64-bit sector_t.
- Hold sizes and offsets as sector_t not int;
- use 64-bit arithmetic if necessary to map block-in-raid to zone
and block-in-zone
|
|
partition_name() moved from md.c to partitions/check.c; disk_name() is not
exported anymore; partition_name() takes dev_t instead of kdev_t.
|
|
* we remove the paritition 0 from ->part[] and put the old
contents of ->part[0] into gendisk itself; indexes are shifted, obviously.
* ->part is allocated at add_gendisk() time and freed at del_gendisk()
according to value of ->minor_shift; static arrays of hd_struct are gone
from drivers, ditto for manual allocations a-la ide. As the matter of fact,
none of the drivers know about struct hd_struct now.
|
|
raid1, raid5 and multipath maintain their own
'operational' flag. This is equivalent to
!rdev->faulty
and so isn't needed.
Similarly raid1 and raid1 maintain a "write_only" flag
that is equivalnt to
!rdev->in_sync
so it isn't needed either.
As part of implementing this change, we introduce some extra
flag bit in raid5 that are meaningful only inside 'handle_stripe'.
Some of these replace the "action" array which recorded what
actions were required (and would be performed after the stripe
spinlock was released). This has the advantage of reducing our
dependance on MD_SB_DISKS which personalities shouldn't need
to know about.
|
|
This flag was used by multipath to make sure only
one superblock was written, as there is only one
real device.
The relevant test is now more explicitly dependant on multipath,
and the flag is gone.
|
|
1/ Personalities only know about raid_disks devices.
Some might be not in_sync and so cannot be read from,
but must be written to.
- change MD_SB_DISKS to ->raid_disks
- add tests for .write_only
2/ rdev->raid_disk is now -1 for spares. desc_nr is maintained
by analyse_sbs and sync_sbs.
3/ spare_inactive method is subsumed into hot_remove_disk
spare_writable is subsumed into hot_add_disk.
hot_add_disk decides which slot a new device will hold.
4/ spare_active now finds all non-in_sync devices and marks them
in_sync.
5/ faulty devices are removed by the md recovery thread as soon
as they are idle. Any spares that are available are then added.
|
|
This is equivalent to ->rdev != NULL, so it isn't needed.
|
|
device on an MD array
This will allow us to know, in the event of a device failure, when the
device is completely unused and so can be disconnected from the
array. Currently this isn't a problem as drives aren't normally disconnect
until after a repacement has been rebuilt, which is a LONG TIME, but that
will change shortly...
We always increment the count under a spinlock after checking that
it hasn't been disconnected already (rdev!= NULL).
We disconnect under the same spinlock after checking that the
count is zero.
|
|
This simplifies the error handlers slighty, but allows for even more
simplification later.
|
|
Holding the rdev instead of the bdev does cause an extra
de-reference, but it is conceptually cleaner and will allow
lots more tidying up.
|
|
get_spare recently became static and no-one told md_k.h
|
|
Get rid of dev in rdev and use bdev exclusively.
There is an awkwardness here in that userspace sometimes
passed down a dev_t (e.g. hot_add_disk) and sometime
a major and a minor (e.g. add_new_disk). Should we convert
both to kdev_t as the uniform standard....
That is what was being done but it seemed very clumsy and
things were gets converted back and forth a lot.
As bdget used a dev_t, I felt safe in staying with dev_t once I
had one rather than converting to kdev_t and back.
|
|
Remove the sb from the mddev
Now that al the important information is in mddev, we don't need
to have an sb off the mddev. We only keep the per-device ones.
Previously we determined if "set_array_info" had been run byb checking
mddev->sb. Now we check mddev->raid_disks on the assumption that
any valid array MUST have a non-zero number of devices.
|
|
Remove dependance on superblock
All the remaining field of interest in the superblock
get duplicated in the mddev struture and this is treated as
authoritative. The superblock gets completely generated at
write time, and all useful information extracted at read time.
This means that we can slot in different superblock formats
without affecting the bulk of the code.
|
|
Move persistent from superblock to mddev
Tidyup calc_dev_sboffset and calc_dev_size on the way
|
|
Remove number and raid_disk from personality arrays
These are redundant. number not needed any more
raid_disk never was as that is the index.
|
|
nr_disks is gone from multipath/raid1
Never used.
|
|
Remove old_dev field.
We used to monitor the pervious device number of a
component device for superblock maintenance. This is
not needed any more.
|
|
Don't maintain disc status in superblock.
The state is now in rdev so we don't maintain it
in superblock any more.
We also nolonger test content of superblock for
disk status
mddev->spare is now an rdev and not a superblock fragment.
|
|
Add "degraded" field to md device
This is used to determine if a spare should be added
without relying on the superblock.
|
|
Add in_sync flag to each rdev
This currently mirrors the MD_DISK_SYNC superblock flag,
but soon it will be authoritative and the superblock will
only be consulted at start time.
|
|
Add raid_disk field to rdev
Also change find_rdev_nr to find based on position
in array (raid_disk) not position in superblock (number).
|
|
Improve handling of spares in md
- hot_remove_disk is given the raid_disk rather than descriptor number
so that it can find the device in internal array directly, no search.
- spare_inactive now uses mddev->spare->raid_disk instead of
mddev->spare->number so it can find the device directly without searching
- spare_write does not need number. It can use mddev->spare->raid_disk as above.
- spare_active does not need &mddev->spare. It finds the descriptor directly
and fixes it without this pointer
|
|
Remove concept of 'spare' drive for multipath.
Multipath now treats all working devices as
active and does io to to first working one.
|
|
Move md_update_sb calls
When a change which requires a superblock update happens
at interrupt time, we currently set a flag (sb_dirty) and
wakeup to per-array thread (raid1/raid5d/multipathd) to
do the actual update.
This patch centralises this. The sb_update is now done
by the mdrecoveryd thread. As this is always woken up after
the error handler is called, we don't need the call to wakeup
the local thread any more.
With this, we don't need "md_update_sb" to lock the array
any more and only use __md_update_sb which is local to md.c
So we rename __md_update_sb back to md_update_sb and stop
exporting it.
|
|
Pass the correct bdev to md_error
After a call to generic_make_request, bio->bi_bdev can have changed
(e.g. by a re-mapped like raid0). So we cannot trust it for reporting
the source of an error. This patch takes care to find the correct
bdev.
|