<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/drivers/md/raid1.h, branch v4.9.74</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.9.74</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.9.74'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2015-10-12T06:32:05Z</updated>
<entry>
<title>md-cluster: Use a small window for resync</title>
<updated>2015-10-12T06:32:05Z</updated>
<author>
<name>Goldwyn Rodrigues</name>
<email>rgoldwyn@suse.com</email>
</author>
<published>2015-08-18T22:14:42Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c40f341f1e7fd4eddcfc5881d94cfa8669071ee6'/>
<id>urn:sha1:c40f341f1e7fd4eddcfc5881d94cfa8669071ee6</id>
<content type='text'>
Suspending the entire device for resync could take too long. Resync
in small chunks.

cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:

1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
   wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
   list.

bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.

Signed-off-by: Goldwyn Rodrigues &lt;rgoldwyn@suse.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>md/raid1: ensure device failure recorded before write request returns.</title>
<updated>2015-08-31T17:43:23Z</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.com</email>
</author>
<published>2015-08-14T01:11:10Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=55ce74d4bfe1b9444436264c637f39a152d1e5ac'/>
<id>urn:sha1:55ce74d4bfe1b9444436264c637f39a152d1e5ac</id>
<content type='text'>
When a write to one of the legs of a RAID1 fails, the failure is
recorded in the metadata of the other leg(s) so that after a restart
the data on the failed drive wont be trusted even if that drive seems
to be working again  (maybe a cable was unplugged).

Similarly when we record a bad-block in response to a write failure,
we must not let the write complete until the bad-block update is safe.

Currently there is no interlock between the write request completing
and the metadata update.  So it is possible that the write will
complete, the app will confirm success in some way, and then the
machine will crash before the metadata update completes.

This is an extremely small hole for a racy to fit in, but it is
theoretically possible and so should be closed.

So:
 - set MD_CHANGE_PENDING when requesting a metadata update for a
   failed device, so we can know with certainty when it completes
 - queue requests that experienced an error on a new queue which
   is only processed after the metadata update completes
 - call raid_end_bio_io() on bios in that queue when the time comes.

Signed-off-by: NeilBrown &lt;neilb@suse.com&gt;
</content>
</entry>
<entry>
<title>md: make -&gt;congested robust against personality changes.</title>
<updated>2015-02-03T21:35:52Z</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2014-12-15T01:56:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5c675f83c68fbdf9c0e103c1090b06be747fa62c'/>
<id>urn:sha1:5c675f83c68fbdf9c0e103c1090b06be747fa62c</id>
<content type='text'>
There is currently no locking around calls to the 'congested'
bdi function.  If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock.  Fortunately this is the case.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>md: remove unwanted white space from md.c</title>
<updated>2014-10-14T02:08:29Z</updated>
<author>
<name>NeilBrown</name>
<email>neilb@suse.de</email>
</author>
<published>2014-09-30T04:23:59Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f72ffdd68616e3697bc782b21c82197aeb480fd5'/>
<id>urn:sha1:f72ffdd68616e3697bc782b21c82197aeb480fd5</id>
<content type='text'>
My editor shows much of this is RED.

Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>raid1: Rewrite the implementation of iobarrier.</title>
<updated>2013-11-19T04:19:18Z</updated>
<author>
<name>majianpeng</name>
<email>majianpeng@gmail.com</email>
</author>
<published>2013-11-15T06:55:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=79ef3a8aa1cb1523cc231c9a90a278333c21f761'/>
<id>urn:sha1:79ef3a8aa1cb1523cc231c9a90a278333c21f761</id>
<content type='text'>
There is an iobarrier in raid1 because of contention between normal IO and
resync IO.  It suspends all normal IO when resync/recovery happens.

However if normal IO is out side the resync window, there is no contention.
So this patch changes the barrier mechanism to only block IO that
could contend with the resync that is currently happening.

We partition the whole space into five parts.
|---------|-----------|------------|----------------|-------|
        start   next_resync   start_next_window    end_window

start + RESYNC_WINDOW = next_resync
next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
start_next_window + NEXT_NORMALIO_DISTANCE = end_window

Firstly we introduce some concepts:

1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
      same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
      So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
      and start_next_window.  It also indicates the distance between
      start_next_window and end_window.
      It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
      this turned out not to be optimal.
3 - next_resync: the next sector at which we will do sync IO.
4 - start: a position which is at most RESYNC_WINDOW before
      next_resync.
5 - start_next_window:  a position which is NEXT_NORMALIO_DISTANCE
      beyond next_resync.  Normal-io after this position doesn't need to
      wait for resync-io to complete.
6 - end_window:  a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
      next_resync.  This also doesn't need to wait, but is counted
      differently.
7 - current_window_requests:  the count of normalIO between
      start_next_window and end_window.
8 - next_window_requests: the count of normalIO after end_window.

NormalIO will be partitioned into four types:

NormIO1:  the end sector of bio is smaller or equal the start
NormIO2:  the start sector of bio larger or equal to end_window
NormIO3:  the start sector of bio larger or equal to
          start_next_window.
NormIO4:  the location between start_next_window and end_window

|--------|-----------|--------------------|----------------|-------------|
    | start   |   next_resync   |  start_next_window   |  end_window |
 NormIO1   NormIO4            NormIO4                NormIO3      NormIO2

For NormIO1, we don't need any io barrier.
For NormIO4, we used a similar approach to the original iobarrier
    mechanism.  The normalIO and resyncIO must be kept separate.
For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
    and "next_window_requests". They indicate the count of active
    requests in the two window.
    For these, we don't wait for resync io to complete.

For resync action, if there are NormIO4s, we must wait for it.
If not, we can proceed.
But if resync action reaches start_next_window and
current_window_requests &gt; 0 (that is there are NormIO3s), we must
wait until the current_window_requests becomes zero.
When current_window_requests becomes zero,  start_next_window also
moves forward. Then current_window_requests will replaced by
next_window_requests.

There is a problem which when and how to change from NormIO2 to
NormIO3.  Only then can sync action progress.

We add a field in struct r1conf "start_next_window".

A: if start_next_window == MaxSector, it means there are no NormIO2/3.
   So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
B: if current_window_requests == 0 &amp;&amp; next_window_requests != 0, it
   means start_next_window move to end_window

There is another problem which how to differentiate between
old NormIO2(now it is NormIO3) and NormIO2.
For example, there are many bios which are NormIO2 and a bio which is
NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.

We add a field in struct r1bio "start_next_window".
This is used to record the position conf-&gt;start_next_window when the call
to wait_barrier() is made in make_request().

In allow_barrier(), we check the conf-&gt;start_next_window.
If r1bio-&gt;stat_next_window == conf-&gt;start_next_window, it means
there is no transition between NormIO2 and NormIO3.
If r1bio-&gt;start_next_window != conf-&gt;start_next_window, it mean
there was a transition between NormIO2 and NormIO3.  There can only
have been one transition.  So it only means the bio is old NormIO2.

For one bio, there may be many r1bio's. So we make sure
all the r1bio-&gt;start_next_window are the same value.
If we met blocked_dev in make_request(), it must call allow_barrier
and wait_barrier. So the former and the later value of
conf-&gt;start_next_window will be change.
If there are many r1bio's with differnet start_next_window,
for the relevant bio, it depend on the last value of r1bio.
It will cause error. To avoid this, we must wait for previous r1bios
to complete.

Signed-off-by: Jianpeng Ma &lt;majianpeng@gmail.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>raid1: Add a field array_frozen to indicate whether raid in freeze state.</title>
<updated>2013-11-19T04:19:18Z</updated>
<author>
<name>majianpeng</name>
<email>majianpeng@gmail.com</email>
</author>
<published>2013-11-14T04:16:18Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b364e3d048e49b1d177eb7ee7853e77aa0560464'/>
<id>urn:sha1:b364e3d048e49b1d177eb7ee7853e77aa0560464</id>
<content type='text'>
Because the following patch will rewrite the content between normal IO
and resync IO. So we used a parameter to indicate whether raid is in freeze
array.

Signed-off-by: Jianpeng Ma &lt;majianpeng@gmail.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>md/raid1: prevent merging too large request</title>
<updated>2012-07-31T00:03:53Z</updated>
<author>
<name>Shaohua Li</name>
<email>shli@kernel.org</email>
</author>
<published>2012-07-31T00:03:53Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=12cee5a8a29e7263e39953f1d941f723c617ca5f'/>
<id>urn:sha1:12cee5a8a29e7263e39953f1d941f723c617ca5f</id>
<content type='text'>
For SSD, if request size exceeds specific value (optimal io size), request size
isn't important for bandwidth. In such condition, if making request size bigger
will cause some disks idle, the total throughput will actually drop. A good
example is doing a readahead in a two-disk raid1 setup.

So when should we split big requests? We absolutly don't want to split big
request to very small requests. Even in SSD, big request transfer is more
efficient. This patch only considers request with size above optimal io size.

If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
two requests 32k and two disks. We can let each disk run one 32k request, or
split the requests to 4 16k requests and each disk runs two. It's hard to say
which case is better, depending on hardware.

So only consider case where there are idle disks. For readahead, split is
always better in this case. And in my test, below patch can improve &gt; 30%
thoughput. Hmm, not 100%, because disk isn't 100% busy.

Such case can happen not just in readahead, for example, in directio. But I
suppose directio usually will have bigger IO depth and make all disks busy, so
I ignored it.

Note: if the raid uses any hard disk, we don't prevent merging. That will make
performace worse.

Signed-off-by: Shaohua Li &lt;shli@fusionio.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>md/raid1: make sequential read detection per disk based</title>
<updated>2012-07-31T00:03:53Z</updated>
<author>
<name>Shaohua Li</name>
<email>shli@kernel.org</email>
</author>
<published>2012-07-31T00:03:53Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=be4d3280b17bc51f23ec6ebb345728f302f80a0c'/>
<id>urn:sha1:be4d3280b17bc51f23ec6ebb345728f302f80a0c</id>
<content type='text'>
Currently the sequential read detection is global wide. It's natural to make it
per disk based, which can improve the detection for concurrent multiple
sequential reads. And next patch will make SSD read balance not use distance
based algorithm, where this change help detect truly sequential read for SSD.

Signed-off-by: Shaohua Li &lt;shli@fusionio.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>MD: Move macros from raid1*.h to raid1*.c</title>
<updated>2012-07-31T00:03:52Z</updated>
<author>
<name>Jonathan Brassow</name>
<email>jbrassow@redhat.com</email>
</author>
<published>2012-07-31T00:03:52Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=473e87ce485ffcac041f7911b33f0b4cd4d6cf2b'/>
<id>urn:sha1:473e87ce485ffcac041f7911b33f0b4cd4d6cf2b</id>
<content type='text'>
MD RAID1/RAID10: Move some macros from .h file to .c file

There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
in both raid1.h and raid10.h.  They are only used in there respective .c files.
However, if we wish to make RAID10 accessible to the device-mapper RAID
target (dm-raid.c), then we need to move these macros into the .c files where
they are used so that they do not conflict with each other.

The macros from the two files are identical and could be moved into md.h, but
I chose to leave the duplication and have them remain in the personality
files.

Signed-off-by: Jonathan Brassow &lt;jbrassow@redhat.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
<entry>
<title>MD RAID1: rename mirror_info structure</title>
<updated>2012-07-31T00:03:52Z</updated>
<author>
<name>Jonathan Brassow</name>
<email>jbrassow@redhat.com</email>
</author>
<published>2012-07-31T00:03:52Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0eaf822cb3dfcf2a64b2d27f4f6219186adb2695'/>
<id>urn:sha1:0eaf822cb3dfcf2a64b2d27f4f6219186adb2695</id>
<content type='text'>
MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'

The same structure name ('mirror_info') is used by raid10.  Each of these
structures are defined in there respective header files.  If dm-raid is
to support both RAID1 and RAID10, the header files will be included and
the structure names must not collide.  While only one of these structure
names needs to change, this patch adds consistency to the naming of the
structure.

Signed-off-by: Jonathan Brassow &lt;jbrassow@redhat.com&gt;
Signed-off-by: NeilBrown &lt;neilb@suse.de&gt;
</content>
</entry>
</feed>
