<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/drivers/md/dm-snap.c, branch v5.9.2</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.9.2</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.9.2'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2020-07-01T13:27:24Z</updated>
<entry>
<title>block: rename generic_make_request to submit_bio_noacct</title>
<updated>2020-07-01T13:27:24Z</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2020-07-01T08:59:44Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead'/>
<id>urn:sha1:ed00aabd5eb9fb44d6aff1173234a2e911b9fead</id>
<content type='text'>
generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.

Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>dm snapshot: use true/false for bool variable</title>
<updated>2020-01-07T17:07:17Z</updated>
<author>
<name>zhengbin</name>
<email>zhengbin13@huawei.com</email>
</author>
<published>2019-12-24T06:38:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1d1dda8ca8ca0586df87d898e59cd69110d3d74b'/>
<id>urn:sha1:1d1dda8ca8ca0586df87d898e59cd69110d3d74b</id>
<content type='text'>
Fixes coccicheck warning:

drivers/md/dm-snap.c:1064:3-18: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-snap.c:1152:1-16: WARNING: Assignment of 0/1 to bool variable
drivers/md/dm-snap.c:1317:1-16: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot &lt;hulkci@huawei.com&gt;
Signed-off-by: zhengbin &lt;zhengbin13@huawei.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm snapshot: rework COW throttling to fix deadlock</title>
<updated>2019-10-10T13:46:05Z</updated>
<author>
<name>Mikulas Patocka</name>
<email>mpatocka@redhat.com</email>
</author>
<published>2019-10-02T10:15:53Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b21555786f18cd77f2311ad89074533109ae3ffa'/>
<id>urn:sha1:b21555786f18cd77f2311ad89074533109ae3ffa</id>
<content type='text'>
Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and
workqueue stalls") introduced a semaphore to limit the maximum number of
in-flight kcopyd (COW) jobs.

The implementation of this throttling mechanism is prone to a deadlock:

1. One or more threads write to the origin device causing COW, which is
   performed by kcopyd.

2. At some point some of these threads might reach the s-&gt;cow_count
   semaphore limit and block in down(&amp;s-&gt;cow_count), holding a read lock
   on _origins_lock.

3. Someone tries to acquire a write lock on _origins_lock, e.g.,
   snapshot_ctr(), which blocks because the threads at step (2) already
   hold a read lock on it.

4. A COW operation completes and kcopyd runs dm-snapshot's completion
   callback, which ends up calling pending_complete().
   pending_complete() tries to resubmit any deferred origin bios. This
   requires acquiring a read lock on _origins_lock, which blocks.

   This happens because the read-write semaphore implementation gives
   priority to writers, meaning that as soon as a writer tries to enter
   the critical section, no readers will be allowed in, until all
   writers have completed their work.

   So, pending_complete() waits for the writer at step (3) to acquire
   and release the lock. This writer waits for the readers at step (2)
   to release the read lock and those readers wait for
   pending_complete() (the kcopyd thread) to signal the s-&gt;cow_count
   semaphore: DEADLOCK.

The above was thoroughly analyzed and documented by Nikos Tsironis as
part of his initial proposal for fixing this deadlock, see:
https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html

Fix this deadlock by reworking COW throttling so that it waits without
holding any locks. Add a variable 'in_progress' that counts how many
kcopyd jobs are running. A function wait_for_in_progress() will sleep if
'in_progress' is over the limit. It drops _origins_lock in order to
avoid the deadlock.

Reported-by: Guruswamy Basavaiah &lt;guru2018@gmail.com&gt;
Reported-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Reviewed-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Tested-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue stalls")
Cc: stable@vger.kernel.org # v5.0+
Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()")
Signed-off-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm snapshot: introduce account_start_copy() and account_end_copy()</title>
<updated>2019-10-10T13:45:48Z</updated>
<author>
<name>Mikulas Patocka</name>
<email>mpatocka@redhat.com</email>
</author>
<published>2019-10-02T10:14:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a2f83e8b0c82c9500421a26c49eb198b25fcdea3'/>
<id>urn:sha1:a2f83e8b0c82c9500421a26c49eb198b25fcdea3</id>
<content type='text'>
This simple refactoring moves code for modifying the semaphore cow_count
into separate functions to prepare for changes that will extend these
methods to provide for a more sophisticated mechanism for COW
throttling.

Signed-off-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Reviewed-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm snapshot: fix oversights in optional discard support</title>
<updated>2019-07-17T15:12:30Z</updated>
<author>
<name>Mike Snitzer</name>
<email>snitzer@redhat.com</email>
</author>
<published>2019-07-17T15:12:30Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3ee25485ba8e8271fe9401eef5003c20ab648ddf'/>
<id>urn:sha1:3ee25485ba8e8271fe9401eef5003c20ab648ddf</id>
<content type='text'>
__find_snapshots_sharing_cow() should always be used with _origins_lock
held so fix snapshot_io_hints() accordingly.  Also, once a snapshot is
being merged discards must not be allowed -- otherwise incorrect or
duplicate work will be performed.

Fixes: 2e6023850e177d ("dm snapshot: add optional discard support features")
Reported-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm snapshot: add optional discard support features</title>
<updated>2019-07-12T13:59:36Z</updated>
<author>
<name>Mike Snitzer</name>
<email>snitzer@redhat.com</email>
</author>
<published>2019-06-19T21:05:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2e6023850e177dbaca21498ada04c5a5ac93f812'/>
<id>urn:sha1:2e6023850e177dbaca21498ada04c5a5ac93f812</id>
<content type='text'>
discard_zeroes_cow - a discard issued to the snapshot device that maps
to entire chunks to will zero the corresponding exception(s) in the
snapshot's exception store.

discard_passdown_origin - a discard to the snapshot device is passed down
to the snapshot-origin's underlying device.  This doesn't cause copy-out
to the snapshot exception store because the snapshot-origin target is
bypassed.

The discard_passdown_origin feature depends on the discard_zeroes_cow
feature being enabled.

When these 2 features are enabled they allow a temporarily read-only
device that has completely exhausted its free space to recover space.
To do so dm-snapshot provides temporary buffer to accommodate writes
that the temporarily read-only device cannot handle yet.  Once the upper
layer frees space (e.g. fstrim to XFS) the discards issued to the
dm-snapshot target will be issued to underlying read-only device whose
free space was exhausted.  In addition those discards will also cause
zeroes to be written to the snapshot exception store if corresponding
exceptions exist.  If the underlying origin device provides
deduplication for zero blocks then if/when the snapshot is merged backed
to the origin those blocks will become unused.  Once the origin has
gained adequate space, merging the snapshot back to the thinly
provisioned device will permit continued use of that device without the
temporary space provided by the snapshot.

Requested-by: John Dorminy &lt;jdorminy@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm snapshot: Use fine-grained locking scheme</title>
<updated>2019-04-18T20:18:30Z</updated>
<author>
<name>Nikos Tsironis</name>
<email>ntsironis@arrikto.com</email>
</author>
<published>2019-03-17T12:22:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3f1637f2103822f9cab4c927d929db57ac4fd933'/>
<id>urn:sha1:3f1637f2103822f9cab4c927d929db57ac4fd933</id>
<content type='text'>
Substitute the global locking scheme with a fine grained one, employing
the read-write semaphore and the scalable exception tables with
per-bucket locks introduced by the previous two commits.

Summarizing, we now use a read-write semaphore to protect the mostly
read fields of the snapshot structure, e.g., valid, active, etc., and
per-bucket bit spinlocks to protect accesses to the complete and pending
exception tables.

Finally, we use an extra spinlock (pe_allocation_lock) to serialize the
allocation of new exceptions by the exception store. This allocation is
really fast, so the extra spinlock doesn't hurt the performance.

This scheme allows dm-snapshot to scale better, resulting in increased
IOPS and reduced latency.

Following are some benchmark results using the null_blk device:

  modprobe null_blk gb=1024 bs=512 submit_queues=8 hw_queue_depth=4096 \
   queue_mode=2 irqmode=1 completion_nsec=1 nr_devices=1

* Benchmark fio_origin_randwrite_throughput_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to origin device, IO
  engine libaio):

  +--------------+-------------+------------+
  | # of workers | IOPS Before | IOPS After |
  +--------------+-------------+------------+
  |      1       |    57708    |   66421    |
  |      2       |    63415    |   77589    |
  |      4       |    67276    |   98839    |
  |      8       |    60564    |   109258   |
  +--------------+-------------+------------+

* Benchmark fio_origin_randwrite_latency_N, from the device mapper test
  suite [1] (direct IO, random 4K writes to origin device, IO engine
  psync):

  +--------------+-----------------------+----------------------+
  | # of workers | Latency (usec) Before | Latency (usec) After |
  +--------------+-----------------------+----------------------+
  |      1       |         16.25         |        13.27         |
  |      2       |         31.65         |        25.08         |
  |      4       |         55.28         |        41.08         |
  |      8       |         121.47        |        74.44         |
  +--------------+-----------------------+----------------------+

* Benchmark fio_snapshot_randwrite_throughput_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to snapshot device, IO
  engine libaio):

  +--------------+-------------+------------+
  | # of workers | IOPS Before | IOPS After |
  +--------------+-------------+------------+
  |      1       |    72593    |   84938    |
  |      2       |    97379    |   134973   |
  |      4       |    90610    |   143077   |
  |      8       |    90537    |   180085   |
  +--------------+-------------+------------+

* Benchmark fio_snapshot_randwrite_latency_N, from the device mapper
  test suite [1] (direct IO, random 4K writes to snapshot device, IO
  engine psync):

  +--------------+-----------------------+----------------------+
  | # of workers | Latency (usec) Before | Latency (usec) After |
  +--------------+-----------------------+----------------------+
  |      1       |         12.53         |         10.6         |
  |      2       |         19.78         |        14.89         |
  |      4       |         40.37         |        23.47         |
  |      8       |         89.32         |        48.48         |
  +--------------+-----------------------+----------------------+

[1] https://github.com/jthornber/device-mapper-test-suite

Co-developed-by: Ilias Tsitsimpis &lt;iliastsi@arrikto.com&gt;
Signed-off-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Acked-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm snapshot: Make exception tables scalable</title>
<updated>2019-04-18T20:18:29Z</updated>
<author>
<name>Nikos Tsironis</name>
<email>ntsironis@arrikto.com</email>
</author>
<published>2019-03-17T12:22:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f79ae415b64c35d9ecca159fe796cf98d2ff9e9c'/>
<id>urn:sha1:f79ae415b64c35d9ecca159fe796cf98d2ff9e9c</id>
<content type='text'>
Use list_bl to implement the exception hash tables' buckets. This change
permits concurrent access, to distinct buckets, by multiple threads.

Also, implement helper functions to lock and unlock the exception tables
based on the chunk number of the exception at hand.

We retain the global locking, by means of down_write(), which is
replaced by the next commit.

Still, we must acquire the per-bucket spinlocks when accessing the hash
tables, since list_bl does not allow modification on unlocked lists.

Co-developed-by: Ilias Tsitsimpis &lt;iliastsi@arrikto.com&gt;
Signed-off-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Acked-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm snapshot: Replace mutex with rw semaphore</title>
<updated>2019-04-18T20:18:28Z</updated>
<author>
<name>Nikos Tsironis</name>
<email>ntsironis@arrikto.com</email>
</author>
<published>2019-03-17T12:22:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4ad8d880b6c4497e365fb6fd16bab52e9974a3f6'/>
<id>urn:sha1:4ad8d880b6c4497e365fb6fd16bab52e9974a3f6</id>
<content type='text'>
dm-snapshot uses a single mutex to serialize every access to the
snapshot state. This includes all accesses to the complete and pending
exception tables, which occur at every origin write, every snapshot
read/write and every exception completion.

The lock statistics indicate that this mutex is a bottleneck (average
wait time ~480 usecs for 8 processes doing random 4K writes to the
origin device) preventing dm-snapshot to scale as the number of threads
doing IO increases.

The major contention points are __origin_write()/snapshot_map() and
pending_complete(), i.e., the submission and completion of pending
exceptions.

Replace this mutex with a rw semaphore.

We essentially revert commit ae1093be5a0ef9 ("dm snapshot: use mutex
instead of rw_semaphore") and together with the next two patches we
substitute the single mutex with a fine-grained locking scheme, where we
use a read-write semaphore to protect the mostly read fields of the
snapshot structure, e.g., valid, active, etc., and per-bucket bit
spinlocks to protect accesses to the complete and pending exception
tables.

Co-developed-by: Ilias Tsitsimpis &lt;iliastsi@arrikto.com&gt;
Signed-off-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Acked-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
<entry>
<title>dm snapshot: Don't sleep holding the snapshot lock</title>
<updated>2019-04-18T20:18:27Z</updated>
<author>
<name>Nikos Tsironis</name>
<email>ntsironis@arrikto.com</email>
</author>
<published>2019-03-17T12:22:55Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=65fc7c37047797a0a7b80d0ad2c063deda569337'/>
<id>urn:sha1:65fc7c37047797a0a7b80d0ad2c063deda569337</id>
<content type='text'>
When completing a pending exception, pending_complete() waits for all
conflicting reads to drain, before inserting the final, completed
exception. Conflicting reads are snapshot reads redirected to the
origin, because the relevant chunk is not remapped to the COW device the
moment we receive the read.

The completed exception must be inserted into the exception table after
all conflicting reads drain to ensure snapshot reads don't return
corrupted data. This is required because inserting the completed
exception into the exception table signals that the relevant chunk is
remapped and both origin writes and snapshot merging will now overwrite
the chunk in origin.

This wait is done holding the snapshot lock to ensure that
pending_complete() doesn't starve if new snapshot reads keep coming for
this chunk.

In preparation for the next commit, where we use a spinlock instead of a
mutex to protect the exception tables, we remove the need for holding
the lock while waiting for conflicting reads to drain.

We achieve this in two steps:

1. pending_complete() inserts the completed exception before waiting for
   conflicting reads to drain and removes the pending exception after
   all conflicting reads drain.

   This ensures that new snapshot reads will be redirected to the COW
   device, instead of the origin, and thus pending_complete() will not
   starve. Moreover, we use the existence of both a completed and
   a pending exception to signify that the COW is done but there are
   conflicting reads in flight.

2. In __origin_write() we check first if there is a pending exception
   and then if there is a completed exception. If there is a pending
   exception any submitted BIO is delayed on the pe-&gt;origin_bios list and
   DM_MAPIO_SUBMITTED is returned. This ensures that neither writes to the
   origin nor snapshot merging can overwrite the origin chunk, until all
   conflicting reads drain, and thus snapshot reads will not return
   corrupted data.

Summarizing, we now have the following possible combinations of pending
and completed exceptions for a chunk, along with their meaning:

A. No exceptions exist: The chunk has not been remapped yet.
B. Only a pending exception exists: The chunk is currently being copied
   to the COW device.
C. Both a pending and a completed exception exist: COW for this chunk
   has completed but there are snapshot reads in flight which had been
   redirected to the origin before the chunk was remapped.
D. Only the completed exception exists: COW has been completed and there
   are no conflicting reads in flight.

Co-developed-by: Ilias Tsitsimpis &lt;iliastsi@arrikto.com&gt;
Signed-off-by: Nikos Tsironis &lt;ntsironis@arrikto.com&gt;
Acked-by: Mikulas Patocka &lt;mpatocka@redhat.com&gt;
Signed-off-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
</content>
</entry>
</feed>
