<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/drivers/virtio/virtio_ring.c, branch v4.4</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.4</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v4.4'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2015-12-07T15:28:11Z</updated>
<entry>
<title>virtio_ring: shadow available ring flags &amp; index</title>
<updated>2015-12-07T15:28:11Z</updated>
<author>
<name>Venkatesh Srinivas</name>
<email>venkateshs@google.com</email>
</author>
<published>2015-11-11T00:21:07Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f277ec42f38f406ed073569b24b04f1c291fec0f'/>
<id>urn:sha1:f277ec42f38f406ed073569b24b04f1c291fec0f</id>
<content type='text'>
Improves cacheline transfer flow of available ring header.

Virtqueues are implemented as a pair of rings, one producer-&gt;consumer
avail ring and one consumer-&gt;producer used ring; preceding the
avail ring in memory are two contiguous u16 fields -- avail-&gt;flags
and avail-&gt;idx. A producer posts work by writing to avail-&gt;idx and
a consumer reads avail-&gt;idx.

The flags and idx fields only need to be written by a producer CPU
and only read by a consumer CPU; when the producer and consumer are
running on different CPUs and the virtio_ring code is structured to
only have source writes/sink reads, we can continuously transfer the
avail header cacheline between 'M' states between cores. This flow
optimizes core -&gt; core bandwidth on certain CPUs.

(see: "Software Optimization Guide for AMD Family 15h Processors",
Section 11.6; similar language appears in the 10h guide and should
apply to CPUs w/ exclusive caches, using LLC as a transfer cache)

Unfortunately the existing virtio_ring code issued reads to the
avail-&gt;idx and read-modify-writes to avail-&gt;flags on the producer.

This change shadows the flags and index fields in producer memory;
the vring code now reads from the shadows and only ever writes to
avail-&gt;flags and avail-&gt;idx, allowing the cacheline to transfer
core -&gt; core optimally.

In a concurrent version of vring_bench, the time required for
10,000,000 buffer checkout/returns was reduced by ~2% (average
across many runs) on an AMD Piledriver (15h) CPU:

(w/o shadowing):
 Performance counter stats for './vring_bench':
     5,451,082,016      L1-dcache-loads
     ...
       2.221477739 seconds time elapsed

(w/ shadowing):
 Performance counter stats for './vring_bench':
     5,405,701,361      L1-dcache-loads
     ...
       2.168405376 seconds time elapsed

The further away (in a NUMA sense) virtio producers and consumers are
from each other, the more we expect to benefit. Physical implementations
of virtio devices and implementations of virtio where the consumer polls
vring avail indexes (vhost) should also benefit.

Signed-off-by: Venkatesh Srinivas &lt;venkateshs@google.com&gt;
Signed-off-by: Michael S. Tsirkin &lt;mst@redhat.com&gt;
</content>
</entry>
<entry>
<title>virtio: Do not drop __GFP_HIGH in alloc_indirect</title>
<updated>2015-12-07T15:28:11Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2015-12-01T14:32:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=82107539bbb9db303fb6676c78c836add5680bb0'/>
<id>urn:sha1:82107539bbb9db303fb6676c78c836add5680bb0</id>
<content type='text'>
b92b1b89a33c ("virtio: force vring descriptors to be allocated from
lowmem") tried to exclude highmem pages for descriptors so it cleared
__GFP_HIGHMEM from a given gfp mask. The patch also cleared __GFP_HIGH
which doesn't make much sense for this fix because __GFP_HIGH only
controls access to memory reserves and it doesn't have any influence
on the zone selection. Some of the call paths use GFP_ATOMIC and
dropping __GFP_HIGH will reduce their changes for success because the
lack of access to memory reserves.

Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Michael S. Tsirkin &lt;mst@redhat.com&gt;
Acked-by: Will Deacon &lt;will.deacon@arm.com&gt;
Reviewed-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
</content>
</entry>
<entry>
<title>virtio: Avoid possible kernel panic if DEBUG is enabled.</title>
<updated>2015-02-11T04:33:14Z</updated>
<author>
<name>Tetsuo Handa</name>
<email>penguin-kernel@I-love.SAKURA.ne.jp</email>
</author>
<published>2015-02-11T04:31:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5e05bf5833eb3dd97b6b6a52301d81e033714cb3'/>
<id>urn:sha1:5e05bf5833eb3dd97b6b6a52301d81e033714cb3</id>
<content type='text'>
The virtqueue_add() calls START_USE() upon entry. The virtqueue_kick() is
called if vq-&gt;num_added == (1 &lt;&lt; 16) - 1 before calling END_USE().
The virtqueue_kick_prepare() called via virtqueue_kick() calls START_USE()
upon entry, and will call panic() if DEBUG is enabled.
Move this virtqueue_kick() call to after END_USE() call.

Signed-off-by: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Signed-off-by: Rusty Russell &lt;rusty@rustcorp.com.au&gt;
</content>
</entry>
<entry>
<title>virtio_ring: coding style fix</title>
<updated>2015-01-21T05:58:57Z</updated>
<author>
<name>Michael S. Tsirkin</name>
<email>mst@redhat.com</email>
</author>
<published>2015-01-15T11:33:31Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=43b4f721ce6d497648a8d4a21c1d53483090bcf9'/>
<id>urn:sha1:43b4f721ce6d497648a8d4a21c1d53483090bcf9</id>
<content type='text'>
Most of our code has
struct foo {
}

Fix one instances where ring is inconsistent.

Signed-off-by: Michael S. Tsirkin &lt;mst@redhat.com&gt;
Signed-off-by: Rusty Russell &lt;rusty@rustcorp.com.au&gt;
</content>
</entry>
<entry>
<title>virtio: make VIRTIO_F_VERSION_1 a transport bit</title>
<updated>2014-12-09T10:06:32Z</updated>
<author>
<name>Michael S. Tsirkin</name>
<email>mst@redhat.com</email>
</author>
<published>2014-12-01T13:52:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=747ae34a6ef681fbd993be214d8c0a30bd4a2fda'/>
<id>urn:sha1:747ae34a6ef681fbd993be214d8c0a30bd4a2fda</id>
<content type='text'>
Activate VIRTIO_F_VERSION_1 automatically unless legacy_only
is set.

Signed-off-by: Michael S. Tsirkin &lt;mst@redhat.com&gt;




</content>
</entry>
<entry>
<title>virtio: allow transports to get avail/used addresses</title>
<updated>2014-12-09T10:05:25Z</updated>
<author>
<name>Cornelia Huck</name>
<email>cornelia.huck@de.ibm.com</email>
</author>
<published>2014-10-07T14:39:47Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=890626521503318d7ac92a4a3b9feba55c0131ec'/>
<id>urn:sha1:890626521503318d7ac92a4a3b9feba55c0131ec</id>
<content type='text'>
For virtio-1, we can theoretically have a more complex virtqueue
layout with avail and used buffers not on a contiguous memory area
with the descriptor table. For now, it's fine for a transport driver
to stay with the old layout: It needs, however, a way to access
the locations of the avail/used rings so it can register them with
the host.

Reviewed-by: David Hildenbrand &lt;dahi@linux.vnet.ibm.com&gt;
Signed-off-by: Cornelia Huck &lt;cornelia.huck@de.ibm.com&gt;

Signed-off-by: Michael S. Tsirkin &lt;mst@redhat.com&gt;




</content>
</entry>
<entry>
<title>virtio_ring: switch to new memory access APIs</title>
<updated>2014-12-09T10:05:25Z</updated>
<author>
<name>Michael S. Tsirkin</name>
<email>mst@redhat.com</email>
</author>
<published>2014-10-22T12:42:09Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=00e6f3d9d9e356dbf08369ffc4576f79438d51ea'/>
<id>urn:sha1:00e6f3d9d9e356dbf08369ffc4576f79438d51ea</id>
<content type='text'>
Use virtioXX_to_cpu and friends for access to
all multibyte structures in memory.

Note: this is intentionally mechanical.
A follow-up patch will split long lines etc.

Signed-off-by: Michael S. Tsirkin &lt;mst@redhat.com&gt;
Reviewed-by: Cornelia Huck &lt;cornelia.huck@de.ibm.com&gt;




</content>
</entry>
<entry>
<title>virtio: use u32, not bitmap for features</title>
<updated>2014-12-09T10:05:23Z</updated>
<author>
<name>Michael S. Tsirkin</name>
<email>mst@redhat.com</email>
</author>
<published>2014-10-07T14:39:42Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e16e12be34648777606a2c03a3526409b38f0e63'/>
<id>urn:sha1:e16e12be34648777606a2c03a3526409b38f0e63</id>
<content type='text'>
It seemed like a good idea to use bitmap for features
in struct virtio_device, but it's actually a pain,
and seems to become even more painful when we get more
than 32 feature bits.  Just change it to a u32 for now.

Based on patch by Rusty.

Suggested-by: David Hildenbrand &lt;dahi@linux.vnet.ibm.com&gt;
Signed-off-by: Rusty Russell &lt;rusty@rustcorp.com.au&gt;
Signed-off-by: Cornelia Huck &lt;cornelia.huck@de.ibm.com&gt;
Signed-off-by: Michael S. Tsirkin &lt;mst@redhat.com&gt;
Reviewed-by: Cornelia Huck &lt;cornelia.huck@de.ibm.com&gt;




</content>
</entry>
<entry>
<title>virtio_ring: unify direct/indirect code paths.</title>
<updated>2014-09-13T16:52:35Z</updated>
<author>
<name>Rusty Russell</name>
<email>rusty@rustcorp.com.au</email>
</author>
<published>2014-09-11T00:47:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b25bd2515ea32cf5ddd5fd5a2a93b8c9dd875e4f'/>
<id>urn:sha1:b25bd2515ea32cf5ddd5fd5a2a93b8c9dd875e4f</id>
<content type='text'>
virtqueue_add() populates the virtqueue descriptor table from the sgs
given.  If it uses an indirect descriptor table, then it puts a single
descriptor in the descriptor table pointing to the kmalloc'ed indirect
table where the sg is populated.

Previously vring_add_indirect() did the allocation and the simple
linear layout.  We replace that with alloc_indirect() which allocates
the indirect table then chains it like the normal descriptor table so
we can reuse the core logic.

This slows down pktgen by less than 1/2 a percent (which uses direct
descriptors), as well as vring_bench, but it's far neater.

vring_bench before:
	1061485790-1104800648(1.08254e+09+/-6.6e+06)ns
vring_bench after:
	1125610268-1183528965(1.14172e+09+/-8e+06)ns

pktgen before:
   787781-796334(793165+/-2.4e+03)pps 365-369(367.5+/-1.2)Mb/sec (365530384-369498976(3.68028e+08+/-1.1e+06)bps) errors: 0

pktgen after:
   779988-790404(786391+/-2.5e+03)pps 361-366(364.35+/-1.3)Mb/sec (361914432-366747456(3.64885e+08+/-1.2e+06)bps) errors: 0

Now, if we make force indirect descriptors by turning off any_header_sg
in virtio_net.c:

pktgen before:
  713773-721062(718374+/-2.1e+03)pps 331-334(332.95+/-0.92)Mb/sec (331190672-334572768(3.33325e+08+/-9.6e+05)bps) errors: 0
pktgen after:
  710542-719195(714898+/-2.4e+03)pps 329-333(331.15+/-1.1)Mb/sec (329691488-333706480(3.31713e+08+/-1.1e+06)bps) errors: 0

Signed-off-by: Rusty Russell &lt;rusty@rustcorp.com.au&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>virtio_ring: assume sgs are always well-formed.</title>
<updated>2014-09-13T16:50:46Z</updated>
<author>
<name>Rusty Russell</name>
<email>rusty@rustcorp.com.au</email>
</author>
<published>2014-09-11T00:47:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=eeebf9b1fc0862466c5661d63fbaf66ab4a50210'/>
<id>urn:sha1:eeebf9b1fc0862466c5661d63fbaf66ab4a50210</id>
<content type='text'>
We used to have several callers which just used arrays.  They're
gone, so we can use sg_next() everywhere, simplifying the code.

On my laptop, this slowed down vring_bench by 15%:

vring_bench before:
	936153354-967745359(9.44739e+08+/-6.1e+06)ns
vring_bench after:
	1061485790-1104800648(1.08254e+09+/-6.6e+06)ns

However, a more realistic test using pktgen on a AMD FX(tm)-8320 saw
a few percent improvement:

pktgen before:
  767390-792966(785159+/-6.5e+03)pps 356-367(363.75+/-2.9)Mb/sec (356068960-367936224(3.64314e+08+/-3e+06)bps) errors: 0

pktgen after:
   787781-796334(793165+/-2.4e+03)pps 365-369(367.5+/-1.2)Mb/sec (365530384-369498976(3.68028e+08+/-1.1e+06)bps) errors: 0

Signed-off-by: Rusty Russell &lt;rusty@rustcorp.com.au&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
