<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/tcp.h, branch v5.4.279</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.4.279</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.4.279'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2023-07-27T06:37:45Z</updated>
<entry>
<title>tcp: annotate data-races around fastopenq.max_qlen</title>
<updated>2023-07-27T06:37:45Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2023-07-19T21:28:57Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b1062596556e59e18efa9561998198da65de4c92'/>
<id>urn:sha1:b1062596556e59e18efa9561998198da65de4c92</id>
<content type='text'>
[ Upstream commit 70f360dd7042cb843635ece9d28335a4addff9eb ]

This field can be read locklessly.

Fixes: 1536e2857bd3 ("tcp: Add a TCP_FASTOPEN socket option to get a max backlog on its listner")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Link: https://lore.kernel.org/r/20230719212857.3943972-12-edumazet@google.com
Signed-off-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited</title>
<updated>2022-10-26T11:22:26Z</updated>
<author>
<name>Neal Cardwell</name>
<email>ncardwell@google.com</email>
</author>
<published>2022-09-28T20:03:31Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3625b684a285a3dcead621b7442828371caa5f85'/>
<id>urn:sha1:3625b684a285a3dcead621b7442828371caa5f85</id>
<content type='text'>
[ Upstream commit f4ce91ce12a7c6ead19b128ffa8cff6e3ded2a14 ]

This commit fixes a bug in the tracking of max_packets_out and
is_cwnd_limited. This bug can cause the connection to fail to remember
that is_cwnd_limited is true, causing the connection to fail to grow
cwnd when it should, causing throughput to be lower than it should be.

The following event sequence is an example that triggers the bug:

 (a) The connection is cwnd_limited, but packets_out is not at its
     peak due to TSO deferral deciding not to send another skb yet.
     In such cases the connection can advance max_packets_seq and set
     tp-&gt;is_cwnd_limited to true and max_packets_out to a small
     number.

(b) Then later in the round trip the connection is pacing-limited (not
     cwnd-limited), and packets_out is larger. In such cases the
     connection would raise max_packets_out to a bigger number but
     (unexpectedly) flip tp-&gt;is_cwnd_limited from true to false.

This commit fixes that bug.

One straightforward fix would be to separately track (a) the next
window after max_packets_out reaches a maximum, and (b) the next
window after tp-&gt;is_cwnd_limited is set to true. But this would
require consuming an extra u32 sequence number.

Instead, to save space we track only the most important
information. Specifically, we track the strongest available signal of
the degree to which the cwnd is fully utilized:

(1) If the connection is cwnd-limited then we remember that fact for
the current window.

(2) If the connection not cwnd-limited then we track the maximum
number of outstanding packets in the current window.

In particular, note that the new logic cannot trigger the buggy
(a)/(b) sequence above because with the new logic a condition where
tp-&gt;packets_out &gt; tp-&gt;max_packets_out can only trigger an update of
tp-&gt;is_cwnd_limited if tp-&gt;is_cwnd_limited is false.

This first showed up in a testing of a BBRv2 dev branch, but this
buggy behavior highlighted a general issue with the
tcp_cwnd_validate() logic that can cause cwnd to fail to increase at
the proper rate for any TCP congestion control, including Reno or
CUBIC.

Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler")
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Kevin(Yudong) Yang &lt;yyd@google.com&gt;
Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
</entry>
<entry>
<title>tcp: allow at most one TLP probe per flight</title>
<updated>2020-07-31T16:39:31Z</updated>
<author>
<name>Yuchung Cheng</name>
<email>ycheng@google.com</email>
</author>
<published>2020-07-23T19:00:06Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=182ffc66456b20530def3b2d4f6b9a07545ac475'/>
<id>urn:sha1:182ffc66456b20530def3b2d4f6b9a07545ac475</id>
<content type='text'>
[ Upstream commit 76be93fc0702322179bb0ea87295d820ee46ad14 ]

Previously TLP may send multiple probes of new data in one
flight. This happens when the sender is cwnd limited. After the
initial TLP containing new data is sent, the sender receives another
ACK that acks partial inflight.  It may re-arm another TLP timer
to send more, if no further ACK returns before the next TLP timeout
(PTO) expires. The sender may send in theory a large amount of TLP
until send queue is depleted. This only happens if the sender sees
such irregular uncommon ACK pattern. But it is generally undesirable
behavior during congestion especially.

The original TLP design restrict only one TLP probe per inflight as
published in "Reducing Web Latency: the Virtue of Gentle Aggression",
SIGCOMM 2013. This patch changes TLP to send at most one probe
per inflight.

Note that if the sender is app-limited, TLP retransmits old data
and did not have this issue.

Signed-off-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
</entry>
<entry>
<title>tcp: add rcu protection around tp-&gt;fastopen_rsk</title>
<updated>2019-10-13T17:13:08Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2019-10-11T03:17:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d983ea6f16b835dcde2ee9a58a1e764ce68bfccc'/>
<id>urn:sha1:d983ea6f16b835dcde2ee9a58a1e764ce68bfccc</id>
<content type='text'>
Both tcp_v4_err() and tcp_v6_err() do the following operations
while they do not own the socket lock :

	fastopen = tp-&gt;fastopen_rsk;
 	snd_una = fastopen ? tcp_rsk(fastopen)-&gt;snt_isn : tp-&gt;snd_una;

The problem is that without appropriate barrier, the compiler
might reload tp-&gt;fastopen_rsk and trigger a NULL deref.

request sockets are protected by RCU, we can simply add
the missing annotations and barriers to solve the issue.

Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: Add TCP_INFO counter for packets received out-of-order</title>
<updated>2019-09-16T14:26:11Z</updated>
<author>
<name>Thomas Higdon</name>
<email>tph@fb.com</email>
</author>
<published>2019-09-13T23:23:34Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f9af2dbbfe01def62765a58af7fbc488351893c3'/>
<id>urn:sha1:f9af2dbbfe01def62765a58af7fbc488351893c3</id>
<content type='text'>
For receive-heavy cases on the server-side, we want to track the
connection quality for individual client IPs. This counter, similar to
the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
tracks out-of-order packet reception. By providing this counter in
TCP_INFO, it will allow understanding to what degree receive-heavy
sockets are experiencing out-of-order delivery and packet drops
indicating congestion.

Please note that this is similar to the counter in NetBSD TCP_INFO, and
has the same name.

Also note that we avoid increasing the size of the tcp_sock struct by
taking advantage of a hole.

Signed-off-by: Thomas Higdon &lt;tph@fb.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: fastopen: robustness and endianness fixes for SipHash</title>
<updated>2019-06-22T23:30:37Z</updated>
<author>
<name>Ard Biesheuvel</name>
<email>ard.biesheuvel@linaro.org</email>
</author>
<published>2019-06-19T21:46:28Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=438ac88009bcb10f9ced07fbb4b32d5377ee936b'/>
<id>urn:sha1:438ac88009bcb10f9ced07fbb4b32d5377ee936b</id>
<content type='text'>
Some changes to the TCP fastopen code to make it more robust
against future changes in the choice of key/cookie size, etc.

- Instead of keeping the SipHash key in an untyped u8[] buffer
  and casting it to the right type upon use, use the correct
  type directly. This ensures that the key will appear at the
  correct alignment if we ever change the way these data
  structures are allocated. (Currently, they are only allocated
  via kmalloc so they always appear at the correct alignment)

- Use DIV_ROUND_UP when sizing the u64[] array to hold the
  cookie, so it is always of sufficient size, even if
  TCP_FASTOPEN_COOKIE_MAX is no longer a multiple of 8.

- Drop the 'len' parameter from the tcp_fastopen_reset_cipher()
  function, which is no longer used.

- Add endian swabbing when setting the keys and calculating the hash,
  to ensure that cookie values are the same for a given key and
  source/destination address pair regardless of the endianness of
  the server.

Note that none of these are functional changes wrt the current
state of the code, with the exception of the swabbing, which only
affects big endian systems.

Signed-off-by: Ard Biesheuvel &lt;ard.biesheuvel@linaro.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2019-06-18T03:20:36Z</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2019-06-18T02:48:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=13091aa30535b719e269f20a7bc34002bf5afae5'/>
<id>urn:sha1:13091aa30535b719e269f20a7bc34002bf5afae5</id>
<content type='text'>
Honestly all the conflicts were simple overlapping changes,
nothing really interesting to report.

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>net: ipv4: move tcp_fastopen server side code to SipHash library</title>
<updated>2019-06-17T20:56:26Z</updated>
<author>
<name>Ard Biesheuvel</name>
<email>ard.biesheuvel@linaro.org</email>
</author>
<published>2019-06-17T08:09:33Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c681edae33e86ff27be2d6cc717663d91df20b0e'/>
<id>urn:sha1:c681edae33e86ff27be2d6cc717663d91df20b0e</id>
<content type='text'>
Using a bare block cipher in non-crypto code is almost always a bad idea,
not only for security reasons (and we've seen some examples of this in
the kernel in the past), but also for performance reasons.

In the TCP fastopen case, we call into the bare AES block cipher one or
two times (depending on whether the connection is IPv4 or IPv6). On most
systems, this results in a call chain such as

  crypto_cipher_encrypt_one(ctx, dst, src)
    crypto_cipher_crt(tfm)-&gt;cit_encrypt_one(crypto_cipher_tfm(tfm), ...);
      aesni_encrypt
        kernel_fpu_begin();
        aesni_enc(ctx, dst, src); // asm routine
        kernel_fpu_end();

It is highly unlikely that the use of special AES instructions has a
benefit in this case, especially since we are doing the above twice
for IPv6 connections, instead of using a transform which can process
the entire input in one go.

We could switch to the cbcmac(aes) shash, which would at least get
rid of the duplicated overhead in *some* cases (i.e., today, only
arm64 has an accelerated implementation of cbcmac(aes), while x86 will
end up using the generic cbcmac template wrapping the AES-NI cipher,
which basically ends up doing exactly the above). However, in the given
context, it makes more sense to use a light-weight MAC algorithm that
is more suitable for the purpose at hand, such as SipHash.

Since the output size of SipHash already matches our chosen value for
TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input
sizes, this greatly simplifies the code as well.

NOTE: Server farms backing a single server IP for load balancing purposes
      and sharing a single fastopen key will be adversely affected by
      this change unless all systems in the pool receive their kernel
      upgrades at the same time.

Signed-off-by: Ard Biesheuvel &lt;ard.biesheuvel@linaro.org&gt;
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: limit payload size of sacked skbs</title>
<updated>2019-06-16T01:47:31Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2019-05-18T00:17:22Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3b4929f65b0d8249f19a50245cd88ed1a2f78cff'/>
<id>urn:sha1:3b4929f65b0d8249f19a50245cd88ed1a2f78cff</id>
<content type='text'>
Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :

	BUG_ON(tcp_skb_pcount(skb) &lt; pcount);

This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48

An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.

This means that the 16bit witdh of TCP_SKB_CB(skb)-&gt;tcp_gso_segs
can overflow.

Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.

CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)-&gt;tcp_gso_segs

Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Jonathan Looney &lt;jtl@netflix.com&gt;
Acked-by: Neal Cardwell &lt;ncardwell@google.com&gt;
Reviewed-by: Tyler Hicks &lt;tyhicks@canonical.com&gt;
Cc: Yuchung Cheng &lt;ycheng@google.com&gt;
Cc: Bruce Curtis &lt;brucec@netflix.com&gt;
Cc: Jonathan Lemon &lt;jonathan.lemon@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>tcp: add optional per socket transmit delay</title>
<updated>2019-06-12T20:05:43Z</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2019-06-12T18:57:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a842fe1425cb20f457abd3f8ef98b468f83ca98b'/>
<id>urn:sha1:a842fe1425cb20f457abd3f8ef98b468f83ca98b</id>
<content type='text'>
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.

Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.

EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.

Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.

This requires FQ packet scheduler or a EDT-enabled NIC.

This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.

  unsigned int tx_delay = 10000; /* 10 msec */

  setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &amp;tx_delay, sizeof(tx_delay));

Note that FQ packet scheduler limits might need some tweaking :

man tc-fq

PARAMETERS
   limit
       Hard  limit  on  the  real  queue  size. When this limit is
       reached, new packets are dropped. If the value is  lowered,
       packets  are  dropped so that the new limit is met. Default
       is 10000 packets.

   flow_limit
       Hard limit on the maximum  number  of  packets  queued  per
       flow.  Default value is 100.

Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.

Use of a jump label makes this support runtime-free, for hosts
never using the option.

Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)

Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
