summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/6pack.rst2
-rw-r--r--Documentation/networking/arcnet-hardware.rst22
-rw-r--r--Documentation/networking/arcnet.rst48
-rw-r--r--Documentation/networking/ax25.rst7
-rw-r--r--Documentation/networking/can.rst71
-rw-r--r--Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst22
-rw-r--r--Documentation/networking/device_drivers/ethernet/index.rst1
-rw-r--r--Documentation/networking/device_drivers/ethernet/mucse/rnpgbe.rst17
-rw-r--r--Documentation/networking/devlink/devlink-eswitch-attr.rst13
-rw-r--r--Documentation/networking/devlink/devlink-params.rst14
-rw-r--r--Documentation/networking/devlink/i40e.rst34
-rw-r--r--Documentation/networking/devlink/index.rst1
-rw-r--r--Documentation/networking/devlink/mlx5.rst14
-rw-r--r--Documentation/networking/devlink/stmmac.rst40
-rw-r--r--Documentation/networking/dsa/dsa.rst17
-rw-r--r--Documentation/networking/ethtool-netlink.rst64
-rw-r--r--Documentation/networking/index.rst5
-rw-r--r--Documentation/networking/ip-sysctl.rst60
-rw-r--r--Documentation/networking/napi.rst50
-rw-r--r--Documentation/networking/net_cachelines/inet_connection_sock.rst2
-rw-r--r--Documentation/networking/net_cachelines/inet_sock.rst79
-rw-r--r--Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst3
-rw-r--r--Documentation/networking/net_failover.rst6
-rw-r--r--Documentation/networking/netconsole.rst5
-rw-r--r--Documentation/networking/nfc.rst6
-rw-r--r--Documentation/networking/seg6-sysctl.rst3
-rw-r--r--Documentation/networking/smc-sysctl.rst40
-rw-r--r--Documentation/networking/statistics.rst4
-rw-r--r--Documentation/networking/tls.rst20
-rw-r--r--Documentation/networking/xfrm/index.rst13
-rw-r--r--Documentation/networking/xfrm/xfrm_device.rst (renamed from Documentation/networking/xfrm_device.rst)20
-rw-r--r--Documentation/networking/xfrm/xfrm_proc.rst (renamed from Documentation/networking/xfrm_proc.rst)0
-rw-r--r--Documentation/networking/xfrm/xfrm_sync.rst (renamed from Documentation/networking/xfrm_sync.rst)97
-rw-r--r--Documentation/networking/xfrm/xfrm_sysctl.rst (renamed from Documentation/networking/xfrm_sysctl.rst)4
34 files changed, 621 insertions, 183 deletions
diff --git a/Documentation/networking/6pack.rst b/Documentation/networking/6pack.rst
index bc5bf1f1a98f..66d5fd4fc821 100644
--- a/Documentation/networking/6pack.rst
+++ b/Documentation/networking/6pack.rst
@@ -94,7 +94,7 @@ kernels may lead to a compilation error because the interface to a kernel
function has been changed in the 2.1.8x kernels.
How to turn on 6pack support:
-=============================
+-----------------------------
- In the linux kernel configuration program, select the code maturity level
options menu and turn on the prompting for development drivers.
diff --git a/Documentation/networking/arcnet-hardware.rst b/Documentation/networking/arcnet-hardware.rst
index 3bf7f99cd7bb..20e5075d0d0e 100644
--- a/Documentation/networking/arcnet-hardware.rst
+++ b/Documentation/networking/arcnet-hardware.rst
@@ -4,18 +4,20 @@
ARCnet Hardware
===============
+:Author: Avery Pennarun <apenwarr@worldvisions.ca>
+
.. note::
- 1) This file is a supplement to arcnet.txt. Please read that for general
+ 1) This file is a supplement to arcnet.rst. Please read that for general
driver configuration help.
2) This file is no longer Linux-specific. It should probably be moved out
of the kernel sources. Ideas?
Because so many people (myself included) seem to have obtained ARCnet cards
without manuals, this file contains a quick introduction to ARCnet hardware,
-some cabling tips, and a listing of all jumper settings I can find. Please
-e-mail apenwarr@worldvisions.ca with any settings for your particular card,
-or any other information you have!
+some cabling tips, and a listing of all jumper settings I can find. If you
+have any settings for your particular card, and/or any other information you
+have, do not hesitate to :ref:`email to netdev <arcnet-netdev>`.
Introduction to ARCnet
@@ -72,11 +74,10 @@ level of encapsulation is defined by RFC1201, which I call "packet
splitting," that allows "virtual packets" to grow as large as 64K each,
although they are generally kept down to the Ethernet-style 1500 bytes.
-For more information on the advantages and disadvantages (mostly the
-advantages) of ARCnet networks, you might try the "ARCnet Trade Association"
-WWW page:
+For more information on ARCnet networks, visit the "ARCNET Resource Center"
+WWW page at:
- http://www.arcnet.com
+ https://www.arcnet.cc
Cabling ARCnet Networks
@@ -3226,9 +3227,6 @@ Settings for IRQ Selection (Lower Jumper Line)
Other Cards
===========
-I have no information on other models of ARCnet cards at the moment. Please
-send any and all info to:
-
- apenwarr@worldvisions.ca
+I have no information on other models of ARCnet cards at the moment.
Thanks.
diff --git a/Documentation/networking/arcnet.rst b/Documentation/networking/arcnet.rst
index 82fce606c0f0..cd43a18ad149 100644
--- a/Documentation/networking/arcnet.rst
+++ b/Documentation/networking/arcnet.rst
@@ -4,6 +4,8 @@
ARCnet
======
+:Author: Avery Pennarun <apenwarr@worldvisions.ca>
+
.. note::
See also arcnet-hardware.txt in this directory for jumper-setting
@@ -30,18 +32,7 @@ Come on, be a sport! Send me a success report!
(hey, that was even better than my original poem... this is getting bad!)
-
-.. warning::
-
- If you don't e-mail me about your success/failure soon, I may be forced to
- start SINGING. And we don't want that, do we?
-
- (You know, it might be argued that I'm pushing this point a little too much.
- If you think so, why not flame me in a quick little e-mail? Please also
- include the type of card(s) you're using, software, size of network, and
- whether it's working or not.)
-
- My e-mail address is: apenwarr@worldvisions.ca
+----
These are the ARCnet drivers for Linux.
@@ -59,23 +50,14 @@ ARCnet 2.10 ALPHA, Tomasz's all-new-and-improved RFC1051 support has been
included and seems to be working fine!
+.. _arcnet-netdev:
+
Where do I discuss these drivers?
---------------------------------
-Tomasz has been so kind as to set up a new and improved mailing list.
-Subscribe by sending a message with the BODY "subscribe linux-arcnet YOUR
-REAL NAME" to listserv@tichy.ch.uj.edu.pl. Then, to submit messages to the
-list, mail to linux-arcnet@tichy.ch.uj.edu.pl.
-
-There are archives of the mailing list at:
-
- http://epistolary.org/mailman/listinfo.cgi/arcnet
-
-The people on linux-net@vger.kernel.org (now defunct, replaced by
-netdev@vger.kernel.org) have also been known to be very helpful, especially
-when we're talking about ALPHA Linux kernels that may or may not work right
-in the first place.
-
+ARCnet discussions take place on netdev. Simply send your email to
+netdev@vger.kernel.org and make sure to Cc: maintainer listed in
+"ARCNET NETWORK LAYER" heading of Documentation/process/maintainers.rst.
Other Drivers and Info
----------------------
@@ -523,17 +505,9 @@ can set up your network then:
It works: what now?
-------------------
-Send mail describing your setup, preferably including driver version, kernel
-version, ARCnet card model, CPU type, number of systems on your network, and
-list of software in use to me at the following address:
-
- apenwarr@worldvisions.ca
-
-I do send (sometimes automated) replies to all messages I receive. My email
-can be weird (and also usually gets forwarded all over the place along the
-way to me), so if you don't get a reply within a reasonable time, please
-resend.
-
+Send mail following :ref:`arcnet-netdev`. Describe your setup, preferably
+including driver version, kernel version, ARCnet card model, CPU type, number
+of systems on your network, and list of software in use.
It doesn't work: what now?
--------------------------
diff --git a/Documentation/networking/ax25.rst b/Documentation/networking/ax25.rst
index 605e72c6c877..89c79dd6c6f9 100644
--- a/Documentation/networking/ax25.rst
+++ b/Documentation/networking/ax25.rst
@@ -11,6 +11,7 @@ found on https://linux-ax25.in-berlin.de.
There is a mailing list for discussing Linux amateur radio matters
called linux-hams@vger.kernel.org. To subscribe to it, send a message to
-majordomo@vger.kernel.org with the words "subscribe linux-hams" in the body
-of the message, the subject field is ignored. You don't need to be
-subscribed to post but of course that means you might miss an answer.
+linux-hams+subscribe@vger.kernel.org or use the web interface at
+https://vger.kernel.org. The subject and body of the message are
+ignored. You don't need to be subscribed to post but of course that
+means you might miss an answer.
diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst
index f93049f03a37..536ff411da1d 100644
--- a/Documentation/networking/can.rst
+++ b/Documentation/networking/can.rst
@@ -1398,10 +1398,9 @@ second bit timing has to be specified in order to enable the CAN FD bitrate.
Additionally CAN FD capable CAN controllers support up to 64 bytes of
payload. The representation of this length in can_frame.len and
canfd_frame.len for userspace applications and inside the Linux network
-layer is a plain value from 0 .. 64 instead of the CAN 'data length code'.
-The data length code was a 1:1 mapping to the payload length in the Classical
-CAN frames anyway. The payload length to the bus-relevant DLC mapping is
-only performed inside the CAN drivers, preferably with the helper
+layer is a plain value from 0 .. 64 instead of the Classical CAN length
+which ranges from 0 to 8. The payload length to the bus-relevant DLC mapping
+is only performed inside the CAN drivers, preferably with the helper
functions can_fd_dlc2len() and can_fd_len2dlc().
The CAN netdevice driver capabilities can be distinguished by the network
@@ -1465,6 +1464,70 @@ Example when 'fd-non-iso on' is added on this switchable CAN FD adapter::
can <FD,FD-NON-ISO> state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
+Transmitter Delay Compensation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+At high bit rates, the propagation delay from the TX pin to the RX pin of
+the transceiver might become greater than the actual bit time causing
+measurement errors: the RX pin would still be measuring the previous bit.
+
+The Transmitter Delay Compensation (thereafter, TDC) resolves this problem
+by introducing a Secondary Sample Point (SSP) equal to the distance, in
+minimum time quantum, from the start of the bit time on the TX pin to the
+actual measurement on the RX pin. The SSP is calculated as the sum of two
+configurable values: the TDC Value (TDCV) and the TDC offset (TDCO).
+
+TDC, if supported by the device, can be configured together with CAN-FD
+using the ip tool's "tdc-mode" argument as follow:
+
+**omitted**
+ When no "tdc-mode" option is provided, the kernel will automatically
+ decide whether TDC should be turned on, in which case it will
+ calculate a default TDCO and use the TDCV as measured by the
+ device. This is the recommended method to use TDC.
+
+**"tdc-mode off"**
+ TDC is explicitly disabled.
+
+**"tdc-mode auto"**
+ The user must provide the "tdco" argument. The TDCV will be
+ automatically calculated by the device. This option is only
+ available if the device supports the TDC-AUTO CAN controller mode.
+
+**"tdc-mode manual"**
+ The user must provide both the "tdco" and "tdcv" arguments. This
+ option is only available if the device supports the TDC-MANUAL CAN
+ controller mode.
+
+Note that some devices may offer an additional parameter: "tdcf" (TDC Filter
+window). If supported by your device, this can be added as an optional
+argument to either "tdc-mode auto" or "tdc-mode manual".
+
+Example configuring a 500 kbit/s arbitration bitrate, a 5 Mbit/s data
+bitrate, a TDCO of 15 minimum time quantum and a TDCV automatically measured
+by the device::
+
+ $ ip link set can0 up type can bitrate 500000 \
+ fd on dbitrate 4000000 \
+ tdc-mode auto tdco 15
+ $ ip -details link show can0
+ 5: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UP \
+ mode DEFAULT group default qlen 10
+ link/can promiscuity 0 allmulti 0 minmtu 72 maxmtu 72
+ can <FD,TDC-AUTO> state ERROR-ACTIVE restart-ms 0
+ bitrate 500000 sample-point 0.875
+ tq 12 prop-seg 69 phase-seg1 70 phase-seg2 20 sjw 10 brp 1
+ ES582.1/ES584.1: tseg1 2..256 tseg2 2..128 sjw 1..128 brp 1..512 \
+ brp_inc 1
+ dbitrate 4000000 dsample-point 0.750
+ dtq 12 dprop-seg 7 dphase-seg1 7 dphase-seg2 5 dsjw 2 dbrp 1
+ tdco 15 tdcf 0
+ ES582.1/ES584.1: dtseg1 2..32 dtseg2 1..16 dsjw 1..8 dbrp 1..32 \
+ dbrp_inc 1
+ tdco 0..127 tdcf 0..127
+ clock 80000000
+
+
Supported CAN Hardware
----------------------
diff --git a/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
index 289c146a8291..5aedbabb7382 100644
--- a/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
+++ b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
@@ -28,6 +28,7 @@ these MAP frames and send them to appropriate PDN's.
================
a. MAP packet v1 (data / control)
+---------------------------------
MAP header fields are in big endian format.
@@ -54,6 +55,7 @@ Payload length includes the padding length but does not include MAP header
length.
b. Map packet v4 (data / control)
+---------------------------------
MAP header fields are in big endian format.
@@ -107,6 +109,7 @@ over which checksum is computed.
Checksum value, indicates the checksum computed.
c. MAP packet v5 (data / control)
+---------------------------------
MAP header fields are in big endian format.
@@ -134,19 +137,24 @@ Payload length includes the padding length but does not include MAP header
length.
d. Checksum offload header v5
+-----------------------------
Checksum offload header fields are in big endian format.
+Packet format::
+
Bit 0 - 6 7 8-15 16-31
Function Header Type Next Header Checksum Valid Reserved
Header Type is to indicate the type of header, this usually is set to CHECKSUM
Header types
-= ==========================================
+
+= ===============
0 Reserved
1 Reserved
2 checksum header
+= ===============
Checksum Valid is to indicate whether the header checksum is valid. Value of 1
implies that checksum is calculated on this packet and is valid, value of 0
@@ -154,7 +162,10 @@ indicates that the calculated packet checksum is invalid.
Reserved bits must be zero when sent and ignored when received.
-e. MAP packet v1/v5 (command specific)::
+e. MAP packet v1/v5 (command specific)
+--------------------------------------
+
+Packet format::
Bit 0 1 2-7 8 - 15 16 - 31
Function Command Reserved Pad Multiplexer ID Payload length
@@ -177,15 +188,18 @@ Command types
= ==========================================
f. Aggregation
+--------------
Aggregation is multiple MAP packets (can be data or command) delivered to
rmnet in a single linear skb. rmnet will process the individual
packets and either ACK the MAP command or deliver the IP packet to the
network stack as needed
-MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding....
+Packet format::
+
+ MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding....
-MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad...
+ MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad...
3. Userspace configuration
==========================
diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
index 7cfcd183054f..bcc02355f828 100644
--- a/Documentation/networking/device_drivers/ethernet/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -47,6 +47,7 @@ Contents:
mellanox/mlx5/index
meta/fbnic
microsoft/netvsc
+ mucse/rnpgbe
neterion/s2io
netronome/nfp
pensando/ionic
diff --git a/Documentation/networking/device_drivers/ethernet/mucse/rnpgbe.rst b/Documentation/networking/device_drivers/ethernet/mucse/rnpgbe.rst
new file mode 100644
index 000000000000..d35cf8a46b6c
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mucse/rnpgbe.rst
@@ -0,0 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================================================
+Linux Base Driver for MUCSE(R) Gigabit PCI Express Adapters
+===========================================================
+
+Contents
+========
+
+- Identifying Your Adapter
+
+Identifying Your Adapter
+========================
+The driver is compatible with devices based on the following:
+
+ * MUCSE(R) Ethernet Controller N210 series
+ * MUCSE(R) Ethernet Controller N500 series
diff --git a/Documentation/networking/devlink/devlink-eswitch-attr.rst b/Documentation/networking/devlink/devlink-eswitch-attr.rst
index 08bb39ab1528..eafe09abc40c 100644
--- a/Documentation/networking/devlink/devlink-eswitch-attr.rst
+++ b/Documentation/networking/devlink/devlink-eswitch-attr.rst
@@ -39,6 +39,10 @@ The following is a list of E-Switch attributes.
rules.
* ``switchdev`` allows for more advanced offloading capabilities of
the E-Switch to hardware.
+ * ``switchdev_inactive`` switchdev mode but starts inactive, doesn't allow traffic
+ until explicitly activated. This mode is useful for orchestrators that
+ want to prepare the device in switchdev mode but only activate it when
+ all configurations are done.
* - ``inline-mode``
- enum
- Some HWs need the VF driver to put part of the packet
@@ -74,3 +78,12 @@ Example Usage
# enable encap-mode with legacy mode
$ devlink dev eswitch set pci/0000:08:00.0 mode legacy inline-mode none encap-mode basic
+
+ # start switchdev mode in inactive state
+ $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev_inactive
+
+ # setup switchdev configurations, representors, FDB entries, etc..
+ ...
+
+ # activate switchdev mode to allow traffic
+ $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index 0a9c20d70122..ea17756dcda6 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -41,6 +41,16 @@ In order for ``driverinit`` parameters to take effect, the driver must
support reloading via the ``devlink-reload`` command. This command will
request a reload of the device driver.
+Default parameter values
+=========================
+
+Drivers may optionally export default values for parameters of cmode
+``runtime`` and ``permanent``. For ``driverinit`` parameters, the last
+value set by the driver will be used as the default value. Drivers can
+also support resetting params with cmode ``runtime`` and ``permanent``
+to their default values. Resetting ``driverinit`` params is supported
+by devlink core without additional driver support needed.
+
.. _devlink_params_generic:
Generic configuration parameters
@@ -151,3 +161,7 @@ own name.
* - ``num_doorbells``
- u32
- Controls the number of doorbells used by the device.
+ * - ``max_mac_per_vf``
+ - u32
+ - Controls the maximum number of MAC address filters that can be assigned
+ to a Virtual Function (VF).
diff --git a/Documentation/networking/devlink/i40e.rst b/Documentation/networking/devlink/i40e.rst
index d3cb5bb5197e..51c887f0dc83 100644
--- a/Documentation/networking/devlink/i40e.rst
+++ b/Documentation/networking/devlink/i40e.rst
@@ -7,6 +7,40 @@ i40e devlink support
This document describes the devlink features implemented by the ``i40e``
device driver.
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Mode
+ - Notes
+ * - ``max_mac_per_vf``
+ - runtime
+ - Controls the maximum number of MAC addresses a VF can use
+ on i40e devices.
+
+ By default (``0``), the driver enforces its internally calculated per-VF
+ MAC filter limit, which is based on the number of allocated VFS.
+
+ If set to a non-zero value, this parameter acts as a strict cap:
+ the driver will use the user-provided value instead of its internal
+ calculation.
+
+ **Important notes:**
+
+ - This value **must be set before enabling SR-IOV**.
+ Attempting to change it while SR-IOV is enabled will return an error.
+ - MAC filters are a **shared hardware resource** across all VFs.
+ Setting a high value may cause other VFs to be starved of filters.
+ - This value is a **Administrative policy**. The hardware may return
+ errors when its absolute limit is reached, regardless of the value
+ set here.
+
+ The default value is ``0`` (internal calculation is used).
+
+
Info versions
=============
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index 0c58e5c729d9..35b12a2bfeba 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -99,5 +99,6 @@ parameters, info versions, and other features it supports.
prestera
qed
sfc
+ stmmac
ti-cpsw-switch
zl3073x
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 0e5f9c76e514..4bba4d780a4a 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -218,6 +218,20 @@ parameters.
* ``balanced`` : Merges fewer CQEs, resulting in a moderate compression ratio but maintaining a balance between bandwidth savings and performance
* ``aggressive`` : Merges more CQEs into a single entry, achieving a higher compression rate and maximizing performance, particularly under high traffic loads
+ * - ``swp_l4_csum_mode``
+ - string
+ - permanent
+ - Configure how the L4 checksum is calculated by the device when using
+ Software Parser (SWP) hints for header locations.
+
+ * ``default`` : Use the device's default checksum calculation
+ mode. The driver will discover during init whether or
+ full_csum or l4_only is in use. Setting this value explicitly
+ from userspace is not allowed, but some firmware versions may
+ return this value on param read.
+ * ``full_csum`` : Calculate full checksum including the pseudo-header
+ * ``l4_only`` : Calculate L4-only checksum, excluding the pseudo-header
+
The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
Info versions
diff --git a/Documentation/networking/devlink/stmmac.rst b/Documentation/networking/devlink/stmmac.rst
new file mode 100644
index 000000000000..47e3ff10bc08
--- /dev/null
+++ b/Documentation/networking/devlink/stmmac.rst
@@ -0,0 +1,40 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+stmmac (synopsys dwmac) devlink support
+=======================================
+
+This document describes the devlink features implemented by the ``stmmac``
+device driver.
+
+Parameters
+==========
+
+The ``stmmac`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``phc_coarse_adj``
+ - Boolean
+ - runtime
+ - Enable the Coarse timestamping mode, as defined in the DWMAC TRM.
+ A detailed explanation of this timestamping mode can be found in the
+ Socfpga Functionnal Description [1].
+
+ In Coarse mode, the ptp clock is expected to be fed by a high-precision
+ clock that is externally adjusted, and the subsecond increment used for
+ timestamping is set to 1/ptp_clock_rate.
+
+ In Fine mode (i.e. Coarse mode == false), the ptp clock frequency is
+ continuously adjusted, but the subsecond increment is set to
+ 2/ptp_clock_rate.
+
+ Coarse mode is suitable for PTP Grand Master operation. If unsure, leave
+ the parameter to False.
+
+ [1] https://www.intel.com/content/www/us/en/docs/programmable/683126/21-2/functional-description-of-the-emac.html
diff --git a/Documentation/networking/dsa/dsa.rst b/Documentation/networking/dsa/dsa.rst
index 7b2e69cd7ef0..5c79740a533b 100644
--- a/Documentation/networking/dsa/dsa.rst
+++ b/Documentation/networking/dsa/dsa.rst
@@ -1104,12 +1104,11 @@ health of the network and for discovery of other nodes.
In Linux, both HSR and PRP are implemented in the hsr driver, which
instantiates a virtual, stackable network interface with two member ports.
The driver only implements the basic roles of DANH (Doubly Attached Node
-implementing HSR) and DANP (Doubly Attached Node implementing PRP); the roles
-of RedBox and QuadBox are not implemented (therefore, bridging a hsr network
-interface with a physical switch port does not produce the expected result).
+implementing HSR), DANP (Doubly Attached Node implementing PRP) and RedBox
+(allows non-HSR devices to connect to the ring via Interlink ports).
-A driver which is able of offloading certain functions of a DANP or DANH should
-declare the corresponding netdev features as indicated by the documentation at
+A driver which is able of offloading certain functions should declare the
+corresponding netdev features as indicated by the documentation at
``Documentation/networking/netdev-features.rst``. Additionally, the following
methods must be implemented:
@@ -1120,6 +1119,14 @@ methods must be implemented:
- ``port_hsr_leave``: function invoked when a given switch port leaves a
DANP/DANH and returns to normal operation as a standalone port.
+Note that the ``NETIF_F_HW_HSR_DUP`` feature relies on transmission towards
+multiple ports, which is generally available whenever the tagging protocol uses
+the ``dsa_xmit_port_mask()`` helper function. If the helper is used, the HSR
+offload feature should also be set. The ``dsa_port_simple_hsr_join()`` and
+``dsa_port_simple_hsr_leave()`` methods can be used as generic implementations
+of ``port_hsr_join`` and ``port_hsr_leave``, if this is the only supported
+offload feature.
+
TODO
====
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index b270886c5f5d..af56c304cef4 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -242,6 +242,7 @@ Userspace to kernel:
``ETHTOOL_MSG_RSS_SET`` set RSS settings
``ETHTOOL_MSG_RSS_CREATE_ACT`` create an additional RSS context
``ETHTOOL_MSG_RSS_DELETE_ACT`` delete an additional RSS context
+ ``ETHTOOL_MSG_MSE_GET`` get MSE diagnostic data
===================================== =================================
Kernel to userspace:
@@ -299,6 +300,7 @@ Kernel to userspace:
``ETHTOOL_MSG_RSS_CREATE_ACT_REPLY`` create an additional RSS context
``ETHTOOL_MSG_RSS_CREATE_NTF`` additional RSS context created
``ETHTOOL_MSG_RSS_DELETE_NTF`` additional RSS context deleted
+ ``ETHTOOL_MSG_MSE_GET_REPLY`` MSE diagnostic data
======================================== =================================
``GET`` requests are sent by userspace applications to retrieve device
@@ -2458,6 +2460,68 @@ Kernel response contents:
For a description of each attribute, see ``TSCONFIG_GET``.
+MSE_GET
+=======
+
+Retrieves detailed Mean Square Error (MSE) diagnostic information from the PHY.
+
+Request Contents:
+
+ ==================================== ====== ============================
+ ``ETHTOOL_A_MSE_HEADER`` nested request header
+ ==================================== ====== ============================
+
+Kernel Response Contents:
+
+ ==================================== ====== ================================
+ ``ETHTOOL_A_MSE_HEADER`` nested reply header
+ ``ETHTOOL_A_MSE_CAPABILITIES`` nested capability/scale info for MSE
+ measurements
+ ``ETHTOOL_A_MSE_CHANNEL_A`` nested snapshot for Channel A
+ ``ETHTOOL_A_MSE_CHANNEL_B`` nested snapshot for Channel B
+ ``ETHTOOL_A_MSE_CHANNEL_C`` nested snapshot for Channel C
+ ``ETHTOOL_A_MSE_CHANNEL_D`` nested snapshot for Channel D
+ ``ETHTOOL_A_MSE_WORST_CHANNEL`` nested snapshot for worst channel
+ ``ETHTOOL_A_MSE_LINK`` nested snapshot for link-wide aggregate
+ ==================================== ====== ================================
+
+MSE Capabilities
+----------------
+
+This nested attribute reports the capability / scaling properties used to
+interpret snapshot values.
+
+ ============================================== ====== =========================
+ ``ETHTOOL_A_MSE_CAPABILITIES_MAX_AVERAGE_MSE`` uint max avg_mse scale
+ ``ETHTOOL_A_MSE_CAPABILITIES_MAX_PEAK_MSE`` uint max peak_mse scale
+ ``ETHTOOL_A_MSE_CAPABILITIES_REFRESH_RATE_PS`` uint sample rate (picoseconds)
+ ``ETHTOOL_A_MSE_CAPABILITIES_NUM_SYMBOLS`` uint symbols per HW sample
+ ============================================== ====== =========================
+
+The max-average/peak fields are included only if the corresponding metric
+is supported by the PHY. Their absence indicates that the metric is not
+available.
+
+See ``struct phy_mse_capability`` kernel documentation in
+``include/linux/phy.h``.
+
+MSE Snapshot
+------------
+
+Each per-channel nest contains an atomic snapshot of MSE values for that
+selector (channel A/B/C/D, worst channel, or link).
+
+ ========================================== ====== ===================
+ ``ETHTOOL_A_MSE_SNAPSHOT_AVERAGE_MSE`` uint average MSE value
+ ``ETHTOOL_A_MSE_SNAPSHOT_PEAK_MSE`` uint current peak MSE
+ ``ETHTOOL_A_MSE_SNAPSHOT_WORST_PEAK_MSE`` uint worst-case peak MSE
+ ========================================== ====== ===================
+
+Within each channel nest, only the metrics supported by the PHY will be present.
+
+See ``struct phy_mse_snapshot`` kernel documentation in
+``include/linux/phy.h``.
+
Request translation
===================
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index c775cababc8c..75db2251649b 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -131,10 +131,7 @@ Contents:
vxlan
x25
x25-iface
- xfrm_device
- xfrm_proc
- xfrm_sync
- xfrm_sysctl
+ xfrm/index
xdp-rx-metadata
xsk-tx-metadata
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index a06cb99d66dc..bc9a01606daf 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -673,6 +673,16 @@ tcp_moderate_rcvbuf - BOOLEAN
Default: 1 (enabled)
+tcp_rcvbuf_low_rtt - INTEGER
+ rcvbuf autotuning can over estimate final socket rcvbuf, which
+ can lead to cache trashing for high throughput flows.
+
+ For small RTT flows (below tcp_rcvbuf_low_rtt usecs), we can relax
+ rcvbuf growth: Few additional ms to reach the final (and smaller)
+ rcvbuf is a good tradeoff.
+
+ Default : 1000 (1 ms)
+
tcp_mtu_probing - INTEGER
Controls TCP Packetization-Layer Path MTU Discovery. Takes three
values:
@@ -854,9 +864,18 @@ tcp_sack - BOOLEAN
Default: 1 (enabled)
+tcp_comp_sack_rtt_percent - INTEGER
+ Percentage of SRTT used for the compressed SACK feature.
+ See tcp_comp_sack_nr, tcp_comp_sack_delay_ns, tcp_comp_sack_slack_ns.
+
+ Possible values : 1 - 1000
+
+ Default : 33 %
+
tcp_comp_sack_delay_ns - LONG INTEGER
- TCP tries to reduce number of SACK sent, using a timer
- based on 5% of SRTT, capped by this sysctl, in nano seconds.
+ TCP tries to reduce number of SACK sent, using a timer based
+ on tcp_comp_sack_rtt_percent of SRTT, capped by this sysctl
+ in nano seconds.
The default is 1ms, based on TSO autosizing period.
Default : 1,000,000 ns (1 ms)
@@ -866,8 +885,9 @@ tcp_comp_sack_slack_ns - LONG INTEGER
timer used by SACK compression. This gives extra time
for small RTT flows, and reduces system overhead by allowing
opportunistic reduction of timer interrupts.
+ Too big values might reduce goodput.
- Default : 100,000 ns (100 us)
+ Default : 10,000 ns (10 us)
tcp_comp_sack_nr - INTEGER
Max number of SACK that can be compressed.
@@ -1796,6 +1816,23 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
Default: 0 (disabled)
+icmp_errors_extension_mask - UNSIGNED INTEGER
+ Bitmask of ICMP extensions to append to ICMPv4 error messages
+ ("Destination Unreachable", "Time Exceeded" and "Parameter Problem").
+ The original datagram is trimmed / padded to 128 bytes in order to be
+ compatible with applications that do not comply with RFC 4884.
+
+ Possible extensions are:
+
+ ==== ==============================================================
+ 0x01 Incoming IP interface information according to RFC 5837.
+ Extension will include the index, IPv4 address (if present),
+ name and MTU of the IP interface that received the datagram
+ which elicited the ICMP error.
+ ==== ==============================================================
+
+ Default: 0x00 (no extensions)
+
igmp_max_memberships - INTEGER
Change the maximum number of multicast groups we can subscribe to.
Default: 20
@@ -3262,6 +3299,23 @@ error_anycast_as_unicast - BOOLEAN
Default: 0 (disabled)
+errors_extension_mask - UNSIGNED INTEGER
+ Bitmask of ICMP extensions to append to ICMPv6 error messages
+ ("Destination Unreachable" and "Time Exceeded"). The original datagram
+ is trimmed / padded to 128 bytes in order to be compatible with
+ applications that do not comply with RFC 4884.
+
+ Possible extensions are:
+
+ ==== ==============================================================
+ 0x01 Incoming IP interface information according to RFC 5837.
+ Extension will include the index, IPv6 address (if present),
+ name and MTU of the IP interface that received the datagram
+ which elicited the ICMP error.
+ ==== ==============================================================
+
+ Default: 0x00 (no extensions)
+
xfrm6_gc_thresh - INTEGER
(Obsolete since linux-4.14)
The threshold at which we will start garbage collecting for IPv6
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index 7dd60366f4ff..4e008efebb35 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -263,7 +263,9 @@ are not well known).
Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
selected sockets or using the global ``net.core.busy_poll`` and
``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
-also exists.
+also exists. Threaded polling of NAPI also has a mode to busy poll for
+packets (:ref:`threaded busy polling<threaded_busy_poll>`) using the NAPI
+processing kthread.
epoll-based busy polling
------------------------
@@ -426,6 +428,52 @@ Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
the recommended usage, because otherwise setting ``irq-suspend-timeout``
might not have any discernible effect.
+.. _threaded_busy_poll:
+
+Threaded NAPI busy polling
+--------------------------
+
+Threaded NAPI busy polling extends threaded NAPI and adds support to do
+continuous busy polling of the NAPI. This can be useful for forwarding or
+AF_XDP applications.
+
+Threaded NAPI busy polling can be enabled on per NIC queue basis using Netlink.
+
+For example, using the following script:
+
+.. code-block:: bash
+
+ $ ynl --family netdev --do napi-set \
+ --json='{"id": 66, "threaded": "busy-poll"}'
+
+The kernel will create a kthread that busy polls on this NAPI.
+
+The user may elect to set the CPU affinity of this kthread to an unused CPU
+core to improve how often the NAPI is polled at the expense of wasted CPU
+cycles. Note that this will keep the CPU core busy with 100% usage.
+
+Once threaded busy polling is enabled for a NAPI, PID of the kthread can be
+retrieved using Netlink so the affinity of the kthread can be set up.
+
+For example, the following script can be used to fetch the PID:
+
+.. code-block:: bash
+
+ $ ynl --family netdev --do napi-get --json='{"id": 66}'
+
+This will output something like following, the pid `258` is the PID of the
+kthread that is polling this NAPI.
+
+.. code-block:: bash
+
+ $ {'defer-hard-irqs': 0,
+ 'gro-flush-timeout': 0,
+ 'id': 66,
+ 'ifindex': 2,
+ 'irq-suspend-timeout': 0,
+ 'pid': 258,
+ 'threaded': 'busy-poll'}
+
.. _threaded:
Threaded NAPI
diff --git a/Documentation/networking/net_cachelines/inet_connection_sock.rst b/Documentation/networking/net_cachelines/inet_connection_sock.rst
index 8fae85ebb773..cc2000f55c29 100644
--- a/Documentation/networking/net_cachelines/inet_connection_sock.rst
+++ b/Documentation/networking/net_cachelines/inet_connection_sock.rst
@@ -12,8 +12,8 @@ struct inet_sock icsk_inet read_mostly r
struct request_sock_queue icsk_accept_queue
struct inet_bind_bucket icsk_bind_hash read_mostly tcp_set_state
struct inet_bind2_bucket icsk_bind2_hash read_mostly tcp_set_state,inet_put_port
-struct timer_list icsk_retransmit_timer read_write inet_csk_reset_xmit_timer,tcp_connect
struct timer_list icsk_delack_timer read_mostly inet_csk_reset_xmit_timer,tcp_connect
+struct timer_list icsk_keepalive_timer
u32 icsk_rto read_write tcp_cwnd_validate,tcp_schedule_loss_probe,tcp_connect_init,tcp_connect,tcp_write_xmit,tcp_push_one
u32 icsk_rto_min
u32 icsk_rto_max read_mostly tcp_reset_xmit_timer
diff --git a/Documentation/networking/net_cachelines/inet_sock.rst b/Documentation/networking/net_cachelines/inet_sock.rst
index b11bf48fa2b3..4c72a28a7012 100644
--- a/Documentation/networking/net_cachelines/inet_sock.rst
+++ b/Documentation/networking/net_cachelines/inet_sock.rst
@@ -5,42 +5,43 @@
inet_sock struct fast path usage breakdown
==========================================
-======================= ===================== =================== =================== ======================================================================================================
-Type Name fastpath_tx_access fastpath_rx_access comment
-======================= ===================== =================== =================== ======================================================================================================
-struct sock sk read_mostly read_mostly tcp_init_buffer_space,tcp_init_transfer,tcp_finish_connect,tcp_connect,tcp_send_rcvq,tcp_send_syn_data
-struct ipv6_pinfo* pinet6
-be16 inet_sport read_mostly __tcp_transmit_skb
-be32 inet_daddr read_mostly ip_select_ident_segs
-be32 inet_rcv_saddr
-be16 inet_dport read_mostly __tcp_transmit_skb
-u16 inet_num
-be32 inet_saddr
-s16 uc_ttl read_mostly __ip_queue_xmit/ip_select_ttl
-u16 cmsg_flags
-struct ip_options_rcu* inet_opt read_mostly __ip_queue_xmit
-u16 inet_id read_mostly ip_select_ident_segs
-u8 tos read_mostly ip_queue_xmit
-u8 min_ttl
-u8 mc_ttl
-u8 pmtudisc
-u8:1 recverr
-u8:1 is_icsk
-u8:1 freebind
-u8:1 hdrincl
-u8:1 mc_loop
-u8:1 transparent
-u8:1 mc_all
-u8:1 nodefrag
-u8:1 bind_address_no_port
-u8:1 recverr_rfc4884
-u8:1 defer_connect read_mostly tcp_sendmsg_fastopen
-u8 rcv_tos
-u8 convert_csum
-int uc_index
-int mc_index
-be32 mc_addr
-struct ip_mc_socklist* mc_list
-struct inet_cork_full cork read_mostly __tcp_transmit_skb
-struct local_port_range
-======================= ===================== =================== =================== ======================================================================================================
+======================== ===================== =================== =================== ======================================================================================================
+Type Name fastpath_tx_access fastpath_rx_access comment
+======================== ===================== =================== =================== ======================================================================================================
+struct sock sk read_mostly read_mostly tcp_init_buffer_space,tcp_init_transfer,tcp_finish_connect,tcp_connect,tcp_send_rcvq,tcp_send_syn_data
+struct ipv6_pinfo* pinet6
+struct ipv6_fl_socklist* ipv6_fl_list read_mostly tcp_v6_connect,__ip6_datagram_connect,udpv6_sendmsg,rawv6_sendmsg
+be16 inet_sport read_mostly __tcp_transmit_skb
+be32 inet_daddr read_mostly ip_select_ident_segs
+be32 inet_rcv_saddr
+be16 inet_dport read_mostly __tcp_transmit_skb
+u16 inet_num
+be32 inet_saddr
+s16 uc_ttl read_mostly __ip_queue_xmit/ip_select_ttl
+u16 cmsg_flags
+struct ip_options_rcu* inet_opt read_mostly __ip_queue_xmit
+u16 inet_id read_mostly ip_select_ident_segs
+u8 tos read_mostly ip_queue_xmit
+u8 min_ttl
+u8 mc_ttl
+u8 pmtudisc
+u8:1 recverr
+u8:1 is_icsk
+u8:1 freebind
+u8:1 hdrincl
+u8:1 mc_loop
+u8:1 transparent
+u8:1 mc_all
+u8:1 nodefrag
+u8:1 bind_address_no_port
+u8:1 recverr_rfc4884
+u8:1 defer_connect read_mostly tcp_sendmsg_fastopen
+u8 rcv_tos
+u8 convert_csum
+int uc_index
+int mc_index
+be32 mc_addr
+struct ip_mc_socklist* mc_list
+struct inet_cork_full cork read_mostly __tcp_transmit_skb
+struct local_port_range
+======================== ===================== =================== =================== ======================================================================================================
diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
index 6e7b20afd2d4..beaf1880a19b 100644
--- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
+++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
@@ -102,7 +102,8 @@ u8 sysctl_tcp_app_win
u8 sysctl_tcp_frto tcp_enter_loss
u8 sysctl_tcp_nometrics_save TCP_LAST_ACK/tcp_update_metrics
u8 sysctl_tcp_no_ssthresh_metrics_save TCP_LAST_ACK/tcp_(update/init)_metrics
-u8 sysctl_tcp_moderate_rcvbuf read_mostly read_mostly tcp_tso_should_defer(tx);tcp_rcv_space_adjust(rx)
+u8 sysctl_tcp_moderate_rcvbuf read_mostly tcp_rcvbuf_grow()
+u32 sysctl_tcp_rcvbuf_low_rtt read_mostly tcp_rcvbuf_grow()
u8 sysctl_tcp_tso_win_divisor read_mostly tcp_tso_should_defer(tcp_write_xmit)
u8 sysctl_tcp_workaround_signed_windows tcp_select_window
int sysctl_tcp_limit_output_bytes read_mostly tcp_small_queue_check(tcp_write_xmit)
diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst
index f4e1b4e07adc..2f776e90d318 100644
--- a/Documentation/networking/net_failover.rst
+++ b/Documentation/networking/net_failover.rst
@@ -96,9 +96,8 @@ needed to these network configuration daemons to make sure that an IP is
received only on the 'failover' device.
Below is the patch snippet used with 'cloud-ifupdown-helper' script found on
-Debian cloud images:
+Debian cloud images::
-::
@@ -27,6 +27,8 @@ do_setup() {
local working="$cfgdir/.$INTERFACE"
local final="$cfgdir/$INTERFACE"
@@ -172,9 +171,8 @@ appropriate FDB entry is added.
The following script is executed on the destination hypervisor once migration
completes, and it reattaches the VF to the VM and brings down the virtio-net
-interface.
+interface::
-::
# reattach-vf.sh
#!/bin/bash
diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst
index 59cb9982afe6..4ab5d7b05cf1 100644
--- a/Documentation/networking/netconsole.rst
+++ b/Documentation/networking/netconsole.rst
@@ -19,9 +19,6 @@ Userdata append support by Matthew Wood <thepacketgeek@gmail.com>, Jan 22 2024
Sysdata append support by Breno Leitao <leitao@debian.org>, Jan 15 2025
-Please send bug reports to Matt Mackall <mpm@selenic.com>
-Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com>
-
Introduction:
=============
@@ -91,7 +88,7 @@ for example:
nc -u -l -p <port>' / 'nc -u -l <port>
- or::
+ or::
netcat -u -l -p <port>' / 'netcat -u -l <port>
diff --git a/Documentation/networking/nfc.rst b/Documentation/networking/nfc.rst
index 9aab3a88c9b2..401735006143 100644
--- a/Documentation/networking/nfc.rst
+++ b/Documentation/networking/nfc.rst
@@ -71,7 +71,8 @@ Userspace interface
The userspace interface is divided in control operations and low-level data
exchange operation.
-CONTROL OPERATIONS:
+Control operations
+------------------
Generic netlink is used to implement the interface to the control operations.
The operations are composed by commands and events, all listed below:
@@ -100,7 +101,8 @@ relevant information such as the supported NFC protocols.
All polling operations requested through one netlink socket are stopped when
it's closed.
-LOW-LEVEL DATA EXCHANGE:
+Low-level data exchange
+-----------------------
The userspace must use PF_NFC sockets to perform any data communication with
targets. All NFC sockets use AF_NFC::
diff --git a/Documentation/networking/seg6-sysctl.rst b/Documentation/networking/seg6-sysctl.rst
index 07c20e470baf..1b6af4779be1 100644
--- a/Documentation/networking/seg6-sysctl.rst
+++ b/Documentation/networking/seg6-sysctl.rst
@@ -25,6 +25,9 @@ seg6_require_hmac - INTEGER
Default is 0.
+/proc/sys/net/ipv6/seg6_* variables:
+====================================
+
seg6_flowlabel - INTEGER
Controls the behaviour of computing the flowlabel of outer
IPv6 header in case of SR T.encaps
diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index a874d007f2db..904a910f198e 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -71,3 +71,43 @@ smcr_max_conns_per_lgr - INTEGER
acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
Default: 255
+
+smcr_max_send_wr - INTEGER
+ So-called work request buffers are SMCR link (and RDMA queue pair) level
+ resources necessary for performing RDMA operations. Since up to 255
+ connections can share a link group and thus also a link and the number
+ of the work request buffers is decided when the link is allocated,
+ depending on the workload it can be a bottleneck in a sense that threads
+ have to wait for work request buffers to become available. Before the
+ introduction of this control the maximal number of work request buffers
+ available on the send path used to be hard coded to 16. With this control
+ it becomes configurable. The acceptable range is between 2 and 2048.
+
+ Please be aware that all the buffers need to be allocated as a physically
+ continuous array in which each element is a single buffer and has the size
+ of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+ with half of the buffer count until it is ether successful or (unlikely)
+ we dip below the old hard coded value which is 16 where we give up much
+ like before having this control.
+
+ Default: 16
+
+smcr_max_recv_wr - INTEGER
+ So-called work request buffers are SMCR link (and RDMA queue pair) level
+ resources necessary for performing RDMA operations. Since up to 255
+ connections can share a link group and thus also a link and the number
+ of the work request buffers is decided when the link is allocated,
+ depending on the workload it can be a bottleneck in a sense that threads
+ have to wait for work request buffers to become available. Before the
+ introduction of this control the maximal number of work request buffers
+ available on the receive path used to be hard coded to 16. With this control
+ it becomes configurable. The acceptable range is between 2 and 2048.
+
+ Please be aware that all the buffers need to be allocated as a physically
+ continuous array in which each element is a single buffer and has the size
+ of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+ with half of the buffer count until it is ether successful or (unlikely)
+ we dip below the old hard coded value which is 16 where we give up much
+ like before having this control.
+
+ Default: 48
diff --git a/Documentation/networking/statistics.rst b/Documentation/networking/statistics.rst
index 518284e287b0..66b0ef941457 100644
--- a/Documentation/networking/statistics.rst
+++ b/Documentation/networking/statistics.rst
@@ -184,9 +184,11 @@ Protocol-related statistics can be requested in get commands by setting
the `ETHTOOL_FLAG_STATS` flag in `ETHTOOL_A_HEADER_FLAGS`. Currently
statistics are supported in the following commands:
- - `ETHTOOL_MSG_PAUSE_GET`
- `ETHTOOL_MSG_FEC_GET`
+ - `ETHTOOL_MSG_LINKSTATE_GET`
- `ETHTOOL_MSG_MM_GET`
+ - `ETHTOOL_MSG_PAUSE_GET`
+ - `ETHTOOL_MSG_TSINFO_GET`
debugfs
-------
diff --git a/Documentation/networking/tls.rst b/Documentation/networking/tls.rst
index 36cc7afc2527..980c442d7161 100644
--- a/Documentation/networking/tls.rst
+++ b/Documentation/networking/tls.rst
@@ -280,6 +280,26 @@ If the record decrypted turns out to had been padded or is not a data
record it will be decrypted again into a kernel buffer without zero copy.
Such events are counted in the ``TlsDecryptRetry`` statistic.
+TLS_TX_MAX_PAYLOAD_LEN
+~~~~~~~~~~~~~~~~~~~~~~
+
+Specifies the maximum size of the plaintext payload for transmitted TLS records.
+
+When this option is set, the kernel enforces the specified limit on all outgoing
+TLS records. No plaintext fragment will exceed this size. This option can be used
+to implement the TLS Record Size Limit extension [1].
+
+* For TLS 1.2, the value corresponds directly to the record size limit.
+* For TLS 1.3, the value should be set to record_size_limit - 1, since
+ the record size limit includes one additional byte for the ContentType
+ field.
+
+The valid range for this option is 64 to 16384 bytes for TLS 1.2, and 63 to
+16384 bytes for TLS 1.3. The lower minimum for TLS 1.3 accounts for the
+extra byte used by the ContentType field.
+
+[1] https://datatracker.ietf.org/doc/html/rfc8449
+
Statistics
==========
diff --git a/Documentation/networking/xfrm/index.rst b/Documentation/networking/xfrm/index.rst
new file mode 100644
index 000000000000..7d866da836fe
--- /dev/null
+++ b/Documentation/networking/xfrm/index.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+XFRM Framework
+==============
+
+.. toctree::
+ :maxdepth: 2
+
+ xfrm_device
+ xfrm_proc
+ xfrm_sync
+ xfrm_sysctl
diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm/xfrm_device.rst
index 122204da0fff..b0d85a5f57d1 100644
--- a/Documentation/networking/xfrm_device.rst
+++ b/Documentation/networking/xfrm/xfrm_device.rst
@@ -20,11 +20,15 @@ can radically increase throughput and decrease CPU utilization. The XFRM
Device interface allows NIC drivers to offer to the stack access to the
hardware offload.
-Right now, there are two types of hardware offload that kernel supports.
+Right now, there are two types of hardware offload that kernel supports:
+
* IPsec crypto offload:
+
* NIC performs encrypt/decrypt
* Kernel does everything else
+
* IPsec packet offload:
+
* NIC performs encrypt/decrypt
* NIC does encapsulation
* Kernel and NIC have SA and policy in-sync
@@ -34,7 +38,7 @@ Right now, there are two types of hardware offload that kernel supports.
Userland access to the offload is typically through a system such as
libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can
be handy when experimenting. An example command might look something
-like this for crypto offload:
+like this for crypto offload::
ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
reqid 0x07 replay-window 32 \
@@ -42,7 +46,7 @@ like this for crypto offload:
sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
offload dev eth4 dir in
-and for packet offload
+and for packet offload::
ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
reqid 0x07 replay-window 32 \
@@ -153,26 +157,26 @@ the packet's skb. At this point the data should be decrypted but the
IPsec headers are still in the packet data; they are removed later up
the stack in xfrm_input().
- find and hold the SA that was used to the Rx skb::
+1. Find and hold the SA that was used to the Rx skb::
- get spi, protocol, and destination IP from packet headers
+ /* get spi, protocol, and destination IP from packet headers */
xs = find xs from (spi, protocol, dest_IP)
xfrm_state_hold(xs);
- store the state information into the skb::
+2. Store the state information into the skb::
sp = secpath_set(skb);
if (!sp) return;
sp->xvec[sp->len++] = xs;
sp->olen++;
- indicate the success and/or error status of the offload::
+3. Indicate the success and/or error status of the offload::
xo = xfrm_offload(skb);
xo->flags = CRYPTO_DONE;
xo->status = crypto_status;
- hand the packet to napi_gro_receive() as usual
+4. Hand the packet to napi_gro_receive() as usual.
In ESN mode, xdo_dev_state_advance_esn() is called from
xfrm_replay_advance_esn() for RX, and xfrm_replay_overflow_offload_esn for TX.
diff --git a/Documentation/networking/xfrm_proc.rst b/Documentation/networking/xfrm/xfrm_proc.rst
index 973d1571acac..973d1571acac 100644
--- a/Documentation/networking/xfrm_proc.rst
+++ b/Documentation/networking/xfrm/xfrm_proc.rst
diff --git a/Documentation/networking/xfrm_sync.rst b/Documentation/networking/xfrm/xfrm_sync.rst
index 6246503ceab2..dfc2ec0df380 100644
--- a/Documentation/networking/xfrm_sync.rst
+++ b/Documentation/networking/xfrm/xfrm_sync.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
-====
-XFRM
-====
+=========
+XFRM sync
+=========
The sync patches work is based on initial patches from
Krisztian <hidden@balabit.hu> and others and additional patches
@@ -36,7 +36,7 @@ is not driven by packet arrival.
- the replay sequence for both inbound and outbound
1) Message Structure
-----------------------
+--------------------
nlmsghdr:aevent_id:optional-TLVs.
@@ -83,31 +83,31 @@ when going from kernel to user space)
A program needs to subscribe to multicast group XFRMNLGRP_AEVENTS
to get notified of these events.
-2) TLVS reflect the different parameters:
------------------------------------------
+2) TLVS reflect the different parameters
+----------------------------------------
a) byte value (XFRMA_LTIME_VAL)
-This TLV carries the running/current counter for byte lifetime since
-last event.
+ This TLV carries the running/current counter for byte lifetime since
+ last event.
-b)replay value (XFRMA_REPLAY_VAL)
+b) replay value (XFRMA_REPLAY_VAL)
-This TLV carries the running/current counter for replay sequence since
-last event.
+ This TLV carries the running/current counter for replay sequence since
+ last event.
-c)replay threshold (XFRMA_REPLAY_THRESH)
+c) replay threshold (XFRMA_REPLAY_THRESH)
-This TLV carries the threshold being used by the kernel to trigger events
-when the replay sequence is exceeded.
+ This TLV carries the threshold being used by the kernel to trigger events
+ when the replay sequence is exceeded.
d) expiry timer (XFRMA_ETIMER_THRESH)
-This is a timer value in milliseconds which is used as the nagle
-value to rate limit the events.
+ This is a timer value in milliseconds which is used as the nagle
+ value to rate limit the events.
-3) Default configurations for the parameters:
----------------------------------------------
+3) Default configurations for the parameters
+--------------------------------------------
By default these events should be turned off unless there is
at least one listener registered to listen to the multicast
@@ -121,12 +121,14 @@ in case they are not specified.
the two sysctls/proc entries are:
a) /proc/sys/net/core/sysctl_xfrm_aevent_etime
-used to provide default values for the XFRMA_ETIMER_THRESH in incremental
-units of time of 100ms. The default is 10 (1 second)
+
+ Used to provide default values for the XFRMA_ETIMER_THRESH in incremental
+ units of time of 100ms. The default is 10 (1 second)
b) /proc/sys/net/core/sysctl_xfrm_aevent_rseqth
-used to provide default values for XFRMA_REPLAY_THRESH parameter
-in incremental packet count. The default is two packets.
+
+ Used to provide default values for XFRMA_REPLAY_THRESH parameter
+ in incremental packet count. The default is two packets.
4) Message types
----------------
@@ -134,50 +136,51 @@ in incremental packet count. The default is two packets.
a) XFRM_MSG_GETAE issued by user-->kernel.
XFRM_MSG_GETAE does not carry any TLVs.
-The response is a XFRM_MSG_NEWAE which is formatted based on what
-XFRM_MSG_GETAE queried for.
+ The response is a XFRM_MSG_NEWAE which is formatted based on what
+ XFRM_MSG_GETAE queried for.
+
+ The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
-The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
-* if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved
-* if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved
+ * if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved
+ * if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved
b) XFRM_MSG_NEWAE is issued by either user space to configure
or kernel to announce events or respond to a XFRM_MSG_GETAE.
-i) user --> kernel to configure a specific SA.
+ i) user --> kernel to configure a specific SA.
-any of the values or threshold parameters can be updated by passing the
-appropriate TLV.
+ any of the values or threshold parameters can be updated by passing the
+ appropriate TLV.
-A response is issued back to the sender in user space to indicate success
-or failure.
+ A response is issued back to the sender in user space to indicate success
+ or failure.
-In the case of success, additionally an event with
-XFRM_MSG_NEWAE is also issued to any listeners as described in iii).
+ In the case of success, additionally an event with
+ XFRM_MSG_NEWAE is also issued to any listeners as described in iii).
-ii) kernel->user direction as a response to XFRM_MSG_GETAE
+ ii) kernel->user direction as a response to XFRM_MSG_GETAE
-The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+ The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
-The threshold TLVs will be included if explicitly requested in
-the XFRM_MSG_GETAE message.
+ The threshold TLVs will be included if explicitly requested in
+ the XFRM_MSG_GETAE message.
-iii) kernel->user to report as event if someone sets any values or
- thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above).
- In such a case XFRM_AE_CU flag is set to inform the user that
- the change happened as a result of an update.
- The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+ iii) kernel->user to report as event if someone sets any values or
+ thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above).
+ In such a case XFRM_AE_CU flag is set to inform the user that
+ the change happened as a result of an update.
+ The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
-iv) kernel->user to report event when replay threshold or a timeout
- is exceeded.
+ iv) kernel->user to report event when replay threshold or a timeout
+ is exceeded.
In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout
happened) is set to inform the user what happened.
Note the two flags are mutually exclusive.
The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
-Exceptions to threshold settings
---------------------------------
+5) Exceptions to threshold settings
+-----------------------------------
If you have an SA that is getting hit by traffic in bursts such that
there is a period where the timer threshold expires with no packets
diff --git a/Documentation/networking/xfrm_sysctl.rst b/Documentation/networking/xfrm/xfrm_sysctl.rst
index 47b9bbdd0179..7d0c4b17c0bd 100644
--- a/Documentation/networking/xfrm_sysctl.rst
+++ b/Documentation/networking/xfrm/xfrm_sysctl.rst
@@ -4,8 +4,8 @@
XFRM Syscall
============
-/proc/sys/net/core/xfrm_* Variables:
-====================================
+/proc/sys/net/core/xfrm_* Variables
+===================================
xfrm_acq_expires - INTEGER
default 30 - hard timeout in seconds for acquire requests