summaryrefslogtreecommitdiff
path: root/Documentation/networking/xfrm
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking/xfrm')
-rw-r--r--Documentation/networking/xfrm/index.rst13
-rw-r--r--Documentation/networking/xfrm/xfrm_device.rst206
-rw-r--r--Documentation/networking/xfrm/xfrm_proc.rst119
-rw-r--r--Documentation/networking/xfrm/xfrm_sync.rst192
-rw-r--r--Documentation/networking/xfrm/xfrm_sysctl.rst11
5 files changed, 541 insertions, 0 deletions
diff --git a/Documentation/networking/xfrm/index.rst b/Documentation/networking/xfrm/index.rst
new file mode 100644
index 000000000000..7d866da836fe
--- /dev/null
+++ b/Documentation/networking/xfrm/index.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+XFRM Framework
+==============
+
+.. toctree::
+ :maxdepth: 2
+
+ xfrm_device
+ xfrm_proc
+ xfrm_sync
+ xfrm_sysctl
diff --git a/Documentation/networking/xfrm/xfrm_device.rst b/Documentation/networking/xfrm/xfrm_device.rst
new file mode 100644
index 000000000000..b0d85a5f57d1
--- /dev/null
+++ b/Documentation/networking/xfrm/xfrm_device.rst
@@ -0,0 +1,206 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _xfrm_device:
+
+===============================================
+XFRM device - offloading the IPsec computations
+===============================================
+
+Shannon Nelson <shannon.nelson@oracle.com>
+Leon Romanovsky <leonro@nvidia.com>
+
+
+Overview
+========
+
+IPsec is a useful feature for securing network traffic, but the
+computational cost is high: a 10Gbps link can easily be brought down
+to under 1Gbps, depending on the traffic and link configuration.
+Luckily, there are NICs that offer a hardware based IPsec offload which
+can radically increase throughput and decrease CPU utilization. The XFRM
+Device interface allows NIC drivers to offer to the stack access to the
+hardware offload.
+
+Right now, there are two types of hardware offload that kernel supports:
+
+ * IPsec crypto offload:
+
+ * NIC performs encrypt/decrypt
+ * Kernel does everything else
+
+ * IPsec packet offload:
+
+ * NIC performs encrypt/decrypt
+ * NIC does encapsulation
+ * Kernel and NIC have SA and policy in-sync
+ * NIC handles the SA and policies states
+ * The Kernel talks to the keymanager
+
+Userland access to the offload is typically through a system such as
+libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can
+be handy when experimenting. An example command might look something
+like this for crypto offload::
+
+ ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
+ reqid 0x07 replay-window 32 \
+ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
+ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
+ offload dev eth4 dir in
+
+and for packet offload::
+
+ ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
+ reqid 0x07 replay-window 32 \
+ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
+ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
+ offload packet dev eth4 dir in
+
+ ip x p add src 14.0.0.70 dst 14.0.0.52 offload packet dev eth4 dir in
+ tmpl src 14.0.0.70 dst 14.0.0.52 proto esp reqid 10000 mode transport
+
+Yes, that's ugly, but that's what shell scripts and/or libreswan are for.
+
+
+
+Callbacks to implement
+======================
+
+::
+
+ /* from include/linux/netdevice.h */
+ struct xfrmdev_ops {
+ /* Crypto and Packet offload callbacks */
+ int (*xdo_dev_state_add)(struct net_device *dev,
+ struct xfrm_state *x,
+ struct netlink_ext_ack *extack);
+ void (*xdo_dev_state_delete)(struct net_device *dev,
+ struct xfrm_state *x);
+ void (*xdo_dev_state_free)(struct net_device *dev,
+ struct xfrm_state *x);
+ bool (*xdo_dev_offload_ok) (struct sk_buff *skb,
+ struct xfrm_state *x);
+ void (*xdo_dev_state_advance_esn) (struct xfrm_state *x);
+ void (*xdo_dev_state_update_stats) (struct xfrm_state *x);
+
+ /* Solely packet offload callbacks */
+ int (*xdo_dev_policy_add) (struct xfrm_policy *x, struct netlink_ext_ack *extack);
+ void (*xdo_dev_policy_delete) (struct xfrm_policy *x);
+ void (*xdo_dev_policy_free) (struct xfrm_policy *x);
+ };
+
+The NIC driver offering ipsec offload will need to implement callbacks
+relevant to supported offload to make the offload available to the network
+stack's XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
+NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload.
+
+
+
+Flow
+====
+
+At probe time and before the call to register_netdev(), the driver should
+set up local data structures and XFRM callbacks, and set the feature bits.
+The XFRM code's listener will finish the setup on NETDEV_REGISTER.
+
+::
+
+ adapter->netdev->xfrmdev_ops = &ixgbe_xfrmdev_ops;
+ adapter->netdev->features |= NETIF_F_HW_ESP;
+ adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP;
+
+When new SAs are set up with a request for "offload" feature, the
+driver's xdo_dev_state_add() will be given the new SA to be offloaded
+and an indication of whether it is for Rx or Tx. The driver should
+
+ - verify the algorithm is supported for offloads
+ - store the SA information (key, salt, target-ip, protocol, etc)
+ - enable the HW offload of the SA
+ - return status value:
+
+ =========== ===================================
+ 0 success
+ -EOPNETSUPP offload not supported, try SW IPsec,
+ not applicable for packet offload mode
+ other fail the request
+ =========== ===================================
+
+The driver can also set an offload_handle in the SA, an opaque void pointer
+that can be used to convey context into the fast-path offload requests::
+
+ xs->xso.offload_handle = context;
+
+
+When the network stack is preparing an IPsec packet for an SA that has
+been setup for offload, it first calls into xdo_dev_offload_ok() with
+the skb and the intended offload state to ask the driver if the offload
+will serviceable. This can check the packet information to be sure the
+offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and
+return true or false to signify its support. In case driver doesn't implement
+this callback, the stack provides reasonable defaults.
+
+Crypto offload mode:
+When ready to send, the driver needs to inspect the Tx packet for the
+offload information, including the opaque context, and set up the packet
+send accordingly::
+
+ xs = xfrm_input_state(skb);
+ context = xs->xso.offload_handle;
+ set up HW for send
+
+The stack has already inserted the appropriate IPsec headers in the
+packet data, the offload just needs to do the encryption and fix up the
+header values.
+
+
+When a packet is received and the HW has indicated that it offloaded a
+decryption, the driver needs to add a reference to the decoded SA into
+the packet's skb. At this point the data should be decrypted but the
+IPsec headers are still in the packet data; they are removed later up
+the stack in xfrm_input().
+
+1. Find and hold the SA that was used to the Rx skb::
+
+ /* get spi, protocol, and destination IP from packet headers */
+ xs = find xs from (spi, protocol, dest_IP)
+ xfrm_state_hold(xs);
+
+2. Store the state information into the skb::
+
+ sp = secpath_set(skb);
+ if (!sp) return;
+ sp->xvec[sp->len++] = xs;
+ sp->olen++;
+
+3. Indicate the success and/or error status of the offload::
+
+ xo = xfrm_offload(skb);
+ xo->flags = CRYPTO_DONE;
+ xo->status = crypto_status;
+
+4. Hand the packet to napi_gro_receive() as usual.
+
+In ESN mode, xdo_dev_state_advance_esn() is called from
+xfrm_replay_advance_esn() for RX, and xfrm_replay_overflow_offload_esn for TX.
+Driver will check packet seq number and update HW ESN state machine if needed.
+
+Packet offload mode:
+HW adds and deletes XFRM headers. So in RX path, XFRM stack is bypassed if HW
+reported success. In TX path, the packet lefts kernel without extra header
+and not encrypted, the HW is responsible to perform it.
+
+When the SA is removed by the user, the driver's xdo_dev_state_delete()
+and xdo_dev_policy_delete() are asked to disable the offload. Later,
+xdo_dev_state_free() and xdo_dev_policy_free() are called from a garbage
+collection routine after all reference counts to the state and policy
+have been removed and any remaining resources can be cleared for the
+offload state. How these are used by the driver will depend on specific
+hardware needs.
+
+As a netdev is set to DOWN the XFRM stack's netdev listener will call
+xdo_dev_state_delete(), xdo_dev_policy_delete(), xdo_dev_state_free() and
+xdo_dev_policy_free() on any remaining offloaded states.
+
+Outcome of HW handling packets, the XFRM core can't count hard, soft limits.
+The HW/driver are responsible to perform it and provide accurate data when
+xdo_dev_state_update_stats() is called. In case of one of these limits
+occuried, the driver needs to call to xfrm_state_check_expire() to make sure
+that XFRM performs rekeying sequence.
diff --git a/Documentation/networking/xfrm/xfrm_proc.rst b/Documentation/networking/xfrm/xfrm_proc.rst
new file mode 100644
index 000000000000..973d1571acac
--- /dev/null
+++ b/Documentation/networking/xfrm/xfrm_proc.rst
@@ -0,0 +1,119 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+XFRM proc - /proc/net/xfrm_* files
+==================================
+
+Masahide NAKAMURA <nakam@linux-ipv6.org>
+
+
+Transformation Statistics
+-------------------------
+
+The xfrm_proc code is a set of statistics showing numbers of packets
+dropped by the transformation code and why. These counters are defined
+as part of the linux private MIB. These counters can be viewed in
+/proc/net/xfrm_stat.
+
+
+Inbound errors
+~~~~~~~~~~~~~~
+
+XfrmInError:
+ All errors which is not matched others
+
+XfrmInBufferError:
+ No buffer is left
+
+XfrmInHdrError:
+ Header error
+
+XfrmInNoStates:
+ No state is found
+ i.e. Either inbound SPI, address, or IPsec protocol at SA is wrong
+
+XfrmInStateProtoError:
+ Transformation protocol specific error
+ e.g. SA key is wrong
+
+XfrmInStateModeError:
+ Transformation mode specific error
+
+XfrmInStateSeqError:
+ Sequence error
+ i.e. Sequence number is out of window
+
+XfrmInStateExpired:
+ State is expired
+
+XfrmInStateMismatch:
+ State has mismatch option
+ e.g. UDP encapsulation type is mismatch
+
+XfrmInStateInvalid:
+ State is invalid
+
+XfrmInTmplMismatch:
+ No matching template for states
+ e.g. Inbound SAs are correct but SP rule is wrong
+
+XfrmInNoPols:
+ No policy is found for states
+ e.g. Inbound SAs are correct but no SP is found
+
+XfrmInPolBlock:
+ Policy discards
+
+XfrmInPolError:
+ Policy error
+
+XfrmAcquireError:
+ State hasn't been fully acquired before use
+
+XfrmFwdHdrError:
+ Forward routing of a packet is not allowed
+
+XfrmInStateDirError:
+ State direction mismatch (lookup found an output state on the input path, expected input or no direction)
+
+Outbound errors
+~~~~~~~~~~~~~~~
+XfrmOutError:
+ All errors which is not matched others
+
+XfrmOutBundleGenError:
+ Bundle generation error
+
+XfrmOutBundleCheckError:
+ Bundle check error
+
+XfrmOutNoStates:
+ No state is found
+
+XfrmOutStateProtoError:
+ Transformation protocol specific error
+
+XfrmOutStateModeError:
+ Transformation mode specific error
+
+XfrmOutStateSeqError:
+ Sequence error
+ i.e. Sequence number overflow
+
+XfrmOutStateExpired:
+ State is expired
+
+XfrmOutPolBlock:
+ Policy discards
+
+XfrmOutPolDead:
+ Policy is dead
+
+XfrmOutPolError:
+ Policy error
+
+XfrmOutStateInvalid:
+ State is invalid, perhaps expired
+
+XfrmOutStateDirError:
+ State direction mismatch (lookup found an input state on the output path, expected output or no direction)
diff --git a/Documentation/networking/xfrm/xfrm_sync.rst b/Documentation/networking/xfrm/xfrm_sync.rst
new file mode 100644
index 000000000000..dfc2ec0df380
--- /dev/null
+++ b/Documentation/networking/xfrm/xfrm_sync.rst
@@ -0,0 +1,192 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+XFRM sync
+=========
+
+The sync patches work is based on initial patches from
+Krisztian <hidden@balabit.hu> and others and additional patches
+from Jamal <hadi@cyberus.ca>.
+
+The end goal for syncing is to be able to insert attributes + generate
+events so that the SA can be safely moved from one machine to another
+for HA purposes.
+The idea is to synchronize the SA so that the takeover machine can do
+the processing of the SA as accurate as possible if it has access to it.
+
+We already have the ability to generate SA add/del/upd events.
+These patches add ability to sync and have accurate lifetime byte (to
+ensure proper decay of SAs) and replay counters to avoid replay attacks
+with as minimal loss at failover time.
+This way a backup stays as closely up-to-date as an active member.
+
+Because the above items change for every packet the SA receives,
+it is possible for a lot of the events to be generated.
+For this reason, we also add a nagle-like algorithm to restrict
+the events. i.e we are going to set thresholds to say "let me
+know if the replay sequence threshold is reached or 10 secs have passed"
+These thresholds are set system-wide via sysctls or can be updated
+per SA.
+
+The identified items that need to be synchronized are:
+- the lifetime byte counter
+note that: lifetime time limit is not important if you assume the failover
+machine is known ahead of time since the decay of the time countdown
+is not driven by packet arrival.
+- the replay sequence for both inbound and outbound
+
+1) Message Structure
+--------------------
+
+nlmsghdr:aevent_id:optional-TLVs.
+
+The netlink message types are:
+
+XFRM_MSG_NEWAE and XFRM_MSG_GETAE.
+
+A XFRM_MSG_GETAE does not have TLVs.
+
+A XFRM_MSG_NEWAE will have at least two TLVs (as is
+discussed further below).
+
+aevent_id structure looks like::
+
+ struct xfrm_aevent_id {
+ struct xfrm_usersa_id sa_id;
+ xfrm_address_t saddr;
+ __u32 flags;
+ __u32 reqid;
+ };
+
+The unique SA is identified by the combination of xfrm_usersa_id,
+reqid and saddr.
+
+flags are used to indicate different things. The possible
+flags are::
+
+ XFRM_AE_RTHR=1, /* replay threshold*/
+ XFRM_AE_RVAL=2, /* replay value */
+ XFRM_AE_LVAL=4, /* lifetime value */
+ XFRM_AE_ETHR=8, /* expiry timer threshold */
+ XFRM_AE_CR=16, /* Event cause is replay update */
+ XFRM_AE_CE=32, /* Event cause is timer expiry */
+ XFRM_AE_CU=64, /* Event cause is policy update */
+
+How these flags are used is dependent on the direction of the
+message (kernel<->user) as well the cause (config, query or event).
+This is described below in the different messages.
+
+The pid will be set appropriately in netlink to recognize direction
+(0 to the kernel and pid = processid that created the event
+when going from kernel to user space)
+
+A program needs to subscribe to multicast group XFRMNLGRP_AEVENTS
+to get notified of these events.
+
+2) TLVS reflect the different parameters
+----------------------------------------
+
+a) byte value (XFRMA_LTIME_VAL)
+
+ This TLV carries the running/current counter for byte lifetime since
+ last event.
+
+b) replay value (XFRMA_REPLAY_VAL)
+
+ This TLV carries the running/current counter for replay sequence since
+ last event.
+
+c) replay threshold (XFRMA_REPLAY_THRESH)
+
+ This TLV carries the threshold being used by the kernel to trigger events
+ when the replay sequence is exceeded.
+
+d) expiry timer (XFRMA_ETIMER_THRESH)
+
+ This is a timer value in milliseconds which is used as the nagle
+ value to rate limit the events.
+
+3) Default configurations for the parameters
+--------------------------------------------
+
+By default these events should be turned off unless there is
+at least one listener registered to listen to the multicast
+group XFRMNLGRP_AEVENTS.
+
+Programs installing SAs will need to specify the two thresholds, however,
+in order to not change existing applications such as racoon
+we also provide default threshold values for these different parameters
+in case they are not specified.
+
+the two sysctls/proc entries are:
+
+a) /proc/sys/net/core/sysctl_xfrm_aevent_etime
+
+ Used to provide default values for the XFRMA_ETIMER_THRESH in incremental
+ units of time of 100ms. The default is 10 (1 second)
+
+b) /proc/sys/net/core/sysctl_xfrm_aevent_rseqth
+
+ Used to provide default values for XFRMA_REPLAY_THRESH parameter
+ in incremental packet count. The default is two packets.
+
+4) Message types
+----------------
+
+a) XFRM_MSG_GETAE issued by user-->kernel.
+ XFRM_MSG_GETAE does not carry any TLVs.
+
+ The response is a XFRM_MSG_NEWAE which is formatted based on what
+ XFRM_MSG_GETAE queried for.
+
+ The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+
+ * if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved
+ * if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved
+
+b) XFRM_MSG_NEWAE is issued by either user space to configure
+ or kernel to announce events or respond to a XFRM_MSG_GETAE.
+
+ i) user --> kernel to configure a specific SA.
+
+ any of the values or threshold parameters can be updated by passing the
+ appropriate TLV.
+
+ A response is issued back to the sender in user space to indicate success
+ or failure.
+
+ In the case of success, additionally an event with
+ XFRM_MSG_NEWAE is also issued to any listeners as described in iii).
+
+ ii) kernel->user direction as a response to XFRM_MSG_GETAE
+
+ The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+
+ The threshold TLVs will be included if explicitly requested in
+ the XFRM_MSG_GETAE message.
+
+ iii) kernel->user to report as event if someone sets any values or
+ thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above).
+ In such a case XFRM_AE_CU flag is set to inform the user that
+ the change happened as a result of an update.
+ The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+
+ iv) kernel->user to report event when replay threshold or a timeout
+ is exceeded.
+
+In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout
+happened) is set to inform the user what happened.
+Note the two flags are mutually exclusive.
+The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+
+5) Exceptions to threshold settings
+-----------------------------------
+
+If you have an SA that is getting hit by traffic in bursts such that
+there is a period where the timer threshold expires with no packets
+seen, then an odd behavior is seen as follows:
+The first packet arrival after a timer expiry will trigger a timeout
+event; i.e we don't wait for a timeout period or a packet threshold
+to be reached. This is done for simplicity and efficiency reasons.
+
+-JHS
diff --git a/Documentation/networking/xfrm/xfrm_sysctl.rst b/Documentation/networking/xfrm/xfrm_sysctl.rst
new file mode 100644
index 000000000000..7d0c4b17c0bd
--- /dev/null
+++ b/Documentation/networking/xfrm/xfrm_sysctl.rst
@@ -0,0 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+XFRM Syscall
+============
+
+/proc/sys/net/core/xfrm_* Variables
+===================================
+
+xfrm_acq_expires - INTEGER
+ default 30 - hard timeout in seconds for acquire requests