diff options
Diffstat (limited to 'Documentation/networking/xfrm')
| -rw-r--r-- | Documentation/networking/xfrm/index.rst | 13 | ||||
| -rw-r--r-- | Documentation/networking/xfrm/xfrm_device.rst | 206 | ||||
| -rw-r--r-- | Documentation/networking/xfrm/xfrm_proc.rst | 119 | ||||
| -rw-r--r-- | Documentation/networking/xfrm/xfrm_sync.rst | 192 | ||||
| -rw-r--r-- | Documentation/networking/xfrm/xfrm_sysctl.rst | 11 |
5 files changed, 541 insertions, 0 deletions
diff --git a/Documentation/networking/xfrm/index.rst b/Documentation/networking/xfrm/index.rst new file mode 100644 index 000000000000..7d866da836fe --- /dev/null +++ b/Documentation/networking/xfrm/index.rst @@ -0,0 +1,13 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +XFRM Framework +============== + +.. toctree:: + :maxdepth: 2 + + xfrm_device + xfrm_proc + xfrm_sync + xfrm_sysctl diff --git a/Documentation/networking/xfrm/xfrm_device.rst b/Documentation/networking/xfrm/xfrm_device.rst new file mode 100644 index 000000000000..b0d85a5f57d1 --- /dev/null +++ b/Documentation/networking/xfrm/xfrm_device.rst @@ -0,0 +1,206 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _xfrm_device: + +=============================================== +XFRM device - offloading the IPsec computations +=============================================== + +Shannon Nelson <shannon.nelson@oracle.com> +Leon Romanovsky <leonro@nvidia.com> + + +Overview +======== + +IPsec is a useful feature for securing network traffic, but the +computational cost is high: a 10Gbps link can easily be brought down +to under 1Gbps, depending on the traffic and link configuration. +Luckily, there are NICs that offer a hardware based IPsec offload which +can radically increase throughput and decrease CPU utilization. The XFRM +Device interface allows NIC drivers to offer to the stack access to the +hardware offload. + +Right now, there are two types of hardware offload that kernel supports: + + * IPsec crypto offload: + + * NIC performs encrypt/decrypt + * Kernel does everything else + + * IPsec packet offload: + + * NIC performs encrypt/decrypt + * NIC does encapsulation + * Kernel and NIC have SA and policy in-sync + * NIC handles the SA and policies states + * The Kernel talks to the keymanager + +Userland access to the offload is typically through a system such as +libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can +be handy when experimenting. An example command might look something +like this for crypto offload:: + + ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \ + reqid 0x07 replay-window 32 \ + aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ + sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \ + offload dev eth4 dir in + +and for packet offload:: + + ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \ + reqid 0x07 replay-window 32 \ + aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \ + sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \ + offload packet dev eth4 dir in + + ip x p add src 14.0.0.70 dst 14.0.0.52 offload packet dev eth4 dir in + tmpl src 14.0.0.70 dst 14.0.0.52 proto esp reqid 10000 mode transport + +Yes, that's ugly, but that's what shell scripts and/or libreswan are for. + + + +Callbacks to implement +====================== + +:: + + /* from include/linux/netdevice.h */ + struct xfrmdev_ops { + /* Crypto and Packet offload callbacks */ + int (*xdo_dev_state_add)(struct net_device *dev, + struct xfrm_state *x, + struct netlink_ext_ack *extack); + void (*xdo_dev_state_delete)(struct net_device *dev, + struct xfrm_state *x); + void (*xdo_dev_state_free)(struct net_device *dev, + struct xfrm_state *x); + bool (*xdo_dev_offload_ok) (struct sk_buff *skb, + struct xfrm_state *x); + void (*xdo_dev_state_advance_esn) (struct xfrm_state *x); + void (*xdo_dev_state_update_stats) (struct xfrm_state *x); + + /* Solely packet offload callbacks */ + int (*xdo_dev_policy_add) (struct xfrm_policy *x, struct netlink_ext_ack *extack); + void (*xdo_dev_policy_delete) (struct xfrm_policy *x); + void (*xdo_dev_policy_free) (struct xfrm_policy *x); + }; + +The NIC driver offering ipsec offload will need to implement callbacks +relevant to supported offload to make the offload available to the network +stack's XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and +NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload. + + + +Flow +==== + +At probe time and before the call to register_netdev(), the driver should +set up local data structures and XFRM callbacks, and set the feature bits. +The XFRM code's listener will finish the setup on NETDEV_REGISTER. + +:: + + adapter->netdev->xfrmdev_ops = &ixgbe_xfrmdev_ops; + adapter->netdev->features |= NETIF_F_HW_ESP; + adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP; + +When new SAs are set up with a request for "offload" feature, the +driver's xdo_dev_state_add() will be given the new SA to be offloaded +and an indication of whether it is for Rx or Tx. The driver should + + - verify the algorithm is supported for offloads + - store the SA information (key, salt, target-ip, protocol, etc) + - enable the HW offload of the SA + - return status value: + + =========== =================================== + 0 success + -EOPNETSUPP offload not supported, try SW IPsec, + not applicable for packet offload mode + other fail the request + =========== =================================== + +The driver can also set an offload_handle in the SA, an opaque void pointer +that can be used to convey context into the fast-path offload requests:: + + xs->xso.offload_handle = context; + + +When the network stack is preparing an IPsec packet for an SA that has +been setup for offload, it first calls into xdo_dev_offload_ok() with +the skb and the intended offload state to ask the driver if the offload +will serviceable. This can check the packet information to be sure the +offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and +return true or false to signify its support. In case driver doesn't implement +this callback, the stack provides reasonable defaults. + +Crypto offload mode: +When ready to send, the driver needs to inspect the Tx packet for the +offload information, including the opaque context, and set up the packet +send accordingly:: + + xs = xfrm_input_state(skb); + context = xs->xso.offload_handle; + set up HW for send + +The stack has already inserted the appropriate IPsec headers in the +packet data, the offload just needs to do the encryption and fix up the +header values. + + +When a packet is received and the HW has indicated that it offloaded a +decryption, the driver needs to add a reference to the decoded SA into +the packet's skb. At this point the data should be decrypted but the +IPsec headers are still in the packet data; they are removed later up +the stack in xfrm_input(). + +1. Find and hold the SA that was used to the Rx skb:: + + /* get spi, protocol, and destination IP from packet headers */ + xs = find xs from (spi, protocol, dest_IP) + xfrm_state_hold(xs); + +2. Store the state information into the skb:: + + sp = secpath_set(skb); + if (!sp) return; + sp->xvec[sp->len++] = xs; + sp->olen++; + +3. Indicate the success and/or error status of the offload:: + + xo = xfrm_offload(skb); + xo->flags = CRYPTO_DONE; + xo->status = crypto_status; + +4. Hand the packet to napi_gro_receive() as usual. + +In ESN mode, xdo_dev_state_advance_esn() is called from +xfrm_replay_advance_esn() for RX, and xfrm_replay_overflow_offload_esn for TX. +Driver will check packet seq number and update HW ESN state machine if needed. + +Packet offload mode: +HW adds and deletes XFRM headers. So in RX path, XFRM stack is bypassed if HW +reported success. In TX path, the packet lefts kernel without extra header +and not encrypted, the HW is responsible to perform it. + +When the SA is removed by the user, the driver's xdo_dev_state_delete() +and xdo_dev_policy_delete() are asked to disable the offload. Later, +xdo_dev_state_free() and xdo_dev_policy_free() are called from a garbage +collection routine after all reference counts to the state and policy +have been removed and any remaining resources can be cleared for the +offload state. How these are used by the driver will depend on specific +hardware needs. + +As a netdev is set to DOWN the XFRM stack's netdev listener will call +xdo_dev_state_delete(), xdo_dev_policy_delete(), xdo_dev_state_free() and +xdo_dev_policy_free() on any remaining offloaded states. + +Outcome of HW handling packets, the XFRM core can't count hard, soft limits. +The HW/driver are responsible to perform it and provide accurate data when +xdo_dev_state_update_stats() is called. In case of one of these limits +occuried, the driver needs to call to xfrm_state_check_expire() to make sure +that XFRM performs rekeying sequence. diff --git a/Documentation/networking/xfrm/xfrm_proc.rst b/Documentation/networking/xfrm/xfrm_proc.rst new file mode 100644 index 000000000000..973d1571acac --- /dev/null +++ b/Documentation/networking/xfrm/xfrm_proc.rst @@ -0,0 +1,119 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +XFRM proc - /proc/net/xfrm_* files +================================== + +Masahide NAKAMURA <nakam@linux-ipv6.org> + + +Transformation Statistics +------------------------- + +The xfrm_proc code is a set of statistics showing numbers of packets +dropped by the transformation code and why. These counters are defined +as part of the linux private MIB. These counters can be viewed in +/proc/net/xfrm_stat. + + +Inbound errors +~~~~~~~~~~~~~~ + +XfrmInError: + All errors which is not matched others + +XfrmInBufferError: + No buffer is left + +XfrmInHdrError: + Header error + +XfrmInNoStates: + No state is found + i.e. Either inbound SPI, address, or IPsec protocol at SA is wrong + +XfrmInStateProtoError: + Transformation protocol specific error + e.g. SA key is wrong + +XfrmInStateModeError: + Transformation mode specific error + +XfrmInStateSeqError: + Sequence error + i.e. Sequence number is out of window + +XfrmInStateExpired: + State is expired + +XfrmInStateMismatch: + State has mismatch option + e.g. UDP encapsulation type is mismatch + +XfrmInStateInvalid: + State is invalid + +XfrmInTmplMismatch: + No matching template for states + e.g. Inbound SAs are correct but SP rule is wrong + +XfrmInNoPols: + No policy is found for states + e.g. Inbound SAs are correct but no SP is found + +XfrmInPolBlock: + Policy discards + +XfrmInPolError: + Policy error + +XfrmAcquireError: + State hasn't been fully acquired before use + +XfrmFwdHdrError: + Forward routing of a packet is not allowed + +XfrmInStateDirError: + State direction mismatch (lookup found an output state on the input path, expected input or no direction) + +Outbound errors +~~~~~~~~~~~~~~~ +XfrmOutError: + All errors which is not matched others + +XfrmOutBundleGenError: + Bundle generation error + +XfrmOutBundleCheckError: + Bundle check error + +XfrmOutNoStates: + No state is found + +XfrmOutStateProtoError: + Transformation protocol specific error + +XfrmOutStateModeError: + Transformation mode specific error + +XfrmOutStateSeqError: + Sequence error + i.e. Sequence number overflow + +XfrmOutStateExpired: + State is expired + +XfrmOutPolBlock: + Policy discards + +XfrmOutPolDead: + Policy is dead + +XfrmOutPolError: + Policy error + +XfrmOutStateInvalid: + State is invalid, perhaps expired + +XfrmOutStateDirError: + State direction mismatch (lookup found an input state on the output path, expected output or no direction) diff --git a/Documentation/networking/xfrm/xfrm_sync.rst b/Documentation/networking/xfrm/xfrm_sync.rst new file mode 100644 index 000000000000..dfc2ec0df380 --- /dev/null +++ b/Documentation/networking/xfrm/xfrm_sync.rst @@ -0,0 +1,192 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========= +XFRM sync +========= + +The sync patches work is based on initial patches from +Krisztian <hidden@balabit.hu> and others and additional patches +from Jamal <hadi@cyberus.ca>. + +The end goal for syncing is to be able to insert attributes + generate +events so that the SA can be safely moved from one machine to another +for HA purposes. +The idea is to synchronize the SA so that the takeover machine can do +the processing of the SA as accurate as possible if it has access to it. + +We already have the ability to generate SA add/del/upd events. +These patches add ability to sync and have accurate lifetime byte (to +ensure proper decay of SAs) and replay counters to avoid replay attacks +with as minimal loss at failover time. +This way a backup stays as closely up-to-date as an active member. + +Because the above items change for every packet the SA receives, +it is possible for a lot of the events to be generated. +For this reason, we also add a nagle-like algorithm to restrict +the events. i.e we are going to set thresholds to say "let me +know if the replay sequence threshold is reached or 10 secs have passed" +These thresholds are set system-wide via sysctls or can be updated +per SA. + +The identified items that need to be synchronized are: +- the lifetime byte counter +note that: lifetime time limit is not important if you assume the failover +machine is known ahead of time since the decay of the time countdown +is not driven by packet arrival. +- the replay sequence for both inbound and outbound + +1) Message Structure +-------------------- + +nlmsghdr:aevent_id:optional-TLVs. + +The netlink message types are: + +XFRM_MSG_NEWAE and XFRM_MSG_GETAE. + +A XFRM_MSG_GETAE does not have TLVs. + +A XFRM_MSG_NEWAE will have at least two TLVs (as is +discussed further below). + +aevent_id structure looks like:: + + struct xfrm_aevent_id { + struct xfrm_usersa_id sa_id; + xfrm_address_t saddr; + __u32 flags; + __u32 reqid; + }; + +The unique SA is identified by the combination of xfrm_usersa_id, +reqid and saddr. + +flags are used to indicate different things. The possible +flags are:: + + XFRM_AE_RTHR=1, /* replay threshold*/ + XFRM_AE_RVAL=2, /* replay value */ + XFRM_AE_LVAL=4, /* lifetime value */ + XFRM_AE_ETHR=8, /* expiry timer threshold */ + XFRM_AE_CR=16, /* Event cause is replay update */ + XFRM_AE_CE=32, /* Event cause is timer expiry */ + XFRM_AE_CU=64, /* Event cause is policy update */ + +How these flags are used is dependent on the direction of the +message (kernel<->user) as well the cause (config, query or event). +This is described below in the different messages. + +The pid will be set appropriately in netlink to recognize direction +(0 to the kernel and pid = processid that created the event +when going from kernel to user space) + +A program needs to subscribe to multicast group XFRMNLGRP_AEVENTS +to get notified of these events. + +2) TLVS reflect the different parameters +---------------------------------------- + +a) byte value (XFRMA_LTIME_VAL) + + This TLV carries the running/current counter for byte lifetime since + last event. + +b) replay value (XFRMA_REPLAY_VAL) + + This TLV carries the running/current counter for replay sequence since + last event. + +c) replay threshold (XFRMA_REPLAY_THRESH) + + This TLV carries the threshold being used by the kernel to trigger events + when the replay sequence is exceeded. + +d) expiry timer (XFRMA_ETIMER_THRESH) + + This is a timer value in milliseconds which is used as the nagle + value to rate limit the events. + +3) Default configurations for the parameters +-------------------------------------------- + +By default these events should be turned off unless there is +at least one listener registered to listen to the multicast +group XFRMNLGRP_AEVENTS. + +Programs installing SAs will need to specify the two thresholds, however, +in order to not change existing applications such as racoon +we also provide default threshold values for these different parameters +in case they are not specified. + +the two sysctls/proc entries are: + +a) /proc/sys/net/core/sysctl_xfrm_aevent_etime + + Used to provide default values for the XFRMA_ETIMER_THRESH in incremental + units of time of 100ms. The default is 10 (1 second) + +b) /proc/sys/net/core/sysctl_xfrm_aevent_rseqth + + Used to provide default values for XFRMA_REPLAY_THRESH parameter + in incremental packet count. The default is two packets. + +4) Message types +---------------- + +a) XFRM_MSG_GETAE issued by user-->kernel. + XFRM_MSG_GETAE does not carry any TLVs. + + The response is a XFRM_MSG_NEWAE which is formatted based on what + XFRM_MSG_GETAE queried for. + + The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + + * if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved + * if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved + +b) XFRM_MSG_NEWAE is issued by either user space to configure + or kernel to announce events or respond to a XFRM_MSG_GETAE. + + i) user --> kernel to configure a specific SA. + + any of the values or threshold parameters can be updated by passing the + appropriate TLV. + + A response is issued back to the sender in user space to indicate success + or failure. + + In the case of success, additionally an event with + XFRM_MSG_NEWAE is also issued to any listeners as described in iii). + + ii) kernel->user direction as a response to XFRM_MSG_GETAE + + The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + + The threshold TLVs will be included if explicitly requested in + the XFRM_MSG_GETAE message. + + iii) kernel->user to report as event if someone sets any values or + thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above). + In such a case XFRM_AE_CU flag is set to inform the user that + the change happened as a result of an update. + The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + + iv) kernel->user to report event when replay threshold or a timeout + is exceeded. + +In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout +happened) is set to inform the user what happened. +Note the two flags are mutually exclusive. +The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + +5) Exceptions to threshold settings +----------------------------------- + +If you have an SA that is getting hit by traffic in bursts such that +there is a period where the timer threshold expires with no packets +seen, then an odd behavior is seen as follows: +The first packet arrival after a timer expiry will trigger a timeout +event; i.e we don't wait for a timeout period or a packet threshold +to be reached. This is done for simplicity and efficiency reasons. + +-JHS diff --git a/Documentation/networking/xfrm/xfrm_sysctl.rst b/Documentation/networking/xfrm/xfrm_sysctl.rst new file mode 100644 index 000000000000..7d0c4b17c0bd --- /dev/null +++ b/Documentation/networking/xfrm/xfrm_sysctl.rst @@ -0,0 +1,11 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============ +XFRM Syscall +============ + +/proc/sys/net/core/xfrm_* Variables +=================================== + +xfrm_acq_expires - INTEGER + default 30 - hard timeout in seconds for acquire requests |
