diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2026-02-11 19:31:52 -0800 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2026-02-11 19:31:52 -0800 |
| commit | 37a93dd5c49b5fda807fd204edf2547c3493319c (patch) | |
| tree | ce1ef5a642b9ea3d7242156438eb96dc5607a752 /net/rds/send.c | |
| parent | 098b6e44cbaa2d526d06af90c862d13fb414a0ec (diff) | |
| parent | 83310d613382f74070fc8b402f3f6c2af8439ead (diff) | |
Merge tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextipvs/mainipvs/HEADipvs-next/mainipvs-next/HEADdavem/net-next/maindavem/net-next/HEAD
Pull networking updates from Paolo Abeni:
"Core & protocols:
- A significant effort all around the stack to guide the compiler to
make the right choice when inlining code, to avoid unneeded calls
for small helper and stack canary overhead in the fast-path.
This generates better and faster code with very small or no text
size increases, as in many cases the call generated more code than
the actual inlined helper.
- Extend AccECN implementation so that is now functionally complete,
also allow the user-space enabling it on a per network namespace
basis.
- Add support for memory providers with large (above 4K) rx buffer.
Paired with hw-gro, larger rx buffer sizes reduce the number of
buffers traversing the stack, dincreasing single stream CPU usage
by up to ~30%.
- Do not add HBH header to Big TCP GSO packets. This simplifies the
RX path, the TX path and the NIC drivers, and is possible because
user-space taps can now interpret correctly such packets without
the HBH hint.
- Allow IPv6 routes to be configured with a gateway address that is
resolved out of a different interface than the one specified,
aligning IPv6 to IPv4 behavior.
- Multi-queue aware sch_cake. This makes it possible to scale the
rate shaper of sch_cake across multiple CPUs, while still enforcing
a single global rate on the interface.
- Add support for the nbcon (new buffer console) infrastructure to
netconsole, enabling lock-free, priority-based console operations
that are safer in crash scenarios.
- Improve the TCP ipv6 output path to cache the flow information,
saving cpu cycles, reducing cache line misses and stack use.
- Improve netfilter packet tracker to resolve clashes for most
protocols, avoiding unneeded drops on rare occasions.
- Add IP6IP6 tunneling acceleration to the flowtable infrastructure.
- Reduce tcp socket size by one cache line.
- Notify neighbour changes atomically, avoiding inconsistencies
between the notification sequence and the actual states sequence.
- Add vsock namespace support, allowing complete isolation of vsocks
across different network namespaces.
- Improve xsk generic performances with cache-alignment-oriented
optimizations.
- Support netconsole automatic target recovery, allowing netconsole
to reestablish targets when underlying low-level interface comes
back online.
Driver API:
- Support for switching the working mode (automatic vs manual) of a
DPLL device via netlink.
- Introduce PHY ports representation to expose multiple front-facing
media ports over a single MAC.
- Introduce "rx-polarity" and "tx-polarity" device tree properties,
to generalize polarity inversion requirements for differential
signaling.
- Add helper to create, prepare and enable managed clocks.
Device drivers:
- Add Huawei hinic3 PF etherner driver.
- Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet
controller.
- Add ethernet driver for MaxLinear MxL862xx switches
- Remove parallel-port Ethernet driver.
- Convert existing driver timestamp configuration reporting to
hwtstamp_get and remove legacy ioctl().
- Convert existing drivers to .get_rx_ring_count(), simplifing the RX
ring count retrieval. Also remove the legacy fallback path.
- Ethernet high-speed NICs:
- Broadcom (bnxt, bng):
- bnxt: add FW interface update to support FEC stats histogram
and NVRAM defragmentation
- bng: add TSO and H/W GRO support
- nVidia/Mellanox (mlx5):
- improve latency of channel restart operations, reducing the
used H/W resources
- add TSO support for UDP over GRE over VLAN
- add flow counters support for hardware steering (HWS) rules
- use a static memory area to store headers for H/W GRO,
leading to 12% RX tput improvement
- Intel (100G, ice, idpf):
- ice: reorganizes layout of Tx and Rx rings for cacheline
locality and utilizes __cacheline_group* macros on the new
layouts
- ice: introduces Synchronous Ethernet (SyncE) support
- Meta (fbnic):
- adds debugfs for firmware mailbox and tx/rx rings vectors
- Ethernet virtual:
- geneve: introduce GRO/GSO support for double UDP encapsulation
- Ethernet NICs consumer, and embedded:
- Synopsys (stmmac):
- some code refactoring and cleanups
- RealTek (r8169):
- add support for RTL8127ATF (10G Fiber SFP)
- add dash and LTR support
- Airoha:
- AN8811HB 2.5 Gbps phy support
- Freescale (fec):
- add XDP zero-copy support
- Thunderbolt:
- add get link setting support to allow bonding
- Renesas:
- add support for RZ/G3L GBETH SoC
- Ethernet switches:
- Maxlinear:
- support R(G)MII slow rate configuration
- add support for Intel GSW150
- Motorcomm (yt921x):
- add DCB/QoS support
- TI:
- icssm-prueth: support bridging (STP/RSTP) via the switchdev
framework
- Ethernet PHYs:
- Realtek:
- enable SGMII and 2500Base-X in-band auto-negotiation
- simplify and reunify C22/C45 drivers
- Micrel: convert bindings to DT schema
- CAN:
- move skb headroom content into skb extensions, making CAN
metadata access more robust
- CAN drivers:
- rcar_canfd:
- add support for FD-only mode
- add support for the RZ/T2H SoC
- sja1000: cleanup the CAN state handling
- WiFi:
- implement EPPKE/802.1X over auth frames support
- split up drop reasons better, removing generic RX_DROP
- additional FTM capabilities: 6 GHz support, supported number of
spatial streams and supported number of LTF repetitions
- better mac80211 iterators to enumerate resources
- initial UHR (Wi-Fi 8) support for cfg80211/mac80211
- WiFi drivers:
- Qualcomm/Atheros:
- ath11k: support for Channel Frequency Response measurement
- ath12k: a significant driver refactor to support multi-wiphy
devices and and pave the way for future device support in the
same driver (rather than splitting to ath13k)
- ath12k: support for the QCC2072 chipset
- Intel:
- iwlwifi: partial Neighbor Awareness Networking (NAN) support
- iwlwifi: initial support for U-NII-9 and IEEE 802.11bn
- RealTek (rtw89):
- preparations for RTL8922DE support
- Bluetooth:
- implement setsockopt(BT_PHY) to set the connection packet type/PHY
- set link_policy on incoming ACL connections
- Bluetooth drivers:
- btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE
- btqca: add WCN6855 firmware priority selection feature"
* tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1254 commits)
bnge/bng_re: Add a new HSI
net: macb: Fix tx/rx malfunction after phy link down and up
af_unix: Fix memleak of newsk in unix_stream_connect().
net: ti: icssg-prueth: Add optional dependency on HSR
net: dsa: add basic initial driver for MxL862xx switches
net: mdio: add unlocked mdiodev C45 bus accessors
net: dsa: add tag format for MxL862xx switches
dt-bindings: net: dsa: add MaxLinear MxL862xx
selftests: drivers: net: hw: Modify toeplitz.c to poll for packets
octeontx2-pf: Unregister devlink on probe failure
net: renesas: rswitch: fix forwarding offload statemachine
ionic: Rate limit unknown xcvr type messages
tcp: inet6_csk_xmit() optimization
tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock()
tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect()
ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6
ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update()
ipv6: use np->final in inet6_sk_rebuild_header()
ipv6: add daddr/final storage in struct ipv6_pinfo
net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup()
...
Diffstat (limited to 'net/rds/send.c')
| -rw-r--r-- | net/rds/send.c | 139 |
1 files changed, 95 insertions, 44 deletions
diff --git a/net/rds/send.c b/net/rds/send.c index 0b3d0ef2f008..6e96f108473e 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -120,6 +120,57 @@ static void release_in_xmit(struct rds_conn_path *cp) } /* + * Helper function for multipath fanout to ensure lane 0 transmits queued + * messages before other lanes to prevent out-of-order delivery. + * + * Returns true if lane 0 still has messages or false otherwise + */ +static bool rds_mprds_cp0_catchup(struct rds_connection *conn) +{ + struct rds_conn_path *cp0 = conn->c_path; + struct rds_message *rm0; + unsigned long flags; + bool ret = false; + + spin_lock_irqsave(&cp0->cp_lock, flags); + + /* the oldest / first message in the retransmit queue + * has to be at or beyond c_cp0_mprds_catchup_tx_seq + */ + if (!list_empty(&cp0->cp_retrans)) { + rm0 = list_entry(cp0->cp_retrans.next, struct rds_message, + m_conn_item); + if (be64_to_cpu(rm0->m_inc.i_hdr.h_sequence) < + conn->c_cp0_mprds_catchup_tx_seq) { + /* the retransmit queue of cp_index#0 has not + * quite caught up yet + */ + ret = true; + goto unlock; + } + } + + /* the oldest / first message of the send queue + * has to be at or beyond c_cp0_mprds_catchup_tx_seq + */ + rm0 = cp0->cp_xmit_rm; + if (!rm0 && !list_empty(&cp0->cp_send_queue)) + rm0 = list_entry(cp0->cp_send_queue.next, struct rds_message, + m_conn_item); + if (rm0 && be64_to_cpu(rm0->m_inc.i_hdr.h_sequence) < + conn->c_cp0_mprds_catchup_tx_seq) { + /* the send queue of cp_index#0 has not quite + * caught up yet + */ + ret = true; + } + +unlock: + spin_unlock_irqrestore(&cp0->cp_lock, flags); + return ret; +} + +/* * We're making the conscious trade-off here to only send one message * down the connection at a time. * Pro: @@ -248,6 +299,14 @@ restart: if (batch_count >= send_batch_count) goto over_batch; + /* make sure cp_index#0 caught up during fan-out in + * order to avoid lane races + */ + if (cp->cp_index > 0 && rds_mprds_cp0_catchup(conn)) { + rds_stats_inc(s_mprds_catchup_tx0_retries); + goto over_batch; + } + spin_lock_irqsave(&cp->cp_lock, flags); if (!list_empty(&cp->cp_send_queue)) { @@ -458,7 +517,8 @@ over_batch: if (rds_destroy_pending(cp->cp_conn)) ret = -ENETUNREACH; else - queue_delayed_work(rds_wq, &cp->cp_send_w, 1); + queue_delayed_work(cp->cp_wq, + &cp->cp_send_w, 1); rcu_read_unlock(); } else if (raced) { rds_stats_inc(s_send_lock_queue_raced); @@ -1041,39 +1101,6 @@ static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm, return ret; } -static int rds_send_mprds_hash(struct rds_sock *rs, - struct rds_connection *conn, int nonblock) -{ - int hash; - - if (conn->c_npaths == 0) - hash = RDS_MPATH_HASH(rs, RDS_MPATH_WORKERS); - else - hash = RDS_MPATH_HASH(rs, conn->c_npaths); - if (conn->c_npaths == 0 && hash != 0) { - rds_send_ping(conn, 0); - - /* The underlying connection is not up yet. Need to wait - * until it is up to be sure that the non-zero c_path can be - * used. But if we are interrupted, we have to use the zero - * c_path in case the connection ends up being non-MP capable. - */ - if (conn->c_npaths == 0) { - /* Cannot wait for the connection be made, so just use - * the base c_path. - */ - if (nonblock) - return 0; - if (wait_event_interruptible(conn->c_hs_waitq, - conn->c_npaths != 0)) - hash = 0; - } - if (conn->c_npaths == 1) - hash = 0; - } - return hash; -} - static int rds_rdma_bytes(struct msghdr *msg, size_t *rdma_bytes) { struct rds_rdma_args *args; @@ -1303,10 +1330,32 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len) rs->rs_conn = conn; } - if (conn->c_trans->t_mp_capable) - cpath = &conn->c_path[rds_send_mprds_hash(rs, conn, nonblock)]; - else + if (conn->c_trans->t_mp_capable) { + /* Use c_path[0] until we learn that + * the peer supports more (c_npaths > 1) + */ + cpath = &conn->c_path[RDS_MPATH_HASH(rs, conn->c_npaths ? : 1)]; + } else { cpath = &conn->c_path[0]; + } + + /* If we're multipath capable and path 0 is down, queue reconnect + * and send a ping. This initiates the multipath handshake through + * rds_send_probe(), which sends RDS_EXTHDR_NPATHS to the peer, + * starting multipath capability negotiation. + */ + if (conn->c_trans->t_mp_capable && + !rds_conn_path_up(&conn->c_path[0])) { + /* Ensures that only one request is queued. And + * rds_send_ping() ensures that only one ping is + * outstanding. + */ + if (!test_and_set_bit(RDS_RECONNECT_PENDING, + &conn->c_path[0].cp_flags)) + queue_delayed_work(conn->c_path[0].cp_wq, + &conn->c_path[0].cp_conn_w, 0); + rds_send_ping(conn, 0); + } rm->m_conn_path = cpath; @@ -1380,7 +1429,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len) if (rds_destroy_pending(cpath->cp_conn)) ret = -ENETUNREACH; else - queue_delayed_work(rds_wq, &cpath->cp_send_w, 1); + queue_delayed_work(cpath->cp_wq, &cpath->cp_send_w, 1); rcu_read_unlock(); } if (ret) @@ -1456,24 +1505,26 @@ rds_send_probe(struct rds_conn_path *cp, __be16 sport, cp->cp_conn->c_trans->t_mp_capable) { __be16 npaths = cpu_to_be16(RDS_MPATH_WORKERS); __be32 my_gen_num = cpu_to_be32(cp->cp_conn->c_my_gen_num); + u8 dummy = 0; rds_message_add_extension(&rm->m_inc.i_hdr, - RDS_EXTHDR_NPATHS, &npaths, - sizeof(npaths)); + RDS_EXTHDR_NPATHS, &npaths); rds_message_add_extension(&rm->m_inc.i_hdr, RDS_EXTHDR_GEN_NUM, - &my_gen_num, - sizeof(u32)); + &my_gen_num); + rds_message_add_extension(&rm->m_inc.i_hdr, + RDS_EXTHDR_SPORT_IDX, + &dummy); } spin_unlock_irqrestore(&cp->cp_lock, flags); rds_stats_inc(s_send_queued); rds_stats_inc(s_send_pong); - /* schedule the send work on rds_wq */ + /* schedule the send work on cp_wq */ rcu_read_lock(); if (!rds_destroy_pending(cp->cp_conn)) - queue_delayed_work(rds_wq, &cp->cp_send_w, 1); + queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 1); rcu_read_unlock(); rds_message_put(rm); |
