| Age | Commit message (Collapse) | Author |
|
Pull rdma fixes from Jason Gunthorpe:
- Quite a few irdma bug fixes, several user triggerable
- Fix a 0 SMAC header in ionic
- Tolerate FW errors for RAAS in bng_re
- Don't UAF in efa when printing error events
- Better handle pool exhaustion in the new bvec paths
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/irdma: Harden depth calculation functions
RDMA/irdma: Return EINVAL for invalid arp index error
RDMA/irdma: Fix deadlock during netdev reset with active connections
RDMA/irdma: Remove reset check from irdma_modify_qp_to_err()
RDMA/irdma: Clean up unnecessary dereference of event->cm_node
RDMA/irdma: Remove a NOP wait_event() in irdma_modify_qp_roce()
RDMA/irdma: Update ibqp state to error if QP is already in error state
RDMA/irdma: Initialize free_qp completion before using it
RDMA/efa: Fix possible deadlock
RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
RDMA/rw: Fall back to direct SGE on MR pool exhaustion
RDMA/efa: Fix use of completion ctx after free
RDMA/bng_re: Fix silent failure in HWRM version query
RDMA/ionic: Preserve and set Ethernet source MAC after ib_ud_header_init()
RDMA/irdma: Fix double free related to rereg_user_mr
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"A set of fixes for DMA-mapping subsystem, which resolve false-
positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida
and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda)"
* tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: add missing `inline` for `dma_free_attrs`
mm/hmm: Indicate that HMM requires DMA coherency
RDMA/umem: Tell DMA mapping that UMEM requires coherency
iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute
dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set
dma-mapping: Introduce DMA require coherency attribute
dma-mapping: Clarify valid conditions for CPU cache line overlap
dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output
dma-debug: Allow multiple invocations of overlapping entries
dma: swiotlb: add KMSAN annotations to swiotlb_bounce()
|
|
The RDMA subsystem exposes DMA regions through the verbs interface, which
assumes a coherent system. Use the DMA_ATTR_REQUIRE_COHERENCE attribute
to ensure coherency and avoid taking the SWIOTLB path.
The RDMA verbs programming model resembles HMM and assumes concurrent DMA
and CPU access to userspace memory. The hardware and programming model
support "one-sided" operations initiated remotely without any local CPU
involvement or notification. These include ATOMIC compare/swap, READ, and
WRITE. A remote CPU can use these operations to traverse data structures,
manipulate locks, and perform similar tasks without the host CPU’s
awareness. If SWIOTLB substitutes memory or DMA is not cache coherent,
these use cases break entirely.
In-kernel RDMA is fine with incoherent mappings because kernel users do
not rely on one-sided operations in ways that would expose these issues.
A given region may also be exported multiple times, which can trigger
warnings about cacheline overlaps. These warnings are suppressed when the
new attribute is used.
infiniband rocep8s0f0: mlx5_ib_reg_user_mr:1592:(pid 5812): start 0x2b28c000, iova 0x2b28c000, length 0x1000, access_flags 0x1
infiniband rocep8s0f0: mlx5_ib_reg_user_mr:1592:(pid 5812): start 0x2b28c001, iova 0x2b28c001, length 0xfff, access_flags 0x1
------------[ cut here ]------------
DMA-API: mlx5_core 0000:08:00.0: cacheline tracking EEXIST, overlapping mappings aren't supported
WARNING: kernel/dma/debug.c:620 at add_dma_entry+0x1bb/0x280, CPU#6: ibv_rc_pingpong/5812
Modules linked in: veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay mlx5_fwctl zram zsmalloc mlx5_ib fuse rpcrdma rdma_ucm ib_uverbs ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_core ib_core
CPU: 6 UID: 2733 PID: 5812 Comm: ibv_rc_pingpong Tainted: G W 6.19.0+ #129 PREEMPT
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:add_dma_entry+0x1be/0x280
Code: 8b 7b 10 48 85 ff 0f 84 c3 00 00 00 48 8b 6f 50 48 85 ed 75 03 48 8b 2f e8 ff 8e 6a 00 48 89 c6 48 8d 3d 55 ef 2d 01 48 89 ea <67> 48 0f b9 3a 48 85 db 74 1a 48 c7 c7 b0 00 2b 82 e8 9c 25 fd ff
RSP: 0018:ff11000138717978 EFLAGS: 00010286
RAX: ffffffffa02d7831 RBX: ff1100010246de00 RCX: 0000000000000000
RDX: ff110001036fac30 RSI: ffffffffa02d7831 RDI: ffffffff82678650
RBP: ff110001036fac30 R08: ff11000110dcb4a0 R09: ff11000110dcb478
R10: 0000000000000000 R11: ffffffff824b30a8 R12: 0000000000000000
R13: 00000000ffffffef R14: 0000000000000202 R15: ff1100010246de00
FS: 00007f59b411c740(0000) GS:ff110008dcc99000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffe538f7000 CR3: 000000010e066005 CR4: 0000000000373eb0
Call Trace:
<TASK>
debug_dma_map_sg+0x1b4/0x390
__dma_map_sg_attrs+0x6d/0x1a0
dma_map_sgtable+0x19/0x30
ib_umem_get+0x254/0x380 [ib_uverbs]
mlx5_ib_reg_user_mr+0x68/0x2a0 [mlx5_ib]
ib_uverbs_reg_mr+0x17f/0x2a0 [ib_uverbs]
ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0x130 [ib_uverbs]
ib_uverbs_cmd_verbs+0xa0b/0xae0 [ib_uverbs]
? ib_uverbs_handler_UVERBS_METHOD_QUERY_PORT_SPEED+0xe0/0xe0 [ib_uverbs]
? mmap_region+0x7a/0xb0
? do_mmap+0x3b8/0x5c0
ib_uverbs_ioctl+0xa7/0x110 [ib_uverbs]
__x64_sys_ioctl+0x14f/0x8b0
? ksys_mmap_pgoff+0xc5/0x190
do_syscall_64+0x8c/0xbf0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f59b430aeed
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:00007ffe538f9430 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe538f94c0 RCX: 00007f59b430aeed
RDX: 00007ffe538f94e0 RSI: 00000000c0181b01 RDI: 0000000000000003
RBP: 00007ffe538f9480 R08: 0000000000000028 R09: 00007ffe538f9684
R10: 0000000000000001 R11: 0000000000000246 R12: 00007ffe538f9684
R13: 000000000000000c R14: 000000002b28d170 R15: 000000000000000c
</TASK>
---[ end trace 0000000000000000 ]---
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260316-dma-debug-overlap-v3-7-1dde90a7f08b@nvidia.com
|
|
When IOVA-based DMA mapping is unavailable (e.g., IOMMU
passthrough mode), rdma_rw_ctx_init_bvec() falls back to
checking rdma_rw_io_needs_mr() with the raw bvec count.
Unlike the scatterlist path in rdma_rw_ctx_init(), which
passes a post-DMA-mapping entry count that reflects
coalescing of physically contiguous pages, the bvec path
passes the pre-mapping page count. This overstates the
number of DMA entries, causing every multi-bvec RDMA READ
to consume an MR from the QP's pool.
Under NFS WRITE workloads the server performs RDMA READs
to pull data from the client. With the inflated MR demand,
the pool is rapidly exhausted, ib_mr_pool_get() returns
NULL, and rdma_rw_init_one_mr() returns -EAGAIN. svcrdma
treats this as a DMA mapping failure, closes the connection,
and the client reconnects -- producing a cycle of 71% RPC
retransmissions and ~100 reconnections per test run. RDMA
WRITEs (NFS READ direction) are unaffected because
DMA_TO_DEVICE never triggers the max_sgl_rd check.
Remove the rdma_rw_io_needs_mr() gate from the bvec path
entirely, so that bvec RDMA operations always use the
map_wrs path (direct WR posting without MR allocation).
The bvec caller has no post-DMA-coalescing segment count
available -- xdr_buf and svc_rqst hold pages as individual
pointers, and physical contiguity is discovered only during
DMA mapping -- so the raw page count cannot serve as a
reliable input to rdma_rw_io_needs_mr(). iWARP devices,
which require MRs unconditionally, are handled by an
earlier check in rdma_rw_ctx_init_bvec() and are unaffected.
Fixes: bea28ac14cab ("RDMA/core: add MR support for bvec-based RDMA operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260313194201.5818-3-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When IOMMU passthrough mode is active, ib_dma_map_sgtable_attrs()
produces no coalescing: each scatterlist page maps 1:1 to a DMA
entry, so sgt.nents equals the raw page count. A 1 MB transfer
yields 256 DMA entries. If that count exceeds the device's
max_sgl_rd threshold (an optimization hint from mlx5 firmware),
rdma_rw_io_needs_mr() steers the operation into the MR
registration path. Each such operation consumes one or more MRs
from a pool sized at max_rdma_ctxs -- roughly one MR per
concurrent context. Under write-intensive workloads that issue
many concurrent RDMA READs, the pool is rapidly exhausted,
ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns
-EAGAIN. Upper layer protocols treat this as a fatal DMA mapping
failure and tear down the connection.
The max_sgl_rd check is a performance optimization, not a
correctness requirement: the device can handle large SGE counts
via direct posting, just less efficiently than with MR
registration. When the MR pool cannot satisfy a request, falling
back to the direct SGE (map_wrs) path avoids the connection
reset while preserving the MR optimization for the common case
where pool resources are available.
Add a fallback in rdma_rw_ctx_init() so that -EAGAIN from
rdma_rw_init_mr_wrs() triggers direct SGE posting instead of
propagating the error. iWARP devices, which mandate MR
registration for RDMA READs, and force_mr debug mode continue
to treat -EAGAIN as terminal.
Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260313194201.5818-2-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Fix the following compilation error:
ERROR: modpost: module ib_uverbs uses symbol dma_buf_move_notify
from namespace DMA_BUF, but does not import it.
Fixes: 0ac6f4056c4a ("RDMA/uverbs: Add DMABUF object type and operations")
Link: https://patch.msgid.link/20260225-fix-uverbs-compilation-v1-1-acf7b3d0f9fa@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
In ib_umem_dmabuf_get_pinned_with_dma_device(), the call to
ib_umem_dmabuf_map_pages() can fail. If this occurs, the dmabuf
is immediately unpinned but the umem_dmabuf->pinned flag is still
set. Then, when ib_umem_release() is called, it calls
ib_umem_dmabuf_revoke() which will call dma_buf_unpin() again.
Fix this by removing the immediate unpin upon failure and just let
the ib_umem_release/revoke path handle it. This also ensures the
proper unmap-unpin unwind ordering if the dmabuf_map_pages call
happened to fail due to dma_resv_wait_timeout (and therefore has
a non-NULL umem_dmabuf->sgt).
Fixes: 1e4df4a21c5a ("RDMA/umem: Allow pinned dmabuf umem usage")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260224234153.1207849-1-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When listening on wildcard addresses we have a global list for the application
layer rdma_cm_id and for any existing device or any device added in future we
try to listen on any wildcard listener.
When the listener has a restricted_node_type we should prevent listening on
devices with a different node type.
While there fix the documentation comment of rdma_restrict_node_type()
to include rdma_resolve_addr() instead of having rdma_bind_addr() twice.
Fixes: a760e80e90f5 ("RDMA/core: introduce rdma_restrict_node_type()")
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Steve French <smfrench@gmail.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-rdma@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://patch.msgid.link/20260224165951.3582093-2-metze@samba.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
RoCE GID entries become stale when netdev properties change during the
IB device registration window. This is reproducible with a udev rule
that sets a MAC address when a VF netdev appears:
ACTION=="add", SUBSYSTEM=="net", KERNEL=="eth4", \
RUN+="/sbin/ip link set eth4 address 88:22:33:44:55:66"
After VF creation, show_gids displays GIDs derived from the original
random MAC rather than the configured one.
The root cause is a race between netdev event processing and device
registration:
CPU 0 (driver) CPU 1 (udev/workqueue)
────────────── ──────────────────────
ib_register_device()
ib_cache_setup_one()
gid_table_setup_one()
_gid_table_setup_one()
← GID table allocated
rdma_roce_rescan_device()
← GIDs populated with
OLD MAC
ip link set eth4 addr NEW_MAC
NETDEV_CHANGEADDR queued
netdevice_event_work_handler()
ib_enum_all_roce_netdevs()
← Iterates DEVICE_REGISTERED
← Device NOT marked yet, SKIP!
enable_device_and_get()
xa_set_mark(DEVICE_REGISTERED)
← Too late, event was lost
The netdev event handler uses ib_enum_all_roce_netdevs() which only
iterates devices marked DEVICE_REGISTERED. However, this mark is set
late in the registration process, after the GID cache is already
populated. Events arriving in this window are silently dropped.
Fix this by introducing a new xarray mark DEVICE_GID_UPDATES that is
set immediately after the GID table is allocated and initialized. Use
the new mark in ib_enum_all_roce_netdevs() function to iterate devices
instead of DEVICE_REGISTERED.
This is safe because:
- After _gid_table_setup_one(), all required structures exist (port_data,
immutable, cache.gid)
- The GID table mutex serializes concurrent access between the initial
rescan and event handlers
- Event handlers correctly update stale GIDs even when racing with rescan
- The mark is cleared in ib_cache_cleanup_one() before teardown
This also fixes similar races for IP address events (inetaddr_event,
inet6addr_event) which use the same enumeration path.
Fixes: 0df91bb67334 ("RDMA/devices: Use xarray to store the client_data")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260127093839.126291-1-jiri@resnulli.us
Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Pull rdma updates from Jason Gunthorpe:
"Usual smallish cycle. The NFS biovec work to push it down into RDMA
instead of indirecting through a scatterlist is pretty nice to see,
been talked about for a long time now.
- Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe
- Small driver improvements and minor bug fixes to hns, mlx5, rxe,
mana, mlx5, irdma
- Robusness improvements in completion processing for EFA
- New query_port_speed() verb to move past limited IBA defined speed
steps
- Support for SG_GAPS in rts and many other small improvements
- Rare list corruption fix in iwcm
- Better support different page sizes in rxe
- Device memory support for mana
- Direct bio vec to kernel MR for use by NFS-RDMA
- QP rate limiting for bnxt_re
- Remote triggerable NULL pointer crash in siw
- DMA-buf exporter support for RDMA mmaps like doorbells"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits)
RDMA/mlx5: Implement DMABUF export ops
RDMA/uverbs: Add DMABUF object type and operations
RDMA/uverbs: Support external FD uobjects
RDMA/siw: Fix potential NULL pointer dereference in header processing
RDMA/umad: Reject negative data_len in ib_umad_write
IB/core: Extend rate limit support for RC QPs
RDMA/mlx5: Support rate limit only for Raw Packet QP
RDMA/bnxt_re: Report QP rate limit in debugfs
RDMA/bnxt_re: Report packet pacing capabilities when querying device
RDMA/bnxt_re: Add support for QP rate limiting
MAINTAINERS: Drop RDMA files from Hyper-V section
RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc
svcrdma: use bvec-based RDMA read/write API
RDMA/core: add rdma_rw_max_sge() helper for SQ sizing
RDMA/core: add MR support for bvec-based RDMA operations
RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
RDMA/core: add bio_vec based RDMA read/write API
RDMA/irdma: Use kvzalloc for paged memory DMA address array
RDMA/rxe: Fix race condition in QP timer handlers
RDMA/mana_ib: Add device‑memory support
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
disk space by teaching ocfs2 to reclaim suballocator block group
space (Heming Zhao)
- "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
ARRAY_END() macro and uses it in various places (Alejandro Colomar)
- "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
page size (Pnina Feder)
- "kallsyms: Prevent invalid access when showing module buildid" cleans
up kallsyms code related to module buildid and fixes an invalid
access crash when printing backtraces (Petr Mladek)
- "Address page fault in ima_restore_measurement_list()" fixes a
kexec-related crash that can occur when booting the second-stage
kernel on x86 (Harshit Mogalapalli)
- "kho: ABI headers and Documentation updates" updates the kexec
handover ABI documentation (Mike Rapoport)
- "Align atomic storage" adds the __aligned attribute to atomic_t and
atomic64_t definitions to get natural alignment of both types on
csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)
- "kho: clean up page initialization logic" simplifies the page
initialization logic in kho_restore_page() (Pratyush Yadav)
- "Unload linux/kernel.h" moves several things out of kernel.h and into
more appropriate places (Yury Norov)
- "don't abuse task_struct.group_leader" removes the usage of
->group_leader when it is "obviously unnecessary" (Oleg Nesterov)
- "list private v2 & luo flb" adds some infrastructure improvements to
the live update orchestrator (Pasha Tatashin)
* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
procfs: fix missing RCU protection when reading real_parent in do_task_stat()
watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
kho: fix doc for kho_restore_pages()
tests/liveupdate: add in-kernel liveupdate test
liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
liveupdate: luo_file: Use private list
list: add kunit test for private list primitives
list: add primitives for private list manipulations
delayacct: fix uapi timespec64 definition
panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
netclassid: use thread_group_leader(p) in update_classid_task()
RDMA/umem: don't abuse current->group_leader
drm/pan*: don't abuse current->group_leader
drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
drm/amdgpu: don't abuse current->group_leader
android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
android/binder: don't abuse current->group_leader
kho: skip memoryless NUMA nodes when reserving scratch areas
...
|
|
Expose DMABUF functionality to userspace through the uverbs interface,
enabling InfiniBand/RDMA devices to export PCI based memory regions
(e.g. device memory) as DMABUF file descriptors. This allows
zero-copy sharing of RDMA memory with other subsystems that support the
dma-buf framework.
A new UVERBS_OBJECT_DMABUF object type and allocation method were
introduced.
During allocation, uverbs invokes the driver to supply the
rdma_user_mmap_entry associated with the given page offset (pgoff).
Based on the returned rdma_user_mmap_entry, uverbs requests the driver
to provide the corresponding physical-memory details as well as the
driver’s PCI provider information.
Using this information, dma_buf_export() is called; if it succeeds,
uobj->object is set to the underlying file pointer returned by the
dma-buf framework.
The file descriptor number follows the standard uverbs allocation flow,
but the file pointer comes from the dma-buf subsystem, including its own
fops and private data.
When an mmap entry is removed, uverbs iterates over its associated
DMABUFs, marks them as revoked, and calls dma_buf_move_notify() so that
their importers are notified.
The same procedure applies during the disassociate flow; final cleanup
occurs when the application closes the file.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260201-dmabuf-export-v3-2-da238b614fe3@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add support for uobjects that wrap externally allocated file
descriptors (FDs).
In this mode, the FD number still follows the standard uverbs allocation
flow, but the file pointer is allocated externally and has its own fops
and private data.
As a result, alloc_begin_fd_uobject() must handle cases where
fd_type->fops is NULL, and both alloc_commit_fd_uobject() and
alloc_abort_fd_uobject() must account for whether filp->private_data
exists, since it is populated outside the standard uverbs flow.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260201-dmabuf-export-v3-1-da238b614fe3@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
For smbdirect it required to use different ports depending
on the RDMA protocol. E.g. for iWarp 5445 is needed
(as tcp port 445 already used by the raw tcp transport for SMB),
while InfiniBand, RoCEv1 and RoCEv2 use port 445, as they
use an independent port range (even for RoCEv2, which uses udp
port 4791 itself).
Currently ksmbd is not able to function correctly at
all if the system has iWarp (RDMA_NODE_RNIC) interface(s)
and any InfiniBand, RoCEv1 and/or RoCEv2 interface(s)
at the same time.
And cifs.ko uses 5445 with a fallback to 445, which
means depending on the available interfaces, it tries
5445 in the RoCE range or may tries iWarp with 445
as a fallback. This leads to strange error messages
and strange network captures.
To avoid these problems they will be able to
use rdma_restrict_node_type(RDMA_NODE_RNIC) before
trying port 5445 and rdma_restrict_node_type(RDMA_NODE_IB_CA)
before trying port 445. It means we'll get early
-ENODEV early from rdma_resolve_addr() without any
network traffic and timeouts.
This is designed to be called before calling any
of rdma_bind_addr(), rdma_resolve_addr() or rdma_listen().
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: linux-rdma@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Acked-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
ib_umad_write computes data_len from user-controlled count and the
MAD header sizes. With a mismatched user MAD header size and RMPP
header length, data_len can become negative and reach ib_create_send_mad().
This can make the padding calculation exceed the segment size and trigger
an out-of-bounds memset in alloc_send_rmpp_list().
Add an explicit check to reject negative data_len before creating the
send buffer.
KASAN splat:
[ 211.363464] BUG: KASAN: slab-out-of-bounds in ib_create_send_mad+0xa01/0x11b0
[ 211.364077] Write of size 220 at addr ffff88800c3fa1f8 by task spray_thread/102
[ 211.365867] ib_create_send_mad+0xa01/0x11b0
[ 211.365887] ib_umad_write+0x853/0x1c80
Fixes: 2be8e3ee8efd ("IB/umad: Add P_Key index support")
Signed-off-by: YunJe Shin <ioerts@kookmin.ac.kr>
Link: https://patch.msgid.link/20260203100628.1215408-1-ioerts@kookmin.ac.kr
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Cleanup and preparation to simplify the next changes.
Use current->tgid instead of current->group_leader->pid.
Link: https://lkml.kernel.org/r/aXY_2JIhCeGAYC0r@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Leon Romanovsky <leon@kernel.org>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: Christan König <christian.koenig@amd.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Broadcom devices supports setting the rate limit while changing
RC QP state from INIT to RTR, RTR to RTS and RTS to RTS.
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20260202133413.3182578-6-kalesh-anakkur.purayil@broadcom.com
Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Since wqe_size in ib_uverbs_unmarshall_recv() is user-provided and already
validated, but can still be large, add __GFP_NOWARN to suppress memory
allocation warnings for large sizes, consistent with the similar fix in
ib_uverbs_post_send().
Fixes: 67cdb40ca444 ("[IB] uverbs: Implement more commands")
Signed-off-by: Yi Liu <liuy22@mails.tsinghua.edu.cn>
Link: https://patch.msgid.link/20260129094900.3517706-1-liuy22@mails.tsinghua.edu.cn
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
svc_rdma_accept() computes sc_sq_depth as the sum of rq_depth and the
number of rdma_rw contexts (ctxts). This value is used to allocate the
Send CQ and to initialize the sc_sq_avail credit pool.
However, when the device uses memory registration for RDMA operations,
rdma_rw_init_qp() inflates the QP's max_send_wr by a factor of three
per context to account for REG and INV work requests. The Send CQ and
credit pool remain sized for only one work request per context,
causing Send Queue exhaustion under heavy NFS WRITE workloads.
Introduce rdma_rw_max_sge() to compute the actual number of Send Queue
entries required for a given number of rdma_rw contexts. Upper layer
protocols call this helper before creating a Queue Pair so that their
Send CQs and credit accounting match the QP's true capacity.
Update svc_rdma_accept() to use rdma_rw_max_sge() when computing
sc_sq_depth, ensuring the credit pool reflects the work requests
that rdma_rw_init_qp() will reserve.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 00bd1439f464 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-5-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The bvec-based RDMA API currently returns -EOPNOTSUPP when Memory
Region registration is required. This prevents iWARP devices from
using the bvec path, since iWARP requires MR registration for RDMA
READ operations. The force_mr debug parameter is also unusable with
bvec input.
Add rdma_rw_init_mr_wrs_bvec() to handle MR registration for bvec
arrays. The approach creates a synthetic scatterlist populated with
DMA addresses from the bvecs, then reuses the existing ib_map_mr_sg()
infrastructure. This avoids driver changes while keeping the
implementation small.
The synthetic scatterlist is stored in the rdma_rw_ctx for cleanup.
On destroy, the MRs are returned to the pool and the bvec DMA
mappings are released using the stored addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-4-cel@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The bvec RDMA API maps each bvec individually via dma_map_phys(),
requiring an IOTLB sync for each mapping. For large I/O operations
with many bvecs, this overhead becomes significant.
The two-step IOVA API (dma_iova_try_alloc / dma_iova_link /
dma_iova_sync) allocates a contiguous IOVA range upfront, links
all physical pages without IOTLB syncs, then performs a single
sync at the end. This reduces IOTLB flushes from O(n) to O(1).
It also requires only a single output dma_addr_t compared to extra
per-input element storage in struct scatterlist.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-3-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The existing rdma_rw_ctx_init() API requires callers to construct a
scatterlist, which is then DMA-mapped page by page. Callers that
already have data in bio_vec form (such as the NVMe-oF target) must
first convert to scatterlist, adding overhead and complexity.
Introduce rdma_rw_ctx_init_bvec() and rdma_rw_ctx_destroy_bvec() to
accept bio_vec arrays directly. The new helpers use dma_map_phys()
for hardware RDMA devices and virtual addressing for software RDMA
devices (rxe, siw), avoiding intermediate scatterlist construction.
Memory registration (MR) path support is deferred to a follow-up
series; callers requiring MR-based transfers (iWARP devices or
force_mr=1) receive -EOPNOTSUPP and should use the scatterlist API.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://patch.msgid.link/20260128005400.25147-2-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
ib_uverbs_post_send() uses cmd.wqe_size from userspace without any
validation before passing it to kmalloc() and using the allocated
buffer as struct ib_uverbs_send_wr.
If a user provides a small wqe_size value (e.g., 1), kmalloc() will
succeed, but subsequent accesses to user_wr->opcode, user_wr->num_sge,
and other fields will read beyond the allocated buffer, resulting in
an out-of-bounds read from kernel heap memory. This could potentially
leak sensitive kernel information to userspace.
Additionally, providing an excessively large wqe_size can trigger a
WARNING in the memory allocation path, as reported by syzkaller.
This is inconsistent with ib_uverbs_unmarshall_recv() which properly
validates that wqe_size >= sizeof(struct ib_uverbs_recv_wr) before
proceeding.
Add the same validation for ib_uverbs_post_send() to ensure wqe_size
is at least sizeof(struct ib_uverbs_send_wr).
Fixes: c3bea3d2dc53 ("RDMA/uverbs: Use the iterator for ib_uverbs_unmarshall_recv()")
Signed-off-by: Yi Liu <liuy22@mails.tsinghua.edu.cn>
Link: https://patch.msgid.link/20260122142900.2356276-2-liuy22@mails.tsinghua.edu.cn
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The commit e1168f0 ("RDMA/iwcm: Simplify cm_event_handler()")
changed the work submission logic to unconditionally call
queue_work() with the expectation that queue_work() would
have no effect if work was already pending. The problem is
that a free list of struct iwcm_work is used (for which
struct work_struct is embedded), so each call to queue_work()
is basically unique and therefore does indeed queue the work.
This causes a problem in the work handler which walks the work_list
until it's empty to process entries. This means that a single
run of the work handler could process item N+1 and release it
back to the free list while the actual workqueue entry is still
queued. It could then get reused (INIT_WORK...) and lead to
list corruption in the workqueue logic.
Fix this by just removing the work_list. The workqueue already
does this for us.
This fixes the following error that was observed when stress
testing with ucmatose on an Intel E830 in iWARP mode:
[ 151.465780] list_del corruption. next->prev should be ffff9f0915c69c08, but was ffff9f0a1116be08. (next=ffff9f0a15b11c08)
[ 151.466639] ------------[ cut here ]------------
[ 151.466986] kernel BUG at lib/list_debug.c:67!
[ 151.467349] Oops: invalid opcode: 0000 [#1] SMP NOPTI
[ 151.467753] CPU: 14 UID: 0 PID: 2306 Comm: kworker/u64:18 Not tainted 6.19.0-rc4+ #1 PREEMPT(voluntary)
[ 151.468466] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 151.469192] Workqueue: 0x0 (iw_cm_wq)
[ 151.469478] RIP: 0010:__list_del_entry_valid_or_report+0xf0/0x100
[ 151.469942] Code: c7 58 5f 4c b2 e8 10 50 aa ff 0f 0b 48 89 ef e8 36 57 cb ff 48 8b 55 08 48 89 e9 48 89 de 48 c7 c7 a8 5f 4c b2 e8 f0 4f aa ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90
[ 151.471323] RSP: 0000:ffffb15644e7bd68 EFLAGS: 00010046
[ 151.471712] RAX: 000000000000006d RBX: ffff9f0915c69c08 RCX: 0000000000000027
[ 151.472243] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f0a37d9c600
[ 151.472768] RBP: ffff9f0a15b11c08 R08: 0000000000000000 R09: c0000000ffff7fff
[ 151.473294] R10: 0000000000000001 R11: ffffb15644e7bba8 R12: ffff9f092339ee68
[ 151.473817] R13: ffff9f0900059c28 R14: ffff9f092339ee78 R15: 0000000000000000
[ 151.474344] FS: 0000000000000000(0000) GS:ffff9f0a847b5000(0000) knlGS:0000000000000000
[ 151.474934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 151.475362] CR2: 0000559e233a9088 CR3: 000000020296b004 CR4: 0000000000770ef0
[ 151.475895] PKRU: 55555554
[ 151.476118] Call Trace:
[ 151.476331] <TASK>
[ 151.476497] move_linked_works+0x49/0xa0
[ 151.476792] __pwq_activate_work.isra.46+0x2f/0xa0
[ 151.477151] pwq_dec_nr_in_flight+0x1e0/0x2f0
[ 151.477479] process_scheduled_works+0x1c8/0x410
[ 151.477823] worker_thread+0x125/0x260
[ 151.478108] ? __pfx_worker_thread+0x10/0x10
[ 151.478430] kthread+0xfe/0x240
[ 151.478671] ? __pfx_kthread+0x10/0x10
[ 151.478955] ? __pfx_kthread+0x10/0x10
[ 151.479240] ret_from_fork+0x208/0x270
[ 151.479523] ? __pfx_kthread+0x10/0x10
[ 151.479806] ret_from_fork_asm+0x1a/0x30
[ 151.480103] </TASK>
Fixes: e1168f09b331 ("RDMA/iwcm: Simplify cm_event_handler()")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260112020006.1352438-1-jmoroni@google.com
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Some of the functions are local to the module and some are not used
starting from commit 36783dec8d79 ("RDMA/rxe: Delete deprecated module
parameters interface"). Delete and avoid exporting them.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/20260104-ib-core-misc-v1-2-00367f77f3a8@nvidia.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
ib_umem_dmabuf_get_with_dma_device() is an in-kernel function and does
not require a defensive check for the .move_notify callback. All current
callers guarantee that this callback is always present.
Link: https://patch.msgid.link/20260104-ib-core-misc-v1-1-00367f77f3a8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
|
|
Add new ibv_query_port_speed() verb to enable applications to query
the effective bandwidth of a port.
This verb is particularly useful when the speed is not a multiplication
of IB speed and width where width is 2^n.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Update sysfs rate_show() to rely on ib_port_attr_to_speed_info() for
converting IB port speed and width attributes to data rate and speed
string.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Introduce ib_port_attr_to_rate() to compute the data rate in 100 Mbps
units (deci-Gb/sec) from a port's active_speed and active_width
attributes. This generic helper removes duplicated speed-to-rate
calculations, which are used by sysfs and the upcoming new verb.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add IB_EVENT_DEVICE_SPEED_CHANGE for notifying user applications on
device's ports speed changes.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Some HCAs (e.g: ConnectX4) do not trigger a IB_EVENT_GID_CHANGE on
subnet prefix update from SM (PortInfo).
Since the commit d58c23c92548 ("IB/core: Only update PKEY and GID caches
on respective events"), the GID cache is updated exclusively on
IB_EVENT_GID_CHANGE. If this event is not emitted, the subnet prefix in the
IPoIB interface’s hardware address remains set to its default value
(0xfe80000000000000).
Then rdma_bind_addr() failed because it relies on hardware address to
find the port GID (subnet_prefix + port GUID).
This patch fixes this issue by updating the GID cache on
IB_EVENT_CLIENT_REREGISTER event (emitted on PortInfo::ClientReregister=1).
Fixes: d58c23c92548 ("IB/core: Only update PKEY and GID caches on respective events")
Signed-off-by: Etienne AUJAMES <eaujames@ddn.com>
Link: https://patch.msgid.link/aVUfsO58QIDn5bGX@eaujamesFR0130
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Since nldev_deldev() (introduced by commit 060c642b2ab8 ("RDMA/nldev: Add
support to add/delete a sub IB device through netlink") grabs a reference
using ib_device_get_by_index() before calling ib_del_sub_device_and_put(),
we need to drop that reference before returning -EOPNOTSUPP error.
Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84
Fixes: bca51197620a ("RDMA/core: Support IB sub device with type "SMI"")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://patch.msgid.link/80749a85-cbe2-460c-8451-42516013f9fa@I-love.SAKURA.ne.jp
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Fix missing comparison operator for RDMA_NETWORK_ROCE_V1 in the
conditional statement. The constant was used directly instead of
being compared with net_type, causing the condition to always
evaluate to true.
Fixes: 1c15b4f2a42f ("RDMA/core: Modify enum ib_gid_type and enum rdma_network_type")
Signed-off-by: Jang Ingyu <ingyujang25@korea.ac.kr>
Link: https://patch.msgid.link/20251219041508.1725947-1-ingyujang25@korea.ac.kr
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
If the CM ID is destroyed while the CM event for multicast creating is
still queued the cancel_work_sync() will prevent the work from running
which also prevents destroying the ah_attr. This leaks a refcount and
triggers a WARN:
GID entry ref leak for dev syz1 index 2 ref=573
WARNING: CPU: 1 PID: 655 at drivers/infiniband/core/cache.c:809 release_gid_table drivers/infiniband/core/cache.c:806 [inline]
WARNING: CPU: 1 PID: 655 at drivers/infiniband/core/cache.c:809 gid_table_release_one+0x284/0x3cc drivers/infiniband/core/cache.c:886
Destroy the ah_attr after canceling the work, it is safe to call this
twice.
Link: https://patch.msgid.link/r/0-v1-4285d070a6b2+20a-rdma_mc_gid_leak_syz_jgg@nvidia.com
Cc: stable@vger.kernel.org
Fixes: fe454dc31e84 ("RDMA/ucma: Fix use-after-free bug in ucma_create_uevent")
Reported-by: syzbot+b0da83a6c0e2e2bddbd4@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68232e7b.050a0220.f2294.09f6.GAE@google.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The netlink response for RDMA_NL_LS_OP_IP_RESOLVE should always have a
LS_NLA_TYPE_DGID attribute, it is invalid if it does not.
Use the nl parsing logic properly and call nla_parse_deprecated() to fill
the nlattrs array and then directly index that array to get the data for
the DGID. Just fail if it is NULL.
Remove the for loop searching for the nla, and squash the validation and
parsing into one function.
Fixes an uninitialized read from the stack triggered by userspace if it
does not provide the DGID to a kernel initiated RDMA_NL_LS_OP_IP_RESOLVE
query.
BUG: KMSAN: uninit-value in hex_byte_pack include/linux/hex.h:13 [inline]
BUG: KMSAN: uninit-value in ip6_string+0xef4/0x13a0 lib/vsprintf.c:1490
hex_byte_pack include/linux/hex.h:13 [inline]
ip6_string+0xef4/0x13a0 lib/vsprintf.c:1490
ip6_addr_string+0x18a/0x3e0 lib/vsprintf.c:1509
ip_addr_string+0x245/0xee0 lib/vsprintf.c:1633
pointer+0xc09/0x1bd0 lib/vsprintf.c:2542
vsnprintf+0xf8a/0x1bd0 lib/vsprintf.c:2930
vprintk_store+0x3ae/0x1530 kernel/printk/printk.c:2279
vprintk_emit+0x307/0xcd0 kernel/printk/printk.c:2426
vprintk_default+0x3f/0x50 kernel/printk/printk.c:2465
vprintk+0x36/0x50 kernel/printk/printk_safe.c:82
_printk+0x17e/0x1b0 kernel/printk/printk.c:2475
ib_nl_process_good_ip_rsep drivers/infiniband/core/addr.c:128 [inline]
ib_nl_handle_ip_res_resp+0x963/0x9d0 drivers/infiniband/core/addr.c:141
rdma_nl_rcv_msg drivers/infiniband/core/netlink.c:-1 [inline]
rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline]
rdma_nl_rcv+0xefa/0x11c0 drivers/infiniband/core/netlink.c:259
netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline]
netlink_unicast+0xf04/0x12b0 net/netlink/af_netlink.c:1346
netlink_sendmsg+0x10b3/0x1250 net/netlink/af_netlink.c:1896
sock_sendmsg_nosec net/socket.c:714 [inline]
__sock_sendmsg+0x333/0x3d0 net/socket.c:729
____sys_sendmsg+0x7e0/0xd80 net/socket.c:2617
___sys_sendmsg+0x271/0x3b0 net/socket.c:2671
__sys_sendmsg+0x1aa/0x300 net/socket.c:2703
__compat_sys_sendmsg net/compat.c:346 [inline]
__do_compat_sys_sendmsg net/compat.c:353 [inline]
__se_compat_sys_sendmsg net/compat.c:350 [inline]
__ia32_compat_sys_sendmsg+0xa4/0x100 net/compat.c:350
ia32_sys_call+0x3f6c/0x4310 arch/x86/include/generated/asm/syscalls_32.h:371
do_syscall_32_irqs_on arch/x86/entry/syscall_32.c:83 [inline]
__do_fast_syscall_32+0xb0/0x150 arch/x86/entry/syscall_32.c:306
do_fast_syscall_32+0x38/0x80 arch/x86/entry/syscall_32.c:331
do_SYSENTER_32+0x1f/0x30 arch/x86/entry/syscall_32.c:3
Link: https://patch.msgid.link/r/0-v1-3fbaef094271+2cf-rdma_op_ip_rslv_syz_jgg@nvidia.com
Cc: stable@vger.kernel.org
Fixes: ae43f8286730 ("IB/core: Add IP to GID netlink offload")
Reported-by: syzbot+938fcd548c303fe33c1a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/68dc3dac.a00a0220.102ee.004f.GAE@google.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Pull rdma updates from Jason Gunthorpe:
"This has another new RDMA driver 'bng_en' for latest generation
Broadcom NICs. There might be one more new driver still to come.
Otherwise it is a fairly quite cycle. Summary:
- Minor driver bug fixes and updates to cxgb4, rxe, rdmavt, bnxt_re,
mlx5
- Many bug fix patches for irdma
- WQ_PERCPU annotations and system_dfl_wq changes
- Improved mlx5 support for "other eswitches" and multiple PFs
- 1600Gbps link speed reporting support. Four Digits Now!
- New driver bng_en for latest generation Broadcom NICs
- Bonding support for hns
- Adjust mlx5's hmm based ODP to work with the very large address
space created by the new 5 level paging default on x86
- Lockdep fixups in rxe and siw"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits)
RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep
RDMA/siw: reclassify sockets in order to avoid false positives from lockdep
RDMA/bng_re: Remove prefetch instruction
RDMA/core: Reduce cond_resched() frequency in __ib_umem_release
RDMA/irdma: Fix SRQ shadow area address initialization
RDMA/irdma: Remove doorbell elision logic
RDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+
RDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEY
RDMA/irdma: Add missing mutex destroy
RDMA/irdma: Fix SIGBUS in AEQ destroy
RDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2
RDMA/irdma: Fix data race in irdma_free_pble
RDMA/irdma: Fix data race in irdma_sc_ccq_arm
RDMA/mlx5: Add support for 1600_8x lane speed
RDMA/core: Add new IB rate for XDR (8x) support
IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled
RDMA/bnxt_re: Pass correct flag for dma mr creation
RDMA/bnxt_re: Fix the inline size for GenP7 devices
RDMA/hns: Support reset recovery for bond
RDMA/hns: Support link state reporting for bond
...
|
|
The current implementation calls cond_resched() for every SG entry
in __ib_umem_release(), which can increase needless overhead.
This patch introduces RESCHED_LOOP_CNT_THRESHOLD (0x1000) to limit
how often cond_resched() is called. The function now yields the CPU
once every 4096 iterations, and yield at the very first iteration
for lots of small umem case, to reduce scheduling overhead.
Fixes: d056bc45b62b ("RDMA/core: Prevent soft lockup during large user memory region cleanup")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20251126025147.2627-1-lirongqing@baidu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add the new rates as defined in the Infiniband spec for XDR and 8x
link width support.
Furthermore, modify the utility conversion methods accordingly.
Reference: IB Spec Release 1.8
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/20251120-speed-8-v1-1-e6a7efef8cb8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When a process exits with numerous large, pinned memory regions consisting
of 4KB pages, the cleanup of the memory region through __ib_umem_release()
may cause soft lockups. This is because unpin_user_page_range_dirty_lock()
is called in a tight loop for unpin and releasing page without yielding the
CPU.
watchdog: BUG: soft lockup - CPU#44 stuck for 26s! [python3:73464]
Kernel panic - not syncing: softlockup: hung tasks
CPU: 44 PID: 73464 Comm: python3 Tainted: G OEL
asm_sysvec_apic_timer_interrupt+0x1b/0x20
RIP: 0010:free_unref_page+0xff/0x190
? free_unref_page+0xe3/0x190
__put_page+0x77/0xe0
put_compound_head+0xed/0x100
unpin_user_page_range_dirty_lock+0xb2/0x180
__ib_umem_release+0x57/0xb0 [ib_core]
ib_umem_release+0x3f/0xd0 [ib_core]
mlx5_ib_dereg_mr+0x2e9/0x440 [mlx5_ib]
ib_dereg_mr_user+0x43/0xb0 [ib_core]
uverbs_free_mr+0x15/0x20 [ib_uverbs]
destroy_hw_idr_uobject+0x21/0x60 [ib_uverbs]
uverbs_destroy_uobject+0x38/0x1b0 [ib_uverbs]
__uverbs_cleanup_ufile+0xd1/0x150 [ib_uverbs]
uverbs_destroy_ufile_hw+0x3f/0x100 [ib_uverbs]
ib_uverbs_close+0x1f/0xb0 [ib_uverbs]
__fput+0x9c/0x280
____fput+0xe/0x20
task_work_run+0x6a/0xb0
do_exit+0x217/0x3c0
do_group_exit+0x3b/0xb0
get_signal+0x150/0x900
arch_do_signal_or_restart+0xde/0x100
exit_to_user_mode_loop+0xc4/0x160
exit_to_user_mode_prepare+0xa0/0xb0
syscall_exit_to_user_mode+0x27/0x50
do_syscall_64+0x63/0xb0
Fix soft lockup issues by incorporating cond_resched() calls within
__ib_umem_release(), and this SG entries are typically grouped in 2MB
chunks on x86_64, adding cond_resched() should has minimal performance
impact.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20251113095317.2628-1-lirongqing@baidu.com
Acked-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Fix couple of occurrences of the misspelled word "reource"
in the comments with the correct spelling "resource".
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20251113105457.879903-1-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-3-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-2-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When a GSI MAD packet is sent on the QP, it will potentially be
retried CMA_MAX_CM_RETRIES times with a timeout value of:
4.096usec * 2 ^ CMA_CM_RESPONSE_TIMEOUT
The above equates to ~64 seconds using the default CMA values.
The cm_id_priv's refcount will be incremented for this period.
Therefore, the timeout value waiting for a cm_id destruction must be
based on the effective timeout of MAD packets. To provide additional
leeway, we add 25% to this timeout and use that instead of the
constant 10 seconds timeout, which may result in false negatives.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Link: https://patch.msgid.link/20251021132738.4179604-1-haakon.bugge@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In `UVERBS_METHOD_CQ_CREATE`, umem should be released if anything goes
wrong. Currently, if `create_cq_umem` fails, umem would not be
released or referenced, causing a possible leak.
In this patch, we release umem at `UVERBS_METHOD_CQ_CREATE`, the driver
should not release umem if it returns an error code.
Fixes: 1a40c362ae26 ("RDMA/uverbs: Add a common way to create CQ with umem")
Signed-off-by: Shuhao Fu <sfual@cse.ust.hk>
Link: https://patch.msgid.link/aOh1le4YqtYwj-hH@osx.local
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
lockdep_assert_held(&id_priv->handler_mutex)
rdma_accept() also has this, so this is now more consistent and may
prevent bugs in future.
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://patch.msgid.link/20251008165913.444276-1-metze@samba.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|