summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2025-06-27mm/hugetlb: unshare page tables during VMA split, not beforeJann Horn
commit 081056dc00a27bccb55ccc3c6f230a3d5fd3f7e0 upstream. Currently, __split_vma() triggers hugetlb page table unsharing through vm_ops->may_split(). This happens before the VMA lock and rmap locks are taken - which is too early, it allows racing VMA-locked page faults in our process and racing rmap walks from other processes to cause page tables to be shared again before we actually perform the split. Fix it by explicitly calling into the hugetlb unshare logic from __split_vma() in the same place where THP splitting also happens. At that point, both the VMA and the rmap(s) are write-locked. An annoying detail is that we can now call into the helper hugetlb_unshare_pmds() from two different locking contexts: 1. from hugetlb_split(), holding: - mmap lock (exclusively) - VMA lock - file rmap lock (exclusively) 2. hugetlb_unshare_all_pmds(), which I think is designed to be able to call us with only the mmap lock held (in shared mode), but currently only runs while holding mmap lock (exclusively) and VMA lock Backporting note: This commit fixes a racy protection that was introduced in commit b30c14cd6102 ("hugetlb: unshare some PMDs when splitting VMAs"); that commit claimed to fix an issue introduced in 5.13, but it should actually also go all the way back. [jannh@google.com: v2] Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-1-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250528-hugetlb-fixes-splitrace-v2-0-1329349bad1a@google.com Link: https://lkml.kernel.org/r/20250527-hugetlb-fixes-splitrace-v1-1-f4136f5ec58a@google.com Fixes: 39dde65c9940 ("[PATCH] shared page table for hugetlb page") Signed-off-by: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [b30c14cd6102: hugetlb: unshare some PMDs when splitting VMAs] Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [stable backport: added missing include] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27atm: Revert atm_account_tx() if copy_from_iter_full() fails.Kuniyuki Iwashima
commit 7851263998d4269125fd6cb3fdbfc7c6db853859 upstream. In vcc_sendmsg(), we account skb->truesize to sk->sk_wmem_alloc by atm_account_tx(). It is expected to be reverted by atm_pop_raw() later called by vcc->dev->ops->send(vcc, skb). However, vcc_sendmsg() misses the same revert when copy_from_iter_full() fails, and then we will leak a socket. Let's factorise the revert part as atm_return_tx() and call it in the failure path. Note that the corresponding sk_wmem_alloc operation can be found in alloc_tx() as of the blamed commit. $ git blame -L:alloc_tx net/atm/common.c c55fa3cccbc2c~ Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Simon Horman <horms@kernel.org> Closes: https://lore.kernel.org/netdev/20250614161959.GR414686@horms.kernel.org/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250616182147.963333-3-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27f2fs: fix to bail out in get_new_segment()Chao Yu
[ Upstream commit bb5eb8a5b222fa5092f60d5555867a05ebc3bdf2 ] ------------[ cut here ]------------ WARNING: CPU: 3 PID: 579 at fs/f2fs/segment.c:2832 new_curseg+0x5e8/0x6dc pc : new_curseg+0x5e8/0x6dc Call trace: new_curseg+0x5e8/0x6dc f2fs_allocate_data_block+0xa54/0xe28 do_write_page+0x6c/0x194 f2fs_do_write_node_page+0x38/0x78 __write_node_page+0x248/0x6d4 f2fs_sync_node_pages+0x524/0x72c f2fs_write_checkpoint+0x4bc/0x9b0 __checkpoint_and_complete_reqs+0x80/0x244 issue_checkpoint_thread+0x8c/0xec kthread+0x114/0x1bc ret_from_fork+0x10/0x20 get_new_segment() detects inconsistent status in between free_segmap and free_secmap, let's record such error into super block, and bail out get_new_segment() instead of continue using the segment. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27tcp: add receive queue awareness in tcp_rcv_space_adjust()Eric Dumazet
[ Upstream commit ea33537d82921e71f852ea2ed985acc562125efe ] If the application can not drain fast enough a TCP socket queue, tcp_rcv_space_adjust() can overestimate tp->rcvq_space.space. Then sk->sk_rcvbuf can grow and hit tcp_rmem[2] for no good reason. Fix this by taking into acount the number of available bytes. Keeping sk->sk_rcvbuf at the right size allows better cache efficiency. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Link: https://patch.msgid.link/20250513193919.1089692-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27ACPI: Add missing prototype for non CONFIG_SUSPEND/CONFIG_X86 caseMario Limonciello
[ Upstream commit e1bdbbc98279164d910d2de82a745f090a8b249f ] acpi_register_lps0_dev() and acpi_unregister_lps0_dev() may be used in drivers that don't require CONFIG_SUSPEND or compile on !X86. Add prototypes for those cases. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502191627.fRgoBwcZ-lkp@intel.com/ Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20250407183656.1503446-1-superm1@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27mmc: Add quirk to disable DDR50 tuningErick Shepherd
[ Upstream commit 9510b38dc0ba358c93cbf5ee7c28820afb85937b ] Adds the MMC_QUIRK_NO_UHS_DDR50_TUNING quirk and updates mmc_execute_tuning() to return 0 if that quirk is set. This fixes an issue on certain Swissbit SD cards that do not support DDR50 tuning where tuning requests caused I/O errors to be thrown. Signed-off-by: Erick Shepherd <erick.shepherd@ni.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250331221337.1414534-1-erick.shepherd@ni.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-27bus: firewall: Fix missing static inline annotations for stubsKrzysztof Kozlowski
commit 66db876162155c1cec87359cd78c62aaafde9257 upstream. Stubs in the header file for !CONFIG_STM32_FIREWALL case should be both static and inline, because they do not come with earlier declaration and should be inlined in every unit including the header. Cc: Patrice Chotard <patrice.chotard@foss.st.com> Cc: stable@vger.kernel.org Fixes: 5c9668cfc6d7 ("firewall: introduce stm32_firewall framework") Link: https://lore.kernel.org/r/20250507092121.95121-2-krzysztof.kozlowski@linaro.org Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-19overflow: Introduce __DEFINE_FLEX for having no initializerKees Cook
commit 5c78e793f78732b60276401f75cc1a101f9ad121 upstream. While not yet in the tree, there is a proposed patch[1] that was depending on the prior behavior of _DEFINE_FLEX, which did not have an explicit initializer. Provide this via __DEFINE_FLEX now, which can also have attributes applied (e.g. __uninitialized). Examples of the resulting initializer behaviors can be seen here: https://godbolt.org/z/P7Go8Tr33 Link: https://lore.kernel.org/netdev/20250520205920.2134829-9-anthony.l.nguyen@intel.com [1] Fixes: 47e36ed78406 ("overflow: Fix direct struct member initialization in _DEFINE_FLEX()") Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-19HID: usbhid: Eliminate recurrent out-of-bounds bug in usbhid_parse()Terry Junge
commit fe7f7ac8e0c708446ff017453add769ffc15deed upstream. Update struct hid_descriptor to better reflect the mandatory and optional parts of the HID Descriptor as per USB HID 1.11 specification. Note: the kernel currently does not parse any optional HID class descriptors, only the mandatory report descriptor. Update all references to member element desc[0] to rpt_desc. Add test to verify bLength and bNumDescriptors values are valid. Replace the for loop with direct access to the mandatory HID class descriptor member for the report descriptor. This eliminates the possibility of getting an out-of-bounds fault. Add a warning message if the HID descriptor contains any unsupported optional HID class descriptors. Reported-by: syzbot+c52569baf0c843f35495@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c52569baf0c843f35495 Fixes: f043bfc98c19 ("HID: usbhid: fix out-of-bounds bug") Cc: stable@vger.kernel.org Signed-off-by: Terry Junge <linuxhid@cosmicgizmosystems.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Jiri Kosina <jkosina@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-19block: Fix bvec_set_folio() for very large foliosMatthew Wilcox (Oracle)
[ Upstream commit 5e223e06ee7c6d8f630041a0645ac90e39a42cc6 ] Similarly to 26064d3e2b4d ("block: fix adding folio to bio"), if we attempt to add a folio that is larger than 4GB, we'll silently truncate the offset and len. Widen the parameters to size_t, assert that the length is less than 4GB and set the first page that contains the interesting data rather than the first page of the folio. Fixes: 26db5ee15851 (block: add a bvec_set_folio helper) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20250612144255.2850278-1-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAPMatthew Wilcox (Oracle)
[ Upstream commit f826ec7966a63d48e16e0868af4e038bf9a1a3ae ] It is possible for physically contiguous folios to have discontiguous struct pages if SPARSEMEM is enabled and SPARSEMEM_VMEMMAP is not. This is correctly handled by folio_page_idx(), so remove this open-coded implementation. Fixes: 640d1930bef4 (block: Add bio_for_each_folio_all()) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20250612144126.2849931-1-willy@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19pmdomain: core: Introduce dev_pm_genpd_rpm_always_on()Ulf Hansson
[ Upstream commit cd3fa304ba5c93ce57b9b55b3cd893af2be96527 ] For some usecases a consumer driver requires its device to remain power-on from the PM domain perspective during runtime. Using dev PM qos along with the genpd governors, doesn't work for this case as would potentially prevent the device from being runtime suspended too. To support these usecases, let's introduce dev_pm_genpd_rpm_always_on() to allow consumers drivers to dynamically control the behaviour in genpd for a device that is attached to it. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com> Acked-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Link: https://lore.kernel.org/r/1738736156-119203-4-git-send-email-shawn.lin@rock-chips.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Stable-dep-of: 08f959759e1e ("mmc: sdhci-of-dwcmshc: add PD workaround on RK3576") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19wifi: cfg80211/mac80211: correctly parse S1G beacon optional elementsLachlan Hodges
[ Upstream commit 1e1f706fc2ce90eaaf3480b3d5f27885960d751c ] S1G beacons are not traditional beacons but a type of extension frame. Extension frames contain the frame control and duration fields, followed by zero or more optional fields before the frame body. These optional fields are distinct from the variable length elements. The presence of optional fields is indicated in the frame control field. To correctly locate the elements offset, the frame control must be parsed to identify which optional fields are present. Currently, mac80211 parses S1G beacons based on fixed assumptions about the frame layout, without inspecting the frame control field. This can result in incorrect offsets to the "variable" portion of the frame. Properly parse S1G beacon frames by using the field lengths defined in IEEE 802.11-2024, section 9.3.4.3, ensuring that the elements offset is calculated accurately. Fixes: 9eaffe5078ca ("cfg80211: convert S1G beacon to scan results") Fixes: cd418ba63f0c ("mac80211: convert S1G beacon to scan results") Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20250603053538.468562-1-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19nvme: fix command limits status codeKeith Busch
[ Upstream commit 10f4a7cd724e34b7a6ff96e57ac49dc0cadececc ] The command specific status code, 0x183, was introduced in the NVMe 2.0 specification defined to "Command Size Limits Exceeded" and only ever applied to DSM and Copy commands. Fix the name and, remove the incorrect translation to error codes and special treatment in the target code for it. Fixes: 3b7c33b28a44d4 ("nvme.h: add Write Zeroes definitions") Cc: Chaitanya Kulkarni <chaitanyak@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19coresight: Fixes device's owner field for registered using ↵Junhao He
coresight_init_driver() [ Upstream commit 9f52aecc952ddf307571517d5c91136c8c4e87c9 ] The coresight_init_driver() of the coresight-core module is called from the sub coresgiht device (such as tmc/stm/funnle/...) module. It calls amba_driver_register() and Platform_driver_register(), which are macro functions that use the coresight-core's module to initialize the caller's owner field. Therefore, when the sub coresight device calls coresight_init_driver(), an incorrect THIS_MODULE value is captured. The sub coesgiht modules can be removed while their callbacks are running, resulting in a general protection failure. Add module parameter to coresight_init_driver() so can be called with the module of the callback. Fixes: 075b7cd7ad7d ("coresight: Add helpers registering/removing both AMBA and platform drivers") Signed-off-by: Junhao He <hejunhao3@huawei.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240918035327.9710-1-hejunhao3@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19PCI: endpoint: Retain fixed-size BAR size as well as aligned sizeJerome Brunet
[ Upstream commit 793908d60b8745c386b9f4e29eb702f74ceb0886 ] When allocating space for an endpoint function on a BAR with a fixed size, the size saved in 'struct pci_epf_bar.size' should be the fixed size as expected by pci_epc_set_bar(). However, if pci_epf_alloc_space() increased the allocation size to accommodate iATU alignment requirements, it previously saved the larger aligned size in .size, which broke pci_epc_set_bar(). To solve this, keep the fixed BAR size in .size and save the aligned size in a new .aligned_size for use when deallocating it. Fixes: 2a9a801620ef ("PCI: endpoint: Add support to specify alignment for buffers allocated to BARs") Signed-off-by: Jerome Brunet <jbrunet@baylibre.com> [mani: commit message fixup] Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> [bhelgaas: more specific subject, commit log, wrap comment to match file] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://patch.msgid.link/20250424-pci-ep-size-alignment-v5-1-2d4ec2af23f5@baylibre.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19net: phy: fix up const issues in to_mdio_device() and to_phy_device()Greg Kroah-Hartman
[ Upstream commit e9cb929670a1e98b592b30f03f06e9e20110f318 ] Both to_mdio_device() and to_phy_device() "throw away" the const pointer attribute passed to them and return a non-const pointer, which generally is not a good thing overall. Fix this up by using container_of_const() which was designed for this very problem. Cc: Alexander Lobakin <alobakin@pm.me> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Fixes: 7eab14de73a8 ("mdio, phy: fix -Wshadow warnings triggered by nested container_of()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2025052246-conduit-glory-8fc9@gregkh Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19vsock/virtio: fix `rx_bytes` accounting for stream socketsStefano Garzarella
[ Upstream commit 45ca7e9f0730ae36fc610e675b990e9cc9ca0714 ] In `struct virtio_vsock_sock`, we maintain two counters: - `rx_bytes`: used internally to track how many bytes have been read. This supports mechanisms like .stream_has_data() and sock_rcvlowat(). - `fwd_cnt`: used for the credit mechanism to inform available receive buffer space to the remote peer. These counters are updated via virtio_transport_inc_rx_pkt() and virtio_transport_dec_rx_pkt(). Since the beginning with commit 06a8fc78367d ("VSOCK: Introduce virtio_vsock_common.ko"), we call virtio_transport_dec_rx_pkt() in virtio_transport_stream_do_dequeue() only when we consume the entire packet, so partial reads, do not update `rx_bytes` and `fwd_cnt`. This is fine for `fwd_cnt`, because we still have space used for the entire packet, and we don't want to update the credit for the other peer until we free the space of the entire packet. However, this causes `rx_bytes` to be stale on partial reads. Previously, this didn’t cause issues because `rx_bytes` was used only by .stream_has_data(), and any unread portion of a packet implied data was still available. However, since commit 93b808876682 ("virtio/vsock: fix logic which reduces credit update messages"), we now rely on `rx_bytes` to determine if a credit update should be sent when the data in the RX queue drops below SO_RCVLOWAT value. This patch fixes the accounting by updating `rx_bytes` with the number of bytes actually read, even on partial reads, while leaving `fwd_cnt` untouched until the packet is fully consumed. Also introduce a new `buf_used` counter to check that the remote peer is honoring the given credit; this was previously done via `rx_bytes`. Fixes: 93b808876682 ("virtio/vsock: fix logic which reduces credit update messages") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250521121705.196379-1-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19RDMA/mlx5: Fix error flow upon firmware failure for RQ destructionPatrisious Haddad
[ Upstream commit 5d2ea5aebbb2f3ebde4403f9c55b2b057e5dd2d6 ] Upon RQ destruction if the firmware command fails which is the last resource to be destroyed some SW resources were already cleaned regardless of the failure. Now properly rollback the object to its original state upon such failure. In order to avoid a use-after free in case someone tries to destroy the object again, which results in the following kernel trace: refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 37589 at lib/refcount.c:28 refcount_warn_saturate+0xf4/0x148 Modules linked in: rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) rfkill mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) psample mlxfw(OE) mlx_compat(OE) macsec tls pci_hyperv_intf sunrpc vfat fat virtio_net net_failover failover fuse loop nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_console virtio_gpu virtio_blk virtio_dma_buf virtio_mmio dm_mirror dm_region_hash dm_log dm_mod xpmem(OE) CPU: 0 UID: 0 PID: 37589 Comm: python3 Kdump: loaded Tainted: G OE ------- --- 6.12.0-54.el10.aarch64 #1 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : refcount_warn_saturate+0xf4/0x148 lr : refcount_warn_saturate+0xf4/0x148 sp : ffff80008b81b7e0 x29: ffff80008b81b7e0 x28: ffff000133d51600 x27: 0000000000000001 x26: 0000000000000000 x25: 00000000ffffffea x24: ffff00010ae80f00 x23: ffff00010ae80f80 x22: ffff0000c66e5d08 x21: 0000000000000000 x20: ffff0000c66e0000 x19: ffff00010ae80340 x18: 0000000000000006 x17: 0000000000000000 x16: 0000000000000020 x15: ffff80008b81b37f x14: 0000000000000000 x13: 2e656572662d7265 x12: ffff80008283ef78 x11: ffff80008257efd0 x10: ffff80008283efd0 x9 : ffff80008021ed90 x8 : 0000000000000001 x7 : 00000000000bffe8 x6 : c0000000ffff7fff x5 : ffff0001fb8e3408 x4 : 0000000000000000 x3 : ffff800179993000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000133d51600 Call trace: refcount_warn_saturate+0xf4/0x148 mlx5_core_put_rsc+0x88/0xa0 [mlx5_ib] mlx5_core_destroy_rq_tracked+0x64/0x98 [mlx5_ib] mlx5_ib_destroy_wq+0x34/0x80 [mlx5_ib] ib_destroy_wq_user+0x30/0xc0 [ib_core] uverbs_free_wq+0x28/0x58 [ib_uverbs] destroy_hw_idr_uobject+0x34/0x78 [ib_uverbs] uverbs_destroy_uobject+0x48/0x240 [ib_uverbs] __uverbs_cleanup_ufile+0xd4/0x1a8 [ib_uverbs] uverbs_destroy_ufile_hw+0x48/0x120 [ib_uverbs] ib_uverbs_close+0x2c/0x100 [ib_uverbs] __fput+0xd8/0x2f0 __fput_sync+0x50/0x70 __arm64_sys_close+0x40/0x90 invoke_syscall.constprop.0+0x74/0xd0 do_el0_svc+0x48/0xe8 el0_svc+0x44/0x1d0 el0t_64_sync_handler+0x120/0x130 el0t_64_sync+0x1a4/0x1a8 Fixes: e2013b212f9f ("net/mlx5_core: Add RQ and SQ event handling") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/3181433ccdd695c63560eeeb3f0c990961732101.1745839855.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19page_pool: Track DMA-mapped pages and unmap them when destroying the poolToke Høiland-Jørgensen
[ Upstream commit ee62ce7a1d909ccba0399680a03c2dee83bcae95 ] When enabling DMA mapping in page_pool, pages are kept DMA mapped until they are released from the pool, to avoid the overhead of re-mapping the pages every time they are used. This causes resource leaks and/or crashes when there are pages still outstanding while the device is torn down, because page_pool will attempt an unmap through a non-existent DMA device on the subsequent page return. To fix this, implement a simple tracking of outstanding DMA-mapped pages in page pool using an xarray. This was first suggested by Mina[0], and turns out to be fairly straight forward: We simply store pointers to pages directly in the xarray with xa_alloc() when they are first DMA mapped, and remove them from the array on unmap. Then, when a page pool is torn down, it can simply walk the xarray and unmap all pages still present there before returning, which also allows us to get rid of the get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional synchronisation is needed, as a page will only ever be unmapped once. To avoid having to walk the entire xarray on unmap to find the page reference, we stash the ID assigned by xa_alloc() into the page structure itself, using the upper bits of the pp_magic field. This requires a couple of defines to avoid conflicting with the POINTER_POISON_DELTA define, but this is all evaluated at compile-time, so does not affect run-time performance. The bitmap calculations in this patch gives the following number of bits for different architectures: - 23 bits on 32-bit architectures - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE) - 32 bits on other 64-bit architectures Stashing a value into the unused bits of pp_magic does have the effect that it can make the value stored there lie outside the unmappable range (as governed by the mmap_min_addr sysctl), for architectures that don't define ILLEGAL_POINTER_VALUE. This means that if one of the pointers that is aliased to the pp_magic field (such as page->lru.next) is dereferenced while the page is owned by page_pool, that could lead to a dereference into userspace, which is a security concern. The risk of this is mitigated by the fact that (a) we always clear pp_magic before releasing a page from page_pool, and (b) this would need a use-after-free bug for struct page, which can have many other risks since page->lru.next is used as a generic list pointer in multiple places in the kernel. As such, with this patch we take the position that this risk is negligible in practice. For more discussion, see[1]. Since all the tracking added in this patch is performed on DMA map/unmap, no additional code is needed in the fast path, meaning the performance overhead of this tracking is negligible there. A micro-benchmark shows that the total overhead of the tracking itself is about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]). Since this cost is only paid on DMA map and unmap, it seems like an acceptable cost to fix the late unmap issue. Further optimisation can narrow the cases where this cost is paid (for instance by eliding the tracking when DMA map/unmap is a no-op). The extra memory needed to track the pages is neatly encapsulated inside xarray, which uses the 'struct xa_node' structure to track items. This structure is 576 bytes long, with slots for 64 items, meaning that a full node occurs only 9 bytes of overhead per slot it tracks (in practice, it probably won't be this efficient, but in any case it should be an acceptable overhead). [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/ [1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com [2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com Reported-by: Yonglong Liu <liuyonglong@huawei.com> Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code") Suggested-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org> Tested-by: Qiuling Ren <qren@redhat.com> Tested-by: Yuying Ma <yuma@redhat.com> Tested-by: Yonglong Liu <liuyonglong@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-2-6a9ef2e0cba8@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19page_pool: Move pp_magic check into helper functionsToke Høiland-Jørgensen
[ Upstream commit cd3c93167da0e760b5819246eae7a4ea30fd014b ] Since we are about to stash some more information into the pp_magic field, let's move the magic signature checks into a pair of helper functions so it can be changed in one place. Reviewed-by: Mina Almasry <almasrymina@google.com> Tested-by: Yonglong Liu <liuyonglong@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19overflow: Fix direct struct member initialization in _DEFINE_FLEX()Gustavo A. R. Silva
[ Upstream commit 47e36ed7840661a9f7fb53554a1b04a5f8daffea ] Currently, to statically initialize the struct members of the `type` object created by _DEFINE_FLEX(), the internal `obj` member must be explicitly referenced at the call site. See: struct flex { int a; int b; struct foo flex_array[]; }; _DEFINE_FLEX(struct flex, instance, flex_array, FIXED_SIZE, = { .obj = { .a = 0, .b = 1, }, }); This leaks _DEFINE_FLEX() internal implementation details and make the helper harder to use and read. Fix this and allow for a more natural and intuitive C99 init-style: _DEFINE_FLEX(struct flex, instance, flex_array, FIXED_SIZE, = { .a = 0, .b = 1, }); Note that before these changes, the `initializer` argument was optional, but now it's required. Also, update "counter" member initialization in DEFINE_FLEX(). Fixes: 26dd68d293fd ("overflow: add DEFINE_FLEX() for on-stack allocs") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://lore.kernel.org/r/aBQVeyKfLOkO9Yss@kspp Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-19firmware: SDEI: Allow sdei initialization without ACPI_APEI_GHESHuang Yiwei
[ Upstream commit 59529bbe642de4eb2191a541d9b4bae7eb73862e ] SDEI usually initialize with the ACPI table, but on platforms where ACPI is not used, the SDEI feature can still be used to handle specific firmware calls or other customized purposes. Therefore, it is not necessary for ARM_SDE_INTERFACE to depend on ACPI_APEI_GHES. In commit dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in acpi_init()"), to make APEI ready earlier, sdei_init was moved into acpi_ghes_init instead of being a standalone initcall, adding ACPI_APEI_GHES dependency to ARM_SDE_INTERFACE. This restricts the flexibility and usability of SDEI. This patch corrects the dependency in Kconfig and splits sdei_init() into two separate functions: sdei_init() and acpi_sdei_init(). sdei_init() will be called by arch_initcall and will only initialize the platform driver, while acpi_sdei_init() will initialize the device from acpi_ghes_init() when ACPI is ready. This allows the initialization of SDEI without ACPI_APEI_GHES enabled. Fixes: dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in apci_init()") Cc: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Huang Yiwei <quic_hyiwei@quicinc.com> Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20250507045757.2658795-1-quic_hyiwei@quicinc.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04NFS: Avoid flushing data while holding directory locks in nfs_rename()Trond Myklebust
[ Upstream commit dcd21b609d4abc7303f8683bce4f35d78d7d6830 ] The Linux client assumes that all filehandles are non-volatile for renames within the same directory (otherwise sillyrename cannot work). However, the existence of the Linux 'subtree_check' export option has meant that nfs_rename() has always assumed it needs to flush writes before attempting to rename. Since NFSv4 does allow the client to query whether or not the server exhibits this behaviour, and since knfsd does actually set the appropriate flag when 'subtree_check' is enabled on an export, it should be OK to optimise away the write flushing behaviour in the cases where it is clearly not needed. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-06-04coredump: hand a pidfd to the usermode coredump helperChristian Brauner
commit b5325b2a270fcaf7b2a9a0f23d422ca8a5a8bdea upstream. Give userspace a way to instruct the kernel to install a pidfd into the usermode helper process. This makes coredump handling a lot more reliable for userspace. In parallel with this commit we already have systemd adding support for this in [1]. We create a pidfs file for the coredumping process when we process the corename pattern. When the usermode helper process is forked we then install the pidfs file as file descriptor three into the usermode helpers file descriptor table so it's available to the exec'd program. Since usermode helpers are either children of the system_unbound_wq workqueue or kthreadd we know that the file descriptor table is empty and can thus always use three as the file descriptor number. Note, that we'll install a pidfd for the thread-group leader even if a subthread is calling do_coredump(). We know that task linkage hasn't been removed due to delay_group_leader() and even if this @current isn't the actual thread-group leader we know that the thread-group leader cannot be reaped until @current has exited. [brauner: This is a backport for the v6.12 series. The upstream kernel has changed pidfs_alloc_file() to set O_RDWR implicitly instead of forcing callers to set it. Let's minimize the churn and just let the coredump umh handler raise O_RDWR.] Link: https://github.com/systemd/systemd/pull/37125 [1] Link: https://lore.kernel.org/20250414-work-coredump-v2-3-685bf231f828@kernel.org Tested-by: Luca Boccassi <luca.boccassi@gmail.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-29err.h: move IOMEM_ERR_PTR() to err.hRaag Jadav
commit 18311a766c587fc69b1806f1d5943305903b7e6e upstream. Since IOMEM_ERR_PTR() macro deals with an error pointer, a better place for it is err.h. This helps avoid dependency on io.h for the users that don't need it. Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Raag Jadav <raag.jadav@intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-29spi: use container_of_cont() for to_spi_device()Greg Kroah-Hartman
[ Upstream commit 1007ae0d464ceb55a3740634790521d3543aaab9 ] Some places in the spi core pass in a const pointer to a device and the default container_of() casts that away, which is not a good idea. Preserve the proper const attribute by using container_of_const() for to_spi_device() instead, which is what it was designed for. Note, this removes the NULL check for a device pointer in the call, but no one was ever checking for that return value, and a device pointer should never be NULL overall anyway, so this should be a safe change. Cc: Mark Brown <broonie@kernel.org> Fixes: d69d80484598 ("driver core: have match() callback in struct bus_type take a const *") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2025052230-fidgeting-stooge-66f5@gregkh Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29mm: mmap: map MAP_STACK to VM_NOHUGEPAGE only if THP is enabledIgnacio Moreno Gonzalez
commit 7190b3c8bd2b0cde483bd440cf91ba1c518b4261 upstream. commit c4608d1bf7c6 ("mm: mmap: map MAP_STACK to VM_NOHUGEPAGE") maps the mmap option MAP_STACK to VM_NOHUGEPAGE. This is also done if CONFIG_TRANSPARENT_HUGEPAGE is not defined. But in that case, the VM_NOHUGEPAGE does not make sense. I discovered this issue when trying to use the tool CRIU to checkpoint and restore a container. Our running kernel is compiled without CONFIG_TRANSPARENT_HUGEPAGE. CRIU parses the output of /proc/<pid>/smaps and saves the "nh" flag. When trying to restore the container, CRIU fails to restore the "nh" mappings, since madvise() MADV_NOHUGEPAGE always returns an error because CONFIG_TRANSPARENT_HUGEPAGE is not defined. Link: https://lkml.kernel.org/r/20250507-map-map_stack-to-vm_nohugepage-only-if-thp-is-enabled-v5-1-c6c38cfefd6e@kuka.com Fixes: c4608d1bf7c6 ("mm: mmap: map MAP_STACK to VM_NOHUGEPAGE") Signed-off-by: Ignacio Moreno Gonzalez <Ignacio.MorenoGonzalez@kuka.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Yang Shi <yang@os.amperecomputing.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-29highmem: add folio_test_partial_kmap()Matthew Wilcox (Oracle)
commit 97dfbbd135cb5e4426f37ca53a8fa87eaaa4e376 upstream. In commit c749d9b7ebbc ("iov_iter: fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP"), Hugh correctly noted that if KMAP_LOCAL_FORCE_MAP is enabled, we must limit ourselves to PAGE_SIZE bytes per call to kmap_local(). The same problem exists in memcpy_from_folio(), memcpy_to_folio(), folio_zero_tail(), folio_fill_tail() and memcpy_from_file_folio(), so add folio_test_partial_kmap() to do this more succinctly. Link: https://lkml.kernel.org/r/20250514170607.3000994-2-willy@infradead.org Fixes: 00cdf76012ab ("mm: add memcpy_from_file_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-29devres: Introduce devm_kmemdup_array()Raag Jadav
[ Upstream commit a103b833ac3806b816bc993cba77d0b17cf801f1 ] Introduce '_array' variant of devm_kmemdup() which is more robust and consistent with alloc family of helpers. Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Stable-dep-of: 7dd7f39fce00 ("ASoC: SOF: Intel: hda: Fix UAF when reloading module") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29driver core: Split devres APIs to device/devres.hAndy Shevchenko
[ Upstream commit a21cad9312767d26b5257ce0662699bb202cdda1 ] device.h is a huge header which is hard to follow and easy to miss something. Improve that by splitting devres APIs to device/devres.h. In particular this helps to speedup the build of the code that includes device.h solely for a devres APIs. While at it, cast the error pointers to __iomem using IOMEM_ERR_PTR() and fix sparse warnings. Signed-off-by: Raag Jadav <raag.jadav@intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Stable-dep-of: 7dd7f39fce00 ("ASoC: SOF: Intel: hda: Fix UAF when reloading module") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29perf: Avoid the read if the count is already updatedPeter Zijlstra (Intel)
[ Upstream commit 8ce939a0fa194939cc1f92dbd8bc1a7806e7d40a ] The event may have been updated in the PMU-specific implementation, e.g., Intel PEBS counters snapshotting. The common code should not read and overwrite the value. The PERF_SAMPLE_READ in the data->sample_type can be used to detect whether the PMU-specific value is available. If yes, avoid the pmu->read() in the common code. Add a new flag, skip_read, to track the case. Factor out a perf_pmu_read() to clean up the code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250121152303.3128733-3-kan.liang@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29rcu: fix header guard for rcu_all_qs()Ankur Arora
[ Upstream commit ad6b5b73ff565e88aca7a7d1286788d80c97ba71 ] rcu_all_qs() is defined for !CONFIG_PREEMPT_RCU but the declaration is conditioned on CONFIG_PREEMPTION. With CONFIG_PREEMPT_LAZY, CONFIG_PREEMPTION=y does not imply CONFIG_PREEMPT_RCU=y. Decouple the two. Cc: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29rcu: handle unstable rdp in rcu_read_unlock_strict()Ankur Arora
[ Upstream commit fcf0e25ad4c8d14d2faab4d9a17040f31efce205 ] rcu_read_unlock_strict() can be called with preemption enabled which can make for an unstable rdp and a racy norm value. Fix this by dropping the preempt-count in __rcu_read_unlock() after the call to rcu_read_unlock_strict(), adjusting the preempt-count check appropriately. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29r8152: add vendor/device ID pair for Dell Alienware AW1022zAleksander Jan Bajkowski
[ Upstream commit 848b09d53d923b4caee5491f57a5c5b22d81febc ] The Dell AW1022z is an RTL8156B based 2.5G Ethernet controller. Add the vendor and product ID values to the driver. This makes Ethernet work with the adapter. Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Link: https://patch.msgid.link/20250206224033.980115-1-olek2@wp.pl Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29net/mlx4_core: Avoid impossible mlx4_db_alloc() order valueKees Cook
[ Upstream commit 4a6f18f28627e121bd1f74b5fcc9f945d6dbeb1e ] GCC can see that the value range for "order" is capped, but this leads it to consider that it might be negative, leading to a false positive warning (with GCC 15 with -Warray-bounds -fdiagnostics-details): ../drivers/net/ethernet/mellanox/mlx4/alloc.c:691:47: error: array subscript -1 is below array bounds of 'long unsigned int *[2]' [-Werror=array-bounds=] 691 | i = find_first_bit(pgdir->bits[o], MLX4_DB_PER_PAGE >> o); | ~~~~~~~~~~~^~~ 'mlx4_alloc_db_from_pgdir': events 1-2 691 | i = find_first_bit(pgdir->bits[o], MLX4_DB_PER_PAGE >> o); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | | | | (2) out of array bounds here | (1) when the condition is evaluated to true In file included from ../drivers/net/ethernet/mellanox/mlx4/mlx4.h:53, from ../drivers/net/ethernet/mellanox/mlx4/alloc.c:42: ../include/linux/mlx4/device.h:664:33: note: while referencing 'bits' 664 | unsigned long *bits[2]; | ^~~~ Switch the argument to unsigned int, which removes the compiler needing to consider negative values. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250210174504.work.075-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29mfd: axp20x: AXP717: Add AXP717_TS_PIN_CFG to writeable regsChris Morgan
[ Upstream commit bfad07fe298bfba0c7ddab87c5b5325970203a1e ] Add AXP717_TS_PIN_CFG (register 0x50) to the table of writeable registers so that the temperature sensor can be configured by the battery driver. Signed-off-by: Chris Morgan <macromorgan@hotmail.com> Link: https://lore.kernel.org/r/20250204155835.161973-3-macroalpha82@gmail.com Signed-off-by: Lee Jones <lee@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29genirq/msi: Store the IOMMU IOVA directly in msi_desc instead of iommu_cookieJason Gunthorpe
[ Upstream commit 1f7df3a691740a7736bbc99dc4ed536120eb4746 ] The IOMMU translation for MSI message addresses has been a 2-step process, separated in time: 1) iommu_dma_prepare_msi(): A cookie pointer containing the IOVA address is stored in the MSI descriptor when an MSI interrupt is allocated. 2) iommu_dma_compose_msi_msg(): this cookie pointer is used to compute a translated message address. This has an inherent lifetime problem for the pointer stored in the cookie that must remain valid between the two steps. However, there is no locking at the irq layer that helps protect the lifetime. Today, this works under the assumption that the iommu domain is not changed while MSI interrupts being programmed. This is true for normal DMA API users within the kernel, as the iommu domain is attached before the driver is probed and cannot be changed while a driver is attached. Classic VFIO type1 also prevented changing the iommu domain while VFIO was running as it does not support changing the "container" after starting up. However, iommufd has improved this so that the iommu domain can be changed during VFIO operation. This potentially allows userspace to directly race VFIO_DEVICE_ATTACH_IOMMUFD_PT (which calls iommu_attach_group()) and VFIO_DEVICE_SET_IRQS (which calls into iommu_dma_compose_msi_msg()). This potentially causes both the cookie pointer and the unlocked call to iommu_get_domain_for_dev() on the MSI translation path to become UAFs. Fix the MSI cookie UAF by removing the cookie pointer. The translated IOVA address is already known during iommu_dma_prepare_msi() and cannot change. Thus, it can simply be stored as an integer in the MSI descriptor. The other UAF related to iommu_get_domain_for_dev() will be addressed in patch "iommu: Make iommu_dma_prepare_msi() into a generic operation" by using the IOMMU group mutex. Link: https://patch.msgid.link/r/a4f2cd76b9dc1833ee6c1cf325cba57def22231c.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29net/mlx5: Change POOL_NEXT_SIZE define value and make it globalPatrisious Haddad
[ Upstream commit 80df31f384b4146a62a01b3d4beb376cc7b9a89e ] Change POOL_NEXT_SIZE define value from 0 to BIT(30), since this define is used to request the available maximum sized flow table, and zero doesn't make sense for it, whereas some places in the driver use zero explicitly expecting the smallest table size possible but instead due to this define they end up allocating the biggest table size unawarely. In addition move the definition to "include/linux/mlx5/fs.h" to expose the define to IB driver as well, while appropriately renaming it. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250219085808.349923-3-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29net/mlx5e: Add correct match to check IPSec syndromes for switchdev modeJianbo Liu
[ Upstream commit 85e4a808af2545fefaf18c8fe50071b06fcbdabc ] In commit dddb49b63d86 ("net/mlx5e: Add IPsec and ASO syndromes check in HW"), IPSec and ASO syndromes checks after decryption for the specified ASO object were added. But they are correct only for eswith in legacy mode. For switchdev mode, metadata register c1 is used to save the mapped id (not ASO object id). So, need to change the match accordingly for the check rules in status table. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29crypto: lzo - Fix compression buffer overrunHerbert Xu
[ Upstream commit cc47f07234f72cbd8e2c973cdbf2a6730660a463 ] Unlike the decompression code, the compression code in LZO never checked for output overruns. It instead assumes that the caller always provides enough buffer space, disregarding the buffer length provided by the caller. Add a safe compression interface that checks for the end of buffer before each write. Use the safe interface in crypto/lzo. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29ipv6: save dontfrag in corkWillem de Bruijn
[ Upstream commit a18dfa9925b9ef6107ea3aa5814ca3c704d34a8a ] When spanning datagram construction over multiple send calls using MSG_MORE, per datagram settings are configured on the first send. That is when ip(6)_setup_cork stores these settings for subsequent use in __ip(6)_append_data and others. The only flag that escaped this was dontfrag. As a result, a datagram could be constructed with df=0 on the first sendmsg, but df=1 on a next. Which is what cmsg_ip.sh does in an upcoming MSG_MORE test in the "diff" scenario. Changing datagram conditions in the middle of constructing an skb makes this already complex code path even more convoluted. It is here unintentional. Bring this flag in line with expected sockopt/cmsg behavior. And stop passing ipc6 to __ip6_append_data, to avoid such issues in the future. This is already the case for __ip_append_data. inet6_cork had a 6 byte hole, so the 1B flag has no impact. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250307033620.411611-3-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29PNP: Expand length of fixup id stringKees Cook
[ Upstream commit 425b1c97b07f2290700f708edabef32861e2b2db ] GCC 15's -Wunterminated-string-initialization saw that "id" was not including the required trailing NUL character. Instead of marking "id" with __nonstring[1], expand the length of the string as it is used in (debugging) format strings that expect a properly formed C string. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117178 [1] Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250310222432.work.826-kees@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29bpf: Allow pre-ordering for bpf cgroup progsYonghong Song
[ Upstream commit 4b82b181a26cff8bf7adc3a85a88d121d92edeaf ] Currently for bpf progs in a cgroup hierarchy, the effective prog array is computed from bottom cgroup to upper cgroups (post-ordering). For example, the following cgroup hierarchy root cgroup: p1, p2 subcgroup: p3, p4 have BPF_F_ALLOW_MULTI for both cgroup levels. The effective cgroup array ordering looks like p3 p4 p1 p2 and at run time, progs will execute based on that order. But in some cases, it is desirable to have root prog executes earlier than children progs (pre-ordering). For example, - prog p1 intends to collect original pkt dest addresses. - prog p3 will modify original pkt dest addresses to a proxy address for security reason. The end result is that prog p1 gets proxy address which is not what it wants. Putting p1 to every child cgroup is not desirable either as it will duplicate itself in many child cgroups. And this is exactly a use case we are encountering in Meta. To fix this issue, let us introduce a flag BPF_F_PREORDER. If the flag is specified at attachment time, the prog has higher priority and the ordering with that flag will be from top to bottom (pre-ordering). For example, in the above example, root cgroup: p1, p2 subcgroup: p3, p4 Let us say p2 and p4 are marked with BPF_F_PREORDER. The final effective array ordering will be p2 p4 p3 p1 Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250224230116.283071-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29tracing: Mark binary printing functions with __printf() attributeAndy Shevchenko
[ Upstream commit 196a062641fe68d9bfe0ad36b6cd7628c99ad22c ] Binary printing functions are using printf() type of format, and compiler is not happy about them as is: kernel/trace/trace.c:3292:9: error: function ‘trace_vbprintk’ might be a candidate for ‘gnu_printf’ format attribute [-Werror=suggest-attribute=format] kernel/trace/trace_seq.c:182:9: error: function ‘trace_seq_bprintf’ might be a candidate for ‘gnu_printf’ format attribute [-Werror=suggest-attribute=format] Fix the compilation errors by adding __printf() attribute. While at it, move existing __printf() attributes from the implementations to the declarations. IT also fixes incorrect attribute parameters that are used for trace_array_printk(). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20250321144822.324050-4-andriy.shevchenko@linux.intel.com Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29iommufd: Extend IOMMU_GET_HW_INFO to report PASID capabilityYi Liu
[ Upstream commit 803f97298e7de9242eb677a1351dcafbbcc9117e ] PASID usage requires PASID support in both device and IOMMU. Since the iommu drivers always enable the PASID capability for the device if it is supported, this extends the IOMMU_GET_HW_INFO to report the PASID capability to userspace. Also, enhances the selftest accordingly. Link: https://patch.msgid.link/r/20250321180143.8468-5-yi.l.liu@intel.com Cc: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> #aarch64 platform Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29fs/buffer: introduce sleeping flavors for pagecache lookupsDavidlohr Bueso
[ Upstream commit 2814a7d3d2ff5d2cdd22936f641f758fdb971fa0 ] Add __find_get_block_nonatomic() and sb_find_get_block_nonatomic() calls for which users will be converted where safe. These versions will take the folio lock instead of the mapping's private_lock. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://kdevops.org/ext4/v6.15-rc2.html # [0] Link: https://lore.kernel.org/all/aAAEvcrmREWa1SKF@bombadil.infradead.org/ # [1] Link: https://lore.kernel.org/20250418015921.132400-3-dave@stgolabs.net Tested-by: kdevops@lists.linux.dev Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-29dma-mapping: avoid potential unused data compilation warningMarek Szyprowski
[ Upstream commit c9b19ea63036fc537a69265acea1b18dabd1cbd3 ] When CONFIG_NEED_DMA_MAP_STATE is not defined, dma-mapping clients might report unused data compilation warnings for dma_unmap_*() calls arguments. Redefine macros for those calls to let compiler to notice that it is okay when the provided arguments are not used. Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20250415075659.428549-1-m.szyprowski@samsung.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-22tpm: Mask TPM RC in tpm2_start_auth_session()Jarkko Sakkinen
commit 539fbab37881e32ba6a708a100de6db19e1e7e7d upstream. tpm2_start_auth_session() does not mask TPM RC correctly from the callers: [ 28.766528] tpm tpm0: A TPM error (2307) occurred start auth session Process TPM RCs inside tpm2_start_auth_session(), and map them to POSIX error codes. Cc: stable@vger.kernel.org # v6.10+ Fixes: 699e3efd6c64 ("tpm: Add HMAC session start and end functions") Reported-by: Herbert Xu <herbert@gondor.apana.org.au> Closes: https://lore.kernel.org/linux-integrity/Z_NgdRHuTKP6JK--@gondor.apana.org.au/ Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-22scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES bufferSteve Siwinski
commit e8007fad5457ea547ca63bb011fdb03213571c7e upstream. The REPORT ZONES buffer size is currently limited by the HBA's maximum segment count to ensure the buffer can be mapped. However, the block layer further limits the number of iovec entries to 1024 when allocating a bio. To avoid allocation of buffers too large to be mapped, further restrict the maximum buffer size to BIO_MAX_INLINE_VECS. Replace the UIO_MAXIOV symbolic name with the more contextually appropriate BIO_MAX_INLINE_VECS. Fixes: b091ac616846 ("sd_zbc: Fix report zones buffer allocation") Cc: stable@vger.kernel.org Signed-off-by: Steve Siwinski <ssiwinski@atto.com> Link: https://lore.kernel.org/r/20250508200122.243129-1-ssiwinski@atto.com Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>