summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2025-10-12driver core/PM: Set power.no_callbacks along with power.no_pmRafael J. Wysocki
commit c2ce2453413d429e302659abc5ace634e873f6f5 upstream. Devices with power.no_pm set are not expected to need any power management at all, so modify device_set_pm_not_required() to set power.no_callbacks for them too in case runtime PM will be enabled for any of them (which in principle may be done for convenience if such a device participates in a dependency chain). Since device_set_pm_not_required() must be called before device_add() or it would not have any effect, it can update power.no_callbacks without locking, unlike pm_runtime_no_callbacks() that can be called after registering the target device. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: stable <stable@kernel.org> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/1950054.tdWV9SEqCh@rafael.j.wysocki Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02net/mlx5: fs, fix UAF in flow counter releaseMoshe Shemesh
[ Upstream commit 6043819e707cefb1c9e59d6e431dcfa735c4f975 ] Fix a kernel trace [1] caused by releasing an HWS action of a local flow counter in mlx5_cmd_hws_delete_fte(), where the HWS action refcount and mutex were not initialized and the counter struct could already be freed when deleting the rule. Fix it by adding the missing initializations and adding refcount for the local flow counter struct. [1] Kernel log: Call Trace: <TASK> dump_stack_lvl+0x34/0x48 mlx5_fs_put_hws_action.part.0.cold+0x21/0x94 [mlx5_core] mlx5_fc_put_hws_action+0x96/0xad [mlx5_core] mlx5_fs_destroy_fs_actions+0x8b/0x152 [mlx5_core] mlx5_cmd_hws_delete_fte+0x5a/0xa0 [mlx5_core] del_hw_fte+0x1ce/0x260 [mlx5_core] mlx5_del_flow_rules+0x12d/0x240 [mlx5_core] ? ttwu_queue_wakelist+0xf4/0x110 mlx5_ib_destroy_flow+0x103/0x1b0 [mlx5_ib] uverbs_free_flow+0x20/0x50 [ib_uverbs] destroy_hw_idr_uobject+0x1b/0x50 [ib_uverbs] uverbs_destroy_uobject+0x34/0x1a0 [ib_uverbs] uobj_destroy+0x3c/0x80 [ib_uverbs] ib_uverbs_run_method+0x23e/0x360 [ib_uverbs] ? uverbs_finalize_object+0x60/0x60 [ib_uverbs] ib_uverbs_cmd_verbs+0x14f/0x2c0 [ib_uverbs] ? do_tty_write+0x1a9/0x270 ? file_tty_write.constprop.0+0x98/0xc0 ? new_sync_write+0xfc/0x190 ib_uverbs_ioctl+0xd7/0x160 [ib_uverbs] __x64_sys_ioctl+0x87/0xc0 do_syscall_64+0x59/0x90 Fixes: b581f4266928 ("net/mlx5: fs, manage flow counters HWS action sharing by refcount") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1758525094-816583-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02firmware: imx: Add stub functions for SCMI CPU APIPeng Fan
[ Upstream commit 222accf05fc42f68ae02065d9c1542c20315118b ] To ensure successful builds when CONFIG_IMX_SCMI_CPU_DRV is not enabled, this patch adds static inline stub implementations for the following functions: - scmi_imx_cpu_start() - scmi_imx_cpu_started() - scmi_imx_cpu_reset_vector_set() These stubs return -EOPNOTSUPP to indicate that the functionality is not supported in the current configuration. This avoids potential build or link errors in code that conditionally calls these functions based on feature availability. Fixes: 1055faa5d660 ("firmware: imx: Add i.MX95 SCMI CPU driver") Reviewed-by: Cristian Marussi <cristian.marussi@arm.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02firmware: imx: Add stub functions for SCMI LMM APIPeng Fan
[ Upstream commit 3fb91b5c86d0fb5ff6f65c30a4f20193166e22fe ] To ensure successful builds when CONFIG_IMX_SCMI_LMM_DRV is not enabled, this patch adds static inline stub implementations for the following functions: - scmi_imx_lmm_operation() - scmi_imx_lmm_info() - scmi_imx_lmm_reset_vector_set() These stubs return -EOPNOTSUPP to indicate that the functionality is not supported in the current configuration. This avoids potential build or link errors in code that conditionally calls these functions based on feature availability. Fixes: 7242bbf418f0 ("firmware: imx: Add i.MX95 SCMI LMM driver") Reviewed-by: Cristian Marussi <cristian.marussi@arm.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02firmware: imx: Add stub functions for SCMI MISC APIPeng Fan
[ Upstream commit b2461e20fa9ac18b1305bba5bc7e22ebf644ea01 ] To ensure successful builds when CONFIG_IMX_SCMI_MISC_DRV is not enabled, this patch adds static inline stub implementations for the following functions: - scmi_imx_misc_ctrl_get() - scmi_imx_misc_ctrl_set() These stubs return -EOPNOTSUPP to indicate that the functionality is not supported in the current configuration. This avoids potential build or link errors in code that conditionally calls these functions based on feature availability. This patch also drops the changes in commit 540c830212ed ("firmware: imx: remove duplicate scmi_imx_misc_ctrl_get()"). The original change aimed to simplify the handling of optional features by removing conditional stubs. However, the use of conditional stubs is necessary when CONFIG_IMX_SCMI_MISC_DRV is n, while consumer driver is set to y. This is not a matter of preserving legacy patterns, but rather to ensure that there is no link error whether for module or built-in. Fixes: 0b4f8a68b292 ("firmware: imx: Add i.MX95 MISC driver") Reviewed-by: Cristian Marussi <cristian.marussi@arm.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Shawn Guo <shawnguo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-25io_uring/msg_ring: kill alloc_cache for io_kiocb allocationsJens Axboe
[ Upstream commit df8922afc37aa2111ca79a216653a629146763ad ] A recent commit: fc582cd26e88 ("io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU") fixed an issue with not deferring freeing of io_kiocb structs that msg_ring allocates to after the current RCU grace period. But this only covers requests that don't end up in the allocation cache. If a request goes into the alloc cache, it can get reused before it is sane to do so. A recent syzbot report would seem to indicate that there's something there, however it may very well just be because of the KASAN poisoning that the alloc_cache handles manually. Rather than attempt to make the alloc_cache sane for that use case, just drop the usage of the alloc_cache for msg_ring request payload data. Fixes: 50cf5f3842af ("io_uring/msg_ring: add an alloc cache for io_kiocb entries") Link: https://lore.kernel.org/io-uring/68cc2687.050a0220.139b6.0005.GAE@google.com/ Reported-by: syzbot+baa2e0f4e02df602583e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-25mm: folio_may_be_lru_cached() unless folio_test_large()Hugh Dickins
commit 2da6de30e60dd9bb14600eff1cc99df2fa2ddae3 upstream. mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as a large folio is added: so collect_longterm_unpinnable_folios() just wastes effort when calling lru_add_drain[_all]() on a large folio. But although there is good reason not to batch up PMD-sized folios, we might well benefit from batching a small number of low-order mTHPs (though unclear how that "small number" limitation will be implemented). So ask if folio_may_be_lru_cached() rather than !folio_test_large(), to insulate those particular checks from future change. Name preferred to "folio_is_batchable" because large folios can well be put on a batch: it's just the per-CPU LRU caches, drained much later, which need care. Marked for stable, to counter the increase in lru_add_drain_all()s from "mm/gup: check ref_count instead of lru before migration". Link: https://lkml.kernel.org/r/57d2eaf8-3607-f318-e0c5-be02dce61ad0@google.com Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region") Signed-off-by: Hugh Dickins <hughd@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Keir Fraser <keirf@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: yangge <yangge1116@126.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-25net/mlx5e: Harden uplink netdev access against device unbindJianbo Liu
[ Upstream commit 6b4be64fd9fec16418f365c2d8e47a7566e9eba5 ] The function mlx5_uplink_netdev_get() gets the uplink netdevice pointer from mdev->mlx5e_res.uplink_netdev. However, the netdevice can be removed and its pointer cleared when unbound from the mlx5_core.eth driver. This results in a NULL pointer, causing a kernel panic. BUG: unable to handle page fault for address: 0000000000001300 at RIP: 0010:mlx5e_vport_rep_load+0x22a/0x270 [mlx5_core] Call Trace: <TASK> mlx5_esw_offloads_rep_load+0x68/0xe0 [mlx5_core] esw_offloads_enable+0x593/0x910 [mlx5_core] mlx5_eswitch_enable_locked+0x341/0x420 [mlx5_core] mlx5_devlink_eswitch_mode_set+0x17e/0x3a0 [mlx5_core] devlink_nl_eswitch_set_doit+0x60/0xd0 genl_family_rcv_msg_doit+0xe0/0x130 genl_rcv_msg+0x183/0x290 netlink_rcv_skb+0x4b/0xf0 genl_rcv+0x24/0x40 netlink_unicast+0x255/0x380 netlink_sendmsg+0x1f3/0x420 __sock_sendmsg+0x38/0x60 __sys_sendto+0x119/0x180 do_syscall_64+0x53/0x1d0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Ensure the pointer is valid before use by checking it for NULL. If it is valid, immediately call netdev_hold() to take a reference, and preventing the netdevice from being freed while it is in use. Fixes: 7a9fb35e8c3a ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1757939074-617281-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-19mtd: spinand: winbond: Enable high-speed modes on w25n0xjwMiquel Raynal
[ Upstream commit f1a91175faaab02a45d1ceb313a315a5bfeb5416 ] w25n0xjw chips have a high-speed capability hidden in a configuration register. Once enabled, dual/quad SDR reads may be performed at a much higher frequency. Implement the new ->configure_chip() hook for this purpose and configure the SR4 register accordingly. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Stable-dep-of: 4550d33e1811 ("mtd: spinand: winbond: Fix oob_layout for W25N01JW") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-19mtd: spinand: Add a ->configure_chip() hookMiquel Raynal
[ Upstream commit da55809ebb45d1d80b7a388ffef841ed683e1a6f ] There is already a manufacturer hook, which is manufacturer specific but not chip specific. We no longer have access to the actual NAND identity at this stage so let's add a per-chip configuration hook to align the chip configuration (if any) with the core's setting. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Stable-dep-of: 4550d33e1811 ("mtd: spinand: winbond: Fix oob_layout for W25N01JW") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-19mm/vmalloc, mm/kasan: respect gfp mask in kasan_populate_vmalloc()Uladzislau Rezki (Sony)
commit 79357cd06d41d0f5a11b17d7c86176e395d10ef2 upstream. kasan_populate_vmalloc() and its helpers ignore the caller's gfp_mask and always allocate memory using the hardcoded GFP_KERNEL flag. This makes them inconsistent with vmalloc(), which was recently extended to support GFP_NOFS and GFP_NOIO allocations. Page table allocations performed during shadow population also ignore the external gfp_mask. To preserve the intended semantics of GFP_NOFS and GFP_NOIO, wrap the apply_to_page_range() calls into the appropriate memalloc scope. xfs calls vmalloc with GFP_NOFS, so this bug could lead to deadlock. There was a report here https://lkml.kernel.org/r/686ea951.050a0220.385921.0016.GAE@google.com This patch: - Extends kasan_populate_vmalloc() and helpers to take gfp_mask; - Passes gfp_mask down to alloc_pages_bulk() and __get_free_page(); - Enforces GFP_NOFS/NOIO semantics with memalloc_*_save()/restore() around apply_to_page_range(); - Updates vmalloc.c and percpu allocator call sites accordingly. Link: https://lkml.kernel.org/r/20250831121058.92971-1-urezki@gmail.com Fixes: 451769ebb7e7 ("mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reported-by: syzbot+3470c9ffee63e4abafeb@syzkaller.appspotmail.com Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-19PM: EM: Add function for registering a PD without capacity updateRafael J. Wysocki
commit e0423541477dfb684fbc6e6b5386054bc650f264 upstream. The intel_pstate driver manages CPU capacity changes itself and it does not need an update of the capacity of all CPUs in the system to be carried out after registering a PD. Moreover, in some configurations (for instance, an SMT-capable hybrid x86 system booted with nosmt in the kernel command line) the em_check_capacity_update() call at the end of em_dev_register_perf_domain() always fails and reschedules itself to run once again in 1 s, so effectively it runs in vain every 1 s forever. To address this, introduce a new variant of em_dev_register_perf_domain(), called em_dev_register_pd_no_update(), that does not invoke em_check_capacity_update(), and make intel_pstate use it instead of the original. Fixes: 7b010f9b9061 ("cpufreq: intel_pstate: EAS support for hybrid platforms") Closes: https://lore.kernel.org/linux-pm/40212796-734c-4140-8a85-854f72b8144d@panix.com/ Reported-by: Kenneth R. Crudup <kenny@panix.com> Tested-by: Kenneth R. Crudup <kenny@panix.com> Cc: 6.16+ <stable@vger.kernel.org> # 6.16+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-19compiler-clang.h: define __SANITIZE_*__ macros only when undefinedNathan Chancellor
commit 3fac212fe489aa0dbe8d80a42a7809840ca7b0f9 upstream. Clang 22 recently added support for defining __SANITIZE__ macros similar to GCC [1], which causes warnings (or errors with CONFIG_WERROR=y or W=e) with the existing defines that the kernel creates to emulate this behavior with existing clang versions. In file included from <built-in>:3: In file included from include/linux/compiler_types.h:171: include/linux/compiler-clang.h:37:9: error: '__SANITIZE_THREAD__' macro redefined [-Werror,-Wmacro-redefined] 37 | #define __SANITIZE_THREAD__ | ^ <built-in>:352:9: note: previous definition is here 352 | #define __SANITIZE_THREAD__ 1 | ^ Refactor compiler-clang.h to only define the sanitizer macros when they are undefined and adjust the rest of the code to use these macros for checking if the sanitizers are enabled, clearing up the warnings and allowing the kernel to easily drop these defines when the minimum supported version of LLVM for building the kernel becomes 22.0.0 or newer. Link: https://lkml.kernel.org/r/20250902-clang-update-sanitize-defines-v1-1-cf3702ca3d92@kernel.org Link: https://github.com/llvm/llvm-project/commit/568c23bbd3303518c5056d7f03444dae4fdc8a9c [1] Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Justin Stitt <justinstitt@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Bill Wendling <morbo@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-19fs: add a FMODE_ flag to indicate IOCB_HAS_METADATA availabilityChristoph Hellwig
[ Upstream commit d072148a8631f102de60ed5a3a827e85d09d24f0 ] Currently the kernel will happily route io_uring requests with metadata to file operations that don't support it. Add a FMODE_ flag to guard that. Fixes: 4de2ce04c862 ("fs: introduce IOCB_HAS_METADATA for metadata") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250819082517.2038819-2-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-11x86/vmscape: Enable the mitigationPawan Gupta
Commit 556c1ad666ad90c50ec8fccb930dd5046cfbecfb upstream. Enable the previously added mitigation for VMscape. Add the cmdline vmscape={off|ibpb|force} and sysfs reporting. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-09mm: introduce and use {pgd,p4d}_populate_kernel()Harry Yoo
commit f2d2f9598ebb0158a3fe17cda0106d7752e654a2 upstream. Introduce and use {pgd,p4d}_populate_kernel() in core MM code when populating PGD and P4D entries for the kernel address space. These helpers ensure proper synchronization of page tables when updating the kernel portion of top-level page tables. Until now, the kernel has relied on each architecture to handle synchronization of top-level page tables in an ad-hoc manner. For example, see commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes"). However, this approach has proven fragile for following reasons: 1) It is easy to forget to perform the necessary page table synchronization when introducing new changes. For instance, commit 4917f55b4ef9 ("mm/sparse-vmemmap: improve memory savings for compound devmaps") overlooked the need to synchronize page tables for the vmemmap area. 2) It is also easy to overlook that the vmemmap and direct mapping areas must not be accessed before explicit page table synchronization. For example, commit 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")) caused crashes by accessing the vmemmap area before calling sync_global_pgds(). To address this, as suggested by Dave Hansen, introduce _kernel() variants of the page table population helpers, which invoke architecture-specific hooks to properly synchronize page tables. These are introduced in a new header file, include/linux/pgalloc.h, so they can be called from common code. They reuse existing infrastructure for vmalloc and ioremap. Synchronization requirements are determined by ARCH_PAGE_TABLE_SYNC_MASK, and the actual synchronization is performed by arch_sync_kernel_mappings(). This change currently targets only x86_64, so only PGD and P4D level helpers are introduced. Currently, these helpers are no-ops since no architecture sets PGTBL_{PGD,P4D}_MODIFIED in ARCH_PAGE_TABLE_SYNC_MASK. In theory, PUD and PMD level helpers can be added later if needed by other architectures. For now, 32-bit architectures (x86-32 and arm) only handle PGTBL_PMD_MODIFIED, so p*d_populate_kernel() will never affect them unless we introduce a PMD level helper. [harry.yoo@oracle.com: fix KASAN build error due to p*d_populate_kernel()] Link: https://lkml.kernel.org/r/20250822020727.202749-1-harry.yoo@oracle.com Link: https://lkml.kernel.org/r/20250818020206.4517-3-harry.yoo@oracle.com Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: bibo mao <maobibo@loongson.cn> Cc: Borislav Betkov <bp@alien8.de> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-09mm: move page table sync declarations to linux/pgtable.hHarry Yoo
commit 7cc183f2e67d19b03ee5c13a6664b8c6cc37ff9d upstream. During our internal testing, we started observing intermittent boot failures when the machine uses 4-level paging and has a large amount of persistent memory: BUG: unable to handle page fault for address: ffffe70000000034 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP NOPTI RIP: 0010:__init_single_page+0x9/0x6d Call Trace: <TASK> __init_zone_device_page+0x17/0x5d memmap_init_zone_device+0x154/0x1bb pagemap_range+0x2e0/0x40f memremap_pages+0x10b/0x2f0 devm_memremap_pages+0x1e/0x60 dev_dax_probe+0xce/0x2ec [device_dax] dax_bus_probe+0x6d/0xc9 [... snip ...] </TASK> It turns out that the kernel panics while initializing vmemmap (struct page array) when the vmemmap region spans two PGD entries, because the new PGD entry is only installed in init_mm.pgd, but not in the page tables of other tasks. And looking at __populate_section_memmap(): if (vmemmap_can_optimize(altmap, pgmap)) // does not sync top level page tables r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); else // sync top level page tables in x86 r = vmemmap_populate(start, end, nid, altmap); In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c synchronizes the top level page table (See commit 9b861528a801 ("x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes")) so that all tasks in the system can see the new vmemmap area. However, when vmemmap_can_optimize() returns true, the optimized path skips synchronization of top-level page tables. This is because vmemmap_populate_compound_pages() is implemented in core MM code, which does not handle synchronization of the top-level page tables. Instead, the core MM has historically relied on each architecture to perform this synchronization manually. We're not the first party to encounter a crash caused by not-sync'd top level page tables: earlier this year, Gwan-gyeong Mun attempted to address the issue [1] [2] after hitting a kernel panic when x86 code accessed the vmemmap area before the corresponding top-level entries were synced. At that time, the issue was believed to be triggered only when struct page was enlarged for debugging purposes, and the patch did not get further updates. It turns out that current approach of relying on each arch to handle the page table sync manually is fragile because 1) it's easy to forget to sync the top level page table, and 2) it's also easy to overlook that the kernel should not access the vmemmap and direct mapping areas before the sync. # The solution: Make page table sync more code robust and harder to miss To address this, Dave Hansen suggested [3] [4] introducing {pgd,p4d}_populate_kernel() for updating kernel portion of the page tables and allow each architecture to explicitly perform synchronization when installing top-level entries. With this approach, we no longer need to worry about missing the sync step, reducing the risk of future regressions. The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK, PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by vmalloc and ioremap to synchronize page tables. pgd_populate_kernel() looks like this: static inline void pgd_populate_kernel(unsigned long addr, pgd_t *pgd, p4d_t *p4d) { pgd_populate(&init_mm, pgd, p4d); if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED) arch_sync_kernel_mappings(addr, addr); } It is worth noting that vmalloc() and apply_to_range() carefully synchronizes page tables by calling p*d_alloc_track() and arch_sync_kernel_mappings(), and thus they are not affected by this patch series. This series was hugely inspired by Dave Hansen's suggestion and hence added Suggested-by: Dave Hansen. Cc stable because lack of this series opens the door to intermittent boot failures. This patch (of 3): Move ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to linux/pgtable.h so that they can be used outside of vmalloc and ioremap. Link: https://lkml.kernel.org/r/20250818020206.4517-1-harry.yoo@oracle.com Link: https://lkml.kernel.org/r/20250818020206.4517-2-harry.yoo@oracle.com Link: https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@intel.com [1] Link: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com [2] Link: https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel.com [3] Link: https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel.com [4] Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: bibo mao <maobibo@loongson.cn> Cc: Borislav Betkov <bp@alien8.de> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-09LoongArch: Add cpuhotplug hooks to fix high cpu usage of vCPU threadsXianglai Li
[ Upstream commit 8ef7f3132e4005a103b382e71abea7ad01fbeb86 ] When the CPU is offline, the timer of LoongArch is not correctly closed. This is harmless for real machines, but resulting in an excessively high cpu usage rate of the offline vCPU thread in the virtual machines. To correctly close the timer, we have made the following modifications: Register the cpu hotplug event (CPUHP_AP_LOONGARCH_ARCH_TIMER_STARTING) for LoongArch. This event's hooks will be called to close the timer when the CPU is offline. Clear the timer interrupt when the timer is turned off. Since before the timer is turned off, there may be a timer interrupt that has already been in the pending state due to the interruption of the disabled, which also affects the halt state of the offline vCPU. Signed-off-by: Xianglai Li <lixianglai@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-04firmware: qcom: scm: take struct device as argument in SHM bridge enableBartosz Golaszewski
commit dc3f4e75c54c19bad9a70419afae00ce6baf3ebf upstream. qcom_scm_shm_bridge_enable() is used early in the SCM initialization routine. It makes an SCM call and so expects the internal __scm pointer in the SCM driver to be assigned. For this reason the tzmem memory pool is allocated *after* this pointer is assigned. However, this can lead to a crash if another consumer of the SCM API makes a call using the memory pool between the assignment of the __scm pointer and the initialization of the tzmem memory pool. As qcom_scm_shm_bridge_enable() is a special case, not meant to be called by ordinary users, pull it into the local SCM header. Make it take struct device as argument. This is the device that will be used to make the SCM call as opposed to the global __scm pointer. This will allow us to move the tzmem initialization *before* the __scm assignment in the core SCM driver. Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://lore.kernel.org/r/20250630-qcom-scm-race-v2-2-fa3851c98611@linaro.org Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-04firmware: qcom: scm: remove unused arguments from SHM bridge routinesBartosz Golaszewski
commit 23972da96e1eee7f10c8ef641d56202ab9af8ba7 upstream. qcom_scm_shm_bridge_create() and qcom_scm_shm_bridge_delete() take struct device as argument but don't use it. Remove it from these functions' prototypes. Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Link: https://lore.kernel.org/r/20250630-qcom-scm-race-v2-1-fa3851c98611@linaro.org Signed-off-by: Bjorn Andersson <andersson@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-04octeontx2: Set appropriate PF, VF masks and shifts based on siliconSubbaraya Sundeep
[ Upstream commit 25d51ebf0f54f9c2424f28bb29125cf24f120df0 ] Number of RVU PFs on CN20K silicon have increased to 96 from maximum of 32 that were supported on earlier silicons. Every RVU PF and VF is identified by HW using a 16bit PF_FUNC value. Due to the change in Max number of PFs in CN20K, the bit encoding of this PF_FUNC has changed. This patch handles the change by using helper functions(using silicon check) to use PF,VF masks and shifts to support both new silicon CN20K, OcteonTx series. These helper functions are used in different modules. Also moved the NIX AF register offset macros to other files which will be posted in coming patches. Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Sai Krishna <saikrishnag@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Link: https://patch.msgid.link/1749639716-13868-2-git-send-email-sbhatta@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: d280233fc866 ("Octeontx2-af: Fix NIX X2P calibration failures") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-04atm: atmtcp: Prevent arbitrary write in atmtcp_recv_control().Kuniyuki Iwashima
[ Upstream commit ec79003c5f9d2c7f9576fc69b8dbda80305cbe3a ] syzbot reported the splat below. [0] When atmtcp_v_open() or atmtcp_v_close() is called via connect() or close(), atmtcp_send_control() is called to send an in-kernel special message. The message has ATMTCP_HDR_MAGIC in atmtcp_control.hdr.length. Also, a pointer of struct atm_vcc is set to atmtcp_control.vcc. The notable thing is struct atmtcp_control is uAPI but has a space for an in-kernel pointer. struct atmtcp_control { struct atmtcp_hdr hdr; /* must be first */ ... atm_kptr_t vcc; /* both directions */ ... } __ATM_API_ALIGN; typedef struct { unsigned char _[8]; } __ATM_API_ALIGN atm_kptr_t; The special message is processed in atmtcp_recv_control() called from atmtcp_c_send(). atmtcp_c_send() is vcc->dev->ops->send() and called from 2 paths: 1. .ndo_start_xmit() (vcc->send() == atm_send_aal0()) 2. vcc_sendmsg() The problem is sendmsg() does not validate the message length and userspace can abuse atmtcp_recv_control() to overwrite any kptr by atmtcp_control. Let's add a new ->pre_send() hook to validate messages from sendmsg(). [0]: Oops: general protection fault, probably for non-canonical address 0xdffffc00200000ab: 0000 [#1] SMP KASAN PTI KASAN: probably user-memory-access in range [0x0000000100000558-0x000000010000055f] CPU: 0 UID: 0 PID: 5865 Comm: syz-executor331 Not tainted 6.17.0-rc1-syzkaller-00215-gbab3ce404553 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2025 RIP: 0010:atmtcp_recv_control drivers/atm/atmtcp.c:93 [inline] RIP: 0010:atmtcp_c_send+0x1da/0x950 drivers/atm/atmtcp.c:297 Code: 4d 8d 75 1a 4c 89 f0 48 c1 e8 03 42 0f b6 04 20 84 c0 0f 85 15 06 00 00 41 0f b7 1e 4d 8d b7 60 05 00 00 4c 89 f0 48 c1 e8 03 <42> 0f b6 04 20 84 c0 0f 85 13 06 00 00 66 41 89 1e 4d 8d 75 1c 4c RSP: 0018:ffffc90003f5f810 EFLAGS: 00010203 RAX: 00000000200000ab RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff88802a510000 RSI: 00000000ffffffff RDI: ffff888030a6068c RBP: ffff88802699fb40 R08: ffff888030a606eb R09: 1ffff1100614c0dd R10: dffffc0000000000 R11: ffffffff8718fc40 R12: dffffc0000000000 R13: ffff888030a60680 R14: 000000010000055f R15: 00000000ffffffff FS: 00007f8d7e9236c0(0000) GS:ffff888125c1c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000045ad50 CR3: 0000000075bde000 CR4: 00000000003526f0 Call Trace: <TASK> vcc_sendmsg+0xa10/0xc60 net/atm/common.c:645 sock_sendmsg_nosec net/socket.c:714 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:729 ____sys_sendmsg+0x505/0x830 net/socket.c:2614 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2668 __sys_sendmsg net/socket.c:2700 [inline] __do_sys_sendmsg net/socket.c:2705 [inline] __se_sys_sendmsg net/socket.c:2703 [inline] __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2703 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f8d7e96a4a9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f8d7e923198 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f8d7e9f4308 RCX: 00007f8d7e96a4a9 RDX: 0000000000000000 RSI: 0000200000000240 RDI: 0000000000000005 RBP: 00007f8d7e9f4300 R08: 65732f636f72702f R09: 65732f636f72702f R10: 65732f636f72702f R11: 0000000000000246 R12: 00007f8d7e9c10ac R13: 00007f8d7e9231a0 R14: 0000200000000200 R15: 0000200000000250 </TASK> Modules linked in: Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+1741b56d54536f4ec349@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68a6767c.050a0220.3d78fd.0011.GAE@google.com/ Tested-by: syzbot+1741b56d54536f4ec349@syzkaller.appspotmail.com Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250821021901.2814721-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-04of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()Oreoluwa Babatunde
[ Upstream commit 2c223f7239f376a90d71903ec474ba887cf21d94 ] Restructure the call site for dma_contiguous_early_fixup() to where the reserved_mem nodes are being parsed from the DT so that dma_mmu_remap[] is populated before dma_contiguous_remap() is called. Fixes: 8a6e02d0c00e ("of: reserved_mem: Restructure how the reserved memory regions are processed") Signed-off-by: Oreoluwa Babatunde <oreoluwa.babatunde@oss.qualcomm.com> Tested-by: William Zhang <william.zhang@broadcom.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250806172421.2748302-1-oreoluwa.babatunde@oss.qualcomm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-04platform/x86: int3472: add hpd pin supportDongcheng Yan
commit a032fe30cf09b6723ab61a05aee057311b00f9e1 upstream. Typically HDMI to MIPI CSI-2 bridges have a pin to signal image data is being received. On the host side this is wired to a GPIO for polling or interrupts. This includes the Lontium HDMI to MIPI CSI-2 bridges lt6911uxe and lt6911uxc. The GPIO "hpd" is used already by other HDMI to CSI-2 bridges, use it here as well. Signed-off-by: Dongcheng Yan <dongcheng.yan@intel.com> Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Fixes: 20244cbafbd6 ("media: i2c: change lt6911uxe irq_gpio name to "hpd"") Cc: stable@vger.kernel.org Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-04Revert "virtio: reject shm region if length is zero"Igor Torrente
[ Upstream commit ced17ee32a9988b8a260628e7c31a100d7dc082e ] The commit 206cc44588f7 ("virtio: reject shm region if length is zero") breaks the Virtio-gpu `host_visible` feature. As you can see in the snippet below, host_visible_region is zero because of the `kzalloc`. It's using the `vm_get_shm_region` (drivers/virtio/virtio_mmio.c:536) to read the `addr` and `len` from qemu/crosvm. ``` drivers/gpu/drm/virtio/virtgpu_kms.c 132 vgdev = drmm_kzalloc(dev, sizeof(struct virtio_gpu_device), GFP_KERNEL); [...] 177 if (virtio_get_shm_region(vgdev->vdev, &vgdev->host_visible_region, 178 VIRTIO_GPU_SHM_ID_HOST_VISIBLE)) { ``` Now it always fails. To fix, revert the offending commit. Fixes: 206cc44588f7 ("virtio: reject shm region if length is zero") Signed-off-by: Igor Torrente <igor.torrente@collabora.com> Message-Id: <20250807124145.81816-1-igor.torrente@collabora.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28net/mlx5: Add IFC bits and enums for buf_ownershipOren Sidi
[ Upstream commit 6f09ee0b583cad4f2b6a82842c26235bee3d5c2e ] Extend structure layouts and defines buf_ownership. buf_ownership indicates whether the buffer is managed by SW or FW. Signed-off-by: Oren Sidi <osidi@nvidia.com> Reviewed-by: Alex Lazar <alazar@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752734895-257735-3-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Stable-dep-of: 451d2849ea66 ("net/mlx5e: Query FW for buffer ownership") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28block: avoid cpu_hotplug_lock depedency on freeze_lockNilay Shroff
[ Upstream commit 370ac285f23aecae40600851fb4a1a9e75e50973 ] A recent lockdep[1] splat observed while running blktest block/005 reveals a potential deadlock caused by the cpu_hotplug_lock dependency on ->freeze_lock. This dependency was introduced by commit 033b667a823e ("block: blk-rq-qos: guard rq-qos helpers by static key"). That change added a static key to avoid fetching q->rq_qos when neither blk-wbt nor blk-iolatency is configured. The static key dynamically patches kernel text to a NOP when disabled, eliminating overhead of fetching q->rq_qos in the I/O hot path. However, enabling a static key at runtime requires acquiring both cpu_hotplug_lock and jump_label_mutex. When this happens after the queue has already been frozen (i.e., while holding ->freeze_lock), it creates a locking dependency from cpu_hotplug_lock to ->freeze_lock, which leads to a potential deadlock reported by lockdep [1]. To resolve this, replace the static key mechanism with q->queue_flags: QUEUE_FLAG_QOS_ENABLED. This flag is evaluated in the fast path before accessing q->rq_qos. If the flag is set, we proceed to fetch q->rq_qos; otherwise, the access is skipped. Since q->queue_flags is commonly accessed in IO hotpath and resides in the first cacheline of struct request_queue, checking it imposes minimal overhead while eliminating the deadlock risk. This change avoids the lockdep splat without introducing performance regressions. [1] https://lore.kernel.org/linux-block/4fdm37so3o4xricdgfosgmohn63aa7wj3ua4e5vpihoamwg3ui@fq42f5q5t5ic/ Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-block/4fdm37so3o4xricdgfosgmohn63aa7wj3ua4e5vpihoamwg3ui@fq42f5q5t5ic/ Fixes: 033b667a823e ("block: blk-rq-qos: guard rq-qos helpers by static key") Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250814082612.500845-4-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28iosys-map: Fix undefined behavior in iosys_map_clear()Nitin Gote
[ Upstream commit 5634c8cb298a7146b4e38873473e280b50e27a2c ] The current iosys_map_clear() implementation reads the potentially uninitialized 'is_iomem' boolean field to decide which union member to clear. This causes undefined behavior when called on uninitialized structures, as 'is_iomem' may contain garbage values like 0xFF. UBSAN detects this as: UBSAN: invalid-load in include/linux/iosys-map.h:267 load of value 255 is not a valid value for type '_Bool' Fix by unconditionally clearing the entire structure with memset(), eliminating the need to read uninitialized data and ensuring all fields are set to known good values. Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14639 Fixes: 01fd30da0474 ("dma-buf: Add struct dma-buf-map for storing struct dma_buf.vaddr_ptr") Signed-off-by: Nitin Gote <nitin.r.gote@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://lore.kernel.org/r/20250718105051.2709487-1-nitin.r.gote@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-28kcov, usb: Don't disable interrupts in kcov_remote_start_usb_softirq()Sebastian Andrzej Siewior
commit 9528d32873b38281ae105f2f5799e79ae9d086c2 upstream. kcov_remote_start_usb_softirq() the begin of urb's completion callback. HCDs marked HCD_BH will invoke this function from the softirq and in_serving_softirq() will detect this properly. Root-HUB (RH) requests will not be delayed to softirq but complete immediately in IRQ context. This will confuse kcov because in_serving_softirq() will report true if the softirq is served after the hardirq and if the softirq got interrupted by the hardirq in which currently runs. This was addressed by simply disabling interrupts in kcov_remote_start_usb_softirq() which avoided the interruption by the RH while a regular completion callback was invoked. This not only changes the behaviour while kconv is enabled but also breaks PREEMPT_RT because now sleeping locks can no longer be acquired. Revert the previous fix. Address the issue by invoking kcov_remote_start_usb() only if the context is just "serving softirqs" which is identified by checking in_serving_softirq() and in_hardirq() must be false. Fixes: f85d39dd7ed89 ("kcov, usb: disable interrupts in kcov_remote_start_usb_softirq") Cc: stable <stable@kernel.org> Reported-by: Yunseong Kim <ysk@kzalloc.com> Closes: https://lore.kernel.org/all/20250725201400.1078395-2-ysk@kzalloc.com/ Tested-by: Yunseong Kim <ysk@kzalloc.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20250811082745.ycJqBXMs@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28compiler: remove __ADDRESSABLE_ASM{_STR,}() againJan Beulich
commit 8ea815399c3fcce1889bd951fec25b5b9a3979c1 upstream. __ADDRESSABLE_ASM_STR() is where the necessary stringification happens. As long as "sym" doesn't contain any odd characters, no quoting is required for its use with .quad / .long. In fact the quotation gets in the way with gas 2.25; it's only from 2.26 onwards that quoted symbols are half-way properly supported. However, assembly being different from C anyway, drop __ADDRESSABLE_ASM_STR() and its helper macro altogether. A simple .global directive will suffice to get the symbol "declared", i.e. into the symbol table. While there also stop open-coding STATIC_CALL_TRAMP() and STATIC_CALL_KEY(). Fixes: 0ef8047b737d ("x86/static-call: provide a way to do very early static-call updates") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <609d2c74-de13-4fae-ab1a-1ec44afb948d@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28NFS: Fix a race when updating an existing writeTrond Myklebust
commit 76d2e3890fb169168c73f2e4f8375c7cc24a765e upstream. After nfs_lock_and_join_requests() tests for whether the request is still attached to the mapping, nothing prevents a call to nfs_inode_remove_request() from succeeding until we actually lock the page group. The reason is that whoever called nfs_inode_remove_request() doesn't necessarily have a lock on the page group head. So in order to avoid races, let's take the page group lock earlier in nfs_lock_and_join_requests(), and hold it across the removal of the request in nfs_inode_remove_request(). Reported-by: Jeff Layton <jlayton@kernel.org> Tested-by: Joe Quanaim <jdq@meta.com> Tested-by: Andrew Steffen <aksteffen@meta.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Fixes: bd37d6fce184 ("NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28iov_iter: iterate_folioq: fix handling of offset >= folio sizeDominique Martinet
commit 808471ddb0fa785559c3e7aee59be20a13b46ef5 upstream. It's apparently possible to get an iov advanced all the way up to the end of the current page we're looking at, e.g. (gdb) p *iter $24 = {iter_type = 4 '\004', nofault = false, data_source = false, iov_offset = 4096, {__ubuf_iovec = { iov_base = 0xffff88800f5bc000, iov_len = 655}, {{__iov = 0xffff88800f5bc000, kvec = 0xffff88800f5bc000, bvec = 0xffff88800f5bc000, folioq = 0xffff88800f5bc000, xarray = 0xffff88800f5bc000, ubuf = 0xffff88800f5bc000}, count = 655}}, {nr_segs = 2, folioq_slot = 2 '\002', xarray_start = 2}} Where iov_offset is 4k with 4k-sized folios This should have been fine because we're only in the 2nd slot and there's another one after this, but iterate_folioq should not try to map a folio that skips the whole size, and more importantly part here does not end up zero (because 'PAGE_SIZE - skip % PAGE_SIZE' ends up PAGE_SIZE and not zero..), so skip forward to the "advance to next folio" code Link: https://lkml.kernel.org/r/20250813-iot_iter_folio-v3-0-a0ffad2b665a@codewreck.org Link: https://lkml.kernel.org/r/20250813-iot_iter_folio-v3-1-a0ffad2b665a@codewreck.org Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Fixes: db0aa2e9566f ("mm: Define struct folio_queue and ITER_FOLIOQ to handle a sequence of folios") Reported-by: Maximilian Bosch <maximilian@mbosch.me> Reported-by: Ryan Lahfa <ryan@lahfa.xyz> Reported-by: Christian Theune <ct@flyingcircus.io> Reported-by: Arnout Engelen <arnout@bzzt.net> Link: https://lkml.kernel.org/r/D4LHHUNLG79Y.12PI0X6BEHRHW@mbosch.me/ Acked-by: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> [6.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-28netfs: Fix unbuffered write error handlingDavid Howells
commit a3de58b12ce074ec05b8741fa28d62ccb1070468 upstream. If all the subrequests in an unbuffered write stream fail, the subrequest collector doesn't update the stream->transferred value and it retains its initial LONG_MAX value. Unfortunately, if all active streams fail, then we take the smallest value of { LONG_MAX, LONG_MAX, ... } as the value to set in wreq->transferred - which is then returned from ->write_iter(). LONG_MAX was chosen as the initial value so that all the streams can be quickly assessed by taking the smallest value of all stream->transferred - but this only works if we've set any of them. Fix this by adding a flag to indicate whether the value in stream->transferred is valid and checking that when we integrate the values. stream->transferred can then be initialised to zero. This was found by running the generic/750 xfstest against cifs with cache=none. It splices data to the target file. Once (if) it has used up all the available scratch space, the writes start failing with ENOSPC. This causes ->write_iter() to fail. However, it was returning wreq->transferred, i.e. LONG_MAX, rather than an error (because it thought the amount transferred was non-zero) and iter_file_splice_write() would then try to clean up that amount of pipe bufferage - leading to an oops when it overran. The kernel log showed: CIFS: VFS: Send error in write = -28 followed by: BUG: kernel NULL pointer dereference, address: 0000000000000008 with: RIP: 0010:iter_file_splice_write+0x3a4/0x520 do_splice+0x197/0x4e0 or: RIP: 0010:pipe_buf_release (include/linux/pipe_fs_i.h:282) iter_file_splice_write (fs/splice.c:755) Also put a warning check into splice to announce if ->write_iter() returned that it had written more than it was asked to. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Reported-by: Xiaoli Feng <fengxiaoli0714@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220445 Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/915443.1755207950@warthog.procyon.org.uk cc: Paulo Alcantara <pc@manguebit.org> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <sprasad@microsoft.com> cc: netfs@lists.linux.dev cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20ACPI: Return -ENODEV from acpi_parse_spcr() when SPCR support is disabledLi Chen
commit b9f58d3572a8e1ef707b941eae58ec4014b9269d upstream. If CONFIG_ACPI_SPCR_TABLE is disabled, acpi_parse_spcr() currently returns 0, which may incorrectly suggest that SPCR parsing was successful. This patch changes the behavior to return -ENODEV to clearly indicate that SPCR support is not available. This prepares the codebase for future changes that depend on acpi_parse_spcr() failure detection, such as suppressing misleading console messages. Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Acked-by: Hanjun Guo <guohanjun@huawei.com> Link: https://lore.kernel.org/r/20250620131309.126555-2-me@linux.beauty Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20ata: libata-sata: Add link_power_management_supported sysfs attributeDamien Le Moal
commit 0060beec0bfa647c4b510df188b1c4673a197839 upstream. A port link power management (LPM) policy can be controlled using the link_power_management_policy sysfs host attribute. However, this attribute exists also for hosts that do not support LPM and in such case, attempting to change the LPM policy for the host (port) will fail with -EOPNOTSUPP. Introduce the new sysfs link_power_management_supported host attribute to indicate to the user if a the port and the devices connected to the port for the host support LPM, which implies that the link_power_management_policy attribute can be used. Since checking that a port and its devices support LPM is common between the new ata_scsi_lpm_supported_show() function and the existing ata_scsi_lpm_store() function, the new helper ata_scsi_lpm_supported() is introduced. Fixes: 413e800cadbf ("ata: libata-sata: Disallow changing LPM state if not supported") Reported-by: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202507251014.a5becc3b-lkp@intel.com Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20block: Introduce bio_needs_zone_write_plugging()Damien Le Moal
commit f70291411ba20d50008db90a6f0731efac27872c upstream. In preparation for fixing device mapper zone write handling, introduce the inline helper function bio_needs_zone_write_plugging() to test if a BIO requires handling through zone write plugging using the function blk_zone_plug_bio(). This function returns true for any write (op_is_write(bio) == true) operation directed at a zoned block device using zone write plugging, that is, a block device with a disk that has a zone write plug hash table. This helper allows simplifying the check on entry to blk_zone_plug_bio() and used in to protect calls to it for blk-mq devices and DM devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20lib/sbitmap: convert shallow_depth from one word to the whole sbitmapYu Kuai
[ Upstream commit 42e6c6ce03fd3e41e39a0f93f9b1a1d9fa664338 ] Currently elevators will record internal 'async_depth' to throttle asynchronous requests, and they both calculate shallow_dpeth based on sb->shift, with the respect that sb->shift is the available tags in one word. However, sb->shift is not the availbale tags in the last word, see __map_depth: if (index == sb->map_nr - 1) return sb->depth - (index << sb->shift); For consequence, if the last word is used, more tags can be get than expected, for example, assume nr_requests=256 and there are four words, in the worst case if user set nr_requests=32, then the first word is the last word, and still use bits per word, which is 64, to calculate async_depth is wrong. One the ohter hand, due to cgroup qos, bfq can allow only one request to be allocated, and set shallow_dpeth=1 will still allow the number of words request to be allocated. Fix this problems by using shallow_depth to the whole sbitmap instead of per word, also change kyber, mq-deadline and bfq to follow this, a new helper __map_depth_with_shallow() is introduced to calculate available bits in each word. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250807032413.1469456-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20HID: rate-limit hid_warn to prevent log floodingLi Chen
[ Upstream commit 4051ead99888f101be92c7ce90d2de09aac6fd1c ] Syzkaller can create many uhid devices that trigger repeated warnings like: "hid-generic xxxx: unknown main item tag 0x0" These messages can flood the system log, especially if a crash occurs (e.g., with a slow UART console, leading to soft lockups). To mitigate this, convert `hid_warn()` to use `dev_warn_ratelimited()`. This helps reduce log noise and improves system stability under fuzzing or faulty device scenarios. Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Jiri Kosina <jkosina@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20vsock/virtio: Resize receive buffers so that each SKB fits in a 4K pageWill Deacon
[ Upstream commit 03a92f036a04fed2b00d69f5f46f1a486e70dc5c ] When allocating receive buffers for the vsock virtio RX virtqueue, an SKB is allocated with a 4140 data payload (the 44-byte packet header + VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE). Even when factoring in the SKB overhead, the resulting 8KiB allocation thanks to the rounding in kmalloc_reserve() is wasteful (~3700 unusable bytes) and results in a higher-order page allocation on systems with 4KiB pages just for the sake of a few hundred bytes of packet data. Limit the vsock virtio RX buffers to 4KiB per SKB, resulting in much better memory utilisation and removing the need to allocate higher-order pages entirely. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-5-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20net: vlan: Replace BUG() with WARN_ON_ONCE() in vlan_dev_* stubsGal Pressman
[ Upstream commit 60a8b1a5d0824afda869f18dc0ecfe72f8dfda42 ] When CONFIG_VLAN_8021Q=n, a set of stub helpers are used, three of these helpers use BUG() unconditionally. This code should not be reached, as callers of these functions should always check for is_vlan_dev() first, but the usage of BUG() is not recommended, replace it with WARN_ON() instead. Reviewed-by: Alex Lazar <alazar@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250616132626.1749331-3-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20net: vlan: Make is_vlan_dev() a stub when VLAN is not configuredGal Pressman
[ Upstream commit 2de1ba0887e5d3bf02d7c212f380039b34e10aa3 ] Add a stub implementation of is_vlan_dev() that returns false when VLAN support is not compiled in (CONFIG_VLAN_8021Q=n). This allows us to compile-out VLAN-dependent dead code when it is not needed. This also resolves the following compilation error when: * CONFIG_VLAN_8021Q=n * CONFIG_OBJTOOL=y * CONFIG_OBJTOOL_WERROR=y drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.o: error: objtool: parse_mirred.isra.0+0x370: mlx5e_tc_act_vlan_add_push_action() missing __noreturn in .c/.h or NORETURN() in noreturns.h The error occurs because objtool cannot determine that unreachable BUG() (which doesn't return) calls in VLAN code paths are actually dead code when VLAN support is disabled. Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250616132626.1749331-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20netmem: fix skb_frag_address_safe with unreadable skbsMina Almasry
[ Upstream commit 4672aec56d2e8edabcb74c3e2320301d106a377e ] skb_frag_address_safe() needs a check that the skb_frag_page exists check similar to skb_frag_address(). Cc: ap420073@gmail.com Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250619175239.3039329-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20lib: packing: Include necessary headersNathan Lynch
[ Upstream commit 8bd0af3154b2206ce19f8b1410339f7a2a56d0c3 ] packing.h uses ARRAY_SIZE(), BUILD_BUG_ON_MSG(), min(), max(), and sizeof_field() without including the headers where they are defined, potentially causing build failures. Fix this in packing.h and sort the result. Signed-off-by: Nathan Lynch <nathan.lynch@amd.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://patch.msgid.link/20250624-packing-includes-v1-1-c23c81fab508@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20net: usb: cdc-ncm: check for filtering capabilityOliver Neukum
[ Upstream commit 61c3e8940f2d8b5bfeaeec4bedc2f3e7d873abb3 ] If the decice does not support filtering, filtering must not be used and all packets delivered for the upper layers to sort. Signed-off-by: Oliver Neukum <oneukum@suse.com> Link: https://patch.msgid.link/20250717120649.2090929-1-oneukum@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-08-20mm/memory-tier: fix abstract distance calculation overflowLi Zhijian
commit cce35103135c7ffc7bebc32ebfc74fe1f2c3cb5d upstream. In mt_perf_to_adistance(), the calculation of abstract distance (adist) involves multiplying several int values including MEMTIER_ADISTANCE_DRAM. *adist = MEMTIER_ADISTANCE_DRAM * (perf->read_latency + perf->write_latency) / (default_dram_perf.read_latency + default_dram_perf.write_latency) * (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) / (perf->read_bandwidth + perf->write_bandwidth); Since these values can be large, the multiplication may exceed the maximum value of an int (INT_MAX) and overflow (Our platform did), leading to an incorrect adist. User-visible impact: The memory tiering subsystem will misinterpret slow memory (like CXL) as faster than DRAM, causing inappropriate demotion of pages from CXL (slow memory) to DRAM (fast memory). For example, we will see the following demotion chains from the dmesg, where Node0,1 are DRAM, and Node2,3 are CXL node: Demotion targets for Node 0: null Demotion targets for Node 1: null Demotion targets for Node 2: preferred: 0-1, fallback: 0-1 Demotion targets for Node 3: preferred: 0-1, fallback: 0-1 Change MEMTIER_ADISTANCE_DRAM to be a long constant by writing it with the 'L' suffix. This prevents the overflow because the multiplication will then be done in the long type which has a larger range. Link: https://lkml.kernel.org/r/20250611023439.2845785-1-lizhijian@fujitsu.com Link: https://lkml.kernel.org/r/20250610062751.2365436-1-lizhijian@fujitsu.com Fixes: 3718c02dbd4c ("acpi, hmat: calculate abstract distance with HMAT") Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20block: Make REQ_OP_ZONE_FINISH a write operationDamien Le Moal
commit 3f66ccbaaef3a0c5bd844eab04e3207b4061c546 upstream. REQ_OP_ZONE_FINISH is defined as "12", which makes op_is_write(REQ_OP_ZONE_FINISH) return false, despite the fact that a zone finish operation is an operation that modifies a zone (transition it to full) and so should be considered as a write operation (albeit one that does not transfer any data to the device). Fix this by redefining REQ_OP_ZONE_FINISH to be an odd number (13), and redefine REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL using sequential odd numbers from that new value. Fixes: 6c1b1da58f8c ("block: add zone open, close and finish operations") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250625093327.548866-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable portsLukas Wunner
commit 6cff20ce3b92ffbf2fc5eb9e5a030b3672aa414a upstream. pci_bridge_d3_possible() is called from both pcie_portdrv_probe() and pcie_portdrv_remove() to determine whether runtime power management shall be enabled (on probe) or disabled (on remove) on a PCIe port. The underlying assumption is that pci_bridge_d3_possible() always returns the same value, else a runtime PM reference imbalance would occur. That assumption is not given if the PCIe port is inaccessible on remove due to hot-unplug: pci_bridge_d3_possible() calls pciehp_is_native(), which accesses Config Space to determine whether the port is Hot-Plug Capable. An inaccessible port returns "all ones", which is converted to "all zeroes" by pcie_capability_read_dword(). Hence the port no longer seems Hot-Plug Capable on remove even though it was on probe. The resulting runtime PM ref imbalance causes warning messages such as: pcieport 0000:02:04.0: Runtime PM usage count underflow! Avoid the Config Space access (and thus the runtime PM ref imbalance) by caching the Hot-Plug Capable bit in struct pci_dev. The struct already contains an "is_hotplug_bridge" flag, which however is not only set on Hot-Plug Capable PCIe ports, but also Conventional PCI Hot-Plug bridges and ACPI slots. The flag identifies bridges which are allocated additional MMIO and bus number resources to allow for hierarchy expansion. The kernel is somewhat sloppily using "is_hotplug_bridge" in a number of places to identify Hot-Plug Capable PCIe ports, even though the flag encompasses other devices. Subsequent commits replace these occurrences with the new flag to clearly delineate Hot-Plug Capable PCIe ports from other kinds of hotplug bridges. Document the existing "is_hotplug_bridge" and the new "is_pciehp" flag and document the (non-obvious) requirement that pci_bridge_d3_possible() always returns the same value across the entire lifetime of a bridge, including its hot-removal. Fixes: 5352a44a561d ("PCI: pciehp: Make pciehp_is_native() stricter") Reported-by: Laurent Bigonville <bigon@bigon.be> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220216 Reported-by: Mario Limonciello <mario.limonciello@amd.com> Closes: https://lore.kernel.org/r/20250609020223.269407-3-superm1@kernel.org/ Link: https://lore.kernel.org/all/20250620025535.3425049-3-superm1@kernel.org/T/#u Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Cc: stable@vger.kernel.org # v4.18+ Link: https://patch.msgid.link/fe5dcc3b2e62ee1df7905d746bde161eb1b3291c.1752390101.git.lukas@wunner.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-20PCI: Extend isolated function probing to LoongArchHuacai Chen
commit a02fd05661d73a8507dd70dd820e9b984490c545 upstream. Like s390 and the jailhouse hypervisor, LoongArch's PCI architecture allows passing isolated PCI functions to a guest OS instance. So it is possible that there is a multi-function device without function 0 for the host or guest. Allow probing such functions by adding a IS_ENABLED(CONFIG_LOONGARCH) case in the hypervisor_isolated_pci_functions() helper. This is similar to commit 189c6c33ff42 ("PCI: Extend isolated function probing to s390"). Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250624062927.4037734-1-chenhuacai@loongson.cn Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-15mm: fix a UAF when vma->mm is freed after vma->vm_refcnt got droppedSuren Baghdasaryan
commit 9bbffee67ffd16360179327b57f3b1245579ef08 upstream. By inducing delays in the right places, Jann Horn created a reproducer for a hard to hit UAF issue that became possible after VMAs were allowed to be recycled by adding SLAB_TYPESAFE_BY_RCU to their cache. Race description is borrowed from Jann's discovery report: lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under rcu_read_lock(). At that point, the VMA may be concurrently freed, and it can be recycled by another process. vma_start_read() then increments the vma->vm_refcnt (if it is in an acceptable range), and if this succeeds, vma_start_read() can return a recycled VMA. In this scenario where the VMA has been recycled, lock_vma_under_rcu() will then detect the mismatching ->vm_mm pointer and drop the VMA through vma_end_read(), which calls vma_refcount_put(). vma_refcount_put() drops the refcount and then calls rcuwait_wake_up() using a copy of vma->vm_mm. This is wrong: It implicitly assumes that the caller is keeping the VMA's mm alive, but in this scenario the caller has no relation to the VMA's mm, so the rcuwait_wake_up() can cause UAF. The diagram depicting the race: T1 T2 T3 == == == lock_vma_under_rcu mas_walk <VMA gets removed from mm> mmap <the same VMA is reallocated> vma_start_read __refcount_inc_not_zero_limited_acquire munmap __vma_enter_locked refcount_add_not_zero vma_end_read vma_refcount_put __refcount_dec_and_test rcuwait_wait_event <finish operation> rcuwait_wake_up [UAF] Note that rcuwait_wait_event() in T3 does not block because refcount was already dropped by T1. At this point T3 can exit and free the mm causing UAF in T1. To avoid this we move vma->vm_mm verification into vma_start_read() and grab vma->vm_mm to stabilize it before vma_refcount_put() operation. [surenb@google.com: v3] Link: https://lkml.kernel.org/r/20250729145709.2731370-1-surenb@google.com Link: https://lkml.kernel.org/r/20250728175355.2282375-1-surenb@google.com Fixes: 3104138517fc ("mm: make vma cache SLAB_TYPESAFE_BY_RCU") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/all/CAG48ez0-deFbVH=E3jbkWx=X3uVbd8nWeo6kbJPQ0KoUD+m2tA@mail.gmail.com/ Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-08-15net: usbnet: Avoid potential RCU stall on LINK_CHANGE eventJohn Ernberg
commit 0d9cfc9b8cb17dbc29a98792d36ec39a1cf1395f upstream. The Gemalto Cinterion PLS83-W modem (cdc_ether) is emitting confusing link up and down events when the WWAN interface is activated on the modem-side. Interrupt URBs will in consecutive polls grab: * Link Connected * Link Disconnected * Link Connected Where the last Connected is then a stable link state. When the system is under load this may cause the unlink_urbs() work in __handle_link_change() to not complete before the next usbnet_link_change() call turns the carrier on again, allowing rx_submit() to queue new SKBs. In that event the URB queue is filled faster than it can drain, ending up in a RCU stall: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-.... } 33108 jiffies s: 201 root: 0x1/. rcu: blocking rcu_node structures (internal RCU debug): Sending NMI from CPU 1 to CPUs 0: NMI backtrace for cpu 0 Call trace: arch_local_irq_enable+0x4/0x8 local_bh_enable+0x18/0x20 __netdev_alloc_skb+0x18c/0x1cc rx_submit+0x68/0x1f8 [usbnet] rx_alloc_submit+0x4c/0x74 [usbnet] usbnet_bh+0x1d8/0x218 [usbnet] usbnet_bh_tasklet+0x10/0x18 [usbnet] tasklet_action_common+0xa8/0x110 tasklet_action+0x2c/0x34 handle_softirqs+0x2cc/0x3a0 __do_softirq+0x10/0x18 ____do_softirq+0xc/0x14 call_on_irq_stack+0x24/0x34 do_softirq_own_stack+0x18/0x20 __irq_exit_rcu+0xa8/0xb8 irq_exit_rcu+0xc/0x30 el1_interrupt+0x34/0x48 el1h_64_irq_handler+0x14/0x1c el1h_64_irq+0x68/0x6c _raw_spin_unlock_irqrestore+0x38/0x48 xhci_urb_dequeue+0x1ac/0x45c [xhci_hcd] unlink1+0xd4/0xdc [usbcore] usb_hcd_unlink_urb+0x70/0xb0 [usbcore] usb_unlink_urb+0x24/0x44 [usbcore] unlink_urbs.constprop.0.isra.0+0x64/0xa8 [usbnet] __handle_link_change+0x34/0x70 [usbnet] usbnet_deferred_kevent+0x1c0/0x320 [usbnet] process_scheduled_works+0x2d0/0x48c worker_thread+0x150/0x1dc kthread+0xd8/0xe8 ret_from_fork+0x10/0x20 Get around the problem by delaying the carrier on to the scheduled work. This needs a new flag to keep track of the necessary action. The carrier ok check cannot be removed as it remains required for the LINK_RESET event flow. Fixes: 4b49f58fff00 ("usbnet: handle link change") Cc: stable@vger.kernel.org Signed-off-by: John Ernberg <john.ernberg@actia.se> Link: https://patch.msgid.link/20250723102526.1305339-1-john.ernberg@actia.se Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>