summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2025-03-22mm/hugetlb: wait for hugetlb folios to be freedGe Yang
[ Upstream commit 67bab13307c83fb742c2556b06cdc39dbad27f07 ] Since the introduction of commit c77c0a8ac4c52 ("mm/hugetlb: defer freeing of huge pages if in non-task context"), which supports deferring the freeing of hugetlb pages, the allocation of contiguous memory through cma_alloc() may fail probabilistically. In the CMA allocation process, if it is found that the CMA area is occupied by in-use hugetlb folios, these in-use hugetlb folios need to be migrated to another location. When there are no available hugetlb folios in the free hugetlb pool during the migration of in-use hugetlb folios, new folios are allocated from the buddy system. A temporary state is set on the newly allocated folio. Upon completion of the hugetlb folio migration, the temporary state is transferred from the new folios to the old folios. Normally, when the old folios with the temporary state are freed, it is directly released back to the buddy system. However, due to the deferred freeing of hugetlb pages, the PageBuddy() check fails, ultimately leading to the failure of cma_alloc(). Here is a simplified call trace illustrating the process: cma_alloc() ->__alloc_contig_migrate_range() // Migrate in-use hugetlb folios ->unmap_and_move_huge_page() ->folio_putback_hugetlb() // Free old folios ->test_pages_isolated() ->__test_page_isolated_in_pageblock() ->PageBuddy(page) // Check if the page is in buddy To resolve this issue, we have implemented a function named wait_for_freed_hugetlb_folios(). This function ensures that the hugetlb folios are properly released back to the buddy system after their migration is completed. By invoking wait_for_freed_hugetlb_folios() before calling PageBuddy(), we ensure that PageBuddy() will succeed. Link: https://lkml.kernel.org/r/1739936804-18199-1-git-send-email-yangge1116@126.com Fixes: c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task context") Signed-off-by: Ge Yang <yangge1116@126.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22block: change blk_mq_add_to_batch() third argument type to boolShin'ichiro Kawasaki
[ Upstream commit 9bce6b5f8987678b9c6c1fe433af6b5fe41feadc ] Commit 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") modified the evaluation criteria for the third argument, 'ioerror', in the blk_mq_add_to_batch() function. Initially, the function had checked if 'ioerror' equals zero. Following the commit, it started checking for negative error values, with the presumption that such values, for instance -EIO, would be passed in. However, blk_mq_add_to_batch() callers do not pass negative error values. Instead, they pass status codes defined in various ways: - NVMe PCI and Apple drivers pass NVMe status code - virtio_blk driver passes the virtblk request header status byte - null_blk driver passes blk_status_t These codes are either zero or positive, therefore the revised check fails to function as intended. Specifically, with the NVMe PCI driver, this modification led to the failure of the blktests test case nvme/039. In this test scenario, errors are artificially injected to the NVMe driver, resulting in positive NVMe status codes passed to blk_mq_add_to_batch(), which unexpectedly processes the failed I/O in a batch. Hence the failure. To correct the ioerror check within blk_mq_add_to_batch(), make all callers to uniformly pass the argument as boolean. Modify the callers to check their specific status codes and pass the boolean value 'is_error'. Also describe the arguments of blK_mq_add_to_batch as kerneldoc. Fixes: 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250311104359.1767728-3-shinichiro.kawasaki@wdc.com [axboe: fold in documentation update] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22fuse: don't truncate cached, mutated symlinkMiklos Szeredi
[ Upstream commit b4c173dfbb6c78568578ff18f9e8822d7bd0e31b ] Fuse allows the value of a symlink to change and this property is exploited by some filesystems (e.g. CVMFS). It has been observed, that sometimes after changing the symlink contents, the value is truncated to the old size. This is caused by fuse_getattr() racing with fuse_reverse_inval_inode(). fuse_reverse_inval_inode() updates the fuse_inode's attr_version, which results in fuse_change_attributes() exiting before updating the cached attributes This is okay, as the cached attributes remain invalid and the next call to fuse_change_attributes() will likely update the inode with the correct values. The reason this causes problems is that cached symlinks will be returned through page_get_link(), which truncates the symlink to inode->i_size. This is correct for filesystems that don't mutate symlinks, but in this case it causes bad behavior. The solution is to just remove this truncation. This can cause a regression in a filesystem that relies on supplying a symlink larger than the file size, but this is unlikely. If that happens we'd need to make this behavior conditional. Reported-by: Laura Promberger <laura.promberger@cern.ch> Tested-by: Sam Lewis <samclewis@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20250220100258.793363-1-mszeredi@redhat.com Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22PCI: pci_ids: add INTEL_HDA_PTL_HPierre-Louis Bossart
[ Upstream commit a1f7b7ff0e10ae574d388131596390157222f986 ] Add Intel PTL-H audio Device ID. Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Link: https://patch.msgid.link/20250210081730.22916-2-peter.ujfalusi@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22kbuild: keep symbols for symbol_get() even with CONFIG_TRIM_UNUSED_KSYMSMasahiro Yamada
[ Upstream commit 4c56eb33e603c3b9eb4bd24efbfdd0283c1c37e4 ] Linus observed that the symbol_request(utf8_data_table) call fails when CONFIG_UNICODE=y and CONFIG_TRIM_UNUSED_KSYMS=y. symbol_get() relies on the symbol data being present in the ksymtab for symbol lookups. However, EXPORT_SYMBOL_GPL(utf8_data_table) is dropped due to CONFIG_TRIM_UNUSED_KSYMS, as no module references it in this case. Probably, this has been broken since commit dbacb0ef670d ("kconfig option for TRIM_UNUSED_KSYMS"). This commit addresses the issue by leveraging modpost. Symbol names passed to symbol_get() are recorded in the special .no_trim_symbol section, which is then parsed by modpost to forcibly keep such symbols. The .no_trim_symbol section is discarded by the linker scripts, so there is no impact on the size of the final vmlinux or modules. This commit cannot resolve the issue for direct calls to __symbol_get() because the symbol name is not known at compile-time. Although symbol_get() may eventually be deprecated, this workaround should be good enough meanwhile. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()Ryan Roberts
commit 02410ac72ac3707936c07ede66e94360d0d65319 upstream. In order to fix a bug, arm64 needs to be told the size of the huge page for which the huge_pte is being cleared in huge_ptep_get_and_clear(). Provide for this by adding an `unsigned long sz` parameter to the function. This follows the same pattern as huge_pte_clear() and set_huge_pte_at(). This commit makes the required interface modifications to the core mm as well as all arches that implement this function (arm64, loongarch, mips, parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed in a separate commit. Cc: stable@vger.kernel.org Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit") Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> # s390 Link: https://lore.kernel.org/r/20250226120656.2400136-2-ryan.roberts@arm.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13net: ethtool: plumb PHY stats to PHY driversJakub Kicinski
[ Upstream commit b7a2c1fe6b55364e61b4b54b991eb43a47bb1104 ] Introduce support for standardized PHY statistics reporting in ethtool by extending the PHYLIB framework. Add the functions phy_ethtool_get_phy_stats() and phy_ethtool_get_link_ext_stats() to provide a consistent interface for retrieving PHY-level and link-specific statistics. These functions are used within the ethtool implementation to avoid direct access to the phy_device structure outside of the PHYLIB framework. A new structure, ethtool_phy_stats, is introduced to standardize PHY statistics such as packet counts, byte counts, and error counters. Drivers are updated to include callbacks for retrieving PHY and link-specific statistics, ensuring values are explicitly set only for supported fields, initialized with ETHTOOL_STAT_NOT_SET to avoid ambiguity. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Stable-dep-of: 637399bf7e77 ("net: ethtool: netlink: Allow NULL nlattrs when getting a phy_device") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13cred: Fix RCU warnings in override/revert_credsHerbert Xu
[ Upstream commit e04918dc594669068f5d59d567d08db531167188 ] Fix RCU warnings in override_creds and revert_creds by turning the RCU pointer into a normal pointer using rcu_replace_pointer. These warnings were previously private to the cred code, but due to the move into the header file they are now polluting unrelated subsystems. Fixes: 49dffdfde462 ("cred: Add a light version of override/revert_creds()") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/r/Z8QGQGW0IaSklKG7@gondor.apana.org.au Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13cred: return old creds from revert_creds_light()Christian Brauner
[ Upstream commit 95c54bc81791c210b131f2b1013942487e74896f ] So we can easily convert revert_creds() callers over to drop the reference count explicitly. Link: https://lore.kernel.org/r/20241125-work-cred-v2-2-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: e04918dc5946 ("cred: Fix RCU warnings in override/revert_creds") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13nvme-tcp: add basic support for the C2HTermReq PDUMaurizio Lombardi
[ Upstream commit 84e009042d0f3dfe91bec60bcd208ee3f866cbcd ] Previously, the NVMe/TCP host driver did not handle the C2HTermReq PDU, instead printing "unsupported pdu type (3)" when received. This patch adds support for processing the C2HTermReq PDU, allowing the driver to print the Fatal Error Status field. Example of output: nvme nvme4: Received C2HTermReq (FES = Invalid PDU Header Field) Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: ad95bab0cd28 ("nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13NFS: fix nfs_release_folio() to not deadlock via kcompactd writebackMike Snitzer
commit ce6d9c1c2b5cc785016faa11b48b6cd317eb367e upstream. Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for it so nfs_release_folio() can skip calling nfs_wb_folio() from kcompactd. Otherwise NFS can deadlock waiting for kcompactd enduced writeback which recurses back to NFS (which triggers writeback to NFSD via NFS loopback mount on the same host, NFSD blocks waiting for XFS's call to __filemap_get_folio): 6070.550357] INFO: task kcompactd0:58 blocked for more than 4435 seconds. {--- [58] "kcompactd0" [<0>] folio_wait_bit+0xe8/0x200 [<0>] folio_wait_writeback+0x2b/0x80 [<0>] nfs_wb_folio+0x80/0x1b0 [nfs] [<0>] nfs_release_folio+0x68/0x130 [nfs] [<0>] split_huge_page_to_list_to_order+0x362/0x840 [<0>] migrate_pages_batch+0x43d/0xb90 [<0>] migrate_pages_sync+0x9a/0x240 [<0>] migrate_pages+0x93c/0x9f0 [<0>] compact_zone+0x8e2/0x1030 [<0>] compact_node+0xdb/0x120 [<0>] kcompactd+0x121/0x2e0 [<0>] kthread+0xcf/0x100 [<0>] ret_from_fork+0x31/0x40 [<0>] ret_from_fork_asm+0x1a/0x30 ---} [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20250225022002.26141-1-snitzer@kernel.org Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07rcuref: Plug slowpath race in rcuref_put()Thomas Gleixner
commit b9a49520679e98700d3d89689cc91c08a1c88c1d upstream. Kernel test robot reported an "imbalanced put" in the rcuref_put() slow path, which turned out to be a false positive. Consider the following race: ref = 0 (via rcuref_init(ref, 1)) T1 T2 rcuref_put(ref) -> atomic_add_negative_release(-1, ref) # ref -> 0xffffffff -> rcuref_put_slowpath(ref) rcuref_get(ref) -> atomic_add_negative_relaxed(1, &ref->refcnt) -> return true; # ref -> 0 rcuref_put(ref) -> atomic_add_negative_release(-1, ref) # ref -> 0xffffffff -> rcuref_put_slowpath() -> cnt = atomic_read(&ref->refcnt); # cnt -> 0xffffffff / RCUREF_NOREF -> atomic_try_cmpxchg_release(&ref->refcnt, &cnt, RCUREF_DEAD)) # ref -> 0xe0000000 / RCUREF_DEAD -> return true -> cnt = atomic_read(&ref->refcnt); # cnt -> 0xe0000000 / RCUREF_DEAD -> if (cnt > RCUREF_RELEASED) # 0xe0000000 > 0xc0000000 -> WARN_ONCE(cnt >= RCUREF_RELEASED, "rcuref - imbalanced put()") The problem is the additional read in the slow path (after it decremented to RCUREF_NOREF) which can happen after the counter has been marked RCUREF_DEAD. Prevent this by reusing the return value of the decrement. Now every "final" put uses RCUREF_NOREF in the slow path and attempts the final cmpxchg() to RCUREF_DEAD. [ bigeasy: Add changelog ] Fixes: ee1ee6db07795 ("atomics: Provide rcuref - scalable reference counting") Reported-by: kernel test robot <oliver.sang@intel.com> Debugged-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/oe-lkp/202412311453.9d7636a2-lkp@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07block: Remove zone write plugs when handling native zone append writesDamien Le Moal
commit a6aa36e957a1bfb5341986dec32d013d23228fe1 upstream. For devices that natively support zone append operations, REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging and are immediately issued to the zoned device. This means that there is no write pointer offset tracking done for these operations and that a zone write plug is not necessary. However, when receiving a zone append BIO, we may already have a zone write plug for the target zone if that zone was previously partially written using regular write operations. In such case, since the write pointer offset of the zone write plug is not incremented by the amount of sectors appended to the zone, 2 issues arise: 1) we risk leaving the plug in the disk hash table if the zone is fully written using zone append or regular write operations, because the write pointer offset will never reach the "zone full" state. 2) Regular write operations that are issued after zone append operations will always be failed by blk_zone_wplug_prepare_bio() as the write pointer alignment check will fail, even if the user correctly accounted for the zone append operations and issued the regular writes with a correct sector. Avoid these issues by immediately removing the zone write plug of zones that are the target of zone append operations when blk_zone_plug_bio() is called. The new function blk_zone_wplug_handle_native_zone_append() implements this for devices that natively support zone append. The removal of the zone write plug using disk_remove_zone_wplug() requires aborting all plugged regular write using disk_zone_wplug_abort() as otherwise the plugged write BIOs would never be executed (with the plug removed, the completion path will never see again the zone write plug as disk_get_zone_wplug() will return NULL). Rate-limited warnings are added to blk_zone_wplug_handle_native_zone_append() and to disk_zone_wplug_abort() to signal this. Since blk_zone_wplug_handle_native_zone_append() is called in the hot path for operations that will not be plugged, disk_get_zone_wplug() is optimized under the assumption that a user issuing zone append operations is not at the same time issuing regular writes and that there are no hashed zone write plugs. The struct gendisk atomic counter nr_zone_wplugs is added to check this, with this counter incremented in disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug(). To be consistent with this fix, we do not need to fill the zone write plug hash table with zone write plugs for zones that are partially written for a device that supports native zone append operations. So modify blk_revalidate_seq_zone() to return early to avoid allocating and inserting a zone write plug for partially written sequential zones if the device natively supports zone append. Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Link: https://lore.kernel.org/r/20250214041434.82564-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07objtool: Fix C jump table annotations for ClangArd Biesheuvel
[ Upstream commit 73cfc53cc3b6380eccf013049574485f64cb83ca ] A C jump table (such as the one used by the BPF interpreter) is a const global array of absolute code addresses, and this means that the actual values in the table may not be known until the kernel is booted (e.g., when using KASLR or when the kernel VA space is sized dynamically). When using PIE codegen, the compiler will default to placing such const global objects in .data.rel.ro (which is annotated as writable), rather than .rodata (which is annotated as read-only). As C jump tables are explicitly emitted into .rodata, this used to result in warnings for LoongArch builds (which uses PIE codegen for the entire kernel) like Warning: setting incorrect section attributes for .rodata..c_jump_table due to the fact that the explicitly specified .rodata section inherited the read-write annotation that the compiler uses for such objects when using PIE codegen. This warning was suppressed by explicitly adding the read-only annotation to the __attribute__((section(""))) string, by commit c5b1184decc8 ("compiler.h: specify correct attribute for .rodata..c_jump_table") Unfortunately, this hack does not work on Clang's integrated assembler, which happily interprets the appended section type and permission specifiers as part of the section name, which therefore no longer matches the hard-coded pattern '.rodata..c_jump_table' that objtool expects, causing it to emit a warning kernel/bpf/core.o: warning: objtool: ___bpf_prog_run+0x20: sibling call from callable instruction with modified stack frame Work around this, by emitting C jump tables into .data.rel.ro instead, which is treated as .rodata by the linker script for all builds, not just PIE based ones. Fixes: c5b1184decc8 ("compiler.h: specify correct attribute for .rodata..c_jump_table") Tested-by: Tiezhu Yang <yangtiezhu@loongson.cn> # on LoongArch Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250221135704.431269-6-ardb+git@google.com Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07objtool: Remove annotate_{,un}reachable()Peter Zijlstra
[ Upstream commit 06e24745985c8dd0da18337503afcf2f2fdbdff1 ] There are no users of annotate_reachable() left. And the annotate_unreachable() usage in unreachable() is plain wrong; it will hide dangerous fall-through code-gen. Remove both. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20241128094312.235637588@infradead.org Stable-dep-of: 73cfc53cc3b6 ("objtool: Fix C jump table annotations for Clang") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07unreachable: UnifyPeter Zijlstra
[ Upstream commit c837de3810982cd41cd70e5170da1931439f025c ] Since barrier_before_unreachable() is empty for !GCC it is trivial to unify the two definitions. Less is more. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20241128094311.924381359@infradead.org Stable-dep-of: 73cfc53cc3b6 ("objtool: Fix C jump table annotations for Clang") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07tcp: devmem: don't write truncated dmabuf CMSGs to userspaceStanislav Fomichev
[ Upstream commit 18912c520674ec4d920fe3826e7e4fefeecdf5ae ] Currently, we report -ETOOSMALL (err) only on the first iteration (!sent). When we get put_cmsg error after a bunch of successful put_cmsg calls, we don't signal the error at all. This might be confusing on the userspace side which will see truncated CMSGs but no MSG_CTRUNC signal. Consider the following case: - sizeof(struct cmsghdr) = 16 - sizeof(struct dmabuf_cmsg) = 24 - total cmsg size (CMSG_LEN) = 40 (16+24) When calling recvmsg with msg_controllen=60, the userspace will receive two(!) dmabuf_cmsg(s), the first one will be a valid one and the second one will be silently truncated. There is no easy way to discover the truncation besides doing something like "cm->cmsg_len != CMSG_LEN(sizeof(dmabuf_cmsg))". Introduce new put_devmem_cmsg wrapper that reports an error instead of doing the truncation. Mina suggests that it's the intended way this API should work. Note that we might now report MSG_CTRUNC when the users (incorrectly) call us with msg_control == NULL. Fixes: 8f0b3cc9a4c1 ("tcp: RX path for devmem TCP") Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250224174401.3582695-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07SUNRPC: Prevent looping due to rpc_signal_task() racesTrond Myklebust
[ Upstream commit 5bbd6e863b15a85221e49b9bdb2d5d8f0bb91f3d ] If rpc_signal_task() is called while a task is in an rpc_call_done() callback function, and the latter calls rpc_restart_call(), the task can end up looping due to the RPC_TASK_SIGNALLED flag being set without the tk_rpc_status being set. Removing the redundant mechanism for signalling the task fixes the looping behaviour. Reported-by: Li Lingfeng <lilingfeng3@huawei.com> Fixes: 39494194f93b ("SUNRPC: Fix races with rpc_killall_tasks()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-27sched: Compact RSEQ concurrency IDs with reduced threads and affinityMathieu Desnoyers
[ Upstream commit 02d954c0fdf91845169cdacc7405b120f90afe01 ] When a process reduces its number of threads or clears bits in its CPU affinity mask, the mm_cid allocation should eventually converge towards smaller values. However, the change introduced by: commit 7e019dcc470f ("sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads") adds a per-mm/CPU recent_cid which is never unset unless a thread migrates. This is a tradeoff between: A) Preserving cache locality after a transition from many threads to few threads, or after reducing the hamming weight of the allowed CPU mask. B) Making the mm_cid upper bounds wrt nr threads and allowed CPU mask easy to document and understand. C) Allowing applications to eventually react to mm_cid compaction after reduction of the nr threads or allowed CPU mask, making the tracking of mm_cid compaction easier by shrinking it back towards 0 or not. D) Making sure applications that periodically reduce and then increase again the nr threads or allowed CPU mask still benefit from good cache locality with mm_cid. Introduce the following changes: * After shrinking the number of threads or reducing the number of allowed CPUs, reduce the value of max_nr_cid so expansion of CID allocation will preserve cache locality if the number of threads or allowed CPUs increase again. * Only re-use a recent_cid if it is within the max_nr_cid upper bound, else find the first available CID. Fixes: 7e019dcc470f ("sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lkml.kernel.org/r/20250210153253.460471-2-gmonaco@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-27net: Add non-RCU dev_getbyhwaddr() helperBreno Leitao
[ Upstream commit 4b5a28b38c4a0106c64416a1b2042405166b26ce ] Add dedicated helper for finding devices by hardware address when holding rtnl_lock, similar to existing dev_getbyhwaddr_rcu(). This prevents PROVE_LOCKING warnings when rtnl_lock is held but RCU read lock is not. Extract common address comparison logic into dev_addr_cmp(). The context about this change could be found in the following discussion: Link: https://lore.kernel.org/all/20250206-scarlet-ermine-of-improvement-1fcac5@leitao/ Cc: kuniyu@amazon.com Cc: ushankar@purestorage.com Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250218-arm_fix_selftest-v5-1-d3d6892db9e1@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 4eae0ee0f1e6 ("arp: switch to dev_getbyhwaddr() in arp_req_set_public()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-27net: pse-pd: Use power limit at driver side instead of current limitKory Maincent
[ Upstream commit e0a5e2bba38aa61a900934b45d6e846e0a6d7524 ] The regulator framework uses current limits, but the PSE standard and known PSE controllers rely on power limits. Instead of converting current to power within each driver, perform the conversion in the PSE core. This avoids redundancy in driver implementation and aligns better with the standard, simplifying driver development. Remove at the same time the _pse_ethtool_get_status() function which is not needed anymore. Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Stable-dep-of: f6093c5ec74d ("net: pse-pd: pd692x0: Fix power limit retrieval") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-27PCI: Remove devres from pci_intx()Philipp Stanner
[ Upstream commit dfa2f4d5f9e5d757700cefa8ee480099889f1c69 ] pci_intx() is a hybrid function which can sometimes be managed through devres. This hybrid nature is undesirable. Since all users of pci_intx() have by now been ported either to always-managed pcim_intx() or never-managed pci_intx_unmanaged(), the devres functionality can be removed from pci_intx(). Consequently, pci_intx_unmanaged() is now redundant, because pci_intx() itself is now unmanaged. Remove the devres functionality from pci_intx(). Have all users of pci_intx_unmanaged() call pci_intx(). Remove pci_intx_unmanaged(). Link: https://lore.kernel.org/r/20241209130632.132074-13-pstanner@redhat.com Signed-off-by: Philipp Stanner <pstanner@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Stable-dep-of: d555ed45a5a1 ("PCI: Restore original INTX_DISABLE bit by pcim_intx()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-27PCI: Export pci_intx_unmanaged() and pcim_intx()Philipp Stanner
[ Upstream commit f546e8033d8f3e45d49622f04ca2fde650b80f6d ] pci_intx() is a hybrid function which sometimes performs devres operations, depending on whether pcim_enable_device() has been used to enable the pci_dev. This sometimes-managed nature of the function is problematic. Notably, it causes the function to allocate under some circumstances which makes it unusable from interrupt context. Export pcim_intx() (which is always managed) and rename __pcim_intx() (which is never managed) to pci_intx_unmanaged() and export it as well. Then all callers of pci_intx() can be ported to the version they need, depending whether they use pci_enable_device() or pcim_enable_device(). Link: https://lore.kernel.org/r/20241209130632.132074-3-pstanner@redhat.com Signed-off-by: Philipp Stanner <pstanner@redhat.com> [bhelgaas: commit log] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Stable-dep-of: d555ed45a5a1 ("PCI: Restore original INTX_DISABLE bit by pcim_intx()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21compiler.h: Move C string helpers into C-only kernel sectionKees Cook
[ Upstream commit cb7380de9e4cbc9a24216b722ec50e092ae83036 ] The C kernel helpers for evaluating C Strings were positioned where they were visible to assembly inclusion, which was not intended. Move them into the kernel and C-only area of the header so future changes won't confuse the assembler. Fixes: d7a516c6eeae ("compiler.h: Fix undefined BUILD_BUG_ON_ZERO()") Fixes: 559048d156ff ("string: Check for "nonstring" attribute on strscpy() arguments") Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21net: add dev_net_rcu() helperEric Dumazet
[ Upstream commit 482ad2a4ace2740ca0ff1cbc8f3c7f862f3ab507 ] dev->nd_net can change, readers should either use rcu_read_lock() or RTNL. We currently use a generic helper, dev_net() with no debugging support. We probably have many hidden bugs. Add dev_net_rcu() helper for callers using rcu_read_lock() protection. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250205155120.1676781-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 71b8471c93fa ("ipv4: use RCU protection in ipv4_default_advmss()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21cgroup: fix race between fork and cgroup.killShakeel Butt
commit b69bb476dee99d564d65d418e9a20acca6f32c3f upstream. Tejun reported the following race between fork() and cgroup.kill at [1]. Tejun: I was looking at cgroup.kill implementation and wondering whether there could be a race window. So, __cgroup_kill() does the following: k1. Set CGRP_KILL. k2. Iterate tasks and deliver SIGKILL. k3. Clear CGRP_KILL. The copy_process() does the following: c1. Copy a bunch of stuff. c2. Grab siglock. c3. Check fatal_signal_pending(). c4. Commit to forking. c5. Release siglock. c6. Call cgroup_post_fork() which puts the task on the css_set and tests CGRP_KILL. The intention seems to be that either a forking task gets SIGKILL and terminates on c3 or it sees CGRP_KILL on c6 and kills the child. However, I don't see what guarantees that k3 can't happen before c6. ie. After a forking task passes c5, k2 can take place and then before the forking task reaches c6, k3 can happen. Then, nobody would send SIGKILL to the child. What am I missing? This is indeed a race. One way to fix this race is by taking cgroup_threadgroup_rwsem in write mode in __cgroup_kill() as the fork() side takes cgroup_threadgroup_rwsem in read mode from cgroup_can_fork() to cgroup_post_fork(). However that would be heavy handed as this adds one more potential stall scenario for cgroup.kill which is usually called under extreme situation like memory pressure. To fix this race, let's maintain a sequence number per cgroup which gets incremented on __cgroup_kill() call. On the fork() side, the cgroup_can_fork() will cache the sequence number locally and recheck it against the cgroup's sequence number at cgroup_post_fork() site. If the sequence numbers mismatch, it means __cgroup_kill() can been called and we should send SIGKILL to the newly created task. Reported-by: Tejun Heo <tj@kernel.org> Closes: https://lore.kernel.org/all/Z5QHE2Qn-QZ6M-KW@slm.duckdns.org/ [1] Fixes: 661ee6280931 ("cgroup: introduce cgroup.kill") Cc: stable@vger.kernel.org # v5.14+ Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21efi: Avoid cold plugged memory for placing the kernelArd Biesheuvel
commit ba69e0750b0362870294adab09339a0c39c3beaf upstream. UEFI 2.11 introduced EFI_MEMORY_HOT_PLUGGABLE to annotate system memory regions that are 'cold plugged' at boot, i.e., hot pluggable memory that is available from early boot, and described as system RAM by the firmware. Existing loaders and EFI applications running in the boot context will happily use this memory for allocating data structures that cannot be freed or moved at runtime, and this prevents the memory from being unplugged. Going forward, the new EFI_MEMORY_HOT_PLUGGABLE attribute should be tested, and memory annotated as such should be avoided for such allocations. In the EFI stub, there are a couple of occurrences where, instead of the high-level AllocatePages() UEFI boot service, a low-level code sequence is used that traverses the EFI memory map and carves out the requested number of pages from a free region. This is needed, e.g., for allocating as low as possible, or for allocating pages at random. While AllocatePages() should presumably avoid special purpose memory and cold plugged regions, this manual approach needs to incorporate this logic itself, in order to prevent the kernel itself from ending up in a hot unpluggable region, preventing it from being unplugged. So add the EFI_MEMORY_HOTPLUGGABLE macro definition, and check for it where appropriate. Cc: stable@vger.kernel.org Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21serial: 8250_pci: Share WCH IDs with parport_serial driverAndy Shevchenko
[ Upstream commit 535a07698b8b3e6f305673102d297262cae2360a ] parport_serial driver uses subset of WCH IDs that are present in 8250_pci. Share them via pci_ids.h and switch parport_serial to use defined constants. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20241204031114.1029882-3-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21block: cleanup and fix batch completion adding conditionsJens Axboe
[ Upstream commit 1f47ed294a2bd577d5ae43e6e28e1c9a3be4a833 ] The conditions for whether or not a request is allowed adding to a completion batch are a bit hard to read, and they also have a few issues. One is that ioerror may indeed be a random value on passthrough, and it's being checked unconditionally of whether or not the given request is a passthrough request or not. Rewrite the conditions to be separate for easier reading, and only check ioerror for non-passthrough requests. This fixes an issue with bio unmapping on passthrough, where it fails getting added to a batch. This both leads to suboptimal performance, and may trigger a potential schedule-under-atomic condition for polled passthrough IO. Fixes: f794f3351f26 ("block: add support for blk_mq_end_request_batch()") Link: https://lore.kernel.org/r/20575f0a-656e-4bb3-9d82-dec6c7e3a35c@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-17jiffies: Cast to unsigned long in secs_to_jiffies() conversionEaswar Hariharan
commit bb2784d9ab49587ba4fbff37a319fff2924db289 upstream. While converting users of msecs_to_jiffies(), lkp reported that some range checks would always be true because of the mismatch between the implied int value of secs_to_jiffies() vs the unsigned long return value of the msecs_to_jiffies() calls it was replacing. Fix this by casting the secs_to_jiffies() input value to unsigned long. Fixes: b35108a51cf7ba ("jiffies: Define secs_to_jiffies()") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20250130192701.99626-1-eahariha@linux.microsoft.com Closes: https://lore.kernel.org/oe-kbuild-all/202501301334.NB6NszQR-lkp@intel.com/ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17hrtimers: Force migrate away hrtimers queued after CPUHP_AP_HRTIMERS_DYINGFrederic Weisbecker
commit 53dac345395c0d2493cbc2f4c85fe38aef5b63f5 upstream. hrtimers are migrated away from the dying CPU to any online target at the CPUHP_AP_HRTIMERS_DYING stage in order not to delay bandwidth timers handling tasks involved in the CPU hotplug forward progress. However wakeups can still be performed by the outgoing CPU after CPUHP_AP_HRTIMERS_DYING. Those can result again in bandwidth timers being armed. Depending on several considerations (crystal ball power management based election, earliest timer already enqueued, timer migration enabled or not), the target may eventually be the current CPU even if offline. If that happens, the timer is eventually ignored. The most notable example is RCU which had to deal with each and every of those wake-ups by deferring them to an online CPU, along with related workarounds: _ e787644caf76 (rcu: Defer RCU kthreads wakeup when CPU is dying) _ 9139f93209d1 (rcu/nocb: Fix RT throttling hrtimer armed from offline CPU) _ f7345ccc62a4 (rcu/nocb: Fix rcuog wake-up from offline softirq) The problem isn't confined to RCU though as the stop machine kthread (which runs CPUHP_AP_HRTIMERS_DYING) reports its completion at the end of its work through cpu_stop_signal_done() and performs a wake up that eventually arms the deadline server timer: WARNING: CPU: 94 PID: 588 at kernel/time/hrtimer.c:1086 hrtimer_start_range_ns+0x289/0x2d0 CPU: 94 UID: 0 PID: 588 Comm: migration/94 Not tainted Stopper: multi_cpu_stop+0x0/0x120 <- stop_machine_cpuslocked+0x66/0xc0 RIP: 0010:hrtimer_start_range_ns+0x289/0x2d0 Call Trace: <TASK> start_dl_timer enqueue_dl_entity dl_server_start enqueue_task_fair enqueue_task ttwu_do_activate try_to_wake_up complete cpu_stopper_thread Instead of providing yet another bandaid to work around the situation, fix it in the hrtimers infrastructure instead: always migrate away a timer to an online target whenever it is enqueued from an offline CPU. This will also allow to revert all the above RCU disgraceful hacks. Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Reported-by: Vlad Poenaru <vlad.wing@gmail.com> Reported-by: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/all/20250117232433.24027-1-frederic@kernel.org Closes: 20241213203739.1519801-1-usamaarif642@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17kvm: defer huge page recovery vhost task to laterKeith Busch
commit 931656b9e2ff7029aee0b36e17780621948a6ac1 upstream. Some libraries want to ensure they are single threaded before forking, so making the kernel's kvm huge page recovery process a vhost task of the user process breaks those. The minijail library used by crosvm is one such affected application. Defer the task to after the first VM_RUN call, which occurs after the parent process has forked all its jailed processes. This needs to happen only once for the kvm instance, so introduce some general-purpose infrastructure for that, too. It's similar in concept to pthread_once; except it is actually usable, because the callback takes a parameter. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Alyssa Ross <hi@alyssa.is> Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250123153543.2769928-1-kbusch@meta.com> [Move call_once API to include/linux. - Paolo] Cc: stable@vger.kernel.org Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17KVM: Explicitly verify target vCPU is online in kvm_get_vcpu()Sean Christopherson
commit 1e7381f3617d14b3c11da80ff5f8a93ab14cfc46 upstream. Explicitly verify the target vCPU is fully online _prior_ to clamping the index in kvm_get_vcpu(). If the index is "bad", the nospec clamping will generate '0', i.e. KVM will return vCPU0 instead of NULL. In practice, the bug is unlikely to cause problems, as it will only come into play if userspace or the guest is buggy or misbehaving, e.g. KVM may send interrupts to vCPU0 instead of dropping them on the floor. However, returning vCPU0 when it shouldn't exist per online_vcpus is problematic now that KVM uses an xarray for the vCPUs array, as KVM needs to insert into the xarray before publishing the vCPU to userspace (see commit c5b077549136 ("KVM: Convert the kvm->vcpus array to a xarray")), i.e. before vCPU creation is guaranteed to succeed. As a result, incorrectly providing access to vCPU0 will trigger a use-after-free if vCPU0 is dereferenced and kvm_vm_ioctl_create_vcpu() bails out of vCPU creation due to an error and frees vCPU0. Commit afb2acb2e3a3 ("KVM: Fix vcpu_array[0] races") papered over that issue, but in doing so introduced an unsolvable teardown conundrum. Preventing accesses to vCPU0 before it's fully online will allow reverting commit afb2acb2e3a3, without re-introducing the vcpu_array[0] UAF race. Fixes: 1d487e9bf8ba ("KVM: fix spectrev1 gadgets") Cc: stable@vger.kernel.org Cc: Will Deacon <will@kernel.org> Cc: Michal Luczaj <mhal@rbox.co> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241009150455.1057573-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17HID: hid-asus: Disable OOBE mode on the ProArt P16Luke D. Jones
[ Upstream commit 53078a736fbc60e5d3a1e14f4cd4214003815026 ] The new ASUS ProArt 16" laptop series come with their keyboards stuck in an Out-Of-Box-Experience mode. While in this mode most functions will not work such as LED control or Fn key combos. The correct init sequence is now done to disable this OOBE. This patch addresses only the ProArt series so far and it is unknown if there may be others, in which case a new quirk may be required. Signed-off-by: Luke D. Jones <luke@ljones.dev> Co-developed-by: Connor Belli <connorbelli2003@gmail.com> Signed-off-by: Connor Belli <connorbelli2003@gmail.com> Tested-by: Jan Schmidt <jan@centricular.com> Signed-off-by: Jiri Kosina <jkosina@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-17net/mlx5: use do_aux_work for PHC overflow checksVadim Fedorenko
[ Upstream commit e61e6c415ba9ff2b32bb6780ce1b17d1d76238f1 ] The overflow_work is using system wq to do overflow checks and updates for PHC device timecounter, which might be overhelmed by other tasks. But there is dedicated kthread in PTP subsystem designed for such things. This patch changes the work queue to proper align with PTP subsystem and to avoid overloading system work queue. The adjfine() function acts the same way as overflow check worker, we can postpone ptp aux worker till the next overflow period after adjfine() was called. Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250107104812.380225-1-vadfed@meta.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-17exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) caseKees Cook
[ Upstream commit 543841d1806029889c2f69f040e88b247aba8e22 ] Zbigniew mentioned at Linux Plumber's that systemd is interested in switching to execveat() for service execution, but can't, because the contents of /proc/pid/comm are the file descriptor which was used, instead of the path to the binary[1]. This makes the output of tools like top and ps useless, especially in a world where most fds are opened CLOEXEC so the number is truly meaningless. When the filename passed in is empty (e.g. with AT_EMPTY_PATH), use the dentry's filename for "comm" instead of using the useless numeral from the synthetic fdpath construction. This way the actual exec machinery is unchanged, but cosmetically the comm looks reasonable to admins investigating things. Instead of adding TASK_COMM_LEN more bytes to bprm, use one of the unused flag bits to indicate that we need to set "comm" from the dentry. Suggested-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> Suggested-by: Tycho Andersen <tandersen@netflix.com> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://github.com/uapi-group/kernel-features#set-comm-field-before-exec [1] Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Tested-by: Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08pwm: Ensure callbacks exist before calling themUwe Kleine-König
commit da6b353786997c0ffa67127355ad1d54ed3324c2 upstream. If one of the waveform functions is called for a chip that only supports .apply(), we want that an error code is returned and not a NULL pointer exception. Fixes: 6c5126c6406d ("pwm: Provide new consumer API functions for waveforms") Cc: stable@vger.kernel.org Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Tested-by: Trevor Gamblin <tgamblin@baylibre.com> Link: https://lore.kernel.org/r/20250123172709.391349-2-u.kleine-koenig@baylibre.com Signed-off-by: Uwe Kleine-König <ukleinek@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-08pps: Fix a use-after-freeCalvin Owens
commit c79a39dc8d060b9e64e8b0fa9d245d44befeefbe upstream. On a board running ntpd and gpsd, I'm seeing a consistent use-after-free in sys_exit() from gpsd when rebooting: pps pps1: removed ------------[ cut here ]------------ kobject: '(null)' (00000000db4bec24): is not initialized, yet kobject_put() is being called. WARNING: CPU: 2 PID: 440 at lib/kobject.c:734 kobject_put+0x120/0x150 CPU: 2 UID: 299 PID: 440 Comm: gpsd Not tainted 6.11.0-rc6-00308-gb31c44928842 #1 Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT) pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : kobject_put+0x120/0x150 lr : kobject_put+0x120/0x150 sp : ffffffc0803d3ae0 x29: ffffffc0803d3ae0 x28: ffffff8042dc9738 x27: 0000000000000001 x26: 0000000000000000 x25: ffffff8042dc9040 x24: ffffff8042dc9440 x23: ffffff80402a4620 x22: ffffff8042ef4bd0 x21: ffffff80405cb600 x20: 000000000008001b x19: ffffff8040b3b6e0 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 696e6920746f6e20 x14: 7369203a29343263 x13: 205d303434542020 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: kobject_put+0x120/0x150 cdev_put+0x20/0x3c __fput+0x2c4/0x2d8 ____fput+0x1c/0x38 task_work_run+0x70/0xfc do_exit+0x2a0/0x924 do_group_exit+0x34/0x90 get_signal+0x7fc/0x8c0 do_signal+0x128/0x13b4 do_notify_resume+0xdc/0x160 el0_svc+0xd4/0xf8 el0t_64_sync_handler+0x140/0x14c el0t_64_sync+0x190/0x194 ---[ end trace 0000000000000000 ]--- ...followed by more symptoms of corruption, with similar stacks: refcount_t: underflow; use-after-free. kernel BUG at lib/list_debug.c:62! Kernel panic - not syncing: Oops - BUG: Fatal exception This happens because pps_device_destruct() frees the pps_device with the embedded cdev immediately after calling cdev_del(), but, as the comment above cdev_del() notes, fops for previously opened cdevs are still callable even after cdev_del() returns. I think this bug has always been there: I can't explain why it suddenly started happening every time I reboot this particular board. In commit d953e0e837e6 ("pps: Fix a use-after free bug when unregistering a source."), George Spelvin suggested removing the embedded cdev. That seems like the simplest way to fix this, so I've implemented his suggestion, using __register_chrdev() with pps_idr becoming the source of truth for which minor corresponds to which device. But now that pps_idr defines userspace visibility instead of cdev_add(), we need to be sure the pps->dev refcount can't reach zero while userspace can still find it again. So, the idr_remove() call moves to pps_unregister_cdev(), and pps_idr now holds a reference to pps->dev. pps_core: source serial1 got cdev (251:1) <...> pps pps1: removed pps_core: unregistering pps1 pps_core: deallocating pps1 Fixes: d953e0e837e6 ("pps: Fix a use-after free bug when unregistering a source.") Cc: stable@vger.kernel.org Signed-off-by: Calvin Owens <calvin@wbinvd.org> Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Link: https://lore.kernel.org/r/a17975fd5ae99385791929e563f72564edbcf28f.1731383727.git.calvin@wbinvd.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-08usb: typec: tcpci: Prevent Sink disconnection before vPpsShutdown in SPR PPSKyle Tso
commit 4d27afbf256028a1f54363367f30efc8854433c3 upstream. The Source can drop its output voltage to the minimum of the requested PPS APDO voltage range when it is in Current Limit Mode. If this voltage falls within the range of vPpsShutdown, the Source initiates a Hard Reset and discharges Vbus. However, currently the Sink may disconnect before the voltage reaches vPpsShutdown, leading to unexpected behavior. Prevent premature disconnection by setting the Sink's disconnect threshold to the minimum vPpsShutdown value. Additionally, consider the voltage drop due to IR drop when calculating the appropriate threshold. This ensures a robust and reliable interaction between the Source and Sink during SPR PPS Current Limit Mode operation. Fixes: 4288debeaa4e ("usb: typec: tcpci: Fix up sink disconnect thresholds for PD") Cc: stable <stable@kernel.org> Signed-off-by: Kyle Tso <kyletso@google.com> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Badhri Jagan Sridharan <badhri@google.com> Link: https://lore.kernel.org/r/20250114142435.2093857-1-kyletso@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-08nfs: fix incorrect error handling in LOCALIOMike Snitzer
[ Upstream commit ead11ac50ad4b8ef1b64806e962ea984862d96ad ] nfs4_stat_to_errno() expects a NFSv4 error code as an argument and returns a POSIX errno. The problem is LOCALIO is passing nfs4_stat_to_errno() the POSIX errno return values from filp->f_op->read_iter(), filp->f_op->write_iter() and vfs_fsync_range(). So the POSIX errno that nfs_local_pgio_done() and nfs_local_commit_done() are passing to nfs4_stat_to_errno() are failing to match any NFSv4 error code, which results in nfs4_stat_to_errno() defaulting to returning -EREMOTEIO. This causes assertions in upper layers due to -EREMOTEIO not being a valid NFSv4 error code. Fix this by updating nfs_local_pgio_done() and nfs_local_commit_done() to use the new nfs_localio_errno_to_nfs4_stat() to map a POSIX errno to an NFSv4 error code. Care was taken to factor out nfs4_errtbl_common[] to avoid duplicating the same NFS error to errno table. nfs4_errtbl_common[] is checked first by both nfs4_stat_to_errno and nfs_localio_errno_to_nfs4_stat before they check their own more specialized tables (nfs4_errtbl[] and nfs4_errtbl_localio[] respectively). While auditing the associated error mapping tables, the (ab)use of -1 for the last table entry was removed in favor of using ARRAY_SIZE to iterate the nfs_errtbl[] and nfs4_errtbl[]. And 'errno_NFSERR_IO' was removed because it caused needless obfuscation. Fixes: 70ba381e1a431 ("nfs: add LOCALIO support") Reported-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08module: Extend the preempt disabled section in dereference_symbol_descriptor().Sebastian Andrzej Siewior
[ Upstream commit a145c848d69f9c6f32008d8319edaa133360dd74 ] dereference_symbol_descriptor() needs to obtain the module pointer belonging to pointer in order to resolve that pointer. The returned mod pointer is obtained under RCU-sched/ preempt_disable() guarantees and needs to be used within this section to ensure that the module is not removed in the meantime. Extend the preempt_disable() section to also cover dereference_module_function_descriptor(). Fixes: 04b8eb7a4ccd9 ("symbol lookup: introduce dereference_symbol_descriptor()") Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Helge Deller <deller@gmx.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20250108090457.512198-2-bigeasy@linutronix.de Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08alloc_tag: avoid current->alloc_tag manipulations when profiling is disabledSuren Baghdasaryan
[ Upstream commit 07438779313caafe52ac1a1a6958d735a5938988 ] When memory allocation profiling is disabled there is no need to update current->alloc_tag and these manipulations add unnecessary overhead. Fix the overhead by skipping these extra updates. I ran comprehensive testing on Pixel 6 on Big, Medium and Little cores: Overhead before fixes Overhead after fixes slab alloc page alloc slab alloc page alloc Big 6.21% 5.32% 3.31% 4.93% Medium 4.51% 5.05% 3.79% 4.39% Little 7.62% 1.82% 6.68% 1.02% This is an allocation microbenchmark doing allocations in a tight loop. Not a really realistic scenario and useful only to make performance comparisons. Link: https://lkml.kernel.org/r/20241226211639.1357704-1-surenb@google.com Fixes: b951aaff5035 ("mm: enable page allocation tagging") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Yu Zhao <yuzhao@google.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08bpf: Reject struct_ops registration that uses module ptr and the module ↵Martin KaFai Lau
btf_id is missing [ Upstream commit 96ea081ed52bf077cad6d00153b6fba68e510767 ] There is a UAF report in the bpf_struct_ops when CONFIG_MODULES=n. In particular, the report is on tcp_congestion_ops that has a "struct module *owner" member. For struct_ops that has a "struct module *owner" member, it can be extended either by the regular kernel module or by the bpf_struct_ops. bpf_try_module_get() will be used to do the refcounting and different refcount is done based on the owner pointer. When CONFIG_MODULES=n, the btf_id of the "struct module" is missing: WARN: resolve_btfids: unresolved symbol module Thus, the bpf_try_module_get() cannot do the correct refcounting. Not all subsystem's struct_ops requires the "struct module *owner" member. e.g. the recent sched_ext_ops. This patch is to disable bpf_struct_ops registration if the struct_ops has the "struct module *" member and the "struct module" btf_id is missing. The btf_type_is_fwd() helper is moved to the btf.h header file for this test. This has happened since the beginning of bpf_struct_ops which has gone through many changes. The Fixes tag is set to a recent commit that this patch can apply cleanly. Considering CONFIG_MODULES=n is not common and the age of the issue, targeting for bpf-next also. Fixes: 1611603537a4 ("bpf: Create argument information for nullable arguments.") Reported-by: Robert Morris <rtm@csail.mit.edu> Closes: https://lore.kernel.org/bpf/74665.1733669976@localhost/ Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Tested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241220201818.127152-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08inet: ipmr: fix data-racesEric Dumazet
[ Upstream commit 3440fa34ad99d471f1085bc2f4dedeaebc310261 ] Following fields of 'struct mr_mfc' can be updated concurrently (no lock protection) from ip_mr_forward() and ip6_mr_forward() - bytes - pkt - wrong_if - lastuse They also can be read from other functions. Convert bytes, pkt and wrong_if to atomic_long_t, and use READ_ONCE()/WRITE_ONCE() for lastuse. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250114221049.1190631-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08wifi: mac80211: Fix common size calculation for ML elementIlan Peer
[ Upstream commit 19aa842dcbb5860509b7e1b7745dbae0b791f6c4 ] When the ML type is EPCS the control bitmap is reserved, the length is always 7 and is captured by the 1st octet after the control. Fixes: 0f48b8b88aa9 ("wifi: ieee80211: add definitions for multi-link element") Signed-off-by: Ilan Peer <ilan.peer@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250102161730.5790376754a7.I381208cbb72b1be2a88239509294099e9337e254@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08HID: fix generic desktop D-Pad controlsTerry Tritton
[ Upstream commit 80818fdc068eaab729bb793d790ae9fd053f7053 ] The addition of the "System Do Not Disturb" event code caused the Generic Desktop D-Pad configuration to be skipped. This commit allows both to be configured without conflicting with each other. Fixes: 22d6d060ac77 ("input: Add support for "Do Not Disturb"") Signed-off-by: Terry Tritton <terry.tritton@linaro.org> Reviewed-by: Aseda Aboagye <aaboagye@chromium.org> Reviewed-by: Carlos Llamas <cmllamas@google.com> Signed-off-by: Jiri Kosina <jkosina@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08ax25: rcu protect dev->ax25_ptrEric Dumazet
[ Upstream commit 95fc45d1dea8e1253f8ec58abc5befb71553d666 ] syzbot found a lockdep issue [1]. We should remove ax25 RTNL dependency in ax25_setsockopt() This should also fix a variety of possible UAF in ax25. [1] WARNING: possible circular locking dependency detected 6.13.0-rc3-syzkaller-00762-g9268abe611b0 #0 Not tainted ------------------------------------------------------ syz.5.1818/12806 is trying to acquire lock: ffffffff8fcb3988 (rtnl_mutex){+.+.}-{4:4}, at: ax25_setsockopt+0xa55/0xe90 net/ax25/af_ax25.c:680 but task is already holding lock: ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1618 [inline] ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: ax25_setsockopt+0x209/0xe90 net/ax25/af_ax25.c:574 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (sk_lock-AF_AX25){+.+.}-{0:0}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 lock_sock_nested+0x48/0x100 net/core/sock.c:3642 lock_sock include/net/sock.h:1618 [inline] ax25_kill_by_device net/ax25/af_ax25.c:101 [inline] ax25_device_event+0x24d/0x580 net/ax25/af_ax25.c:146 notifier_call_chain+0x1a5/0x3f0 kernel/notifier.c:85 __dev_notify_flags+0x207/0x400 dev_change_flags+0xf0/0x1a0 net/core/dev.c:9026 dev_ifsioc+0x7c8/0xe70 net/core/dev_ioctl.c:563 dev_ioctl+0x719/0x1340 net/core/dev_ioctl.c:820 sock_do_ioctl+0x240/0x460 net/socket.c:1234 sock_ioctl+0x626/0x8e0 net/socket.c:1339 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (rtnl_mutex){+.+.}-{4:4}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0x1ac/0xee0 kernel/locking/mutex.c:735 ax25_setsockopt+0xa55/0xe90 net/ax25/af_ax25.c:680 do_sock_setsockopt+0x3af/0x720 net/socket.c:2324 __sys_setsockopt net/socket.c:2349 [inline] __do_sys_setsockopt net/socket.c:2355 [inline] __se_sys_setsockopt net/socket.c:2352 [inline] __x64_sys_setsockopt+0x1ee/0x280 net/socket.c:2352 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_AX25); lock(rtnl_mutex); lock(sk_lock-AF_AX25); lock(rtnl_mutex); *** DEADLOCK *** 1 lock held by syz.5.1818/12806: #0: ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1618 [inline] #0: ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: ax25_setsockopt+0x209/0xe90 net/ax25/af_ax25.c:574 stack backtrace: CPU: 1 UID: 0 PID: 12806 Comm: syz.5.1818 Not tainted 6.13.0-rc3-syzkaller-00762-g9268abe611b0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206 check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0x1ac/0xee0 kernel/locking/mutex.c:735 ax25_setsockopt+0xa55/0xe90 net/ax25/af_ax25.c:680 do_sock_setsockopt+0x3af/0x720 net/socket.c:2324 __sys_setsockopt net/socket.c:2349 [inline] __do_sys_setsockopt net/socket.c:2355 [inline] __se_sys_setsockopt net/socket.c:2352 [inline] __x64_sys_setsockopt+0x1ee/0x280 net/socket.c:2352 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f7b62385d29 Fixes: c433570458e4 ("ax25: fix a use-after-free in ax25_fillin_cb()") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250103210514.87290-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08ptr_ring: do not block hard interrupts in ptr_ring_resize_multiple()Eric Dumazet
[ Upstream commit a126061c80d5efb4baef4bcf346094139cd81df6 ] Jakub added a lockdep_assert_no_hardirq() check in __page_pool_put_page() to increase test coverage. syzbot found a splat caused by hard irq blocking in ptr_ring_resize_multiple() [1] As current users of ptr_ring_resize_multiple() do not require hard irqs being masked, replace it to only block BH. Rename helpers to better reflect they are safe against BH only. - ptr_ring_resize_multiple() to ptr_ring_resize_multiple_bh() - skb_array_resize_multiple() to skb_array_resize_multiple_bh() [1] WARNING: CPU: 1 PID: 9150 at net/core/page_pool.c:709 __page_pool_put_page net/core/page_pool.c:709 [inline] WARNING: CPU: 1 PID: 9150 at net/core/page_pool.c:709 page_pool_put_unrefed_netmem+0x157/0xa40 net/core/page_pool.c:780 Modules linked in: CPU: 1 UID: 0 PID: 9150 Comm: syz.1.1052 Not tainted 6.11.0-rc3-syzkaller-00202-gf8669d7b5f5d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 RIP: 0010:__page_pool_put_page net/core/page_pool.c:709 [inline] RIP: 0010:page_pool_put_unrefed_netmem+0x157/0xa40 net/core/page_pool.c:780 Code: 74 0e e8 7c aa fb f7 eb 43 e8 75 aa fb f7 eb 3c 65 8b 1d 38 a8 6a 76 31 ff 89 de e8 a3 ae fb f7 85 db 74 0b e8 5a aa fb f7 90 <0f> 0b 90 eb 1d 65 8b 1d 15 a8 6a 76 31 ff 89 de e8 84 ae fb f7 85 RSP: 0018:ffffc9000bda6b58 EFLAGS: 00010083 RAX: ffffffff8997e523 RBX: 0000000000000000 RCX: 0000000000040000 RDX: ffffc9000fbd0000 RSI: 0000000000001842 RDI: 0000000000001843 RBP: 0000000000000000 R08: ffffffff8997df2c R09: 1ffffd40003a000d R10: dffffc0000000000 R11: fffff940003a000e R12: ffffea0001d00040 R13: ffff88802e8a4000 R14: dffffc0000000000 R15: 00000000ffffffff FS: 00007fb7aaf716c0(0000) GS:ffff8880b9300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa15a0d4b72 CR3: 00000000561b0000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tun_ptr_free drivers/net/tun.c:617 [inline] __ptr_ring_swap_queue include/linux/ptr_ring.h:571 [inline] ptr_ring_resize_multiple_noprof include/linux/ptr_ring.h:643 [inline] tun_queue_resize drivers/net/tun.c:3694 [inline] tun_device_event+0xaaf/0x1080 drivers/net/tun.c:3714 notifier_call_chain+0x19f/0x3e0 kernel/notifier.c:93 call_netdevice_notifiers_extack net/core/dev.c:2032 [inline] call_netdevice_notifiers net/core/dev.c:2046 [inline] dev_change_tx_queue_len+0x158/0x2a0 net/core/dev.c:9024 do_setlink+0xff6/0x41f0 net/core/rtnetlink.c:2923 rtnl_setlink+0x40d/0x5a0 net/core/rtnetlink.c:3201 rtnetlink_rcv_msg+0x73f/0xcf0 net/core/rtnetlink.c:6647 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2550 Fixes: ff4e538c8c3e ("page_pool: add a lockdep check for recycling in hardirq") Reported-by: syzbot+f56a5c5eac2b28439810@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/671e10df.050a0220.2b8c0f.01cf.GAE@google.com/T/ Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20241217135121.326370-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08sched/fair: Fix value reported by hot tasks pulled in /proc/schedstatPeter Zijlstra
[ Upstream commit a430d99e349026d53e2557b7b22bd2ebd61fe12a ] In /proc/schedstat, lb_hot_gained reports the number hot tasks pulled during load balance. This value is incremented in can_migrate_task() if the task is migratable and hot. After incrementing the value, load balancer can still decide not to migrate this task leading to wrong accounting. Fix this by incrementing stats when hot tasks are detached. This issue only exists in detach_tasks() where we can decide to not migrate hot task even if it is migratable. However, in detach_one_task(), we migrate it unconditionally. [Swapnil: Handled the case where nr_failed_migrations_hot was not accounted properly and wrote commit log] Fixes: d31980846f96 ("sched: Move up affinity check to mitigate useless redoing overhead") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-2-swapnil.sapkal@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08perf/core: Save raw sample data conditionally based on sample typeYabin Cui
[ Upstream commit b9c44b91476b67327a521568a854babecc4070ab ] Currently, space for raw sample data is always allocated within sample records for both BPF output and tracepoint events. This leads to unused space in sample records when raw sample data is not requested. This patch enforces checking sample type of an event in perf_sample_save_raw_data(). So raw sample data will only be saved if explicitly requested, reducing overhead when it is not needed. Fixes: 0a9081cf0a11 ("perf/core: Add perf_sample_save_raw_data() helper") Signed-off-by: Yabin Cui <yabinc@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240515193610.2350456-2-yabinc@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>