summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2025-04-10include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.hJuri Lelli
commit 34929a070b7fd06c386080c926b61ee844e6ad34 upstream. dl_rebuild_rd_accounting() is defined in cpuset.c, so it makes more sense to move related declarations to cpuset.h. Implement the move. Suggested-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <llong@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MSOVMpU7jpVrMU@jlelli-thinkpadt14gen4.remote.csb Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10uprobes/x86: Harden uretprobe syscall trampoline checkJiri Olsa
commit fa6192adc32f4fdfe5b74edd5b210e12afd6ecc0 upstream. Jann reported a possible issue when trampoline_check_ip returns address near the bottom of the address space that is allowed to call into the syscall if uretprobes are not set up: https://lore.kernel.org/bpf/202502081235.5A6F352985@keescook/T/#m9d416df341b8fbc11737dacbcd29f0054413cbbf Though the mmap minimum address restrictions will typically prevent creating mappings there, let's make sure uretprobe syscall checks for that. Fixes: ff474a78cef5 ("uprobe: Add uretprobe syscall to speed up return probe") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250212220433.3624297-1-jolsa@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10xsk: Add launch time hardware offload support to XDP Tx metadataSong Yoong Siang
[ Upstream commit ca4419f15abd19ba8be1e109661b60f9f5b6c9f0 ] Extend the XDP Tx metadata framework so that user can requests launch time hardware offload, where the Ethernet device will schedule the packet for transmission at a pre-determined time called launch time. The value of launch time is communicated from user space to Ethernet driver via launch_time field of struct xsk_tx_metadata. Suggested-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250216093430.957880-2-yoong.siang.song@intel.com Stable-dep-of: d931cf9b38da ("igc: Fix TX drops in XDP ZC") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10rcu-tasks: Always inline rcu_irq_work_resched()Josh Poimboeuf
[ Upstream commit 6309a5c43b0dc629851f25b2e5ef8beff61d08e5 ] Thanks to CONFIG_DEBUG_SECTION_MISMATCH, empty functions can be generated out of line. rcu_irq_work_resched() can be called from noinstr code, so make sure it's always inlined. Fixes: 564506495ca9 ("rcu/context-tracking: Move deferred nocb resched to context tracking") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/e84f15f013c07e4c410d972e75620c53b62c1b3e.1743481539.git.jpoimboe@kernel.org Closes: https://lore.kernel.org/d1eca076-fdde-484a-b33e-70e0d167c36d@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10context_tracking: Always inline ct_{nmi,irq}_{enter,exit}()Josh Poimboeuf
[ Upstream commit 9ac50f7311dc8b39e355582f14c1e82da47a8196 ] Thanks to CONFIG_DEBUG_SECTION_MISMATCH, empty functions can be generated out of line. These can be called from noinstr code, so make sure they're always inlined. Fixes the following warnings: vmlinux.o: warning: objtool: irqentry_nmi_enter+0xa2: call to ct_nmi_enter() leaves .noinstr.text section vmlinux.o: warning: objtool: irqentry_nmi_exit+0x16: call to ct_nmi_exit() leaves .noinstr.text section vmlinux.o: warning: objtool: irqentry_exit+0x78: call to ct_irq_exit() leaves .noinstr.text section Fixes: 6f0e6c1598b1 ("context_tracking: Take IRQ eqs entrypoints over RCU") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/8509bce3f536bcd4ae7af3a2cf6930d48c5e631a.1743481539.git.jpoimboe@kernel.org Closes: https://lore.kernel.org/d1eca076-fdde-484a-b33e-70e0d167c36d@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10sched/smt: Always inline sched_smt_active()Josh Poimboeuf
[ Upstream commit 09f37f2d7b21ff35b8b533f9ab8cfad2fe8f72f6 ] sched_smt_active() can be called from noinstr code, so it should always be inlined. The CONFIG_SCHED_SMT version already has __always_inline. Do the same for its !CONFIG_SCHED_SMT counterpart. Fixes the following warning: vmlinux.o: error: objtool: intel_idle_ibrs+0x13: call to sched_smt_active() leaves .noinstr.text section Fixes: 321a874a7ef8 ("sched/smt: Expose sched_smt_present static key") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/1d03907b0a247cf7fb5c1d518de378864f603060.1743481539.git.jpoimboe@kernel.org Closes: https://lore.kernel.org/r/202503311434.lyw2Tveh-lkp@intel.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10thermal: core: Remove duplicate struct declarationxueqin Luo
[ Upstream commit 9e6ec8cf64e2973f0ec74f09023988cabd218426 ] The struct thermal_zone_device is already declared on line 32, so the duplicate declaration has been removed. Fixes: b1ae92dcfa8e ("thermal: core: Make struct thermal_zone_device definition internal") Signed-off-by: xueqin Luo <luoxueqin@kylinos.cn> Link: https://lore.kernel.org/r/20250206081436.51785-1-luoxueqin@kylinos.cn Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10NFSv4: Avoid unnecessary scans of filesystems for delayed delegationsTrond Myklebust
[ Upstream commit e767b59e29b8327d25edde65efc743f479f30d0a ] The amount of looping through the list of delegations is occasionally leading to soft lockups. If the state manager was asked to manage the delayed return of delegations, then only scan those filesystems containing delegations that were marked as being delayed. Fixes: be20037725d1 ("NFSv4: Fix delegation return in cases where we have to retry") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10NFSv4: Avoid unnecessary scans of filesystems for expired delegationsTrond Myklebust
[ Upstream commit f163aa81a799e2d46d7f8f0b42a0e7770eaa0d06 ] The amount of looping through the list of delegations is occasionally leading to soft lockups. If the state manager was asked to reap the expired delegations, it should scan only those filesystems that hold delegations that need to be reaped. Fixes: 7f156ef0bf45 ("NFSv4: Clean up nfs_delegation_reap_expired()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10NFSv4: Avoid unnecessary scans of filesystems for returning delegationsTrond Myklebust
[ Upstream commit 35a566a24e58f1b5f89737edf60b77de58719ed0 ] The amount of looping through the list of delegations is occasionally leading to soft lockups. If the state manager was asked to return delegations asynchronously, it should only scan those filesystems that hold delegations that need to be returned. Fixes: af3b61bf6131 ("NFSv4: Clean up nfs_client_return_marked_delegations()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10writeback: fix calculations in trace_balance_dirty_pages() for cgwbTang Yizhou
[ Upstream commit 6cc4c3aa714bc58ec5d20f3054ca5f23534984d1 ] In the commit dcc25ae76eb7 ("writeback: move global_dirty_limit into wb_domain") of the cgroup writeback backpressure propagation patchset, Tejun made some adaptations to trace_balance_dirty_pages() for cgroup writeback. However, this adaptation was incomplete and Tejun missed further adaptation in the subsequent patches. In the cgroup writeback scenario, if sdtc in balance_dirty_pages() is assigned to mdtc, then upon entering trace_balance_dirty_pages(), __entry->limit should be assigned based on the dirty_limit of the corresponding memcg's wb_domain, rather than global_wb_domain. To address this issue and simplify the implementation, introduce a 'limit' field in struct dirty_throttle_control to store the hard_limit value computed in wb_position_ratio() by calling hard_dirty_limit(). This field will then be used in trace_balance_dirty_pages() to assign the value to __entry->limit. Link: https://lkml.kernel.org/r/20250304110318.159567-4-yizhou.tang@shopee.com Fixes: dcc25ae76eb7 ("writeback: move global_dirty_limit into wb_domain") Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10writeback: let trace_balance_dirty_pages() take struct dtc as parameterTang Yizhou
[ Upstream commit f1ab2831e2a4312046bca79256b2efc41d373eaf ] Patch series "Fix calculations in trace_balance_dirty_pages() for cgwb", v2. In my experiment, I found that the output of trace_balance_dirty_pages() in the cgroup writeback scenario was strange because trace_balance_dirty_pages() always uses global_wb_domain.dirty_limit for related calculations instead of the dirty_limit of the corresponding memcg's wb_domain. The basic idea of the fix is to store the hard dirty limit value computed in wb_position_ratio() into struct dirty_throttle_control and use it for calculations in trace_balance_dirty_pages(). This patch (of 3): Currently, trace_balance_dirty_pages() already has 12 parameters. In the patch #3, I initially attempted to introduce an additional parameter. However, in include/linux/trace_events.h, bpf_trace_run12() only supports up to 12 parameters and bpf_trace_run13() does not exist. To reduce the number of parameters in trace_balance_dirty_pages(), we can make it accept a pointer to struct dirty_throttle_control as a parameter. To achieve this, we need to move the definition of struct dirty_throttle_control from mm/page-writeback.c to include/linux/writeback.h. Link: https://lkml.kernel.org/r/20250304110318.159567-1-yizhou.tang@shopee.com Link: https://lkml.kernel.org/r/20250304110318.159567-2-yizhou.tang@shopee.com Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Tang Yizhou <yizhou.tang@shopee.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 6cc4c3aa714b ("writeback: fix calculations in trace_balance_dirty_pages() for cgwb") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10reboot: replace __hw_protection_shutdown bool action parameter with an enumAhmad Fatoum
[ Upstream commit 318f05a057157c400a92f17c5f2fa3dd96f57c7e ] Patch series "reboot: support runtime configuration of emergency hw_protection action", v3. We currently leave the decision of whether to shutdown or reboot to protect hardware in an emergency situation to the individual drivers. This works out in some cases, where the driver detecting the critical failure has inside knowledge: It binds to the system management controller for example or is guided by hardware description that defines what to do. This is inadequate in the general case though as a driver reporting e.g. an imminent power failure can't know whether a shutdown or a reboot would be more appropriate for a given hardware platform. To address this, this series adds a hw_protection kernel parameter and sysfs toggle that can be used to change the action from the shutdown default to reboot. A new hw_protection_trigger API then makes use of this default action. My particular use case is unattended embedded systems that don't have support for shutdown and that power on automatically when power is supplied: - A brief power cycle gets detected by the driver - The kernel powers down the system and SoC goes into shutdown mode - Power is restored - The system remains oblivious to the restored power - System needs to be manually power cycled for a duration long enough to drain the capacitors With this series, such systems can configure the kernel with hw_protection=reboot to have the boot firmware worry about critical conditions. This patch (of 12): Currently __hw_protection_shutdown() either reboots or shuts down the system according to its shutdown argument. To make the logic easier to follow, both inside __hw_protection_shutdown and at caller sites, lets replace the bool parameter with an enum. This will be extra useful, when in a later commit, a third action is added to the enumeration. No functional change. Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-0-e1c09b090c0c@pengutronix.de Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-1-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: bbf0ec4f57e0 ("reboot: reboot, not shutdown, on hw_protection_reboot timeout") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10iio: core: Rework claim and release of direct mode to work with sparse.Jonathan Cameron
[ Upstream commit d795e38df4b7ebac1072bbf7d8a5500c1ea83332 ] Initial thought was to do something similar to __cond_lock() do_iio_device_claim_direct_mode(iio_dev) ? : ({ __acquire(iio_dev); 0; }) + Appropriate static inline iio_device_release_direct_mode() However with that, sparse generates false positives. E.g. drivers/iio/imu/st_lsm6dsx/st_lsm6dsx_core.c:1811:17: warning: context imbalance in 'st_lsm6dsx_read_raw' - unexpected unlock So instead, this patch rethinks the return type and makes it more 'conditional lock like' (which is part of what is going on under the hood anyway) and return a boolean - true for successfully acquired, false for did not acquire. To allow a migration path given the rework is now non trivial, take a leaf out of the naming of the conditional guard we currently have for IIO device direct mode and drop the _mode postfix from the new functions giving iio_device_claim_direct() and iio_device_release_direct() Whilst the kernel supports __cond_acquires() upstream sparse does not yet do so. Hence rely on sparse expanding a static inline wrapper to explicitly see whether __acquire() is called. Note that even with the solution here, sparse sometimes gives false positives. However in the few cases seen they were complex code structures that benefited from simplification anyway. Reviewed-by: David Lechner <dlechner@baylibre.com> Reviewed-by: Nuno Sa <nuno.sa@analog.com> Link: https://patch.msgid.link/20250209180624.701140-2-jic23@kernel.org Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Stable-dep-of: 7021d97fb89b ("iio: adc: ad7173: Grab direct mode for calibration") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10coresight-etm4x: add isb() before reading the TRCSTATRYuanfang Zhang
[ Upstream commit 4ff6039ffb79a4a8a44b63810a8a2f2b43264856 ] As recommended by section 4.3.7 ("Synchronization when using system instructions to progrom the trace unit") of ARM IHI 0064H.b, the self-hosted trace analyzer must perform a Context synchronization event between writing to the TRCPRGCTLR and reading the TRCSTATR. Additionally, add an ISB between the each read of TRCSTATR on coresight_timeout() when using system instructions to program the trace unit. Fixes: 1ab3bb9df5e3 ("coresight: etm4x: Add necessary synchronization for sysreg access") Signed-off-by: Yuanfang Zhang <quic_yuanfang@quicinc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20250116-etm_sync-v4-1-39f2b05e9514@quicinc.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10iio: gts-helper: export iio_gts_get_total_gain()Javier Carrasco
[ Upstream commit dbd2e08ff09fd6bd51215b44474899cc1b7b7a16 ] Export this function in preparation for the fix in veml6030.c, where the total gain can be used to ease the calculation of the processed value of the IIO_LIGHT channel compared to acquiring the scale in NANO. Suggested-by: Matti Vaittinen <mazziesaccount@gmail.com> Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com> Reviewed-by: Matti Vaittinen <mazziesaccount@gmail.com> Link: https://patch.msgid.link/20250127-veml6030-scale-v3-1-4f32ba03df94@gmail.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Stable-dep-of: 22eaca4283b2 ("iio: light: veml6030: fix scale to conform to ABI") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10RDMA/core: Don't expose hw_counters outside of init net namespaceRoman Gushchin
[ Upstream commit a1ecb30f90856b0be4168ad51b8875148e285c1f ] Commit 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes") accidentally almost exposed hw counters to non-init net namespaces. It didn't expose them fully, as an attempt to read any of those counters leads to a crash like this one: [42021.807566] BUG: kernel NULL pointer dereference, address: 0000000000000028 [42021.814463] #PF: supervisor read access in kernel mode [42021.819549] #PF: error_code(0x0000) - not-present page [42021.824636] PGD 0 P4D 0 [42021.827145] Oops: 0000 [#1] SMP PTI [42021.830598] CPU: 82 PID: 2843922 Comm: switchto-defaul Kdump: loaded Tainted: G S W I XXX [42021.841697] Hardware name: XXX [42021.849619] RIP: 0010:hw_stat_device_show+0x1e/0x40 [ib_core] [42021.855362] Code: 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 49 89 d0 4c 8b 5e 20 48 8b 8f b8 04 00 00 48 81 c7 f0 fa ff ff <48> 8b 41 28 48 29 ce 48 83 c6 d0 48 c1 ee 04 69 d6 ab aa aa aa 48 [42021.873931] RSP: 0018:ffff97fe90f03da0 EFLAGS: 00010287 [42021.879108] RAX: ffff9406988a8c60 RBX: ffff940e1072d438 RCX: 0000000000000000 [42021.886169] RDX: ffff94085f1aa000 RSI: ffff93c6cbbdbcb0 RDI: ffff940c7517aef0 [42021.893230] RBP: ffff97fe90f03e70 R08: ffff94085f1aa000 R09: 0000000000000000 [42021.900294] R10: ffff94085f1aa000 R11: ffffffffc0775680 R12: ffffffff87ca2530 [42021.907355] R13: ffff940651602840 R14: ffff93c6cbbdbcb0 R15: ffff94085f1aa000 [42021.914418] FS: 00007fda1a3b9700(0000) GS:ffff94453fb80000(0000) knlGS:0000000000000000 [42021.922423] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [42021.928130] CR2: 0000000000000028 CR3: 00000042dcfb8003 CR4: 00000000003726f0 [42021.935194] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [42021.942257] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [42021.949324] Call Trace: [42021.951756] <TASK> [42021.953842] [<ffffffff86c58674>] ? show_regs+0x64/0x70 [42021.959030] [<ffffffff86c58468>] ? __die+0x78/0xc0 [42021.963874] [<ffffffff86c9ef75>] ? page_fault_oops+0x2b5/0x3b0 [42021.969749] [<ffffffff87674b92>] ? exc_page_fault+0x1a2/0x3c0 [42021.975549] [<ffffffff87801326>] ? asm_exc_page_fault+0x26/0x30 [42021.981517] [<ffffffffc0775680>] ? __pfx_show_hw_stats+0x10/0x10 [ib_core] [42021.988482] [<ffffffffc077564e>] ? hw_stat_device_show+0x1e/0x40 [ib_core] [42021.995438] [<ffffffff86ac7f8e>] dev_attr_show+0x1e/0x50 [42022.000803] [<ffffffff86a3eeb1>] sysfs_kf_seq_show+0x81/0xe0 [42022.006508] [<ffffffff86a11134>] seq_read_iter+0xf4/0x410 [42022.011954] [<ffffffff869f4b2e>] vfs_read+0x16e/0x2f0 [42022.017058] [<ffffffff869f50ee>] ksys_read+0x6e/0xe0 [42022.022073] [<ffffffff8766f1ca>] do_syscall_64+0x6a/0xa0 [42022.027441] [<ffffffff8780013b>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 The problem can be reproduced using the following steps: ip netns add foo ip netns exec foo bash cat /sys/class/infiniband/mlx4_0/hw_counters/* The panic occurs because of casting the device pointer into an ib_device pointer using container_of() in hw_stat_device_show() is wrong and leads to a memory corruption. However the real problem is that hw counters should never been exposed outside of the non-init net namespace. Fix this by saving the index of the corresponding attribute group (it might be 1 or 2 depending on the presence of driver-specific attributes) and zeroing the pointer to hw_counters group for compat devices during the initialization. With this fix applied hw_counters are not available in a non-init net namespace: find /sys/class/infiniband/mlx4_0/ -name hw_counters /sys/class/infiniband/mlx4_0/ports/1/hw_counters /sys/class/infiniband/mlx4_0/ports/2/hw_counters /sys/class/infiniband/mlx4_0/hw_counters ip netns add foo ip netns exec foo bash find /sys/class/infiniband/mlx4_0/ -name hw_counters Fixes: 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Maher Sanalla <msanalla@nvidia.com> Cc: linux-rdma@vger.kernel.org Cc: linux-kernel@vger.kernel.org Link: https://patch.msgid.link/20250227165420.3430301-1-roman.gushchin@linux.dev Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()David Hildenbrand
[ Upstream commit dc84bc2aba85a1508f04a936f9f9a15f64ebfb31 ] If track_pfn_copy() fails, we already added the dst VMA to the maple tree. As fork() fails, we'll cleanup the maple tree, and stumble over the dst VMA for which we neither performed any reservation nor copied any page tables. Consequently untrack_pfn() will see VM_PAT and try obtaining the PAT information from the page table -- which fails because the page table was not copied. The easiest fix would be to simply clear the VM_PAT flag of the dst VMA if track_pfn_copy() fails. However, the whole thing is about "simply" clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy() and performed a reservation, but copying the page tables fails, we'll simply clear the VM_PAT flag, not properly undoing the reservation ... which is also wrong. So let's fix it properly: set the VM_PAT flag only if the reservation succeeded (leaving it clear initially), and undo the reservation if anything goes wrong while copying the page tables: clearing the VM_PAT flag after undoing the reservation. Note that any copied page table entries will get zapped when the VMA will get removed later, after copy_page_range() succeeded; as VM_PAT is not set then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be happy. Note that leaving these page tables in place without a reservation is not a problem, as we are aborting fork(); this process will never run. A reproducer can trigger this usually at the first try: https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110 Modules linked in: ... CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:get_pat_info+0xf6/0x110 ... Call Trace: <TASK> ... untrack_pfn+0x52/0x110 unmap_single_vma+0xa6/0xe0 unmap_vmas+0x105/0x1f0 exit_mmap+0xf6/0x460 __mmput+0x4b/0x120 copy_process+0x1bf6/0x2aa0 kernel_clone+0xab/0x440 __do_sys_clone+0x66/0x90 do_syscall_64+0x95/0x180 Likely this case was missed in: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") ... and instead of undoing the reservation we simply cleared the VM_PAT flag. Keep the documentation of these functions in include/linux/pgtable.h, one place is more than sufficient -- we should clean that up for the other functions like track_pfn_remap/untrack_pfn separately. Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3") Reported-by: xingwei lee <xrivendell7@gmail.com> Reported-by: yuxin wang <wang1315768607@163.com> Reported-by: Marius Fleischer <fleischermarius@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20250321112323.153741-1-david@redhat.com Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/ Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10of: property: Increase NR_FWNODE_REFERENCE_ARGSZijun Hu
[ Upstream commit eb50844d728f11e87491f7c7af15a4a737f1159d ] Currently, the following two macros have different values: // The maximal argument count for firmware node reference #define NR_FWNODE_REFERENCE_ARGS 8 // The maximal argument count for DT node reference #define MAX_PHANDLE_ARGS 16 It may cause firmware node reference's argument count out of range if directly assign DT node reference's argument count to firmware's. drivers/of/property.c:of_fwnode_get_reference_args() is doing the direct assignment, so may cause firmware's argument count @args->nargs got out of range, namely, in [9, 16]. Fix by increasing NR_FWNODE_REFERENCE_ARGS to 16 to meet DT requirement. Will align both macros later to avoid such inconsistency. Fixes: 3e3119d3088f ("device property: Introduce fwnode_property_get_reference_args") Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> Link: https://lore.kernel.org/r/20250225-fix_arg_count-v4-1-13cdc519eb31@quicinc.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10drm/file: Add fdinfo helper for printing regions with prefixAdrián Larumbe
[ Upstream commit af6c2b7c46e16701fba44a21326cb634786e3e71 ] This is motivated by the desire of some drivers (eg. Panthor) to print the size of internal memory regions with a prefix that reflects the driver name, as suggested in the previous documentation commit. That means adding a new argument to print_size and making it available for DRM users. Cc: Tvrtko Ursulin <tursulin@ursulin.net> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250130172851.941597-3-adrian.larumbe@collabora.com Stable-dep-of: e379856b428a ("drm/panthor: Replace sleep locks with spinlocks in fdinfo path") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10drm/dp_mst: Fix drm RAD printWayne Lin
[ Upstream commit 6bbce873a9c97cb12f5455c497be279ac58e707f ] [Why] The RAD of sideband message printed today is incorrect. For RAD stored within MST branch - If MST branch LCT is 1, it's RAD array is untouched and remained as 0. - If MST branch LCT is larger than 1, use nibble to store the up facing port number in cascaded sequence as illustrated below: u8 RAD[0] = (LCT_2_UFP << 4) | LCT_3_UFP RAD[1] = (LCT_4_UFP << 4) | LCT_5_UFP ... In drm_dp_mst_rad_to_str(), it wrongly to use BIT_MASK(4) to fetch the port number of one nibble. [How] Adjust the code by: - RAD array items are valuable only for LCT >= 1. - Use 0xF as the mask to replace BIT_MASK(4) V2: - Document how RAD is constructed (Imre) V3: - Adjust the comment for rad[] so kdoc formats it properly (Lyude) Fixes: 2f015ec6eab6 ("drm/dp_mst: Add sideband down request tracing + selftests") Cc: Imre Deak <imre.deak@intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Harry Wentland <hwentlan@amd.com> Cc: Lyude Paul <lyude@redhat.com> Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Wayne Lin <Wayne.Lin@amd.com> Signed-off-by: Lyude Paul <lyude@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250113091100.3314533-2-Wayne.Lin@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10rwonce: fix crash by removing READ_ONCE() for unaligned readJann Horn
[ Upstream commit 47a60391ae0ed04ffbb9bd8dcd94ad9d08b41288 ] When arm64 is built with LTO, it upgrades READ_ONCE() to ldar / ldapr (load-acquire) to avoid issues that can be caused by the compiler optimizing away implicit address dependencies. Unlike plain loads, these load-acquire instructions actually require an aligned address. For now, fix it by removing the READ_ONCE() that the buggy commit introduced. Fixes: ece69af2ede1 ("rwonce: handle KCSAN like KASAN in read_word_at_a_time()") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/r/20250326203926.GA10484@ax162 Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10rwonce: handle KCSAN like KASAN in read_word_at_a_time()Jann Horn
[ Upstream commit ece69af2ede103e190ffdfccd9f9ec850606ab5e ] read_word_at_a_time() is allowed to read out of bounds by straddling the end of an allocation (and the caller is expected to then mask off out-of-bounds data). This works as long as the caller guarantees that the access won't hit a pagefault (either by ensuring that addr is aligned or by explicitly checking where the next page boundary is). Such out-of-bounds data could include things like KASAN redzones, adjacent allocations that are concurrently written to, or simply an adjacent struct field that is concurrently updated. KCSAN should ignore racy reads of OOB data that is not actually used, just like KASAN, so (similar to the code above) change read_word_at_a_time() to use __no_sanitize_or_inline instead of __no_kasan_or_inline, and explicitly inform KCSAN that we're reading the first byte. We do have an instrument_read() helper that calls into both KASAN and KCSAN, but I'm instead open-coding that here to avoid having to pull the entire instrumented.h header into rwonce.h. Also, since this read can be racy by design, we should technically do READ_ONCE(), so add that. Fixes: dfd402a4c4ba ("kcsan: Add Kernel Concurrency Sanitizer infrastructure") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Marco Elver <elver@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10Bluetooth: HCI: Add definition of hci_rp_remote_name_req_cancelWentao Guan
[ Upstream commit e8c00f5433d020a2230226abe7e43f43dc686920 ] Return Parameters is not only status, also bdaddr: BLUETOOTH CORE SPECIFICATION Version 5.4 | Vol 4, Part E page 1870: BLUETOOTH CORE SPECIFICATION Version 5.0 | Vol 2, Part E page 802: Return parameters: Status: Size: 1 octet BD_ADDR: Size: 6 octets Note that it also fixes the warning: "Bluetooth: hci0: unexpected cc 0x041a length: 7 > 1" Fixes: c8992cffbe741 ("Bluetooth: hci_event: Use of a function table to handle Command Complete") Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10Bluetooth: hci_core: Enable buffer flow control for SCO/eSCOLuiz Augusto von Dentz
[ Upstream commit 13218453521d75916dfed55efb8e809bfc03cb4b ] This enables buffer flow control for SCO/eSCO (see: Bluetooth Core 6.0 spec: 6.22. Synchronous Flow Control Enable), recently this has caused the following problem and is actually a nice addition for the likes of Socket TX complete: < HCI Command: Read Buffer Size (0x04|0x0005) plen 0 > HCI Event: Command Complete (0x0e) plen 11 Read Buffer Size (0x04|0x0005) ncmd 1 Status: Success (0x00) ACL MTU: 1021 ACL max packet: 5 SCO MTU: 240 SCO max packet: 8 ... < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 > HCI Event: Hardware Error (0x10) plen 1 Code: 0x0a To fix the code will now attempt to enable buffer flow control when HCI_QUIRK_SYNC_FLOWCTL_SUPPORTED is set by the driver: < HCI Command: Write Sync Fl.. (0x03|0x002f) plen 1 Flow control: Enabled (0x01) > HCI Event: Command Complete (0x0e) plen 4 Write Sync Flow Control Enable (0x03|0x002f) ncmd 1 Status: Success (0x00) On success then HCI_SCO_FLOWCTL would be set which indicates sco_cnt shall be used for flow contro. Fixes: 7fedd3bb6b77 ("Bluetooth: Prioritize SCO traffic") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10Bluetooth: Add quirk for broken READ_PAGE_SCAN_TYPEPedro Nishiyama
[ Upstream commit 127881334eaad639e0a19a399ee8c91d6c9dc982 ] Some fake controllers cannot be initialized because they return a smaller report than expected for READ_PAGE_SCAN_TYPE. Signed-off-by: Pedro Nishiyama <nishiyama.pedro@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Stable-dep-of: 1f04b0e5e3b9 ("Bluetooth: btusb: Fix regression in the initialization of fake Bluetooth controllers") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10Bluetooth: Add quirk for broken READ_VOICE_SETTINGPedro Nishiyama
[ Upstream commit ff26b2dd6568392f60fa67a4e58279938025c3af ] Some fake controllers cannot be initialized because they return a smaller report than expected for READ_VOICE_SETTING. Signed-off-by: Pedro Nishiyama <nishiyama.pedro@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Stable-dep-of: 1f04b0e5e3b9 ("Bluetooth: btusb: Fix regression in the initialization of fake Bluetooth controllers") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10bonding: check xdp prog when set bond modeWang Liang
[ Upstream commit 094ee6017ea09c11d6af187935a949df32803ce0 ] Following operations can trigger a warning[1]: ip netns add ns1 ip netns exec ns1 ip link add bond0 type bond mode balance-rr ip netns exec ns1 ip link set dev bond0 xdp obj af_xdp_kern.o sec xdp ip netns exec ns1 ip link set bond0 type bond mode broadcast ip netns del ns1 When delete the namespace, dev_xdp_uninstall() is called to remove xdp program on bond dev, and bond_xdp_set() will check the bond mode. If bond mode is changed after attaching xdp program, the warning may occur. Some bond modes (broadcast, etc.) do not support native xdp. Set bond mode with xdp program attached is not good. Add check for xdp program when set bond mode. [1] ------------[ cut here ]------------ WARNING: CPU: 0 PID: 11 at net/core/dev.c:9912 unregister_netdevice_many_notify+0x8d9/0x930 Modules linked in: CPU: 0 UID: 0 PID: 11 Comm: kworker/u4:0 Not tainted 6.14.0-rc4 #107 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Workqueue: netns cleanup_net RIP: 0010:unregister_netdevice_many_notify+0x8d9/0x930 Code: 00 00 48 c7 c6 6f e3 a2 82 48 c7 c7 d0 b3 96 82 e8 9c 10 3e ... RSP: 0018:ffffc90000063d80 EFLAGS: 00000282 RAX: 00000000ffffffa1 RBX: ffff888004959000 RCX: 00000000ffffdfff RDX: 0000000000000000 RSI: 00000000ffffffea RDI: ffffc90000063b48 RBP: ffffc90000063e28 R08: ffffffff82d39b28 R09: 0000000000009ffb R10: 0000000000000175 R11: ffffffff82d09b40 R12: ffff8880049598e8 R13: 0000000000000001 R14: dead000000000100 R15: ffffc90000045000 FS: 0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000d406b60 CR3: 000000000483e000 CR4: 00000000000006f0 Call Trace: <TASK> ? __warn+0x83/0x130 ? unregister_netdevice_many_notify+0x8d9/0x930 ? report_bug+0x18e/0x1a0 ? handle_bug+0x54/0x90 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? unregister_netdevice_many_notify+0x8d9/0x930 ? bond_net_exit_batch_rtnl+0x5c/0x90 cleanup_net+0x237/0x3d0 process_one_work+0x163/0x390 worker_thread+0x293/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0xec/0x1e0 ? __pfx_kthread+0x10/0x10 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 9e2ee5c7e7c3 ("net, bonding: Add XDP support to the bonding driver") Signed-off-by: Wang Liang <wangliang74@huawei.com> Acked-by: Jussi Maki <joamaki@gmail.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250321044852.1086551-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10ax25: Remove broken autobindMurad Masimov
[ Upstream commit 2f6efbabceb6b2914ee9bafb86d9a51feae9cce8 ] Binding AX25 socket by using the autobind feature leads to memory leaks in ax25_connect() and also refcount leaks in ax25_release(). Memory leak was detected with kmemleak: ================================================================ unreferenced object 0xffff8880253cd680 (size 96): backtrace: __kmalloc_node_track_caller_noprof (./include/linux/kmemleak.h:43) kmemdup_noprof (mm/util.c:136) ax25_rt_autobind (net/ax25/ax25_route.c:428) ax25_connect (net/ax25/af_ax25.c:1282) __sys_connect_file (net/socket.c:2045) __sys_connect (net/socket.c:2064) __x64_sys_connect (net/socket.c:2067) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) ================================================================ When socket is bound, refcounts must be incremented the way it is done in ax25_bind() and ax25_setsockopt() (SO_BINDTODEVICE). In case of autobind, the refcounts are not incremented. This bug leads to the following issue reported by Syzkaller: ================================================================ ax25_connect(): syz-executor318 uses autobind, please contact jreuter@yaina.de ------------[ cut here ]------------ refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 0 PID: 5317 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31 Modules linked in: CPU: 0 UID: 0 PID: 5317 Comm: syz-executor318 Not tainted 6.14.0-rc4-syzkaller-00278-gece144f151ac #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31 ... Call Trace: <TASK> __refcount_dec include/linux/refcount.h:336 [inline] refcount_dec include/linux/refcount.h:351 [inline] ref_tracker_free+0x6af/0x7e0 lib/ref_tracker.c:236 netdev_tracker_free include/linux/netdevice.h:4302 [inline] netdev_put include/linux/netdevice.h:4319 [inline] ax25_release+0x368/0x960 net/ax25/af_ax25.c:1080 __sock_release net/socket.c:647 [inline] sock_close+0xbc/0x240 net/socket.c:1398 __fput+0x3e9/0x9f0 fs/file_table.c:464 __do_sys_close fs/open.c:1580 [inline] __se_sys_close fs/open.c:1565 [inline] __x64_sys_close+0x7f/0x110 fs/open.c:1565 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f ... </TASK> ================================================================ Considering the issues above and the comments left in the code that say: "check if we can remove this feature. It is broken."; "autobinding in this may or may not work"; - it is better to completely remove this feature than to fix it because it is broken and leads to various kinds of memory bugs. Now calling connect() without first binding socket will result in an error (-EINVAL). Userspace software that relies on the autobind feature might get broken. However, this feature does not seem widely used with this specific driver as it was not reliable at any point of time, and it is already broken anyway. E.g. ax25-tools and ax25-apps packages for popular distributions do not use the autobind feature for AF_AX25. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+33841dc6aa3e1d86b78a@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=33841dc6aa3e1d86b78a Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10net: Remove RTNL dance for SIOCBRADDIF and SIOCBRDELIF.Kuniyuki Iwashima
[ Upstream commit ed3ba9b6e280e14cc3148c1b226ba453f02fa76c ] SIOCBRDELIF is passed to dev_ioctl() first and later forwarded to br_ioctl_call(), which causes unnecessary RTNL dance and the splat below [0] under RTNL pressure. Let's say Thread A is trying to detach a device from a bridge and Thread B is trying to remove the bridge. In dev_ioctl(), Thread A bumps the bridge device's refcnt by netdev_hold() and releases RTNL because the following br_ioctl_call() also re-acquires RTNL. In the race window, Thread B could acquire RTNL and try to remove the bridge device. Then, rtnl_unlock() by Thread B will release RTNL and wait for netdev_put() by Thread A. Thread A, however, must hold RTNL after the unlock in dev_ifsioc(), which may take long under RTNL pressure, resulting in the splat by Thread B. Thread A (SIOCBRDELIF) Thread B (SIOCBRDELBR) ---------------------- ---------------------- sock_ioctl sock_ioctl `- sock_do_ioctl `- br_ioctl_call `- dev_ioctl `- br_ioctl_stub |- rtnl_lock | |- dev_ifsioc ' ' |- dev = __dev_get_by_name(...) |- netdev_hold(dev, ...) . / |- rtnl_unlock ------. | | |- br_ioctl_call `---> |- rtnl_lock Race | | `- br_ioctl_stub |- br_del_bridge Window | | | |- dev = __dev_get_by_name(...) | | | May take long | `- br_dev_delete(dev, ...) | | | under RTNL pressure | `- unregister_netdevice_queue(dev, ...) | | | | `- rtnl_unlock \ | |- rtnl_lock <-' `- netdev_run_todo | |- ... `- netdev_run_todo | `- rtnl_unlock |- __rtnl_unlock | |- netdev_wait_allrefs_any |- netdev_put(dev, ...) <----------------' Wait refcnt decrement and log splat below To avoid blocking SIOCBRDELBR unnecessarily, let's not call dev_ioctl() for SIOCBRADDIF and SIOCBRDELIF. In the dev_ioctl() path, we do the following: 1. Copy struct ifreq by get_user_ifreq in sock_do_ioctl() 2. Check CAP_NET_ADMIN in dev_ioctl() 3. Call dev_load() in dev_ioctl() 4. Fetch the master dev from ifr.ifr_name in dev_ifsioc() 3. can be done by request_module() in br_ioctl_call(), so we move 1., 2., and 4. to br_ioctl_stub(). Note that 2. is also checked later in add_del_if(), but it's better performed before RTNL. SIOCBRADDIF and SIOCBRDELIF have been processed in dev_ioctl() since the pre-git era, and there seems to be no specific reason to process them there. [0]: unregister_netdevice: waiting for wpan3 to become free. Usage count = 2 ref_tracker: wpan3@ffff8880662d8608 has 1/1 users at __netdev_tracker_alloc include/linux/netdevice.h:4282 [inline] netdev_hold include/linux/netdevice.h:4311 [inline] dev_ifsioc+0xc6a/0x1160 net/core/dev_ioctl.c:624 dev_ioctl+0x255/0x10c0 net/core/dev_ioctl.c:826 sock_do_ioctl+0x1ca/0x260 net/socket.c:1213 sock_ioctl+0x23a/0x6c0 net/socket.c:1318 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl fs/ioctl.c:892 [inline] __x64_sys_ioctl+0x1a4/0x210 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcb/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 893b19587534 ("net: bridge: fix ioctl locking") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: yan kang <kangyan91@outlook.com> Reported-by: yue sun <samsun1006219@gmail.com> Closes: https://lore.kernel.org/netdev/SY8P300MB0421225D54EB92762AE8F0F2A1D32@SY8P300MB0421.AUSP300.PROD.OUTLOOK.COM/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250316192851.19781-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10virtchnl: make proto and filter action count unsignedJan Glaza
[ Upstream commit db5e8ea155fc1d89c87cb81f0e4a681a77b9b03f ] The count field in virtchnl_proto_hdrs and virtchnl_filter_action_set should never be negative while still being valid. Changing it from int to u32 ensures proper handling of values in virtchnl messages in driverrs and prevents unintended behavior. In its current signed form, a negative count does not trigger an error in ice driver but instead results in it being treated as 0. This can lead to unexpected outcomes when processing messages. By using u32, any invalid values will correctly trigger -EINVAL, making error detection more robust. Fixes: 1f7ea1cd6a374 ("ice: Enable FDIR Configure for AVF") Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jan Glaza <jan.glaza@intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10badblocks: use sector_t instead of int to avoid truncation of badblocks lengthZheng Qixing
[ Upstream commit d301f164c3fbff611bd71f57dfa553b9219f0f5e ] There is a truncation of badblocks length issue when set badblocks as follow: echo "2055 4294967299" > bad_blocks cat bad_blocks 2055 3 Change 'sectors' argument type from 'int' to 'sector_t'. This change avoids truncation of badblocks length for large sectors by replacing 'int' with 'sector_t' (u64), enabling proper handling of larger disk sizes and ensuring compatibility with 64-bit sector addressing. Fixes: 9e0e252a048b ("badblocks: Add core badblock management code") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10badblocks: return boolean from badblocks_set() and badblocks_clear()Zheng Qixing
[ Upstream commit c8775aefba959cdfbaa25408a84d3dd15bbeb991 ] Change the return type of badblocks_set() and badblocks_clear() from int to bool, indicating success or failure. Specifically: - _badblocks_set() and _badblocks_clear() functions now return true for success and false for failure. - All calls to these functions are updated to handle the new boolean return type. - This change improves code clarity and ensures a more consistent handling of success and failure states. Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Acked-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: d301f164c3fb ("badblocks: use sector_t instead of int to avoid truncation of badblocks length") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10tracing: Fix DECLARE_TRACE_CONDITIONGabriele Monaco
[ Upstream commit 486df3466daf7b185f534a7408fa6f9dbb16dbeb ] Commit 287050d39026 ("tracing: Add TRACE_EVENT_CONDITIONAL()") adds macros to define conditional trace events (TRACE_EVENT_CONDITIONAL) and tracepoints (DECLARE_TRACE_CONDITION), but sets up functionality for direct use only for the former. Add preprocessor bits in define_trace.h to allow usage of DECLARE_TRACE_CONDITION just like DECLARE_TRACE. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/20250218123121.253551-2-gmonaco@redhat.com Fixes: 287050d39026 ("tracing: Add TRACE_EVENT_CONDITIONAL()") Link: https://lore.kernel.org/linux-trace-kernel/20250128111926.303093-1-gmonaco@redhat.com Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10xfrm: delay initialization of offload path till its actually requestedLeon Romanovsky
[ Upstream commit 585b64f5a62089ef42889b106b063d089feb6599 ] XFRM offload path is probed even if offload isn't needed at all. Let's make sure that x->type_offload pointer stays NULL for such path to reduce ambiguity. Fixes: 9d389d7f84bb ("xfrm: Add a xfrm type offload.") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10firmware: arm_ffa: Unregister the FF-A devices when cleaning up the partitionsSudeep Holla
[ Upstream commit 46dcd68aaccac0812c12ec3f4e59c8963e2760ad ] Both the FF-A core and the bus were in a single module before the commit 18c250bd7ed0 ("firmware: arm_ffa: Split bus and driver into distinct modules"). The arm_ffa_bus_exit() takes care of unregistering all the FF-A devices. Now that there are 2 distinct modules, if the core driver is unloaded and reloaded, it will end up adding duplicate FF-A devices as the previously registered devices weren't unregistered when we cleaned up the modules. Fix the same by unregistering all the FF-A devices on the FF-A bus during the cleaning up of the partitions and hence the cleanup of the module. Fixes: 18c250bd7ed0 ("firmware: arm_ffa: Split bus and driver into distinct modules") Tested-by: Viresh Kumar <viresh.kumar@linaro.org> Message-Id: <20250217-ffa_updates-v3-8-bd1d9de615e7@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10sched/deadline: Rebuild root domain accounting after every updateJuri Lelli
[ Upstream commit 2ff899e3516437354204423ef0a94994717b8e6a ] Rebuilding of root domains accounting information (total_bw) is currently broken on some cases, e.g. suspend/resume on aarch64. Problem is that the way we keep track of domain changes and try to add bandwidth back is convoluted and fragile. Fix it by simplify things by making sure bandwidth accounting is cleared and completely restored after root domains changes (after root domains are again stable). To be sure we always call dl_rebuild_rd_accounting while holding cpuset_mutex we also add cpuset_reset_sched_domains() wrapper. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Co-developed-by: Waiman Long <llong@redhat.com> Signed-off-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MRfeJKJUOyUSto@jlelli-thinkpadt14gen4.remote.csb Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10sched/deadline: Generalize unique visiting of root domainsJuri Lelli
[ Upstream commit 45007c6fb5860cf63556a9cadc87c8984927e23d ] Bandwidth checks and updates that work on root domains currently employ a cookie mechanism for efficiency. This mechanism is very much tied to when root domains are first created and initialized. Generalize the cookie mechanism so that it can be used also later at runtime while updating root domains. Also, additionally guard it with sched_domains_mutex, since domains need to be stable while updating them (and it will be required for further dynamic changes). Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MQaiXPvEeW_v7x@jlelli-thinkpadt14gen4.remote.csb Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10sched/topology: Wrappers for sched_domains_mutexJuri Lelli
[ Upstream commit 56209334dda1832c0a919e1d74768c6d0f3b2ca9 ] Create wrappers for sched_domains_mutex so that it can transparently be used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to do. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MP5Oq9RB8jBs3y@jlelli-thinkpadt14gen4.remote.csb Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10perf: Supply task information to sched_task()Kan Liang
[ Upstream commit d57e94f5b891925e4f2796266eba31edd5a01903 ] To save/restore LBR call stack data in system-wide mode, the task_struct information is required. Extend the parameters of sched_task() to supply task_struct information. When schedule in, the LBR call stack data for new task will be restored. When schedule out, the LBR call stack data for old task will be saved. Only need to pass the required task_struct information. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-4-kan.liang@linux.intel.com Stable-dep-of: 3cec9fd03543 ("perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10perf: Save PMU specific data in task_structKan Liang
[ Upstream commit cb4369129339060218baca718a578bb0b826e734 ] Some PMU specific data has to be saved/restored during context switch, e.g. LBR call stack data. Currently, the data is saved in event context structure, but only for per-process event. For system-wide event, because of missing the LBR call stack data after context switch, LBR callstacks are always shorter in comparison to per-process mode. For example, Per-process mode: $perf record --call-graph lbr -- taskset -c 0 ./tchain_edit - 99.90% 99.86% tchain_edit tchain_edit [.] f3 99.86% _start __libc_start_main generic_start_main main f1 - f2 f3 System-wide mode: $perf record --call-graph lbr -a -- taskset -c 0 ./tchain_edit - 99.88% 99.82% tchain_edit tchain_edit [.] f3 - 62.02% main f1 f2 f3 - 28.83% f1 - f2 f3 - 28.83% f1 - f2 f3 - 8.88% generic_start_main main f1 f2 f3 It isn't practical to simply allocate the data for system-wide event in CPU context structure for all tasks. We have no idea which CPU a task will be scheduled to. The duplicated LBR data has to be maintained on every CPU context structure. That's a huge waste. Otherwise, the LBR data still lost if the task is scheduled to another CPU. Save the pmu specific data in task_struct. The size of pmu specific data is 788 bytes for LBR call stack. Usually, the overall amount of threads doesn't exceed a few thousands. For 10K threads, keeping LBR data would consume additional ~8MB. The additional space will only be allocated during LBR call stack monitoring. It will be released when the monitoring is finished. Furthermore, moving task_ctx_data from perf_event_context to task_struct can reduce complexity and make things clearer. E.g. perf doesn't need to swap task_ctx_data on optimized context switch path. This patch set is just the first step. There could be other optimization/extension on top of this patch set. E.g. for cgroup profiling, perf just needs to save/store the LBR call stack information for tasks in specific cgroup. That could reduce the additional space. Also, the LBR call stack can be available for software events, or allow even debugging use cases, like LBRs on crash later. Because of the alignment requirement of Intel Arch LBR, the Kmem cache is used to allocate the PMU specific data. It's required when child task allocates the space. Save it in struct perf_ctx_data. The refcount in struct perf_ctx_data is used to track the users of pmu specific data. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexey Budankov <alexey.budankov@linux.intel.com> Link: https://lore.kernel.org/r/20250314172700.438923-1-kan.liang@linux.intel.com Stable-dep-of: 3cec9fd03543 ("perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10lockdep: Don't disable interrupts on RT in disable_irq_nosync_lockdep.*()Sebastian Andrzej Siewior
[ Upstream commit 87886b32d669abc11c7be95ef44099215e4f5788 ] disable_irq_nosync_lockdep() disables interrupts with lockdep enabled to avoid false positive reports by lockdep that a certain lock has not been acquired with disabled interrupts. The user of this macros expects that a lock can be acquried without disabling interrupts because the IRQ line triggering the interrupt is disabled. This triggers a warning on PREEMPT_RT because after disable_irq_nosync_lockdep.*() the following spinlock_t now is acquired with disabled interrupts. On PREEMPT_RT there is no difference between spin_lock() and spin_lock_irq() so avoiding disabling interrupts in this case works for the two remaining callers as of today. Don't disable interrupts on PREEMPT_RT in disable_irq_nosync_lockdep.*(). Closes: https://lore.kernel.org/760e34f9-6034-40e0-82a5-ee9becd24438@roeck-us.net Fixes: e8106b941ceab ("[PATCH] lockdep: core, add enable/disable_irq_irqsave/irqrestore() APIs") Reported-by: Guenter Roeck <linux@roeck-us.net> Suggested-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250212103619.2560503-2-bigeasy@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10dma: Introduce generic dma_addr_*crypted helpersSuzuki K Poulose
[ Upstream commit b66e2ee7b6c8d45bbe4b6f6885ee27511506812c ] AMD SME added __sme_set/__sme_clr primitives to modify the DMA address for encrypted/decrypted traffic. However this doesn't fit in with other models, e.g., Arm CCA where the meanings are the opposite. i.e., "decrypted" traffic has a bit set and "encrypted" traffic has the top bit cleared. In preparation for adding the support for Arm CCA DMA conversions, convert the existing primitives to more generic ones that can be provided by the backends. i.e., add helpers to 1. dma_addr_encrypted - Convert a DMA address to "encrypted" [ == __sme_set() ] 2. dma_addr_unencrypted - Convert a DMA address to "decrypted" [ None exists today ] 3. dma_addr_canonical - Clear any "encryption"/"decryption" bits from DMA address [ SME uses __sme_clr() ] and convert to a canonical DMA address. Since the original __sme_xxx helpers come from linux/mem_encrypt.h, use that as the home for the new definitions and provide dummy ones when none is provided by the architectures. With the above, phys_to_dma_unencrypted() uses the newly added dma_addr_unencrypted() helper and to make it a bit more easier to read and avoid double conversion, provide __phys_to_dma(). Suggested-by: Robin Murphy <robin.murphy@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Steven Price <steven.price@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Fixes: 42be24a4178f ("arm64: Enable memory encrypt for Realms") Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250227144150.1667735-3-suzuki.poulose@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10dma: Fix encryption bit clearing for dma_to_physSuzuki K Poulose
[ Upstream commit c380931712d16e23f6aa90703f438330139e9731 ] phys_to_dma() sets the encryption bit on the translated DMA address. But dma_to_phys() clears the encryption bit after it has been translated back to the physical address, which could fail if the device uses DMA ranges. AMD SME doesn't use the DMA ranges and thus this is harmless. But as we are about to add support for other architectures, let us fix this. Reported-by: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Link: https://lkml.kernel.org/r/yq5amsen9stc.fsf@kernel.org Cc: Will Deacon <will@kernel.org> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Steven Price <steven.price@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Fixes: 42be24a4178f ("arm64: Enable memory encrypt for Realms") Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250227144150.1667735-2-suzuki.poulose@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10watchdog/hardlockup/perf: Fix perf_event memory leakLi Huafei
[ Upstream commit d6834d9c990333bfa433bc1816e2417f268eebbe ] During stress-testing, we found a kmemleak report for perf_event: unreferenced object 0xff110001410a33e0 (size 1328): comm "kworker/4:11", pid 288, jiffies 4294916004 hex dump (first 32 bytes): b8 be c2 3b 02 00 11 ff 22 01 00 00 00 00 ad de ...;...."....... f0 33 0a 41 01 00 11 ff f0 33 0a 41 01 00 11 ff .3.A.....3.A.... backtrace (crc 24eb7b3a): [<00000000e211b653>] kmem_cache_alloc_node_noprof+0x269/0x2e0 [<000000009d0985fa>] perf_event_alloc+0x5f/0xcf0 [<00000000084ad4a2>] perf_event_create_kernel_counter+0x38/0x1b0 [<00000000fde96401>] hardlockup_detector_event_create+0x50/0xe0 [<0000000051183158>] watchdog_hardlockup_enable+0x17/0x70 [<00000000ac89727f>] softlockup_start_fn+0x15/0x40 ... Our stress test includes CPU online and offline cycles, and updating the watchdog configuration. After reading the code, I found that there may be a race between cleaning up perf_event after updating watchdog and disabling event when the CPU goes offline: CPU0 CPU1 CPU2 (update watchdog) (hotplug offline CPU1) ... _cpu_down(CPU1) cpus_read_lock() // waiting for cpu lock softlockup_start_all smp_call_on_cpu(CPU1) softlockup_start_fn ... watchdog_hardlockup_enable(CPU1) perf create E1 watchdog_ev[CPU1] = E1 cpus_read_unlock() cpus_write_lock() cpuhp_kick_ap_work(CPU1) cpuhp_thread_fun ... watchdog_hardlockup_disable(CPU1) watchdog_ev[CPU1] = NULL dead_event[CPU1] = E1 __lockup_detector_cleanup for each dead_events_mask release each dead_event /* * CPU1 has not been added to * dead_events_mask, then E1 * will not be released */ CPU1 -> dead_events_mask cpumask_clear(&dead_events_mask) // dead_events_mask is cleared, E1 is leaked In this case, the leaked perf_event E1 matches the perf_event leak reported by kmemleak. Due to the low probability of problem recurrence (only reported once), I added some hack delays in the code: static void __lockup_detector_reconfigure(void) { ... watchdog_hardlockup_start(); cpus_read_unlock(); + mdelay(100); /* * Must be called outside the cpus locked section to prevent * recursive locking in the perf code. ... } void watchdog_hardlockup_disable(unsigned int cpu) { ... perf_event_disable(event); this_cpu_write(watchdog_ev, NULL); this_cpu_write(dead_event, event); + mdelay(100); cpumask_set_cpu(smp_processor_id(), &dead_events_mask); atomic_dec(&watchdog_cpus); ... } void hardlockup_detector_perf_cleanup(void) { ... perf_event_release_kernel(event); per_cpu(dead_event, cpu) = NULL; } + mdelay(100); cpumask_clear(&dead_events_mask); } Then, simultaneously performing CPU on/off and switching watchdog, it is almost certain to reproduce this leak. The problem here is that releasing perf_event is not within the CPU hotplug read-write lock. Commit: 941154bd6937 ("watchdog/hardlockup/perf: Prevent CPU hotplug deadlock") introduced deferred release to solve the deadlock caused by calling get_online_cpus() when releasing perf_event. Later, commit: efe951d3de91 ("perf/x86: Fix perf,x86,cpuhp deadlock") removed the get_online_cpus() call on the perf_event release path to solve another deadlock problem. Therefore, it is now possible to move the release of perf_event back into the CPU hotplug read-write lock, and release the event immediately after disabling it. Fixes: 941154bd6937 ("watchdog/hardlockup/perf: Prevent CPU hotplug deadlock") Signed-off-by: Li Huafei <lihuafei1@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241021193004.308303-1-lihuafei1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10PM: sleep: Adjust check before setting power.must_resumeRafael J. Wysocki
[ Upstream commit eeb87d17aceab7803a5a5bcb6cf2817b745157cf ] The check before setting power.must_resume in device_suspend_noirq() does not take power.child_count into account, but it should do that, so use pm_runtime_need_not_resume() in it for this purpose and adjust the comment next to it accordingly. Fixes: 107d47b2b95e ("PM: sleep: core: Simplify the SMART_SUSPEND flag handling") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/3353728.44csPzL39Z@rjwysocki.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10seccomp: fix the __secure_computing() stub for !HAVE_ARCH_SECCOMP_FILTEROleg Nesterov
[ Upstream commit b37778bec82ba82058912ca069881397197cd3d5 ] Depending on CONFIG_HAVE_ARCH_SECCOMP_FILTER, __secure_computing(NULL) will crash or not. This is not consistent/safe, especially considering that after the previous change __secure_computing(sd) is always called with sd == NULL. Fortunately, if CONFIG_HAVE_ARCH_SECCOMP_FILTER=n, __secure_computing() has no callers, these architectures use secure_computing_strict(). Yet it make sense make __secure_computing(NULL) safe in this case. Note also that with this change we can unexport secure_computing_strict() and change the current callers to use __secure_computing(NULL). Fixes: 8cf8dfceebda ("seccomp: Stub for !HAVE_ARCH_SECCOMP_FILTER") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20250128150307.GA15325@redhat.com Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22keys: Fix UAF in key_put()David Howells
Once a key's reference count has been reduced to 0, the garbage collector thread may destroy it at any time and so key_put() is not allowed to touch the key after that point. The most key_put() is normally allowed to do is to touch key_gc_work as that's a static global variable. However, in an effort to speed up the reclamation of quota, this is now done in key_put() once the key's usage is reduced to 0 - but now the code is looking at the key after the deadline, which is forbidden. Fix this by using a flag to indicate that a key can be gc'd now rather than looking at the key's refcount in the garbage collector. Fixes: 9578e327b2b4 ("keys: update key quotas in key_put()") Reported-by: syzbot+6105ffc1ded71d194d6d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/673b6aec.050a0220.87769.004a.GAE@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: syzbot+6105ffc1ded71d194d6d@syzkaller.appspotmail.com Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-03-20Merge tag 'net-6.14-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from can, bluetooth and ipsec. This contains a last minute revert of a recent GRE patch, mostly to allow me stating there are no known regressions outstanding. Current release - regressions: - revert "gre: Fix IPv6 link-local address generation." - eth: ti: am65-cpsw: fix NAPI registration sequence Previous releases - regressions: - ipv6: fix memleak of nhc_pcpu_rth_output in fib_check_nh_v6_gw(). - mptcp: fix data stream corruption in the address announcement - bluetooth: fix connection regression between LE and non-LE adapters - can: - flexcan: only change CAN state when link up in system PM - ucan: fix out of bound read in strscpy() source Previous releases - always broken: - lwtunnel: fix reentry loops - ipv6: fix TCP GSO segmentation with NAT - xfrm: force software GSO only in tunnel mode - eth: ti: icssg-prueth: add lock to stats Misc: - add Andrea Mayer as a maintainer of SRv6" * tag 'net-6.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits) MAINTAINERS: Add Andrea Mayer as a maintainer of SRv6 Revert "gre: Fix IPv6 link-local address generation." Revert "selftests: Add IPv6 link-local address generation tests for GRE devices." net/neighbor: add missing policy for NDTPA_QUEUE_LENBYTES tools headers: Sync uapi/asm-generic/socket.h with the kernel sources mptcp: Fix data stream corruption in the address announcement selftests: net: test for lwtunnel dst ref loops net: ipv6: ioam6: fix lwtunnel_output() loop net: lwtunnel: fix recursion loops net: ti: icssg-prueth: Add lock to stats net: atm: fix use after free in lec_send() xsk: fix an integer overflow in xp_create_and_assign_umem() net: stmmac: dwc-qos-eth: use devm_kzalloc() for AXI data selftests: drv-net: use defer in the ping test phy: fix xa_alloc_cyclic() error handling dpll: fix xa_alloc_cyclic() error handling devlink: fix xa_alloc_cyclic() error handling ipv6: Set errno after ip_fib_metrics_init() in ip6_route_info_create(). ipv6: Fix memleak of nhc_pcpu_rth_output in fib_check_nh_v6_gw(). net: ipv6: fix TCP GSO segmentation with NAT ...
2025-03-19Merge tag 'for-net-2025-03-14' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - hci_event: Fix connection regression between LE and non-LE adapters - Fix error code in chan_alloc_skb_cb() * tag 'for-net-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: hci_event: Fix connection regression between LE and non-LE adapters Bluetooth: Fix error code in chan_alloc_skb_cb() ==================== Link: https://patch.msgid.link/20250314163847.110069-1-luiz.dentz@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>