summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2025-05-02tracing: Remove pointer (asterisk) and brackets from cpumask_t fieldSteven Rostedt (Google)
commit fab89a09c86f948adfc7e20a7d608bd9f323bbe1 upstream. To differentiate between long arrays and cpumasks, the __cpumask() field was created. Part of the TRACE_EVENT() macros test if the type is signed or not by using the is_signed_type() macro. The __cpumask() field used the __dynamic_array() helper but because cpumask_t is a structure, it could not be used in the is_signed_type() macro as that would fail to build, so instead it passed in the pointer to cpumask_t. Unfortunately, that creates in the format file: field:__data_loc cpumask_t *[] mask; offset:36; size:4; signed:0; Which looks like an array of pointers to cpumask_t and not a cpumask_t type, which is misleading to user space parsers. Douglas Raillard pointed out that the "[]" are also misleading, as cpumask_t is not an array. Since cpumask() hasn't been created yet, and the parsers currently fail on it (but will still produce the raw output), make it be: field:__data_loc cpumask_t mask; offset:36; size:4; signed:0; Which is the correct type of the field. Then the parsers can be updated to handle this. Link: https://lore.kernel.org/lkml/6dda5e1d-9416-b55e-88f3-31d148bc925f@arm.com/ Link: https://lore.kernel.org/linux-trace-kernel/20221212193814.0e3f1e43@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8230f27b1ccc ("tracing: Add __cpumask to denote a trace event field that is a cpumask_t") Reported-by: Douglas Raillard <douglas.raillard@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02xdp: Reset bpf_redirect_info before running a xdp's BPF prog.Sebastian Andrzej Siewior
Ricardo reported a KASAN discovered use after free in v6.6-stable. The syzbot starts a BPF program via xdp_test_run_batch() which assigns ri->tgt_value via dev_hash_map_redirect() and the return code isn't XDP_REDIRECT it looks like nonsense. So the output in bpf_warn_invalid_xdp_action() appears once. Then the TUN driver runs another BPF program (on the same CPU) which returns XDP_REDIRECT without setting ri->tgt_value first. It invokes bpf_trace_printk() to print four characters and obtain the required return value. This is enough to get xdp_do_redirect() invoked which then accesses the pointer in tgt_value which might have been already deallocated. This problem does not affect upstream because since commit 401cb7dae8130 ("net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.") the per-CPU variable is referenced via task's task_struct and exists on the stack during NAPI callback. Therefore it is cleared once before the first invocation and remains valid within the RCU section of the NAPI callback. Instead of performing the huge backport of the commit (plus its fix ups) here is an alternative version which only resets the variable in question prior invoking the BPF program. Acked-by: Toke Høiland-Jørgensen <toke@kernel.org> Reported-by: Ricardo Cañuelo Navarro <rcn@igalia.com> Closes: https://lore.kernel.org/all/20250226-20250204-kasan-slab-use-after-free-read-in-dev_map_enqueue__submit-v3-0-360efec441ba@igalia.com/ Fixes: 97f91a7cf04ff ("bpf: add bpf_redirect_map helper routine") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02wifi: mac80211: export ieee80211_purge_tx_queue() for driversPing-Ke Shih
commit 53bc1b73b67836ac9867f93dee7a443986b4a94f upstream. Drivers need to purge TX SKB when stopping. Using skb_queue_purge() can't report TX status to mac80211, causing ieee80211_free_ack_frame() warns "Have pending ack frames!". Export ieee80211_purge_tx_queue() for drivers to not have to reimplement it. Signed-off-by: Ping-Ke Shih <pkshih@realtek.com> Link: https://patch.msgid.link/20240822014255.10211-1-pkshih@realtek.com Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02ASoC: qcom: q6dsp: add support to more display portsSrinivas Kandagatla
[ Upstream commit 90848a2557fec0a6f1a35e58031a1f6f5e44e7d6 ] Existing code base only supports one display port, this patch adds support upto 8 display ports. This support is required to allow platforms like X13s which have 3 display ports, and some of the Qualcomm SoCs there are upto 7 Display ports. Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org Link: https://lore.kernel.org/r/20230509112202.21471-4-srinivas.kandagatla@linaro.org Signed-off-by: Mark Brown <broonie@kernel.org Stable-dep-of: a31a4934b31f ("ASoC: qcom: Fix sc7280 lpass potential buffer overflow") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02PCI: Assign PCI domain IDs by ida_alloc()Pali Rohár
[ Upstream commit c14f7ccc9f5dcf9d06ddeec706f85405b2c80600 ] Replace assignment of PCI domain IDs from atomic_inc_return() to ida_alloc(). Use two IDAs, one for static domain allocations (those which are defined in device tree) and second for dynamic allocations (all other). During removal of root bus / host bridge, also release the domain ID. The released ID can be reused again, for example when dynamically loading and unloading native PCI host bridge drivers. This change also allows to mix static device tree assignment and dynamic by kernel as all static allocations are reserved in dynamic pool. [bhelgaas: set "err" if "bus->domain_nr < 0"] Link: https://lore.kernel.org/r/20220714184130.5436-1-pali@kernel.org Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Stable-dep-of: 804443c1f278 ("PCI: Fix reference leak in pci_register_host_bridge()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02net: dsa: add support for mac_prepare() and mac_finish() callsRussell King (Oracle)
[ Upstream commit dd805cf3e80e038aeb06902399ce9bd6fafb4ff3 ] Add DSA support for the phylink mac_prepare() and mac_finish() calls. These were introduced as part of the PCS support to allow MACs to perform preparatory steps prior to configuration, and finalisation steps after the MAC and PCS has been configured. Introducing phylink_pcs support to the mv88e6xxx DSA driver needs some code moved out of its mac_config() stage into the mac_prepare() and mac_finish() stages, and this commit facilitates such code in DSA drivers. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: 52fdc41c3278 ("net: dsa: mv88e6xxx: fix internal PHYs for 6320 family") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02tracing: Add __print_dynamic_array() helperSteven Rostedt
[ Upstream commit e52750fb1458ae9ea5860a08ed7a149185bc5b97 ] When printing a dynamic array in a trace event, the method is rather ugly. It has the format of: __print_array(__get_dynamic_array(array), __get_dynmaic_array_len(array) / el_size, el_size) Since dynamic arrays are known to the tracing infrastructure, create a helper macro that does the above for you. __print_dynamic_array(array, el_size) Which would expand to the same output. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/20241022194158.110073-3-avadhut.naik@amd.com Stable-dep-of: ea8d7647f9dd ("tracing: Verify event formats that have "%*p.."") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-05-02tracing: Add __cpumask to denote a trace event field that is a cpumask_tSteven Rostedt (Google)
[ Upstream commit 8230f27b1ccc4b8976c137e3d6d690f9d4ffca8d ] The trace events have a __bitmask field that can be used for anything that requires bitmasks. Although currently it is only used for CPU masks, it could be used in the future for any type of bitmasks. There is some user space tooling that wants to know if a field is a CPU mask and not just some random unsigned long bitmask. Introduce "__cpumask()" helper functions that work the same as the current __bitmask() helpers but displays in the format file: field:__data_loc cpumask_t *[] mask; offset:36; size:4; signed:0; Instead of: field:__data_loc unsigned long[] mask; offset:32; size:4; signed:0; The main difference is the type. Instead of "unsigned long" it is "cpumask_t *". Note, this type field needs to be a real type in the __dynamic_array() logic that both __cpumask and__bitmask use, but the comparison field requires it to be a scalar type whereas cpumask_t is a structure (non-scalar). But everything works when making it a pointer. Valentin added changes to remove the need of passing in "nr_bits" and the __cpumask will always use nr_cpumask_bits as its size. Link: https://lkml.kernel.org/r/20221014080456.1d32b989@rorschach.local.home Requested-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: ea8d7647f9dd ("tracing: Verify event formats that have "%*p.."") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25bpf: Prevent tail call between progs attached to different hooksXu Kuohai
commit 28ead3eaabc16ecc907cfb71876da028080f6356 upstream. bpf progs can be attached to kernel functions, and the attached functions can take different parameters or return different return values. If prog attached to one kernel function tail calls prog attached to another kernel function, the ctx access or return value verification could be bypassed. For example, if prog1 is attached to func1 which takes only 1 parameter and prog2 is attached to func2 which takes two parameters. Since verifier assumes the bpf ctx passed to prog2 is constructed based on func2's prototype, verifier allows prog2 to access the second parameter from the bpf ctx passed to it. The problem is that verifier does not prevent prog1 from passing its bpf ctx to prog2 via tail call. In this case, the bpf ctx passed to prog2 is constructed from func1 instead of func2, that is, the assumption for ctx access verification is bypassed. Another example, if BPF LSM prog1 is attached to hook file_alloc_security, and BPF LSM prog2 is attached to hook bpf_lsm_audit_rule_known. Verifier knows the return value rules for these two hooks, e.g. it is legal for bpf_lsm_audit_rule_known to return positive number 1, and it is illegal for file_alloc_security to return positive number. So verifier allows prog2 to return positive number 1, but does not allow prog1 to return positive number. The problem is that verifier does not prevent prog1 from calling prog2 via tail call. In this case, prog2's return value 1 will be used as the return value for prog1's hook file_alloc_security. That is, the return value rule is bypassed. This patch adds restriction for tail call to prevent such bypasses. Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20240719110059.797546-4-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> [Minor conflict resolved due to code context change.] Signed-off-by: Jianqi Ren <jianqi.ren.cn@windriver.com> Signed-off-by: He Zhe <zhe.he@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25landlock: Add the errata interfaceMickaël Salaün
commit 15383a0d63dbcd63dc7e8d9ec1bf3a0f7ebf64ac upstream. Some fixes may require user space to check if they are applied on the running kernel before using a specific feature. For instance, this applies when a restriction was previously too restrictive and is now getting relaxed (e.g. for compatibility reasons). However, non-visible changes for legitimate use (e.g. security fixes) do not require an erratum. Because fixes are backported down to a specific Landlock ABI, we need a way to avoid cherry-pick conflicts. The solution is to only update a file related to the lower ABI impacted by this issue. All the ABI files are then used to create a bitmask of fixes. The new errata interface is similar to the one used to get the supported Landlock ABI version, but it returns a bitmask instead because the order of fixes may not match the order of versions, and not all fixes may apply to all versions. The actual errata will come with dedicated commits. The description is not actually used in the code but serves as documentation. Create the landlock_abi_version symbol and use its value to check errata consistency. Update test_base's create_ruleset_checks_ordering tests and add errata tests. This commit is backportable down to the first version of Landlock. Fixes: 3532b0b4352c ("landlock: Enable user space to infer supported features") Cc: Günther Noack <gnoack@google.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250318161443.279194-3-mic@digikod.net Signed-off-by: Mickaël Salaün <mic@digikod.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25nfs: add missing selections of CONFIG_CRC32Eric Biggers
[ Upstream commit cd35b6cb46649750b7dbd0df0e2d767415d8917b ] nfs.ko, nfsd.ko, and lockd.ko all use crc32_le(), which is available only when CONFIG_CRC32 is enabled. But the only NFS kconfig option that selected CONFIG_CRC32 was CONFIG_NFS_DEBUG, which is client-specific and did not actually guard the use of crc32_le() even on the client. The code worked around this bug by only actually calling crc32_le() when CONFIG_CRC32 is built-in, instead hard-coding '0' in other cases. This avoided randconfig build errors, and in real kernels the fallback code was unlikely to be reached since CONFIG_CRC32 is 'default y'. But, this really needs to just be done properly, especially now that I'm planning to update CONFIG_CRC32 to not be 'default y'. Therefore, make CONFIG_NFS_FS, CONFIG_NFSD, and CONFIG_LOCKD select CONFIG_CRC32. Then remove the fallback code that becomes unnecessary, as well as the selection of CONFIG_CRC32 from CONFIG_NFS_DEBUG. Fixes: 1264a2f053a3 ("NFS: refactor code for calculating the crc32 hash of a filehandle") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25nfs: move nfs_fhandle_hash to common include fileJeff Layton
[ Upstream commit e59fb6749ed833deee5b3cfd7e89925296d41f49 ] lockd needs to be able to hash filehandles for tracepoints. Move the nfs_fhandle_hash() helper to a common nfs include file. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Stable-dep-of: cd35b6cb4664 ("nfs: add missing selections of CONFIG_CRC32") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25writeback: fix false warning in inode_to_wb()Andreas Gruenbacher
commit 9e888998ea4d22257b07ce911576509486fa0667 upstream. inode_to_wb() is used also for filesystems that don't support cgroup writeback. For these filesystems inode->i_wb is stable during the lifetime of the inode (it points to bdi->wb) and there's no need to hold locks protecting the inode->i_wb dereference. Improve the warning in inode_to_wb() to not trigger for these filesystems. Link: https://lkml.kernel.org/r/20250412163914.3773459-3-agruenba@redhat.com Fixes: aaa2cacf8184 ("writeback: add lockdep annotation to inode_to_wb()") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25sctp: detect and prevent references to a freed transport in sendmsgRicardo Cañuelo Navarro
commit f1a69a940de58b16e8249dff26f74c8cc59b32be upstream. sctp_sendmsg() re-uses associations and transports when possible by doing a lookup based on the socket endpoint and the message destination address, and then sctp_sendmsg_to_asoc() sets the selected transport in all the message chunks to be sent. There's a possible race condition if another thread triggers the removal of that selected transport, for instance, by explicitly unbinding an address with setsockopt(SCTP_SOCKOPT_BINDX_REM), after the chunks have been set up and before the message is sent. This can happen if the send buffer is full, during the period when the sender thread temporarily releases the socket lock in sctp_wait_for_sndbuf(). This causes the access to the transport data in sctp_outq_select_transport(), when the association outqueue is flushed, to result in a use-after-free read. This change avoids this scenario by having sctp_transport_free() signal the freeing of the transport, tagging it as "dead". In order to do this, the patch restores the "dead" bit in struct sctp_transport, which was removed in commit 47faa1e4c50e ("sctp: remove the dead field of sctp_transport"). Then, in the scenario where the sender thread has released the socket lock in sctp_wait_for_sndbuf(), the bit is checked again after re-acquiring the socket lock to detect the deletion. This is done while holding a reference to the transport to prevent it from being freed in the process. If the transport was deleted while the socket lock was relinquished, sctp_sendmsg_to_asoc() will return -EAGAIN to let userspace retry the send. The bug was found by a private syzbot instance (see the error report [1] and the C reproducer that triggers it [2]). Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport.txt [1] Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport__repro.c [2] Cc: stable@vger.kernel.org Fixes: df132eff4638 ("sctp: clear the transport of some out_chunk_list chunks in sctp_assoc_rm_peer") Suggested-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20250404-kasan_slab-use-after-free_read_in_sctp_outq_select_transport__20250404-v1-1-5ce4a0b78ef2@igalia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25tpm, tpm_tis: Workaround failed command reception on Infineon devicesJonathan McDowell
[ Upstream commit de9e33df7762abbfc2a1568291f2c3a3154c6a9d ] Some Infineon devices have a issue where the status register will get stuck with a quick REQUEST_USE / COMMAND_READY sequence. This is not simply a matter of requiring a longer timeout; the work around is to retry the command submission. Add appropriate logic to do this in the send path. This is fixed in later firmware revisions, but those are not always available, and cannot generally be easily updated from outside a firmware environment. Testing has been performed with a simple repeated loop of doing a TPM2_CC_GET_CAPABILITY for TPM_CAP_PROP_MANUFACTURER using the Go code at: https://the.earth.li/~noodles/tpm-stuff/timeout-reproducer-simple.go It can take several hours to reproduce, and several million operations. Signed-off-by: Jonathan McDowell <noodles@meta.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25drm/amdkfd: clamp queue size to minimumDavid Yat Sin
[ Upstream commit e90711946b53590371ecce32e8fcc381a99d6333 ] If queue size is less than minimum, clamp it to minimum to prevent underflow when writing queue mqd. Signed-off-by: David Yat Sin <David.YatSin@amd.com> Reviewed-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25xen/mcelog: Add __nonstring annotations for unterminated stringsKees Cook
[ Upstream commit 1c3dfc7c6b0f551fdca3f7c1f1e4c73be8adb17d ] When a character array without a terminating NUL character has a static initializer, GCC 15's -Wunterminated-string-initialization will only warn if the array lacks the "nonstring" attribute[1]. Mark the arrays with __nonstring to and correctly identify the char array as "not a C string" and thereby eliminate the warning. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117178 [1] Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: xen-devel@lists.xenproject.org Signed-off-by: Kees Cook <kees@kernel.org> Acked-by: Juergen Gross <jgross@suse.com> Message-ID: <20250310222234.work.473-kees@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25rtnl: add helper to check if a notification is neededVictor Nogueira
[ Upstream commit 8439109b76a3c405808383bf9dd532fc4b9c2dbd ] Building on the rtnl_has_listeners helper, add the rtnl_notify_needed helper to check if we can bail out early in the notification routines. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Link: https://lore.kernel.org/r/20231208192847.714940-3-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 369609fc6272 ("tc: Ensure we have enough buffer space when sending filter netlink notifications") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25rtnl: add helper to check if rtnl group has listenersJamal Hadi Salim
[ Upstream commit c5e2a973448d958feb7881e4d875eac59fdeff3d ] As of today, rtnl code creates a new skb and unconditionally fills and broadcasts it to the relevant group. For most operations this is okay and doesn't waste resources in general. When operations are done without the rtnl_lock, as in tc-flower, such skb allocation, message fill and no-op broadcasting can happen in all cores of the system, which contributes to system pressure and wastes precious cpu cycles when no one will receive the built message. Introduce this helper so rtnetlink operations can simply check if someone is listening and then proceed if necessary. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Link: https://lore.kernel.org/r/20231208192847.714940-2-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 369609fc6272 ("tc: Ensure we have enough buffer space when sending filter netlink notifications") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10rcu-tasks: Always inline rcu_irq_work_resched()Josh Poimboeuf
[ Upstream commit 6309a5c43b0dc629851f25b2e5ef8beff61d08e5 ] Thanks to CONFIG_DEBUG_SECTION_MISMATCH, empty functions can be generated out of line. rcu_irq_work_resched() can be called from noinstr code, so make sure it's always inlined. Fixes: 564506495ca9 ("rcu/context-tracking: Move deferred nocb resched to context tracking") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/e84f15f013c07e4c410d972e75620c53b62c1b3e.1743481539.git.jpoimboe@kernel.org Closes: https://lore.kernel.org/d1eca076-fdde-484a-b33e-70e0d167c36d@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10context_tracking: Always inline ct_{nmi,irq}_{enter,exit}()Josh Poimboeuf
[ Upstream commit 9ac50f7311dc8b39e355582f14c1e82da47a8196 ] Thanks to CONFIG_DEBUG_SECTION_MISMATCH, empty functions can be generated out of line. These can be called from noinstr code, so make sure they're always inlined. Fixes the following warnings: vmlinux.o: warning: objtool: irqentry_nmi_enter+0xa2: call to ct_nmi_enter() leaves .noinstr.text section vmlinux.o: warning: objtool: irqentry_nmi_exit+0x16: call to ct_nmi_exit() leaves .noinstr.text section vmlinux.o: warning: objtool: irqentry_exit+0x78: call to ct_irq_exit() leaves .noinstr.text section Fixes: 6f0e6c1598b1 ("context_tracking: Take IRQ eqs entrypoints over RCU") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/8509bce3f536bcd4ae7af3a2cf6930d48c5e631a.1743481539.git.jpoimboe@kernel.org Closes: https://lore.kernel.org/d1eca076-fdde-484a-b33e-70e0d167c36d@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10sched/smt: Always inline sched_smt_active()Josh Poimboeuf
[ Upstream commit 09f37f2d7b21ff35b8b533f9ab8cfad2fe8f72f6 ] sched_smt_active() can be called from noinstr code, so it should always be inlined. The CONFIG_SCHED_SMT version already has __always_inline. Do the same for its !CONFIG_SCHED_SMT counterpart. Fixes the following warning: vmlinux.o: error: objtool: intel_idle_ibrs+0x13: call to sched_smt_active() leaves .noinstr.text section Fixes: 321a874a7ef8 ("sched/smt: Expose sched_smt_present static key") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/1d03907b0a247cf7fb5c1d518de378864f603060.1743481539.git.jpoimboe@kernel.org Closes: https://lore.kernel.org/r/202503311434.lyw2Tveh-lkp@intel.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10coresight-etm4x: add isb() before reading the TRCSTATRYuanfang Zhang
[ Upstream commit 4ff6039ffb79a4a8a44b63810a8a2f2b43264856 ] As recommended by section 4.3.7 ("Synchronization when using system instructions to progrom the trace unit") of ARM IHI 0064H.b, the self-hosted trace analyzer must perform a Context synchronization event between writing to the TRCPRGCTLR and reading the TRCSTATR. Additionally, add an ISB between the each read of TRCSTATR on coresight_timeout() when using system instructions to program the trace unit. Fixes: 1ab3bb9df5e3 ("coresight: etm4x: Add necessary synchronization for sysreg access") Signed-off-by: Yuanfang Zhang <quic_yuanfang@quicinc.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20250116-etm_sync-v4-1-39f2b05e9514@quicinc.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10RDMA/core: Don't expose hw_counters outside of init net namespaceRoman Gushchin
[ Upstream commit a1ecb30f90856b0be4168ad51b8875148e285c1f ] Commit 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes") accidentally almost exposed hw counters to non-init net namespaces. It didn't expose them fully, as an attempt to read any of those counters leads to a crash like this one: [42021.807566] BUG: kernel NULL pointer dereference, address: 0000000000000028 [42021.814463] #PF: supervisor read access in kernel mode [42021.819549] #PF: error_code(0x0000) - not-present page [42021.824636] PGD 0 P4D 0 [42021.827145] Oops: 0000 [#1] SMP PTI [42021.830598] CPU: 82 PID: 2843922 Comm: switchto-defaul Kdump: loaded Tainted: G S W I XXX [42021.841697] Hardware name: XXX [42021.849619] RIP: 0010:hw_stat_device_show+0x1e/0x40 [ib_core] [42021.855362] Code: 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 49 89 d0 4c 8b 5e 20 48 8b 8f b8 04 00 00 48 81 c7 f0 fa ff ff <48> 8b 41 28 48 29 ce 48 83 c6 d0 48 c1 ee 04 69 d6 ab aa aa aa 48 [42021.873931] RSP: 0018:ffff97fe90f03da0 EFLAGS: 00010287 [42021.879108] RAX: ffff9406988a8c60 RBX: ffff940e1072d438 RCX: 0000000000000000 [42021.886169] RDX: ffff94085f1aa000 RSI: ffff93c6cbbdbcb0 RDI: ffff940c7517aef0 [42021.893230] RBP: ffff97fe90f03e70 R08: ffff94085f1aa000 R09: 0000000000000000 [42021.900294] R10: ffff94085f1aa000 R11: ffffffffc0775680 R12: ffffffff87ca2530 [42021.907355] R13: ffff940651602840 R14: ffff93c6cbbdbcb0 R15: ffff94085f1aa000 [42021.914418] FS: 00007fda1a3b9700(0000) GS:ffff94453fb80000(0000) knlGS:0000000000000000 [42021.922423] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [42021.928130] CR2: 0000000000000028 CR3: 00000042dcfb8003 CR4: 00000000003726f0 [42021.935194] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [42021.942257] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [42021.949324] Call Trace: [42021.951756] <TASK> [42021.953842] [<ffffffff86c58674>] ? show_regs+0x64/0x70 [42021.959030] [<ffffffff86c58468>] ? __die+0x78/0xc0 [42021.963874] [<ffffffff86c9ef75>] ? page_fault_oops+0x2b5/0x3b0 [42021.969749] [<ffffffff87674b92>] ? exc_page_fault+0x1a2/0x3c0 [42021.975549] [<ffffffff87801326>] ? asm_exc_page_fault+0x26/0x30 [42021.981517] [<ffffffffc0775680>] ? __pfx_show_hw_stats+0x10/0x10 [ib_core] [42021.988482] [<ffffffffc077564e>] ? hw_stat_device_show+0x1e/0x40 [ib_core] [42021.995438] [<ffffffff86ac7f8e>] dev_attr_show+0x1e/0x50 [42022.000803] [<ffffffff86a3eeb1>] sysfs_kf_seq_show+0x81/0xe0 [42022.006508] [<ffffffff86a11134>] seq_read_iter+0xf4/0x410 [42022.011954] [<ffffffff869f4b2e>] vfs_read+0x16e/0x2f0 [42022.017058] [<ffffffff869f50ee>] ksys_read+0x6e/0xe0 [42022.022073] [<ffffffff8766f1ca>] do_syscall_64+0x6a/0xa0 [42022.027441] [<ffffffff8780013b>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 The problem can be reproduced using the following steps: ip netns add foo ip netns exec foo bash cat /sys/class/infiniband/mlx4_0/hw_counters/* The panic occurs because of casting the device pointer into an ib_device pointer using container_of() in hw_stat_device_show() is wrong and leads to a memory corruption. However the real problem is that hw counters should never been exposed outside of the non-init net namespace. Fix this by saving the index of the corresponding attribute group (it might be 1 or 2 depending on the presence of driver-specific attributes) and zeroing the pointer to hw_counters group for compat devices during the initialization. With this fix applied hw_counters are not available in a non-init net namespace: find /sys/class/infiniband/mlx4_0/ -name hw_counters /sys/class/infiniband/mlx4_0/ports/1/hw_counters /sys/class/infiniband/mlx4_0/ports/2/hw_counters /sys/class/infiniband/mlx4_0/hw_counters ip netns add foo ip netns exec foo bash find /sys/class/infiniband/mlx4_0/ -name hw_counters Fixes: 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Maher Sanalla <msanalla@nvidia.com> Cc: linux-rdma@vger.kernel.org Cc: linux-kernel@vger.kernel.org Link: https://patch.msgid.link/20250227165420.3430301-1-roman.gushchin@linux.dev Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10of: property: Increase NR_FWNODE_REFERENCE_ARGSZijun Hu
[ Upstream commit eb50844d728f11e87491f7c7af15a4a737f1159d ] Currently, the following two macros have different values: // The maximal argument count for firmware node reference #define NR_FWNODE_REFERENCE_ARGS 8 // The maximal argument count for DT node reference #define MAX_PHANDLE_ARGS 16 It may cause firmware node reference's argument count out of range if directly assign DT node reference's argument count to firmware's. drivers/of/property.c:of_fwnode_get_reference_args() is doing the direct assignment, so may cause firmware's argument count @args->nargs got out of range, namely, in [9, 16]. Fix by increasing NR_FWNODE_REFERENCE_ARGS to 16 to meet DT requirement. Will align both macros later to avoid such inconsistency. Fixes: 3e3119d3088f ("device property: Introduce fwnode_property_get_reference_args") Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> Link: https://lore.kernel.org/r/20250225-fix_arg_count-v4-1-13cdc519eb31@quicinc.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10drm/dp_mst: Fix drm RAD printWayne Lin
[ Upstream commit 6bbce873a9c97cb12f5455c497be279ac58e707f ] [Why] The RAD of sideband message printed today is incorrect. For RAD stored within MST branch - If MST branch LCT is 1, it's RAD array is untouched and remained as 0. - If MST branch LCT is larger than 1, use nibble to store the up facing port number in cascaded sequence as illustrated below: u8 RAD[0] = (LCT_2_UFP << 4) | LCT_3_UFP RAD[1] = (LCT_4_UFP << 4) | LCT_5_UFP ... In drm_dp_mst_rad_to_str(), it wrongly to use BIT_MASK(4) to fetch the port number of one nibble. [How] Adjust the code by: - RAD array items are valuable only for LCT >= 1. - Use 0xF as the mask to replace BIT_MASK(4) V2: - Document how RAD is constructed (Imre) V3: - Adjust the comment for rad[] so kdoc formats it properly (Lyude) Fixes: 2f015ec6eab6 ("drm/dp_mst: Add sideband down request tracing + selftests") Cc: Imre Deak <imre.deak@intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Harry Wentland <hwentlan@amd.com> Cc: Lyude Paul <lyude@redhat.com> Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Wayne Lin <Wayne.Lin@amd.com> Signed-off-by: Lyude Paul <lyude@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250113091100.3314533-2-Wayne.Lin@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10lockdep: Don't disable interrupts on RT in disable_irq_nosync_lockdep.*()Sebastian Andrzej Siewior
[ Upstream commit 87886b32d669abc11c7be95ef44099215e4f5788 ] disable_irq_nosync_lockdep() disables interrupts with lockdep enabled to avoid false positive reports by lockdep that a certain lock has not been acquired with disabled interrupts. The user of this macros expects that a lock can be acquried without disabling interrupts because the IRQ line triggering the interrupt is disabled. This triggers a warning on PREEMPT_RT because after disable_irq_nosync_lockdep.*() the following spinlock_t now is acquired with disabled interrupts. On PREEMPT_RT there is no difference between spin_lock() and spin_lock_irq() so avoiding disabling interrupts in this case works for the two remaining callers as of today. Don't disable interrupts on PREEMPT_RT in disable_irq_nosync_lockdep.*(). Closes: https://lore.kernel.org/760e34f9-6034-40e0-82a5-ee9becd24438@roeck-us.net Fixes: e8106b941ceab ("[PATCH] lockdep: core, add enable/disable_irq_irqsave/irqrestore() APIs") Reported-by: Guenter Roeck <linux@roeck-us.net> Suggested-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20250212103619.2560503-2-bigeasy@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10PM: sleep: Adjust check before setting power.must_resumeRafael J. Wysocki
[ Upstream commit eeb87d17aceab7803a5a5bcb6cf2817b745157cf ] The check before setting power.must_resume in device_suspend_noirq() does not take power.child_count into account, but it should do that, so use pm_runtime_need_not_resume() in it for this purpose and adjust the comment next to it accordingly. Fixes: 107d47b2b95e ("PM: sleep: core: Simplify the SMART_SUSPEND flag handling") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/3353728.44csPzL39Z@rjwysocki.net Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-28proc: fix UAF in proc_get_inode()Ye Bin
commit 654b33ada4ab5e926cd9c570196fefa7bec7c1df upstream. Fix race between rmmod and /proc/XXX's inode instantiation. The bug is that pde->proc_ops don't belong to /proc, it belongs to a module, therefore dereferencing it after /proc entry has been registered is a bug unless use_pde/unuse_pde() pair has been used. use_pde/unuse_pde can be avoided (2 atomic ops!) because pde->proc_ops never changes so information necessary for inode instantiation can be saved _before_ proc_register() in PDE itself and used later, avoiding pde->proc_ops->... dereference. rmmod lookup sys_delete_module proc_lookup_de pde_get(de); proc_get_inode(dir->i_sb, de); mod->exit() proc_remove remove_proc_subtree proc_entry_rundown(de); free_module(mod); if (S_ISREG(inode->i_mode)) if (de->proc_ops->proc_read_iter) --> As module is already freed, will trigger UAF BUG: unable to handle page fault for address: fffffbfff80a702b PGD 817fc4067 P4D 817fc4067 PUD 817fc0067 PMD 102ef4067 PTE 0 Oops: Oops: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 26 UID: 0 PID: 2667 Comm: ls Tainted: G Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:proc_get_inode+0x302/0x6e0 RSP: 0018:ffff88811c837998 EFLAGS: 00010a06 RAX: dffffc0000000000 RBX: ffffffffc0538140 RCX: 0000000000000007 RDX: 1ffffffff80a702b RSI: 0000000000000001 RDI: ffffffffc0538158 RBP: ffff8881299a6000 R08: 0000000067bbe1e5 R09: 1ffff11023906f20 R10: ffffffffb560ca07 R11: ffffffffb2b43a58 R12: ffff888105bb78f0 R13: ffff888100518048 R14: ffff8881299a6004 R15: 0000000000000001 FS: 00007f95b9686840(0000) GS:ffff8883af100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffbfff80a702b CR3: 0000000117dd2000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> proc_lookup_de+0x11f/0x2e0 __lookup_slow+0x188/0x350 walk_component+0x2ab/0x4f0 path_lookupat+0x120/0x660 filename_lookup+0x1ce/0x560 vfs_statx+0xac/0x150 __do_sys_newstat+0x96/0x110 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e [adobriyan@gmail.com: don't do 2 atomic ops on the common path] Link: https://lkml.kernel.org/r/3d25ded0-1739-447e-812b-e34da7990dcf@p183 Fixes: 778f3dd5a13c ("Fix procfs compat_ioctl regression") Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David S. Miller <davem@davemloft.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-28ASoC: ops: Consistently treat platform_max as control valueCharles Keepax
[ Upstream commit 0eba2a7e858907a746ba69cd002eb9eb4dbd7bf3 ] This reverts commit 9bdd10d57a88 ("ASoC: ops: Shift tested values in snd_soc_put_volsw() by +min"), and makes some additional related updates. There are two ways the platform_max could be interpreted; the maximum register value, or the maximum value the control can be set to. The patch moved from treating the value as a control value to a register one. When the patch was applied it was technically correct as snd_soc_limit_volume() also used the register interpretation. However, even then most of the other usages treated platform_max as a control value, and snd_soc_limit_volume() has since been updated to also do so in commit fb9ad24485087 ("ASoC: ops: add correct range check for limiting volume"). That patch however, missed updating snd_soc_put_volsw() back to the control interpretation, and fixing snd_soc_info_volsw_range(). The control interpretation makes more sense as limiting is typically done from the machine driver, so it is appropriate to use the customer facing representation rather than the internal codec representation. Update all the code to consistently use this interpretation of platform_max. Finally, also add some comments to the soc_mixer_control struct to hopefully avoid further patches switching between the two approaches. Fixes: fb9ad24485087 ("ASoC: ops: add correct range check for limiting volume") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Link: https://patch.msgid.link/20250228151456.3703342-1-ckeepax@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-28io_uring: get rid of remap_pfn_range() for mapping rings/sqesJens Axboe
Commit 3ab1db3c6039e02a9deb9d5091d28d559917a645 upstream. Rather than use remap_pfn_range() for this and manually free later, switch to using vm_insert_pages() and have it Just Work. If possible, allocate a single compound page that covers the range that is needed. If that works, then we can just use page_address() on that page. If we fail to get a compound page, allocate single pages and use vmap() to map them into the kernel virtual address space. This just covers the rings/sqes, the other remaining user of the mmap remap_pfn_range() user will be converted separately. Once that is done, we can kill the old alloc/free code. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-28fuse: don't truncate cached, mutated symlinkMiklos Szeredi
[ Upstream commit b4c173dfbb6c78568578ff18f9e8822d7bd0e31b ] Fuse allows the value of a symlink to change and this property is exploited by some filesystems (e.g. CVMFS). It has been observed, that sometimes after changing the symlink contents, the value is truncated to the old size. This is caused by fuse_getattr() racing with fuse_reverse_inval_inode(). fuse_reverse_inval_inode() updates the fuse_inode's attr_version, which results in fuse_change_attributes() exiting before updating the cached attributes This is okay, as the cached attributes remain invalid and the next call to fuse_change_attributes() will likely update the inode with the correct values. The reason this causes problems is that cached symlinks will be returned through page_get_link(), which truncates the symlink to inode->i_size. This is correct for filesystems that don't mutate symlinks, but in this case it causes bad behavior. The solution is to just remove this truncation. This can cause a regression in a filesystem that relies on supplying a symlink larger than the file size, but this is unlikely. If that happens we'd need to make this behavior conditional. Reported-by: Laura Promberger <laura.promberger@cern.ch> Tested-by: Sam Lewis <samclewis@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20250220100258.793363-1-mszeredi@redhat.com Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-28nvme-tcp: add basic support for the C2HTermReq PDUMaurizio Lombardi
[ Upstream commit 84e009042d0f3dfe91bec60bcd208ee3f866cbcd ] Previously, the NVMe/TCP host driver did not handle the C2HTermReq PDU, instead printing "unsupported pdu type (3)" when received. This patch adds support for processing the C2HTermReq PDU, allowing the driver to print the Fatal Error Status field. Example of output: nvme nvme4: Received C2HTermReq (FES = Invalid PDU Header Field) Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-28Revert "Bluetooth: hci_core: Fix sleeping function called from invalid context"Luiz Augusto von Dentz
[ Upstream commit ab6ab707a4d060a51c45fc13e3b2228d5f7c0b87 ] This reverts commit 4d94f05558271654670d18c26c912da0c1c15549 which has problems (see [1]) and is no longer needed since 581dd2dc168f ("Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating") has reworked the code where the original bug has been found. [1] Link: https://lore.kernel.org/linux-bluetooth/877c55ci1r.wl-tiwai@suse.de/T/#t Fixes: 4d94f0555827 ("Bluetooth: hci_core: Fix sleeping function called from invalid context") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-28clockevents/drivers/i8253: Fix stop sequence for timer 0David Woodhouse
commit 531b2ca0a940ac9db03f246c8b77c4201de72b00 upstream. According to the data sheet, writing the MODE register should stop the counter (and thus the interrupts). This appears to work on real hardware, at least modern Intel and AMD systems. It should also work on Hyper-V. However, on some buggy virtual machines the mode change doesn't have any effect until the counter is subsequently loaded (or perhaps when the IRQ next fires). So, set MODE 0 and then load the counter, to ensure that those buggy VMs do the right thing and the interrupts stop. And then write MODE 0 *again* to stop the counter on compliant implementations too. Apparently, Hyper-V keeps firing the IRQ *repeatedly* even in mode zero when it should only happen once, but the second MODE write stops that too. Userspace test program (mostly written by tglx): ===== #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <stdint.h> #include <sys/io.h> static __always_inline void __out##bwl(type value, uint16_t port) \ { \ asm volatile("out" #bwl " %" #bw "0, %w1" \ : : "a"(value), "Nd"(port)); \ } \ \ static __always_inline type __in##bwl(uint16_t port) \ { \ type value; \ asm volatile("in" #bwl " %w1, %" #bw "0" \ : "=a"(value) : "Nd"(port)); \ return value; \ } BUILDIO(b, b, uint8_t) #define inb __inb #define outb __outb #define PIT_MODE 0x43 #define PIT_CH0 0x40 #define PIT_CH2 0x42 static int is8254; static void dump_pit(void) { if (is8254) { // Latch and output counter and status outb(0xC2, PIT_MODE); printf("%02x %02x %02x\n", inb(PIT_CH0), inb(PIT_CH0), inb(PIT_CH0)); } else { // Latch and output counter outb(0x0, PIT_MODE); printf("%02x %02x\n", inb(PIT_CH0), inb(PIT_CH0)); } } int main(int argc, char* argv[]) { int nr_counts = 2; if (argc > 1) nr_counts = atoi(argv[1]); if (argc > 2) is8254 = 1; if (ioperm(0x40, 4, 1) != 0) return 1; dump_pit(); printf("Set oneshot\n"); outb(0x38, PIT_MODE); outb(0x00, PIT_CH0); outb(0x0F, PIT_CH0); dump_pit(); usleep(1000); dump_pit(); printf("Set periodic\n"); outb(0x34, PIT_MODE); outb(0x00, PIT_CH0); outb(0x0F, PIT_CH0); dump_pit(); usleep(1000); dump_pit(); dump_pit(); usleep(100000); dump_pit(); usleep(100000); dump_pit(); printf("Set stop (%d counter writes)\n", nr_counts); outb(0x30, PIT_MODE); while (nr_counts--) outb(0xFF, PIT_CH0); dump_pit(); usleep(100000); dump_pit(); usleep(100000); dump_pit(); printf("Set MODE 0\n"); outb(0x30, PIT_MODE); dump_pit(); usleep(100000); dump_pit(); usleep(100000); dump_pit(); return 0; } ===== Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhkelley@outlook.com> Link: https://lore.kernel.org/all/20240802135555.564941-2-dwmw2@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07ptrace: Introduce exception_ip arch hookJiaxun Yang
commit 11ba1728be3edb6928791f4c622f154ebe228ae6 upstream. On architectures with delay slot, architecture level instruction pointer (or program counter) in pt_regs may differ from where exception was triggered. Introduce exception_ip hook to invoke architecture code and determine actual instruction pointer to the exception. Link: https://lore.kernel.org/lkml/00d1b813-c55f-4365-8d81-d70258e10b16@app.fastmail.com/ Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07vmlinux.lds: Ensure that const vars with relocations are mapped R/OArd Biesheuvel
commit 68f3ea7ee199ef77551e090dfef5a49046ea8443 upstream. In the kernel, there are architectures (x86, arm64) that perform boot-time relocation (for KASLR) without relying on PIE codegen. In this case, all const global objects are emitted into .rodata, including const objects with fields that will be fixed up by the boot-time relocation code. This implies that .rodata (and .text in some cases) need to be writable at boot, but they will usually be mapped read-only as soon as the boot completes. When using PIE codegen, the compiler will emit const global objects into .data.rel.ro rather than .rodata if the object contains fields that need such fixups at boot-time. This permits the linker to annotate such regions as requiring read-write access only at load time, but not at execution time (in user space), while keeping .rodata truly const (in user space, this is important for reducing the CoW footprint of dynamic executables). This distinction does not matter for the kernel, but it does imply that const data will end up in writable memory if the .data.rel.ro sections are not treated in a special way, as they will end up in the writable .data segment by default. So emit .data.rel.ro into the .rodata segment. Cc: stable@vger.kernel.org Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250221135704.431269-5-ardb+git@google.com Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07mm: Don't pin ZERO_PAGE in pin_user_pages()David Howells
[ Upstream commit c8070b78751955e59b42457b974bea4a4fe00187 ] Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer to it from the page tables and make unpin_user_page*() correspondingly ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a zero page's refcount as we're only allowed ~2 million pins on it - something that userspace can conceivably trigger. Add a pair of functions to test whether a page or a folio is a ZERO_PAGE. Signed-off-by: David Howells <dhowells@redhat.com> cc: Christoph Hellwig <hch@infradead.org> cc: David Hildenbrand <david@redhat.com> cc: Lorenzo Stoakes <lstoakes@gmail.com> cc: Andrew Morton <akpm@linux-foundation.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Matthew Wilcox <willy@infradead.org> cc: Jan Kara <jack@suse.cz> cc: Jeff Layton <jlayton@kernel.org> cc: Jason Gunthorpe <jgg@nvidia.com> cc: Logan Gunthorpe <logang@deltatee.com> cc: Hillf Danton <hdanton@sina.com> cc: Christian Brauner <brauner@kernel.org> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-kernel@vger.kernel.org cc: linux-mm@kvack.org Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20230526214142.958751-2-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: bddf10d26e6e ("uprobes: Reject the shared zeropage in uprobe_write_opcode()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07include: net: add static inline dst_dev_overhead() to dst.hJustin Iurman
[ Upstream commit 0600cf40e9b36fe17f9c9f04d4f9cef249eaa5e7 ] Add static inline dst_dev_overhead() function to include/net/dst.h. This helper function is used by ioam6_iptunnel, rpl_iptunnel and seg6_iptunnel to get the dev's overhead based on a cache entry (dst_entry). If the cache is empty, the default and generic value skb->mac_len is returned. Otherwise, LL_RESERVED_SPACE() over dst's dev is returned. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Cc: Alexander Lobakin <aleksander.lobakin@intel.com> Cc: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Stable-dep-of: c64a0727f9b1 ("net: ipv6: fix dst ref loop on input in seg6 lwt") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07ipv4: Convert ip_route_input() to dscp_t.Guillaume Nault
[ Upstream commit 7e863e5db6185b1add0df4cb01b31a4ed1c4b738 ] Pass a dscp_t variable to ip_route_input(), instead of a plain u8, to prevent accidental setting of ECN bits in ->flowi4_tos. Callers of ip_route_input() to consider are: * input_action_end_dx4_finish() and input_action_end_dt4() in net/ipv6/seg6_local.c. These functions set the tos parameter to 0, which is already a valid dscp_t value, so they don't need to be adjusted for the new prototype. * icmp_route_lookup(), which already has a dscp_t variable to pass as parameter. We just need to remove the inet_dscp_to_dsfield() conversion. * br_nf_pre_routing_finish(), ip_options_rcv_srr() and ip4ip6_err(), which get the DSCP directly from IPv4 headers. Define a helper to read the .tos field of struct iphdr as dscp_t, so that these function don't have to do the conversion manually. While there, declare *iph as const in br_nf_pre_routing_finish(), declare its local variables in reverse-christmas-tree order and move the "err = ip_route_input()" assignment out of the conditional to avoid checkpatch warning. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/e9d40781d64d3d69f4c79ac8a008b8d67a033e8d.1727807926.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 27843ce6ba3d ("ipvlan: ensure network headers are in skb linear part") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07net/ipv4: add tracepoint for icmp_sendPeilin He
[ Upstream commit db3efdcf70c752e8a8deb16071d8e693c3ef8746 ] Introduce a tracepoint for icmp_send, which can help users to get more detail information conveniently when icmp abnormal events happen. 1. Giving an usecase example: ============================= When an application experiences packet loss due to an unreachable UDP destination port, the kernel will send an exception message through the icmp_send function. By adding a trace point for icmp_send, developers or system administrators can obtain detailed information about the UDP packet loss, including the type, code, source address, destination address, source port, and destination port. This facilitates the trouble-shooting of UDP packet loss issues especially for those network-service applications. 2. Operation Instructions: ========================== Switch to the tracing directory. cd /sys/kernel/tracing Filter for destination port unreachable. echo "type==3 && code==3" > events/icmp/icmp_send/filter Enable trace event. echo 1 > events/icmp/icmp_send/enable 3. Result View: ================ udp_client_erro-11370 [002] ...s.12 124.728002: icmp_send: icmp_send: type=3, code=3. From 127.0.0.1:41895 to 127.0.0.1:6666 ulen=23 skbaddr=00000000589b167a Signed-off-by: Peilin He <he.peilin@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Liu Chun <liu.chun2@zte.com.cn> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: 27843ce6ba3d ("ipvlan: ensure network headers are in skb linear part") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07SUNRPC: Prevent looping due to rpc_signal_task() racesTrond Myklebust
[ Upstream commit 5bbd6e863b15a85221e49b9bdb2d5d8f0bb91f3d ] If rpc_signal_task() is called while a task is in an rpc_call_done() callback function, and the latter calls rpc_restart_call(), the task can end up looping due to the RPC_TASK_SIGNALLED flag being set without the tk_rpc_status being set. Removing the redundant mechanism for signalling the task fixes the looping behaviour. Reported-by: Li Lingfeng <lilingfeng3@huawei.com> Fixes: 39494194f93b ("SUNRPC: Fix races with rpc_killall_tasks()") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07SUNRPC: convert RPC_TASK_* constants to enumStephen Brennan
[ Upstream commit 0b108e83795c9c23101f584ef7e3ab4f1f120ef0 ] The RPC_TASK_* constants are defined as macros, which means that most kernel builds will not contain their definitions in the debuginfo. However, it's quite useful for debuggers to be able to view the task state constant and interpret it correctly. Conversion to an enum will ensure the constants are present in debuginfo and can be interpreted by debuggers without needing to hard-code them and track their changes. Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com> Stable-dep-of: 5bbd6e863b15 ("SUNRPC: Prevent looping due to rpc_signal_task() races") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07netfilter: allow exp not to be removed in nf_ct_find_expectationXin Long
commit 4914109a8e1e494c6aa9852f9e84ec77a5fc643f upstream. Currently nf_conntrack_in() calling nf_ct_find_expectation() will remove the exp from the hash table. However, in some scenario, we expect the exp not to be removed when the created ct will not be confirmed, like in OVS and TC conntrack in the following patches. This patch allows exp not to be removed by setting IPS_CONFIRMED in the status of the tmpl. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Aaron Conole <aconole@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07bpf: Fix wrong copied_seq calculationJiayuan Chen
[ Upstream commit 36b62df5683c315ba58c950f1a9c771c796c30ec ] 'sk->copied_seq' was updated in the tcp_eat_skb() function when the action of a BPF program was SK_REDIRECT. For other actions, like SK_PASS, the update logic for 'sk->copied_seq' was moved to tcp_bpf_recvmsg_parser() to ensure the accuracy of the 'fionread' feature. It works for a single stream_verdict scenario, as it also modified sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb to remove updating 'sk->copied_seq'. However, for programs where both stream_parser and stream_verdict are active (strparser purpose), tcp_read_sock() was used instead of tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock). tcp_read_sock() now still updates 'sk->copied_seq', leading to duplicate updates. In summary, for strparser + SK_PASS, copied_seq is redundantly calculated in both tcp_read_sock() and tcp_bpf_recvmsg_parser(). The issue causes incorrect copied_seq calculations, which prevent correct data reads from the recv() interface in user-land. We do not want to add new proto_ops to implement a new version of tcp_read_sock, as this would introduce code complexity [1]. We could have added noack and copied_seq to desc, and then called ops->read_sock. However, unfortunately, other modules didn’t fully initialize desc to zero. So, for now, we are directly calling tcp_read_sock_noack() in tcp_bpf.c. [1]: https://lore.kernel.org/bpf/20241218053408.437295-1-mrpre@163.com Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq") Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Jiayuan Chen <mrpre@163.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://patch.msgid.link/20250122100917.49845-3-mrpre@163.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07strparser: Add read_sock callbackJiayuan Chen
[ Upstream commit 0532a79efd68a4d9686b0385e4993af4b130ff82 ] Added a new read_sock handler, allowing users to customize read operations instead of relying on the native socket's read_sock. Signed-off-by: Jiayuan Chen <mrpre@163.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://patch.msgid.link/20250122100917.49845-2-mrpre@163.com Stable-dep-of: 36b62df5683c ("bpf: Fix wrong copied_seq calculation") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07tcp: drop secpath at the same time as we currently drop dstSabrina Dubroca
[ Upstream commit 9b6412e6979f6f9e0632075f8f008937b5cd4efd ] Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while running tests that boil down to: - create a pair of netns - run a basic TCP test over ipcomp6 - delete the pair of netns The xfrm_state found on spi_byaddr was not deleted at the time we delete the netns, because we still have a reference on it. This lingering reference comes from a secpath (which holds a ref on the xfrm_state), which is still attached to an skb. This skb is not leaked, it ends up on sk_receive_queue and then gets defer-free'd by skb_attempt_defer_free. The problem happens when we defer freeing an skb (push it on one CPU's defer_list), and don't flush that list before the netns is deleted. In that case, we still have a reference on the xfrm_state that we don't expect at this point. We already drop the skb's dst in the TCP receive path when it's no longer needed, so let's also drop the secpath. At this point, tcp_filter has already called into the LSM hooks that may require the secpath, so it should not be needed anymore. However, in some of those places, the MPTCP extension has just been attached to the skb, so we cannot simply drop all extensions. Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/5055ba8f8f72bdcb602faa299faca73c280b7735.1739743613.git.sd@queasysnail.net Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07net: Add non-RCU dev_getbyhwaddr() helperBreno Leitao
[ Upstream commit 4b5a28b38c4a0106c64416a1b2042405166b26ce ] Add dedicated helper for finding devices by hardware address when holding rtnl_lock, similar to existing dev_getbyhwaddr_rcu(). This prevents PROVE_LOCKING warnings when rtnl_lock is held but RCU read lock is not. Extract common address comparison logic into dev_addr_cmp(). The context about this change could be found in the following discussion: Link: https://lore.kernel.org/all/20250206-scarlet-ermine-of-improvement-1fcac5@leitao/ Cc: kuniyu@amazon.com Cc: ushankar@purestorage.com Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250218-arm_fix_selftest-v5-1-d3d6892db9e1@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 4eae0ee0f1e6 ("arp: switch to dev_getbyhwaddr() in arp_req_set_public()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-07mm: update mark_victim tracepoints fieldsCarlos Galo
[ Upstream commit 72ba14deb40a9e9668ec5e66a341ed657e5215c2 ] The current implementation of the mark_victim tracepoint provides only the process ID (pid) of the victim process. This limitation poses challenges for userspace tools requiring real-time OOM analysis and intervention. Although this information is available from the kernel logs, it’s not the appropriate format to provide OOM notifications. In Android, BPF programs are used with the mark_victim trace events to notify userspace of an OOM kill. For consistency, update the trace event to include the same information about the OOMed victim as the kernel logs. - UID In Android each installed application has a unique UID. Including the `uid` assists in correlating OOM events with specific apps. - Process Name (comm) Enables identification of the affected process. - OOM Score Will allow userspace to get additional insight of the relative kill priority of the OOM victim. In Android, the oom_score_adj is used to categorize app state (foreground, background, etc.), which aids in analyzing user-perceptible impacts of OOM events [1]. - Total VM, RSS Stats, and pgtables Amount of memory used by the victim that will, potentially, be freed up by killing it. [1] https://cs.android.com/android/platform/superproject/main/+/246dc8fc95b6d93afcba5c6d6c133307abb3ac2e:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=188-283 Signed-off-by: Carlos Galo <carlosgalo@google.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: ade81479c7dd ("memcg: fix soft lockup in the OOM process") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21x86/i8253: Disable PIT timer 0 when not in useDavid Woodhouse
commit 70e6b7d9ae3c63df90a7bba7700e8d5c300c3c60 upstream. Leaving the PIT interrupt running can cause noticeable steal time for virtual guests. The VMM generally has a timer which toggles the IRQ input to the PIC and I/O APIC, which takes CPU time away from the guest. Even on real hardware, running the counter may use power needlessly (albeit not much). Make sure it's turned off if it isn't going to be used. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mhkelley@outlook.com> Link: https://lore.kernel.org/all/20240802135555.564941-1-dwmw2@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>