summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2024-03-06netfilter: bridge: confirm multicast packets before passing them up the stackFlorian Westphal
[ Upstream commit 62e7151ae3eb465e0ab52a20c941ff33bb6332e9 ] conntrack nf_confirm logic cannot handle cloned skbs referencing the same nf_conn entry, which will happen for multicast (broadcast) frames on bridges. Example: macvlan0 | br0 / \ ethX ethY ethX (or Y) receives a L2 multicast or broadcast packet containing an IP packet, flow is not yet in conntrack table. 1. skb passes through bridge and fake-ip (br_netfilter)Prerouting. -> skb->_nfct now references a unconfirmed entry 2. skb is broad/mcast packet. bridge now passes clones out on each bridge interface. 3. skb gets passed up the stack. 4. In macvlan case, macvlan driver retains clone(s) of the mcast skb and schedules a work queue to send them out on the lower devices. The clone skb->_nfct is not a copy, it is the same entry as the original skb. The macvlan rx handler then returns RX_HANDLER_PASS. 5. Normal conntrack hooks (in NF_INET_LOCAL_IN) confirm the orig skb. The Macvlan broadcast worker and normal confirm path will race. This race will not happen if step 2 already confirmed a clone. In that case later steps perform skb_clone() with skb->_nfct already confirmed (in hash table). This works fine. But such confirmation won't happen when eb/ip/nftables rules dropped the packets before they reached the nf_confirm step in postrouting. Pablo points out that nf_conntrack_bridge doesn't allow use of stateful nat, so we can safely discard the nf_conn entry and let inet call conntrack again. This doesn't work for bridge netfilter: skb could have a nat transformation. Also bridge nf prevents re-invocation of inet prerouting via 'sabotage_in' hook. Work around this problem by explicit confirmation of the entry at LOCAL_IN time, before upper layer has a chance to clone the unconfirmed entry. The downside is that this disables NAT and conntrack helpers. Alternative fix would be to add locking to all code parts that deal with unconfirmed packets, but even if that could be done in a sane way this opens up other problems, for example: -m physdev --physdev-out eth0 -j SNAT --snat-to 1.2.3.4 -m physdev --physdev-out eth1 -j SNAT --snat-to 1.2.3.5 For multicast case, only one of such conflicting mappings will be created, conntrack only handles 1:1 NAT mappings. Users should set create a setup that explicitly marks such traffic NOTRACK (conntrack bypass) to avoid this, but we cannot auto-bypass them, ruleset might have accept rules for untracked traffic already, so user-visible behaviour would change. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217777 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-06netfilter: let reset rules clean out conntrack entriesFlorian Westphal
[ Upstream commit 2954fe60e33da0f4de4d81a4c95c7dddb517d00c ] iptables/nftables support responding to tcp packets with tcp resets. The generated tcp reset packet passes through both output and postrouting netfilter hooks, but conntrack will never see them because the generated skb has its ->nfct pointer copied over from the packet that triggered the reset rule. If the reset rule is used for established connections, this may result in the conntrack entry to be around for a very long time (default timeout is 5 days). One way to avoid this would be to not copy the nf_conn pointer so that the rest packet passes through conntrack too. Problem is that output rules might not have the same conntrack zone setup as the prerouting ones, so its possible that the reset skb won't find the correct entry. Generating a template entry for the skb seems error prone as well. Add an explicit "closing" function that switches a confirmed conntrack entry to closed state and wire this up for tcp. If the entry isn't confirmed, no action is needed because the conntrack entry will never be committed to the table. Reported-by: Russel King <linux@armlinux.org.uk> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Stable-dep-of: 62e7151ae3eb ("netfilter: bridge: confirm multicast packets before passing them up the stack") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-06netfilter: make function op structures constFlorian Westphal
[ Upstream commit 285c8a7a58158cb1805c97ff03875df2ba2ea1fe ] No functional changes, these structures should be const. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Stable-dep-of: 62e7151ae3eb ("netfilter: bridge: confirm multicast packets before passing them up the stack") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-06netfilter: core: move ip_ct_attach indirection to struct nf_ct_hookFlorian Westphal
[ Upstream commit 3fce16493dc1aa2c9af3d7e7bd360dfe203a3e6a ] ip_ct_attach predates struct nf_ct_hook, we can place it there and remove the exported symbol. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Stable-dep-of: 62e7151ae3eb ("netfilter: bridge: confirm multicast packets before passing them up the stack") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaioBart Van Assche
commit b820de741ae48ccf50dd95e297889c286ff4f760 upstream. If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the following kernel warning appears: WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8 Call trace: kiocb_set_cancel_fn+0x9c/0xa8 ffs_epfile_read_iter+0x144/0x1d0 io_read+0x19c/0x498 io_issue_sqe+0x118/0x27c io_submit_sqes+0x25c/0x5fc __arm64_sys_io_uring_enter+0x104/0xab0 invoke_syscall+0x58/0x11c el0_svc_common+0xb4/0xf4 do_el0_svc+0x2c/0xb0 el0_svc+0x2c/0xa4 el0t_64_sync_handler+0x68/0xb4 el0t_64_sync+0x1a4/0x1a8 Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is submitted by libaio. Suggested-by: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-01net: dev: Convert sa_data to flexible array in struct sockaddrKees Cook
[ Upstream commit b5f0de6df6dce8d641ef58ef7012f3304dffb9a1 ] One of the worst offenders of "fake flexible arrays" is struct sockaddr, as it is the classic example of why GCC and Clang have been traditionally forced to treat all trailing arrays as fake flexible arrays: in the distant misty past, sa_data became too small, and code started just treating it as a flexible array, even though it was fixed-size. The special case by the compiler is specifically that sizeof(sa->sa_data) and FORTIFY_SOURCE (which uses __builtin_object_size(sa->sa_data, 1)) do not agree (14 and -1 respectively), which makes FORTIFY_SOURCE treat it as a flexible array. However, the coming -fstrict-flex-arrays compiler flag will remove these special cases so that FORTIFY_SOURCE can gain coverage over all the trailing arrays in the kernel that are _not_ supposed to be treated as a flexible array. To deal with this change, convert sa_data to a true flexible array. To keep the structure size the same, move sa_data into a union with a newly introduced sa_data_min with the original size. The result is that FORTIFY_SOURCE can continue to have no idea how large sa_data may actually be, but anything using sizeof(sa->sa_data) must switch to sizeof(sa->sa_data_min). Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: Dylan Yudaken <dylany@fb.com> Cc: Yajun Deng <yajun.deng@linux.dev> Cc: Petr Machata <petrm@nvidia.com> Cc: Hangbin Liu <liuhangbin@gmail.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: syzbot <syzkaller@googlegroups.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221018095503.never.671-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: a7d6027790ac ("arp: Prevent overflow in arp_req_get().") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01ata: libahci_platform: Introduce reset assertion/deassertion methodsSerge Semin
[ Upstream commit f67f12ff57bcfcd7d64280f748787793217faeaf ] Currently the ACHI-platform library supports only the assert and deassert reset signals and ignores the platforms with self-deasserting reset lines. That prone to having the platforms with self-deasserting reset method misbehaviour when it comes to resuming from sleep state after the clocks have been fully disabled. For such cases the controller needs to be fully reset all over after the reference clocks are enabled and stable, otherwise the controller state machine might be in an undetermined state. The best solution would be to auto-detect which reset method is supported by the particular platform and use it implicitly in the framework of the ahci_platform_enable_resources()/ahci_platform_disable_resources() methods. Alas it can't be implemented due to the AHCI-platform library already supporting the shared reset control lines. As [1] says in such case we have to use only one of the next methods: + reset_control_assert()/reset_control_deassert(); + reset_control_reset()/reset_control_rearm(). If the driver had an exclusive control over the reset lines we could have been able to manipulate the lines with no much limitation and just used the combination of the methods above to cover all the possible reset-control cases. Since the shared reset control has already been advertised and couldn't be changed with no risk to breaking the platforms relying on it, we have no choice but to make the platform drivers to determine which reset methods the platform reset system supports. In order to implement both types of reset control support we suggest to introduce the new AHCI-platform flag: AHCI_PLATFORM_RST_TRIGGER, which when passed to the ahci_platform_get_resources() method together with the AHCI_PLATFORM_GET_RESETS flag will indicate that the reset lines are self-deasserting thus the reset_control_reset()/reset_control_rearm() will be used to control the reset state. Otherwise the reset_control_deassert()/reset_control_assert() methods will be utilized. [1] Documentation/driver-api/reset.rst Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Stable-dep-of: 26c8404e162b ("ata: ahci_ceva: fix error handling for Xilinx GT PHY support") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01posix-timers: Ensure timer ID search-loop limit is validThomas Gleixner
[ Upstream commit 8ce8849dd1e78dadcee0ec9acbd259d239b7069f ] posix_timer_add() tries to allocate a posix timer ID by starting from the cached ID which was stored by the last successful allocation. This is done in a loop searching the ID space for a free slot one by one. The loop has to terminate when the search wrapped around to the starting point. But that's racy vs. establishing the starting point. That is read out lockless, which leads to the following problem: CPU0 CPU1 posix_timer_add() start = sig->posix_timer_id; lock(hash_lock); ... posix_timer_add() if (++sig->posix_timer_id < 0) start = sig->posix_timer_id; sig->posix_timer_id = 0; So CPU1 can observe a negative start value, i.e. -1, and the loop break never happens because the condition can never be true: if (sig->posix_timer_id == start) break; While this is unlikely to ever turn into an endless loop as the ID space is huge (INT_MAX), the racy read of the start value caught the attention of KCSAN and Dmitry unearthed that incorrectness. Rewrite it so that all id operations are under the hash lock. Reported-by: syzbot+5c54bd3eb218bb595aa9@syzkaller.appspotmail.com Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/87bkhzdn6g.ffs@tglx Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01PM: core: Remove static qualifier in DEFINE_SIMPLE_DEV_PM_OPS macroPaul Cercueil
[ Upstream commit 52cc1d7f9786d2be44a3ab9b5b48416a7618e713 ] Keep this macro in line with the other ones. This makes it possible to use them in the cases where the underlying dev_pm_ops structure is exported. Restore the "static" qualifier in the two drivers where the DEFINE_SIMPLE_DEV_PM_OPS macro was used. Signed-off-by: Paul Cercueil <paul@crapouillou.net> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 18ab69c8ca56 ("Input: iqs269a - do not poll during suspend or resume") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01PM: core: Add new *_PM_OPS macros, deprecate old onesPaul Cercueil
[ Upstream commit 1a3c7bb088266fa2db017be299f91f1c1894c857 ] This commit introduces the following macros: SYSTEM_SLEEP_PM_OPS() LATE_SYSTEM_SLEEP_PM_OPS() NOIRQ_SYSTEM_SLEEP_PM_OPS() RUNTIME_PM_OPS() These new macros are very similar to their SET_*_PM_OPS() equivalent. They however differ in the fact that the callbacks they set will always be seen as referenced by the compiler. This means that the callback functions don't need to be wrapped with a #ifdef CONFIG_PM guard, or tagged with __maybe_unused, to prevent the compiler from complaining about unused static symbols. The compiler will then simply evaluate at compile time whether or not these symbols are dead code. The callbacks that are only useful with CONFIG_PM_SLEEP is enabled, are now also wrapped with a new pm_sleep_ptr() macro, which is inspired from pm_ptr(). This is needed for drivers that use different callbacks for sleep and runtime PM, to handle the case where CONFIG_PM is set and CONFIG_PM_SLEEP is not. This commit also deprecates the following macros: SIMPLE_DEV_PM_OPS() UNIVERSAL_DEV_PM_OPS() And introduces the following macros: DEFINE_SIMPLE_DEV_PM_OPS() DEFINE_UNIVERSAL_DEV_PM_OPS() These macros are similar to the functions they were created to replace, with the following differences: - They use the new macros introduced above, and as such always reference the provided callback functions. - They are not tagged with __maybe_unused. They are meant to be used with pm_ptr() or pm_sleep_ptr() for DEFINE_UNIVERSAL_DEV_PM_OPS() and DEFINE_SIMPLE_DEV_PM_OPS() respectively. - They declare the symbol static, since every driver seems to do that anyway; and if a non-static use-case is needed an indirection pointer could be used. The point of this change, is to progressively switch from a code model where PM callbacks are all protected behind CONFIG_PM guards, to a code model where the PM callbacks are always seen by the compiler, but discarded if not used. Signed-off-by: Paul Cercueil <paul@crapouillou.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 18ab69c8ca56 ("Input: iqs269a - do not poll during suspend or resume") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01PM: core: Redefine pm_ptr() macroPaul Cercueil
[ Upstream commit c06ef740d401d0f4ab188882bf6f8d9cf0f75eaf ] The pm_ptr() macro was previously conditionally defined, according to the value of the CONFIG_PM option. This meant that the pointed structure was either referenced (if CONFIG_PM was set), or never referenced (if CONFIG_PM was not set), causing it to be detected as unused by the compiler. This worked fine, but required the __maybe_unused compiler attribute to be used to every symbol pointed to by a pointer wrapped with pm_ptr(). We can do better. With this change, the pm_ptr() is now defined the same, independently of the value of CONFIG_PM. It now uses the (?:) ternary operator to conditionally resolve to its argument. Since the condition is known at compile time, the compiler will then choose to discard the unused symbols, which won't need to be tagged with __maybe_unused anymore. This pm_ptr() macro is usually used with pointers to dev_pm_ops structures created with SIMPLE_DEV_PM_OPS() or similar macros. These do use a __maybe_unused flag, which is now useless with this change, so it later can be removed. However in the meantime it causes no harm, and all the drivers still compile fine with the new pm_ptr() macro. Signed-off-by: Paul Cercueil <paul@crapouillou.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 18ab69c8ca56 ("Input: iqs269a - do not poll during suspend or resume") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01kernel/sched: Remove dl_boosted flag commentHui Su
[ Upstream commit 0e3872499de1a1230cef5221607d71aa09264bd5 ] since commit 2279f540ea7d ("sched/deadline: Fix priority inheritance with multiple scheduling classes"), we should not keep it here. Signed-off-by: Hui Su <suhui_kernel@163.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lore.kernel.org/r/20220107095254.GA49258@localhost.localdomain Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01clk: linux/clk-provider.h: fix kernel-doc warnings and typosRandy Dunlap
[ Upstream commit 84aefafe6b294041b7fa0757414c4a29c1bdeea2 ] Fix spelling of "Structure". Fix multiple kernel-doc warnings: clk-provider.h:269: warning: Function parameter or member 'recalc_rate' not described in 'clk_ops' clk-provider.h:468: warning: Function parameter or member 'parent_data' not described in 'clk_hw_register_fixed_rate_with_accuracy_parent_data' clk-provider.h:468: warning: Excess function parameter 'parent_name' description in 'clk_hw_register_fixed_rate_with_accuracy_parent_data' clk-provider.h:482: warning: Function parameter or member 'parent_data' not described in 'clk_hw_register_fixed_rate_parent_accuracy' clk-provider.h:482: warning: Excess function parameter 'parent_name' description in 'clk_hw_register_fixed_rate_parent_accuracy' clk-provider.h:687: warning: Function parameter or member 'flags' not described in 'clk_divider' clk-provider.h:1164: warning: Function parameter or member 'flags' not described in 'clk_fractional_divider' clk-provider.h:1164: warning: Function parameter or member 'approximation' not described in 'clk_fractional_divider' clk-provider.h:1213: warning: Function parameter or member 'flags' not described in 'clk_multiplier' Fixes: 9fba738a53dd ("clk: add duty cycle support") Fixes: b2476490ef11 ("clk: introduce the common clock framework") Fixes: 2d34f09e79c9 ("clk: fixed-rate: Add support for specifying parents via DT/pointers") Fixes: f5290d8e4f0c ("clk: asm9260: use parent index to link the reference clock") Fixes: 9d9f78ed9af0 ("clk: basic clock hardware types") Fixes: e2d0e90fae82 ("clk: new basic clk type for fractional divider") Fixes: f2e0a53271a4 ("clk: Add a basic multiplier clock") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Michael Turquette <mturquette@baylibre.com> Cc: Stephen Boyd <sboyd@kernel.org> Cc: linux-clk@vger.kernel.org Link: https://lore.kernel.org/r/20230930221428.18463-1-rdunlap@infradead.org Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01bpf: Remove trace_printk_lockJiri Olsa
commit e2bb9e01d589f7fa82573aedd2765ff9b277816a upstream. Both bpf_trace_printk and bpf_trace_vprintk helpers use static buffer guarded with trace_printk_lock spin lock. The spin lock contention causes issues with bpf programs attached to contention_begin tracepoint [1][2]. Andrii suggested we could get rid of the contention by using trylock, but we could actually get rid of the spinlock completely by using percpu buffers the same way as for bin_args in bpf_bprintf_prepare function. Adding new return 'buf' argument to struct bpf_bprintf_data and making bpf_bprintf_prepare to return also the buffer for printk helpers. [1] https://lore.kernel.org/bpf/CACkBjsakT_yWxnSWr4r-0TpPvbKm9-OBmVUhJb7hV3hY8fdCkw@mail.gmail.com/ [2] https://lore.kernel.org/bpf/CACkBjsaCsTovQHFfkqJKto6S4Z8d02ud1D7MPESrHa1cVNNTrw@mail.gmail.com/ Reported-by: Hao Sun <sunhao.th@gmail.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221215214430.1336195-4-jolsa@kernel.org [cascardo: there is no bpf_trace_vprintk in 5.15] Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-01bpf: Do cleanup in bpf_bprintf_cleanup only when neededJiri Olsa
commit f19a4050455aad847fb93f18dc1fe502eb60f989 upstream. Currently we always cleanup/decrement bpf_bprintf_nest_level variable in bpf_bprintf_cleanup if it's > 0. There's possible scenario where this could cause a problem, when bpf_bprintf_prepare does not get bin_args buffer (because num_args is 0) and following bpf_bprintf_cleanup call decrements bpf_bprintf_nest_level variable, like: in task context: bpf_bprintf_prepare(num_args != 0) increments 'bpf_bprintf_nest_level = 1' -> first irq : bpf_bprintf_prepare(num_args == 0) bpf_bprintf_cleanup decrements 'bpf_bprintf_nest_level = 0' -> second irq: bpf_bprintf_prepare(num_args != 0) bpf_bprintf_nest_level = 1 gets same buffer as task context above Adding check to bpf_bprintf_cleanup and doing the real cleanup only if we got bin_args data in the first place. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221215214430.1336195-3-jolsa@kernel.org [cascardo: there is no bpf_trace_vprintk in 5.15] Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-01bpf: Add struct for bin_args arg in bpf_bprintf_prepareJiri Olsa
commit 78aa1cc9404399a15d2a1205329c6a06236f5378 upstream. Adding struct bpf_bprintf_data to hold bin_args argument for bpf_bprintf_prepare function. We will add another return argument to bpf_bprintf_prepare and pass the struct to bpf_bprintf_cleanup for proper cleanup in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221215214430.1336195-2-jolsa@kernel.org [cascardo: there is no bpf_trace_vprintk in 5.15] Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-03-01bpf: Merge printk and seq_printf VARARG max macrosDave Marchevsky
commit 335ff4990cf3bfa42d8846f9b3d8c09456f51801 upstream. MAX_SNPRINTF_VARARGS and MAX_SEQ_PRINTF_VARARGS are used by bpf helpers bpf_snprintf and bpf_seq_printf to limit their varargs. Both call into bpf_bprintf_prepare for print formatting logic and have convenience macros in libbpf (BPF_SNPRINTF, BPF_SEQ_PRINTF) which use the same helper macros to convert varargs to a byte array. Changing shared functionality to support more varargs for either bpf helper would affect the other as well, so let's combine the _VARARGS macros to make this more obvious. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210917182911.2426606-2-davemarchevsky@fb.com Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23PM: runtime: Have devm_pm_runtime_enable() handle ↵Douglas Anderson
pm_runtime_dont_use_autosuspend() [ Upstream commit b4060db9251f919506e4d672737c6b8ab9a84701 ] The PM Runtime docs say: Drivers in ->remove() callback should undo the runtime PM changes done in ->probe(). Usually this means calling pm_runtime_disable(), pm_runtime_dont_use_autosuspend() etc. >From grepping code, it's clear that many people aren't aware of the need to call pm_runtime_dont_use_autosuspend(). When brainstorming solutions, one idea that came up was to leverage the new-ish devm_pm_runtime_enable() function. The idea here is that: * When the devm action is called we know that the driver is being removed. It's the perfect time to undo the use_autosuspend. * The code of pm_runtime_dont_use_autosuspend() already handles the case of being called when autosuspend wasn't enabled. Suggested-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 3d07a411b4fa ("drm/msm/dsi: Use pm_runtime_resume_and_get to prevent refcnt leaks") Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23netfilter: ipset: fix performance regression in swap operationJozsef Kadlecsik
commit 97f7cf1cd80eeed3b7c808b7c12463295c751001 upstream. The patch "netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test", commit 28628fa9 fixes a race condition. But the synchronize_rcu() added to the swap function unnecessarily slows it down: it can safely be moved to destroy and use call_rcu() instead. Eric Dumazet pointed out that simply calling the destroy functions as rcu callback does not work: sets with timeout use garbage collectors which need cancelling at destroy which can wait. Therefore the destroy functions are split into two: cancelling garbage collectors safely at executing the command received by netlink and moving the remaining part only into the rcu callback. Link: https://lore.kernel.org/lkml/C0829B10-EAA6-4809-874E-E1E9C05A8D84@automattic.com/ Fixes: 28628fa952fe ("netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test") Reported-by: Ale Crismani <ale.crismani@automattic.com> Reported-by: David Wang <00107082@163.com> Tested-by: David Wang <00107082@163.com> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23fbdev: Fix incorrect page mapping clearance at fb_deferred_io_release()Takashi Iwai
[ Upstream commit fe9ae05cfbe587dda724fcf537c00bc2f287da62 ] The recent fix for the deferred I/O by the commit 3efc61d95259 ("fbdev: Fix invalid page access after closing deferred I/O devices") caused a regression when the same fb device is opened/closed while it's being used. It resulted in a frozen screen even if something is redrawn there after the close. The breakage is because the patch was made under a wrong assumption of a single open; in the current code, fb_deferred_io_release() cleans up the page mapping of the pageref list and it calls cancel_delayed_work_sync() unconditionally, where both are no correct behavior for multiple opens. This patch adds a refcount for the opens of the device, and applies the cleanup only when all files get closed. As both fb_deferred_io_open() and _close() are called always in the fb_info lock (mutex), it's safe to use the normal int for the refcounting. Also, a useless BUG_ON() is dropped. Fixes: 3efc61d95259 ("fbdev: Fix invalid page access after closing deferred I/O devices") Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Reviewed-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230308105012.1845-1-tiwai@suse.de Stable-dep-of: 33cd6ea9c067 ("fbdev: flush deferred IO before closing") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23fbdev: Fix invalid page access after closing deferred I/O devicesTakashi Iwai
[ Upstream commit 3efc61d95259956db25347e2a9562c3e54546e20 ] When a fbdev with deferred I/O is once opened and closed, the dirty pages still remain queued in the pageref list, and eventually later those may be processed in the delayed work. This may lead to a corruption of pages, hitting an Oops. This patch makes sure to cancel the delayed work and clean up the pageref list at closing the device for addressing the bug. A part of the cleanup code is factored out as a new helper function that is called from the common fb_release(). Reviewed-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Tested-by: Miko Larsson <mikoxyzzz@gmail.com> Fixes: 56c134f7f1b5 ("fbdev: Track deferred-I/O pages in pageref struct") Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230129082856.22113-1-tiwai@suse.de Stable-dep-of: 33cd6ea9c067 ("fbdev: flush deferred IO before closing") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23fbdev: Rename pagelist to pagereflist for deferred I/OThomas Zimmermann
[ Upstream commit e80eec1b871a2acb8f5c92db4c237e9ae6dd322b ] Rename various instances of pagelist to pagereflist. The list now stores pageref structures, so the new name is more appropriate. In their write-back helpers, several fbdev drivers refer to the pageref list in struct fb_deferred_io instead of using the one supplied as argument to the function. Convert them over to the supplied one. It's the same instance, so no change of behavior occurs. v4: * fix commit message (Javier) Suggested-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220429100834.18898-5-tzimmermann@suse.de Stable-dep-of: 33cd6ea9c067 ("fbdev: flush deferred IO before closing") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23fbdev: Track deferred-I/O pages in pageref structThomas Zimmermann
[ Upstream commit 56c134f7f1b58be08bdb0ca8372474a4a5165f31 ] Store the per-page state for fbdev's deferred I/O in struct fb_deferred_io_pageref. Maintain a list of pagerefs for the pages that have to be written back to video memory. Update all affected drivers. As with pages before, fbdev acquires a pageref when an mmaped page of the framebuffer is being written to. It holds the pageref in a list of all currently written pagerefs until it flushes the written pages to video memory. Writeback occurs periodically. After writeback fbdev releases all pagerefs and builds up a new dirty list until the next writeback occurs. Using pagerefs has a number of benefits. For pages of the framebuffer, the deferred I/O code used struct page.lru as an entry into the list of dirty pages. The lru field is owned by the page cache, which makes deferred I/O incompatible with some memory pages (e.g., most notably DRM's GEM SHMEM allocator). struct fb_deferred_io_pageref now provides an entry into a list of dirty framebuffer pages, freeing lru for use with the page cache. Drivers also assumed that struct page.index is the page offset into the framebuffer. This is not true for DRM buffers, which are located at various offset within a mapped area. struct fb_deferred_io_pageref explicitly stores an offset into the framebuffer. struct page.index is now only the page offset into the mapped area. These changes will allow DRM to use fbdev deferred I/O without an intermediate shadow buffer. v3: * use pageref->offset for sorting * fix grammar in comment v2: * minor fixes in commit message Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220429100834.18898-3-tzimmermann@suse.de Stable-dep-of: 33cd6ea9c067 ("fbdev: flush deferred IO before closing") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23fbdev: Don't sort deferred-I/O pages by defaultThomas Zimmermann
[ Upstream commit 8c30e2d81bfddc5ab9f6b04db1c0f7d6ca7bdf46 ] Fbdev's deferred I/O sorts all dirty pages by default, which incurs a significant overhead. Make the sorting step optional and update the few drivers that require it. Use a FIFO list by default. Most fbdev drivers with deferred I/O build a bounding rectangle around the dirty pages or simply flush the whole screen. The only two affected DRM drivers, generic fbdev and vmwgfx, both use a bounding rectangle. In those cases, the exact order of the pages doesn't matter. The other drivers look at the page index or handle pages one-by-one. The patch sets the sort_pagelist flag for those, even though some of them would probably work correctly without sorting. Driver maintainers should update their driver accordingly. Sorting pages by memory offset for deferred I/O performs an implicit bubble-sort step on the list of dirty pages. The algorithm goes through the list of dirty pages and inserts each new page according to its index field. Even worse, list traversal always starts at the first entry. As video memory is most likely updated scanline by scanline, the algorithm traverses through the complete list for each updated page. For example, with 1024x768x32bpp each page covers exactly one scanline. Writing a single screen update from top to bottom requires updating 768 pages. With an average list length of 384 entries, a screen update creates (768 * 384 =) 294912 compare operation. Fix this by making the sorting step opt-in and update the few drivers that require it. All other drivers work with unsorted page lists. Pages are appended to the list. Therefore, in the common case of writing the framebuffer top to bottom, pages are still sorted by offset, which may have a positive effect on performance. Playing a video [1] in mplayer's benchmark mode shows the difference (i7-4790, FullHD, simpledrm, kernel with debugging). mplayer -benchmark -nosound -vo fbdev ./big_buck_bunny_720p_stereo.ogg With sorted page lists: BENCHMARKs: VC: 32.960s VO: 73.068s A: 0.000s Sys: 2.413s = 108.441s BENCHMARK%: VC: 30.3947% VO: 67.3802% A: 0.0000% Sys: 2.2251% = 100.0000% With unsorted page lists: BENCHMARKs: VC: 31.005s VO: 42.889s A: 0.000s Sys: 2.256s = 76.150s BENCHMARK%: VC: 40.7156% VO: 56.3219% A: 0.0000% Sys: 2.9625% = 100.0000% VC shows the overhead of video decoding, VO shows the overhead of the video output. Using unsorted page lists reduces the benchmark's run time by ~32s/~25%. v2: * Make sorted pagelists the special case (Sam) * Comment on drivers' use of pagelist (Sam) * Warn about the overhead in comment Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Acked-by: Sam Ravnborg <sam@ravnborg.org> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://download.blender.org/peach/bigbuckbunny_movies/big_buck_bunny_720p_stereo.ogg # [1] Link: https://patchwork.freedesktop.org/patch/msgid/20220211094640.21632-3-tzimmermann@suse.de Stable-dep-of: 33cd6ea9c067 ("fbdev: flush deferred IO before closing") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23dma-buf: add dma_fence_timestamp helperChristian König
[ Upstream commit b83ce9cb4a465b8f9a3fa45561b721a9551f60e3 ] When a fence signals there is a very small race window where the timestamp isn't updated yet. sync_file solves this by busy waiting for the timestamp to appear, but on other ocassions didn't handled this correctly. Provide a dma_fence_timestamp() helper function for this and use it in all appropriate cases. Another alternative would be to grab the spinlock when that happens. v2 by teddy: add a wait parameter to wait for the timestamp to show up, in case the accurate timestamp is needed and/or the timestamp is not based on ktime (e.g. hw timestamp) v3 chk: drop the parameter again for unified handling Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Fixes: 1774baa64f93 ("drm/scheduler: Change scheduled fence track v2") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> CC: stable@vger.kernel.org Link: https://patchwork.freedesktop.org/patch/msgid/20230929104725.2358-1-christian.koenig@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23hrtimer: Report offline hrtimer enqueueFrederic Weisbecker
commit dad6a09f3148257ac1773cd90934d721d68ab595 upstream. The hrtimers migration on CPU-down hotplug process has been moved earlier, before the CPU actually goes to die. This leaves a small window of opportunity to queue an hrtimer in a blind spot, leaving it ignored. For example a practical case has been reported with RCU waking up a SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that way a sched/rt timer to the local offline CPU. Make sure such situations never go unnoticed and warn when that happens. Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23dmaengine: fix is_slave_direction() return false when DMA_DEV_TO_DEVFrank Li
[ Upstream commit a22fe1d6dec7e98535b97249fdc95c2be79120bb ] is_slave_direction() should return true when direction is DMA_DEV_TO_DEV. Fixes: 49920bc66984 ("dmaengine: add new enum dma_transfer_direction") Signed-off-by: Frank Li <Frank.Li@nxp.com> Link: https://lore.kernel.org/r/20240123172842.3764529-1-Frank.Li@nxp.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23PCI: add INTEL_HDA_ARL to pci_ids.hPierre-Louis Bossart
[ Upstream commit 5ec42bf04d72fd6d0a6855810cc779e0ee31dfd7 ] The PCI ID insertion follows the increasing order in the table, but this hardware follows MTL (MeteorLake). Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Acked-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20231204212710.185976-2-pierre-louis.bossart@linux.intel.com Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23bpf: Add map and need_defer parameters to .map_fd_put_ptr()Hou Tao
[ Upstream commit 20c20bd11a0702ce4dc9300c3da58acf551d9725 ] map is the pointer of outer map, and need_defer needs some explanation. need_defer tells the implementation to defer the reference release of the passed element and ensure that the element is still alive before the bpf program, which may manipulate it, exits. The following three cases will invoke map_fd_put_ptr() and different need_defer values will be passed to these callers: 1) release the reference of the old element in the map during map update or map deletion. The release must be deferred, otherwise the bpf program may incur use-after-free problem, so need_defer needs to be true. 2) release the reference of the to-be-added element in the error path of map update. The to-be-added element is not visible to any bpf program, so it is OK to pass false for need_defer parameter. 3) release the references of all elements in the map during map release. Any bpf program which has access to the map must have been exited and released, so need_defer=false will be OK. These two parameters will be used by the following patches to fix the potential use-after-free problem for map-in-map. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23arch: consolidate arch_irq_work_raise prototypesArnd Bergmann
[ Upstream commit 64bac5ea17d527872121adddfee869c7a0618f8f ] The prototype was hidden in an #ifdef on x86, which causes a warning: kernel/irq_work.c:72:13: error: no previous prototype for 'arch_irq_work_raise' [-Werror=missing-prototypes] Some architectures have a working prototype, while others don't. Fix this by providing it in only one place that is always visible. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Acked-by: Guo Ren <guoren@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23x86/entry/ia32: Ensure s32 is sign extended to s64Richard Palethorpe
commit 56062d60f117dccfb5281869e0ab61e090baf864 upstream. Presently ia32 registers stored in ptregs are unconditionally cast to unsigned int by the ia32 stub. They are then cast to long when passed to __se_sys*, but will not be sign extended. This takes the sign of the syscall argument into account in the ia32 stub. It still casts to unsigned int to avoid implementation specific behavior. However then casts to int or unsigned int as necessary. So that the following cast to long sign extends the value. This fixes the io_pgetevents02 LTP test when compiled with -m32. Presently the systemcall io_pgetevents_time64() unexpectedly accepts -1 for the maximum number of events. It doesn't appear other systemcalls with signed arguments are effected because they all have compat variants defined and wired up. Fixes: ebeb8c82ffaf ("syscalls/x86: Use 'struct pt_regs' based syscall calling for IA32_EMULATION and x32") Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Richard Palethorpe <rpalethorpe@suse.com> Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240110130122.3836513-1-nik.borisov@suse.com Link: https://lore.kernel.org/ltp/20210921130127.24131-1-rpalethorpe@suse.com/ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23mm/sparsemem: fix race in accessing memory_section->usageCharan Teja Kalla
[ Upstream commit 5ec8e8ea8b7783fab150cf86404fc38cb4db8800 ] The below race is observed on a PFN which falls into the device memory region with the system memory configuration where PFN's are such that [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL]. Since normal zone start and end pfn contains the device memory PFN's as well, the compaction triggered will try on the device memory PFN's too though they end up in NOP(because pfn_to_online_page() returns NULL for ZONE_DEVICE memory sections). When from other core, the section mappings are being removed for the ZONE_DEVICE region, that the PFN in question belongs to, on which compaction is currently being operated is resulting into the kernel crash with CONFIG_SPASEMEM_VMEMAP enabled. The crash logs can be seen at [1]. compact_zone() memunmap_pages ------------- --------------- __pageblock_pfn_to_page ...... (a)pfn_valid(): valid_section()//return true (b)__remove_pages()-> sparse_remove_section()-> section_deactivate(): [Free the array ms->usage and set ms->usage = NULL] pfn_section_valid() [Access ms->usage which is NULL] NOTE: From the above it can be said that the race is reduced to between the pfn_valid()/pfn_section_valid() and the section deactivate with SPASEMEM_VMEMAP enabled. The commit b943f045a9af("mm/sparse: fix kernel crash with pfn_section_valid check") tried to address the same problem by clearing the SECTION_HAS_MEM_MAP with the expectation of valid_section() returns false thus ms->usage is not accessed. Fix this issue by the below steps: a) Clear SECTION_HAS_MEM_MAP before freeing the ->usage. b) RCU protected read side critical section will either return NULL when SECTION_HAS_MEM_MAP is cleared or can successfully access ->usage. c) Free the ->usage with kfree_rcu() and set ms->usage = NULL. No attempt will be made to access ->usage after this as the SECTION_HAS_MEM_MAP is cleared thus valid_section() return false. Thanks to David/Pavan for their inputs on this patch. [1] https://lore.kernel.org/linux-mm/994410bb-89aa-d987-1f50-f514903c55aa@quicinc.com/ On Snapdragon SoC, with the mentioned memory configuration of PFN's as [ZONE_NORMAL ZONE_DEVICE ZONE_NORMAL], we are able to see bunch of issues daily while testing on a device farm. For this particular issue below is the log. Though the below log is not directly pointing to the pfn_section_valid(){ ms->usage;}, when we loaded this dump on T32 lauterbach tool, it is pointing. [ 540.578056] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 540.578068] Mem abort info: [ 540.578070] ESR = 0x0000000096000005 [ 540.578073] EC = 0x25: DABT (current EL), IL = 32 bits [ 540.578077] SET = 0, FnV = 0 [ 540.578080] EA = 0, S1PTW = 0 [ 540.578082] FSC = 0x05: level 1 translation fault [ 540.578085] Data abort info: [ 540.578086] ISV = 0, ISS = 0x00000005 [ 540.578088] CM = 0, WnR = 0 [ 540.579431] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBSBTYPE=--) [ 540.579436] pc : __pageblock_pfn_to_page+0x6c/0x14c [ 540.579454] lr : compact_zone+0x994/0x1058 [ 540.579460] sp : ffffffc03579b510 [ 540.579463] x29: ffffffc03579b510 x28: 0000000000235800 x27:000000000000000c [ 540.579470] x26: 0000000000235c00 x25: 0000000000000068 x24:ffffffc03579b640 [ 540.579477] x23: 0000000000000001 x22: ffffffc03579b660 x21:0000000000000000 [ 540.579483] x20: 0000000000235bff x19: ffffffdebf7e3940 x18:ffffffdebf66d140 [ 540.579489] x17: 00000000739ba063 x16: 00000000739ba063 x15:00000000009f4bff [ 540.579495] x14: 0000008000000000 x13: 0000000000000000 x12:0000000000000001 [ 540.579501] x11: 0000000000000000 x10: 0000000000000000 x9 :ffffff897d2cd440 [ 540.579507] x8 : 0000000000000000 x7 : 0000000000000000 x6 :ffffffc03579b5b4 [ 540.579512] x5 : 0000000000027f25 x4 : ffffffc03579b5b8 x3 :0000000000000001 [ 540.579518] x2 : ffffffdebf7e3940 x1 : 0000000000235c00 x0 :0000000000235800 [ 540.579524] Call trace: [ 540.579527] __pageblock_pfn_to_page+0x6c/0x14c [ 540.579533] compact_zone+0x994/0x1058 [ 540.579536] try_to_compact_pages+0x128/0x378 [ 540.579540] __alloc_pages_direct_compact+0x80/0x2b0 [ 540.579544] __alloc_pages_slowpath+0x5c0/0xe10 [ 540.579547] __alloc_pages+0x250/0x2d0 [ 540.579550] __iommu_dma_alloc_noncontiguous+0x13c/0x3fc [ 540.579561] iommu_dma_alloc+0xa0/0x320 [ 540.579565] dma_alloc_attrs+0xd4/0x108 [quic_charante@quicinc.com: use kfree_rcu() in place of synchronize_rcu(), per David] Link: https://lkml.kernel.org/r/1698403778-20938-1-git-send-email-quic_charante@quicinc.com Link: https://lkml.kernel.org/r/1697202267-23600-1-git-send-email-quic_charante@quicinc.com Fixes: f46edbd1b151 ("mm/sparsemem: add helpers track active portions of a section at boot") Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23mm: use __pfn_to_section() instead of open coding itRolf Eike Beer
[ Upstream commit f1dc0db296bd25960273649fc6ef2ecbf5aaa0e0 ] It is defined in the same file just a few lines above. Link: https://lkml.kernel.org/r/4598487.Rc0NezkW7i@mobilepool36.emlix.com Signed-off-by: Rolf Eike Beer <eb@emlix.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Stable-dep-of: 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23fs/pipe: move check to pipe_has_watch_queue()Max Kellermann
[ Upstream commit b4bd6b4bac8edd61eb8f7b836969d12c0c6af165 ] This declutters the code by reducing the number of #ifdefs and makes the watch_queue checks simpler. This has no runtime effect; the machine code is identical. Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Message-Id: <20230921075755.1378787-2-max.kellermann@ionos.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: e95aada4cb93 ("pipe: wakeup wr_wait after setting max_usage") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23overflow: Allow mixed type argumentsKees Cook
commit d219d2a9a92e39aa92799efe8f2aa21259b6dd82 upstream. When the check_[op]_overflow() helpers were introduced, all arguments were required to be the same type to make the fallback macros simpler. However, now that the fallback macros have been removed[1], it is fine to allow mixed types, which makes using the helpers much more useful, as they can be used to test for type-based overflows (e.g. adding two large ints but storing into a u8), as would be handy in the drm core[2]. Remove the restriction, and add additional self-tests that exercise some of the mixed-type overflow cases, and double-check for accidental macro side-effects. [1] https://git.kernel.org/linus/4eb6bd55cfb22ffc20652732340c4962f3ac9a91 [2] https://lore.kernel.org/lkml/20220824084514.2261614-2-gwan-gyeong.mun@intel.com Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: linux-hardening@vger.kernel.org Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com> Reviewed-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Tested-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> [ dropped the test portion of the commit as that doesn't apply to 5.15.y - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23lsm: new security_file_ioctl_compat() hookAlfred Piccioni
commit f1bb47a31dff6d4b34fb14e99850860ee74bb003 upstream. Some ioctl commands do not require ioctl permission, but are routed to other permissions such as FILE_GETATTR or FILE_SETATTR. This routing is done by comparing the ioctl cmd to a set of 64-bit flags (FS_IOC_*). However, if a 32-bit process is running on a 64-bit kernel, it emits 32-bit flags (FS_IOC32_*) for certain ioctl operations. These flags are being checked erroneously, which leads to these ioctl operations being routed to the ioctl permission, rather than the correct file permissions. This was also noted in a RED-PEN finding from a while back - "/* RED-PEN how should LSM module know it's handling 32bit? */". This patch introduces a new hook, security_file_ioctl_compat(), that is called from the compat ioctl syscall. All current LSMs have been changed to support this hook. Reviewing the three places where we are currently using security_file_ioctl(), it appears that only SELinux needs a dedicated compat change; TOMOYO and SMACK appear to be functional without any change. Cc: stable@vger.kernel.org Fixes: 0b24dcb7f2f7 ("Revert "selinux: simplify ioctl checking"") Signed-off-by: Alfred Piccioni <alpic@google.com> Reviewed-by: Stephen Smalley <stephen.smalley.work@gmail.com> [PM: subject tweak, line length fixes, and alignment corrections] Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23async: Introduce async_schedule_dev_nocall()Rafael J. Wysocki
commit 7d4b5d7a37bdd63a5a3371b988744b060d5bb86f upstream. In preparation for subsequent changes, introduce a specialized variant of async_schedule_dev() that will not invoke the argument function synchronously when it cannot be scheduled for asynchronous execution. The new function, async_schedule_dev_nocall(), will be used for fixing possible deadlocks in the system-wide power management core code. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> for the series. Tested-by: Youngmin Nam <youngmin.nam@samsung.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-25iio: adc: ad9467: fix scale settingNuno Sa
[ Upstream commit b73f08bb7fe5a0901646ca5ceaa1e7a2d5ee6293 ] When reading in_voltage_scale we can get something like: root@analog:/sys/bus/iio/devices/iio:device2# cat in_voltage_scale 0.038146 However, when reading the available options: root@analog:/sys/bus/iio/devices/iio:device2# cat in_voltage_scale_available 2000.000000 2100.000006 2200.000007 2300.000008 2400.000009 2500.000010 which does not make sense. Moreover, when trying to set a new scale we get an error because there's no call to __ad9467_get_scale() to give us values as given when reading in_voltage_scale. Fix it by computing the available scales during probe and properly pass the list when .read_available() is called. While at it, change to use .read_available() from iio_info. Also note that to properly fix this, adi-axi-adc.c has to be changed accordingly. Fixes: ad6797120238 ("iio: adc: ad9467: add support AD9467 ADC") Signed-off-by: Nuno Sa <nuno.sa@analog.com> Reviewed-by: David Lechner <dlechner@baylibre.com> Link: https://lore.kernel.org/r/20231207-iio-backend-prep-v2-4-a4a33bc4d70e@analog.com Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-25dma-mapping: Fix build error unused-valueRen Zhijie
commit 50d6281ce9b8412f7ef02d1bc9d23aa62ae0cf98 upstream. If CONFIG_DMA_DECLARE_COHERENT is not set, make ARCH=x86_64 CROSS_COMPILE=x86_64-linux-gnu- will be failed, like this: drivers/remoteproc/remoteproc_core.c: In function ‘rproc_rvdev_release’: ./include/linux/dma-map-ops.h:182:42: error: statement with no effect [-Werror=unused-value] #define dma_release_coherent_memory(dev) (0) ^ drivers/remoteproc/remoteproc_core.c:464:2: note: in expansion of macro ‘dma_release_coherent_memory’ dma_release_coherent_memory(dev); ^~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors The return type of function dma_release_coherent_memory in CONFIG_DMA_DECLARE_COHERENT area is void, so in !CONFIG_DMA_DECLARE_COHERENT area it should neither return any value nor be defined as zero. Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: e61c451476e6 ("dma-mapping: Add dma_release_coherent_memory to DMA API") Signed-off-by: Ren Zhijie <renzhijie2@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220630123528.251181-1-renzhijie2@huawei.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-25clk: fixed-rate: fix clk_hw_register_fixed_rate_with_accuracy_parent_hwThéo Lebrun
[ Upstream commit ee0cf5e07f44a10fce8f1bfa9db226c0b5ecf880 ] Add missing comma and remove extraneous NULL argument. The macro is currently used by no one which explains why the typo slipped by. Fixes: 2d34f09e79c9 ("clk: fixed-rate: Add support for specifying parents via DT/pointers") Signed-off-by: Théo Lebrun <theo.lebrun@bootlin.com> Link: https://lore.kernel.org/r/20231218-mbly-clk-v1-1-44ce54108f06@bootlin.com Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-25clk: fixed-rate: add devm_clk_hw_register_fixed_rateDmitry Baryshkov
[ Upstream commit 1d7d20658534c7d36fe6f4252f6f1a27d9631a99 ] Add devm_clk_hw_register_fixed_rate(), devres-managed helper to register fixed-rate clock. Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Link: https://lore.kernel.org/r/20220916061740.87167-3-dmitry.baryshkov@linaro.org Signed-off-by: Stephen Boyd <sboyd@kernel.org> Stable-dep-of: ee0cf5e07f44 ("clk: fixed-rate: fix clk_hw_register_fixed_rate_with_accuracy_parent_hw") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-25clk: asm9260: use parent index to link the reference clockDmitry Baryshkov
[ Upstream commit f5290d8e4f0caa81a491448a27dd70e726095d07 ] Rewrite clk-asm9260 to use parent index to use the reference clock. During this rework two helpers are added: - clk_hw_register_mux_table_parent_data() to supplement clk_hw_register_mux_table() but using parent_data instead of parent_names - clk_hw_register_fixed_rate_parent_accuracy() to be used instead of directly calling __clk_hw_register_fixed_rate(). The later function is an internal API, which is better not to be called directly. Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Link: https://lore.kernel.org/r/20220916061740.87167-2-dmitry.baryshkov@linaro.org Signed-off-by: Stephen Boyd <sboyd@kernel.org> Stable-dep-of: ee0cf5e07f44 ("clk: fixed-rate: fix clk_hw_register_fixed_rate_with_accuracy_parent_hw") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-25block: make BLK_DEF_MAX_SECTORS unsignedKeith Busch
[ Upstream commit 0a26f327e46c203229e72c823dfec71a2b405ec5 ] This is used as an unsigned value, so define it that way to avoid having to cast it. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230105205146.3610282-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 9a9525de8654 ("null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-25dma-mapping: Add dma_release_coherent_memory to DMA APIMark-PK Tsai
[ Upstream commit e61c451476e61450f6771ce03bbc01210a09be16 ] Add dma_release_coherent_memory to DMA API to allow dma user call it to release dev->dma_mem when the device is removed. Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Acked-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220422062436.14384-2-mark-pk.tsai@mediatek.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Stable-dep-of: b07bc2347672 ("dma-mapping: clear dev->dma_mem to NULL after freeing it") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-25of: Add of_property_present() helperRob Herring
[ Upstream commit 9cbad37ce8122de32a1529e394b468bc101c9e7f ] Add an of_property_present() function similar to fwnode_property_present(). of_property_read_bool() could be used directly, but it is cleaner to not use it on non-boolean properties. Reviewed-by: Frank Rowand <frowand.list@gmail.com> Tested-by: Frank Rowand <frowand.list@gmail.com> Link: https://lore.kernel.org/all/20230215215547.691573-1-robh@kernel.org/ Signed-off-by: Rob Herring <robh@kernel.org> Stable-dep-of: c4a5118a3ae1 ("cpufreq: scmi: process the result of devm_of_clk_add_hw_provider()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-25of: property: define of_property_read_u{8,16,32,64}_array() unconditionallyMichael Walle
[ Upstream commit 2ca42c3ad9ed875b136065b010753a4caaaa1d38 ] We can get rid of all the empty stubs because all these functions call of_property_read_variable_u{8,16,32,64}_array() which already have an empty stub if CONFIG_OF is not defined. Signed-off-by: Michael Walle <michael@walle.cc> Signed-off-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20220118173504.2867523-3-michael@walle.cc Stable-dep-of: c4a5118a3ae1 ("cpufreq: scmi: process the result of devm_of_clk_add_hw_provider()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-15kallsyms: Make module_kallsyms_on_each_symbol generally availableJiri Olsa
commit 73feb8d5fa3b755bb51077c0aabfb6aa556fd498 upstream. Making module_kallsyms_on_each_symbol generally available, so it can be used outside CONFIG_LIVEPATCH option in following changes. Rather than adding another ifdef option let's make the function generally available (when CONFIG_KALLSYMS and CONFIG_MODULES options are defined). Cc: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20221025134148.3300700-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Stable-dep-of: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well") Signed-off-by: Markus Boehme <markubo@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-05bpf: Fix prog_array_map_poke_run map poke updateJiri Olsa
commit 4b7de801606e504e69689df71475d27e35336fb3 upstream. Lee pointed out issue found by syscaller [0] hitting BUG in prog array map poke update in prog_array_map_poke_run function due to error value returned from bpf_arch_text_poke function. There's race window where bpf_arch_text_poke can fail due to missing bpf program kallsym symbols, which is accounted for with check for -EINVAL in that BUG_ON call. The problem is that in such case we won't update the tail call jump and cause imbalance for the next tail call update check which will fail with -EBUSY in bpf_arch_text_poke. I'm hitting following race during the program load: CPU 0 CPU 1 bpf_prog_load bpf_check do_misc_fixups prog_array_map_poke_track map_update_elem bpf_fd_array_map_update_elem prog_array_map_poke_run bpf_arch_text_poke returns -EINVAL bpf_prog_kallsyms_add After bpf_arch_text_poke (CPU 1) fails to update the tail call jump, the next poke update fails on expected jump instruction check in bpf_arch_text_poke with -EBUSY and triggers the BUG_ON in prog_array_map_poke_run. Similar race exists on the program unload. Fixing this by moving the update to bpf_arch_poke_desc_update function which makes sure we call __bpf_arch_text_poke that skips the bpf address check. Each architecture has slightly different approach wrt looking up bpf address in bpf_arch_text_poke, so instead of splitting the function or adding new 'checkip' argument in previous version, it seems best to move the whole map_poke_run update as arch specific code. [0] https://syzkaller.appspot.com/bug?extid=97a4fe20470e9bc30810 Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Reported-by: syzbot+97a4fe20470e9bc30810@syzkaller.appspotmail.com Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Cc: Lee Jones <lee@kernel.org> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20231206083041.1306660-2-jolsa@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-05device property: Allow const parameter to dev_fwnode()Andy Shevchenko
commit b295d484b97081feba72b071ffcb72fb4638ccfd upstream. It's not fully correct to take a const parameter pointer to a struct and return a non-const pointer to a member of that struct. Instead, introduce a const version of the dev_fwnode() API which takes and returns const pointers and use it where it's applicable. With this, convert dev_fwnode() to be a macro wrapper on top of const and non-const APIs that chooses one based on the type. Suggested-by: Sakari Ailus <sakari.ailus@linux.intel.com> Fixes: aade55c86033 ("device property: Add const qualifier to device_get_match_data() parameter") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com> Link: https://lore.kernel.org/r/20221004092129.19412-2-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-05spi: Introduce spi_get_device_match_data() helperAndy Shevchenko
[ Upstream commit aea672d054a21782ed8450c75febb6ba3c208ca4 ] The proposed spi_get_device_match_data() helper is for retrieving a driver data associated with the ID in an ID table. First, it tries to get driver data of the device enumerated by firmware interface (usually Device Tree or ACPI). If none is found it falls back to the SPI ID table matching. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20221020195421.10482-1-andriy.shevchenko@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org> Stable-dep-of: ee4d79055aee ("iio: imu: adis16475: add spi_device_id table") Signed-off-by: Sasha Levin <sashal@kernel.org>