summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2021-01-23net: skbuff: disambiguate argument and member for skb_list_walk_safe helperJason A. Donenfeld
commit 5eee7bd7e245914e4e050c413dfe864e31805207 upstream. This worked before, because we made all callers name their next pointer "next". But in trying to be more "drop-in" ready, the silliness here is revealed. This commit fixes the problem by making the macro argument and the member use different names. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23net: introduce skb_list_walk_safe for skb segment walkingJason A. Donenfeld
commit dcfea72e79b0aa7a057c8f6024169d86a1bbc84b upstream. As part of the continual effort to remove direct usage of skb->next and skb->prev, this patch adds a helper for iterating through the singly-linked variant of skb lists, which are used for lists of GSO packet. The name "skb_list_..." has been chosen to match the existing function, "kfree_skb_list, which also operates on these singly-linked lists, and the "..._walk_safe" part is the same idiom as elsewhere in the kernel. This patch removes the helper from wireguard and puts it into linux/skbuff.h, while making it a bit more robust for general usage. In particular, parenthesis are added around the macro argument usage, and it now accounts for trying to iterate through an already-null skb pointer, which will simply run the iteration zero times. This latter enhancement means it can be used to replace both do { ... } while and while (...) open-coded idioms. This should take care of these three possible usages, which match all current methods of iterations. skb_list_walk_safe(segs, skb, next) { ... } skb_list_walk_safe(skb, skb, next) { ... } skb_list_walk_safe(segs, skb, segs) { ... } Gcc appears to generate efficient code for each of these. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net> [ Just the skbuff.h changes for backporting - gregkh] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23dm integrity: fix flush with external metadata deviceMikulas Patocka
commit 9b5948267adc9e689da609eb61cf7ed49cae5fa8 upstream. With external metadata device, flush requests are not passed down to the data device. Fix this by submitting the flush request in dm_integrity_flush_buffers. In order to not degrade performance, we overlap the data device flush with the metadata device flush. Reported-by: Lukas Straub <lukasstraub2@web.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23compiler.h: Raise minimum version of GCC to 5.1 for arm64Will Deacon
commit dca5244d2f5b94f1809f0c02a549edf41ccd5493 upstream. GCC versions >= 4.9 and < 5.1 have been shown to emit memory references beyond the stack pointer, resulting in memory corruption if an interrupt is taken after the stack pointer has been adjusted but before the reference has been executed. This leads to subtle, infrequent data corruption such as the EXT4 problems reported by Russell King at the link below. Life is too short for buggy compilers, so raise the minimum GCC version required by arm64 to 5.1. Reported-by: Russell King <linux@armlinux.org.uk> Suggested-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Florian Weimer <fweimer@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20210105154726.GD1551@shell.armlinux.org.uk Link: https://lore.kernel.org/r/20210112224832.10980-1-will@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> [will: backport to 4.19.y/5.4.y] Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19ACPI: scan: add stub acpi_create_platform_device() for !CONFIG_ACPIShawn Guo
[ Upstream commit ee61cfd955a64a58ed35cbcfc54068fcbd486945 ] It adds a stub acpi_create_platform_device() for !CONFIG_ACPI build, so that caller doesn't have to deal with !CONFIG_ACPI build issue. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-12proc: fix lookup in /proc/net subdirectories after setns(2)Alexey Dobriyan
[ Upstream commit c6c75deda81344c3a95d1d1f606d5cee109e5d54 ] Commit 1fde6f21d90f ("proc: fix /proc/net/* after setns(2)") only forced revalidation of regular files under /proc/net/ However, /proc/net/ is unusual in the sense of /proc/net/foo handlers take netns pointer from parent directory which is old netns. Steps to reproduce: (void)open("/proc/net/sctp/snmp", O_RDONLY); unshare(CLONE_NEWNET); int fd = open("/proc/net/sctp/snmp", O_RDONLY); read(fd, &c, 1); Read will read wrong data from original netns. Patch forces lookup on every directory under /proc/net . Link: https://lkml.kernel.org/r/20201205160916.GA109739@localhost.localdomain Fixes: 1da4d377f943 ("proc: revalidate misc dentries") Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-09kdev_t: always inline major/minor helper functionsJosh Poimboeuf
commit aa8c7db494d0a83ecae583aa193f1134ef25d506 upstream. Silly GCC doesn't always inline these trivial functions. Fixes the following warning: arch/x86/kernel/sys_ia32.o: warning: objtool: cp_stat64()+0xd8: call to new_encode_dev() with UACCESS enabled Link: https://lkml.kernel.org/r/984353b44a4484d86ba9f73884b7306232e25e30.1608737428.git.jpoimboe@redhat.com Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> [build-tested] Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06of: fix linker-section match-table corruptionJohan Hovold
commit 5812b32e01c6d86ba7a84110702b46d8a8531fe9 upstream. Specify type alignment when declaring linker-section match-table entries to prevent gcc from increasing alignment and corrupting the various tables with padding (e.g. timers, irqchips, clocks, reserved memory). This is specifically needed on x86 where gcc (typically) aligns larger objects like struct of_device_id with static extent on 32-byte boundaries which at best prevents matching on anything but the first entry. Specifying alignment when declaring variables suppresses this optimisation. Here's a 64-bit example where all entries are corrupt as 16 bytes of padding has been inserted before the first entry: ffffffff8266b4b0 D __clk_of_table ffffffff8266b4c0 d __of_table_fixed_factor_clk ffffffff8266b5a0 d __of_table_fixed_clk ffffffff8266b680 d __clk_of_table_sentinel And here's a 32-bit example where the 8-byte-aligned table happens to be placed on a 32-byte boundary so that all but the first entry are corrupt due to the 28 bytes of padding inserted between entries: 812b3ec0 D __irqchip_of_table 812b3ec0 d __of_table_irqchip1 812b3fa0 d __of_table_irqchip2 812b4080 d __of_table_irqchip3 812b4160 d irqchip_of_match_end Verified on x86 using gcc-9.3 and gcc-4.9 (which uses 64-byte alignment), and on arm using gcc-7.2. Note that there are no in-tree users of these tables on x86 currently (even if they are included in the image). Fixes: 54196ccbe0ba ("of: consolidate linker section OF match table declarations") Fixes: f6e916b82022 ("irqchip: add basic infrastructure") Cc: stable <stable@vger.kernel.org> # 3.9 Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://lore.kernel.org/r/20201123102319.8090-2-johan@kernel.org [ johan: adjust context to 5.4 ] Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06fscrypt: add fscrypt_is_nokey_name()Eric Biggers
commit 159e1de201b6fca10bfec50405a3b53a561096a8 upstream. It's possible to create a duplicate filename in an encrypted directory by creating a file concurrently with adding the encryption key. Specifically, sys_open(O_CREAT) (or sys_mkdir(), sys_mknod(), or sys_symlink()) can lookup the target filename while the directory's encryption key hasn't been added yet, resulting in a negative no-key dentry. The VFS then calls ->create() (or ->mkdir(), ->mknod(), or ->symlink()) because the dentry is negative. Normally, ->create() would return -ENOKEY due to the directory's key being unavailable. However, if the key was added between the dentry lookup and ->create(), then the filesystem will go ahead and try to create the file. If the target filename happens to already exist as a normal name (not a no-key name), a duplicate filename may be added to the directory. In order to fix this, we need to fix the filesystems to prevent ->create(), ->mkdir(), ->mknod(), and ->symlink() on no-key names. (->rename() and ->link() need it too, but those are already handled correctly by fscrypt_prepare_rename() and fscrypt_prepare_link().) In preparation for this, add a helper function fscrypt_is_nokey_name() that filesystems can use to do this check. Use this helper function for the existing checks that fs/crypto/ does for rename and link. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201118075609.120337-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30fix namespaced fscaps when !CONFIG_SECURITYSerge Hallyn
[ Upstream commit ed9b25d1970a4787ac6a39c2091e63b127ecbfc1 ] Namespaced file capabilities were introduced in 8db6c34f1dbc . When userspace reads an xattr for a namespaced capability, a virtualized representation of it is returned if the caller is in a user namespace owned by the capability's owning rootid. The function which performs this virtualization was not hooked up if CONFIG_SECURITY=n. Therefore in that case the original xattr was shown instead of the virtualized one. To test this using libcap-bin (*1), $ v=$(mktemp) $ unshare -Ur setcap cap_sys_admin-eip $v $ unshare -Ur setcap -v cap_sys_admin-eip $v /tmp/tmp.lSiIFRvt8Y: OK "setcap -v" verifies the values instead of setting them, and will check whether the rootid value is set. Therefore, with this bug un-fixed, and with CONFIG_SECURITY=n, setcap -v will fail: $ v=$(mktemp) $ unshare -Ur setcap cap_sys_admin=eip $v $ unshare -Ur setcap -v cap_sys_admin=eip $v nsowner[got=1000, want=0],/tmp/tmp.HHDiOOl9fY differs in [] Fix this bug by calling cap_inode_getsecurity() in security_inode_getsecurity() instead of returning -EOPNOTSUPP, when CONFIG_SECURITY=n. *1 - note, if libcap is too old for getcap to have the '-n' option, then use verify-caps instead. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209689 Cc: Hervé Guillemet <herve@guillemet.org> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Serge Hallyn <shallyn@cisco.com> Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Signed-off-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30seq_buf: Avoid type mismatch for seq_buf_initArnd Bergmann
[ Upstream commit d9a9280a0d0ae51dc1d4142138b99242b7ec8ac6 ] Building with W=2 prints a number of warnings for one function that has a pointer type mismatch: linux/seq_buf.h: In function 'seq_buf_init': linux/seq_buf.h:35:12: warning: pointer targets in assignment from 'unsigned char *' to 'char *' differ in signedness [-Wpointer-sign] Change the type in the function prototype according to the type in the structure. Link: https://lkml.kernel.org/r/20201026161108.3707783-1-arnd@kernel.org Fixes: 9a7777935c34 ("tracing: Convert seq_buf fields to be like seq_file fields") Reviewed-by: Cezary Rojewski <cezary.rojewski@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30SUNRPC: xprt_load_transport() needs to support the netid "rdma6"Trond Myklebust
[ Upstream commit d5aa6b22e2258f05317313ecc02efbb988ed6d38 ] According to RFC5666, the correct netid for an IPv6 addressed RDMA transport is "rdma6", which we've supported as a mount option since Linux-4.7. The problem is when we try to load the module "xprtrdma6", that will fail, since there is no modulealias of that name. Fixes: 181342c5ebe8 ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30netfilter: x_tables: Switch synchronization to RCUSubash Abhinov Kasiviswanathan
[ Upstream commit cc00bcaa589914096edef7fb87ca5cee4a166b5c ] When running concurrent iptables rules replacement with data, the per CPU sequence count is checked after the assignment of the new information. The sequence count is used to synchronize with the packet path without the use of any explicit locking. If there are any packets in the packet path using the table information, the sequence count is incremented to an odd value and is incremented to an even after the packet process completion. The new table value assignment is followed by a write memory barrier so every CPU should see the latest value. If the packet path has started with the old table information, the sequence counter will be odd and the iptables replacement will wait till the sequence count is even prior to freeing the old table info. However, this assumes that the new table information assignment and the memory barrier is actually executed prior to the counter check in the replacement thread. If CPU decides to execute the assignment later as there is no user of the table information prior to the sequence check, the packet path in another CPU may use the old table information. The replacement thread would then free the table information under it leading to a use after free in the packet processing context- Unable to handle kernel NULL pointer dereference at virtual address 000000000000008e pc : ip6t_do_table+0x5d0/0x89c lr : ip6t_do_table+0x5b8/0x89c ip6t_do_table+0x5d0/0x89c ip6table_filter_hook+0x24/0x30 nf_hook_slow+0x84/0x120 ip6_input+0x74/0xe0 ip6_rcv_finish+0x7c/0x128 ipv6_rcv+0xac/0xe4 __netif_receive_skb+0x84/0x17c process_backlog+0x15c/0x1b8 napi_poll+0x88/0x284 net_rx_action+0xbc/0x23c __do_softirq+0x20c/0x48c This could be fixed by forcing instruction order after the new table information assignment or by switching to RCU for the synchronization. Fixes: 80055dab5de0 ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore") Reported-by: Sean Tranchetti <stranche@codeaurora.org> Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30USB: UAS: introduce a quirk to set no_write_sameOliver Neukum
commit 8010622c86ca5bb44bc98492f5968726fc7c7a21 upstream. UAS does not share the pessimistic assumption storage is making that devices cannot deal with WRITE_SAME. A few devices supported by UAS, are reported to not deal well with WRITE_SAME. Those need a quirk. Add it to the device that needs it. Reported-by: David C. Partridge <david.partridge@perdrix.co.uk> Signed-off-by: Oliver Neukum <oneukum@suse.com> Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20201209152639.9195-1-oneukum@suse.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30compiler.h: fix barrier_data() on clangArvind Sankar
commit 3347acc6fcd4ee71ad18a9ff9d9dac176b517329 upstream. Commit 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive") neglected to copy barrier_data() from compiler-gcc.h into compiler-clang.h. The definition in compiler-gcc.h was really to work around clang's more aggressive optimization, so this broke barrier_data() on clang, and consequently memzero_explicit() as well. For example, this results in at least the memzero_explicit() call in lib/crypto/sha256.c:sha256_transform() being optimized away by clang. Fix this by moving the definition of barrier_data() into compiler.h. Also move the gcc/clang definition of barrier() into compiler.h, __memory_barrier() is icc-specific (and barrier() is already defined using it in compiler-intel.h) and doesn't belong in compiler.h. [rdunlap@infradead.org: fix ALPHA builds when SMP is not enabled] Link: https://lkml.kernel.org/r/20201101231835.4589-1-rdunlap@infradead.org Fixes: 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive") Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201014212631.207844-1-nivedita@alum.mit.edu Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [nd: backport to account for missing commit e506ea451254a ("compiler.h: Split {READ,WRITE}_ONCE definitions out into rwonce.h") commit d08b9f0ca6605 ("scs: Add support for Clang's Shadow Call Stack (SCS)") commit a3f8a30f3f00 ("Compiler Attributes: use feature checks instead of version checks")] Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30kbuild: avoid static_assert for genksymsArnd Bergmann
commit 14dc3983b5dff513a90bd5a8cc90acaf7867c3d0 upstream. genksyms does not know or care about the _Static_assert() built-in, and sometimes falls back to ignoring the later symbols, which causes undefined behavior such as WARNING: modpost: EXPORT symbol "ethtool_set_ethtool_phy_ops" [vmlinux] version generation failed, symbol will not be versioned. ld: net/ethtool/common.o: relocation R_AARCH64_ABS32 against `__crc_ethtool_set_ethtool_phy_ops' can not be used when making a shared object net/ethtool/common.o:(_ftrace_annotated_branch+0x0): dangerous relocation: unsupported relocation Redefine static_assert for genksyms to avoid that. Link: https://lkml.kernel.org/r/20201203230955.1482058-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Ard Biesheuvel <ardb@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Marek <michal.lkml@markovi.net> Cc: Kees Cook <keescook@chromium.org> Cc: Rikard Falkeborn <rikard.falkeborn@gmail.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-11spi: Introduce device-managed SPI controller allocationLukas Wunner
[ Upstream commit 5e844cc37a5cbaa460e68f9a989d321d63088a89 ] SPI driver probing currently comprises two steps, whereas removal comprises only one step: spi_alloc_master() spi_register_controller() spi_unregister_controller() That's because spi_unregister_controller() calls device_unregister() instead of device_del(), thereby releasing the reference on the spi_controller which was obtained by spi_alloc_master(). An SPI driver's private data is contained in the same memory allocation as the spi_controller struct. Thus, once spi_unregister_controller() has been called, the private data is inaccessible. But some drivers need to access it after spi_unregister_controller() to perform further teardown steps. Introduce devm_spi_alloc_master() and devm_spi_alloc_slave(), which release a reference on the spi_controller struct only after the driver has unbound, thereby keeping the memory allocation accessible. Change spi_unregister_controller() to not release a reference if the spi_controller was allocated by one of these new devm functions. The present commit is small enough to be backportable to stable. It allows fixing drivers which use the private data in their ->remove() hook after it's been freed. It also allows fixing drivers which neglect to release a reference on the spi_controller in the probe error path. Long-term, most SPI drivers shall be moved over to the devm functions introduced herein. The few that can't shall be changed in a treewide commit to explicitly release the last reference on the controller. That commit shall amend spi_unregister_controller() to no longer release a reference, thereby completing the migration. As a result, the behaviour will be less surprising and more consistent with subsystems such as IIO, which also includes the private data in the allocation of the generic iio_dev struct, but calls device_del() in iio_device_unregister(). Signed-off-by: Lukas Wunner <lukas@wunner.de> Link: https://lore.kernel.org/r/272bae2ef08abd21388c98e23729886663d19192.1605121038.git.lukas@wunner.de Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-11tty: Fix ->session lockingJann Horn
commit c8bcd9c5be24fb9e6132e97da5a35e55a83e36b9 upstream. Currently, locking of ->session is very inconsistent; most places protect it using the legacy tty mutex, but disassociate_ctty(), __do_SAK(), tiocspgrp() and tiocgsid() don't. Two of the writers hold the ctrl_lock (because they already need it for ->pgrp), but __proc_set_tty() doesn't do that yet. On a PREEMPT=y system, an unprivileged user can theoretically abuse this broken locking to read 4 bytes of freed memory via TIOCGSID if tiocgsid() is preempted long enough at the right point. (Other things might also go wrong, especially if root-only ioctls are involved; I'm not sure about that.) Change the locking on ->session such that: - tty_lock() is held by all writers: By making disassociate_ctty() hold it. This should be fine because the same lock can already be taken through the call to tty_vhangup_session(). The tricky part is that we need to shorten the area covered by siglock to be able to take tty_lock() without ugly retry logic; as far as I can tell, this should be fine, since nothing in the signal_struct is touched in the `if (tty)` branch. - ctrl_lock is held by all writers: By changing __proc_set_tty() to hold the lock a little longer. - All readers that aren't holding tty_lock() hold ctrl_lock: By adding locking to tiocgsid() and __do_SAK(), and expanding the area covered by ctrl_lock in tiocspgrp(). Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-02netfilter: clear skb->next in NF_HOOK_LIST()Cong Wang
NF_HOOK_LIST() uses list_del() to remove skb from the linked list, however, it is not sufficient as skb->next still points to other skb. We should just call skb_list_del_init() to clear skb->next, like the rest places which using skb list. This has been fixed in upstream by commit ca58fbe06c54 ("netfilter: add and use nf_hook_slow_list()"). Fixes: 9f17dbf04ddf ("netfilter: fix use-after-free in NF_HOOK_LIST") Reported-by: liuzx@knownsec.com Tested-by: liuzx@knownsec.com Cc: Florian Westphal <fw@strlen.de> Cc: Edward Cree <ecree@solarflare.com> Cc: stable@vger.kernel.org # between 4.19 and 5.4 Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18random32: make prandom_u32() output unpredictableGeorge Spelvin
commit c51f8f88d705e06bd696d7510aff22b33eb8e638 upstream. Non-cryptographic PRNGs may have great statistical properties, but are usually trivially predictable to someone who knows the algorithm, given a small sample of their output. An LFSR like prandom_u32() is particularly simple, even if the sample is widely scattered bits. It turns out the network stack uses prandom_u32() for some things like random port numbers which it would prefer are *not* trivially predictable. Predictability led to a practical DNS spoofing attack. Oops. This patch replaces the LFSR with a homebrew cryptographic PRNG based on the SipHash round function, which is in turn seeded with 128 bits of strong random key. (The authors of SipHash have *not* been consulted about this abuse of their algorithm.) Speed is prioritized over security; attacks are rare, while performance is always wanted. Replacing all callers of prandom_u32() is the quick fix. Whether to reinstate a weaker PRNG for uses which can tolerate it is an open question. Commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") was an earlier attempt at a solution. This patch replaces it. Reported-by: Amit Klein <aksecurity@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity") Signed-off-by: George Spelvin <lkml@sdf.org> Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ [ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions to prandom.h for later use; merged George's prandom_seed() proposal; inlined siprand_u32(); replaced the net_rand_state[] array with 4 members to fix a build issue; cosmetic cleanups to make checkpatch happy; fixed RANDOM32_SELFTEST build ] [wt: backported to 4.19 -- various context adjustments] Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18netfilter: use actual socket sk rather than skb sk when routing harderJason A. Donenfeld
commit 46d6c5ae953cc0be38efd0e469284df7c4328cf8 upstream. If netfilter changes the packet mark when mangling, the packet is rerouted using the route_me_harder set of functions. Prior to this commit, there's one big difference between route_me_harder and the ordinary initial routing functions, described in the comment above __ip_queue_xmit(): /* Note: skb->sk can be different from sk, in case of tunnels */ int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl, That function goes on to correctly make use of sk->sk_bound_dev_if, rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a tunnel will receive a packet in ndo_start_xmit with an initial skb->sk. It will make some transformations to that packet, and then it will send the encapsulated packet out of a *new* socket. That new socket will basically always have a different sk_bound_dev_if (otherwise there'd be a routing loop). So for the purposes of routing the encapsulated packet, the routing information as it pertains to the socket should come from that socket's sk, rather than the packet's original skb->sk. For that reason __ip_queue_xmit() and related functions all do the right thing. One might argue that all tunnels should just call skb_orphan(skb) before transmitting the encapsulated packet into the new socket. But tunnels do *not* do this -- and this is wisely avoided in skb_scrub_packet() too -- because features like TSQ rely on skb->destructor() being called when that buffer space is truely available again. Calling skb_orphan(skb) too early would result in buffers filling up unnecessarily and accounting info being all wrong. Instead, additional routing must take into account the new sk, just as __ip_queue_xmit() notes. So, this commit addresses the problem by fishing the correct sk out of state->sk -- it's already set properly in the call to nf_hook() in __ip_local_out(), which receives the sk as part of its normal functionality. So we make sure to plumb state->sk through the various route_me_harder functions, and then make correct use of it following the example of __ip_queue_xmit(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org> [Jason: backported to 4.19 from Sasha's 5.4 backport] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18can: can_create_echo_skb(): fix echo skb generation: always use skb_clone()Oleksij Rempel
[ Upstream commit 286228d382ba6320f04fa2e7c6fc8d4d92e428f4 ] All user space generated SKBs are owned by a socket (unless injected into the key via AF_PACKET). If a socket is closed, all associated skbs will be cleaned up. This leads to a problem when a CAN driver calls can_put_echo_skb() on a unshared SKB. If the socket is closed prior to the TX complete handler, can_get_echo_skb() and the subsequent delivering of the echo SKB to all registered callbacks, a SKB with a refcount of 0 is delivered. To avoid the problem, in can_get_echo_skb() the original SKB is now always cloned, regardless of shared SKB or not. If the process exists it can now safely discard its SKBs, without disturbing the delivery of the echo SKB. The problem shows up in the j1939 stack, when it clones the incoming skb, which detects the already 0 refcount. We can easily reproduce this with following example: testj1939 -B -r can0: & cansend can0 1823ff40#0123 WARNING: CPU: 0 PID: 293 at lib/refcount.c:25 refcount_warn_saturate+0x108/0x174 refcount_t: addition on 0; use-after-free. Modules linked in: coda_vpu imx_vdoa videobuf2_vmalloc dw_hdmi_ahb_audio vcan CPU: 0 PID: 293 Comm: cansend Not tainted 5.5.0-rc6-00376-g9e20dcb7040d #1 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) Backtrace: [<c010f570>] (dump_backtrace) from [<c010f90c>] (show_stack+0x20/0x24) [<c010f8ec>] (show_stack) from [<c0c3e1a4>] (dump_stack+0x8c/0xa0) [<c0c3e118>] (dump_stack) from [<c0127fec>] (__warn+0xe0/0x108) [<c0127f0c>] (__warn) from [<c01283c8>] (warn_slowpath_fmt+0xa8/0xcc) [<c0128324>] (warn_slowpath_fmt) from [<c0539c0c>] (refcount_warn_saturate+0x108/0x174) [<c0539b04>] (refcount_warn_saturate) from [<c0ad2cac>] (j1939_can_recv+0x20c/0x210) [<c0ad2aa0>] (j1939_can_recv) from [<c0ac9dc8>] (can_rcv_filter+0xb4/0x268) [<c0ac9d14>] (can_rcv_filter) from [<c0aca2cc>] (can_receive+0xb0/0xe4) [<c0aca21c>] (can_receive) from [<c0aca348>] (can_rcv+0x48/0x98) [<c0aca300>] (can_rcv) from [<c09b1fdc>] (__netif_receive_skb_one_core+0x64/0x88) [<c09b1f78>] (__netif_receive_skb_one_core) from [<c09b2070>] (__netif_receive_skb+0x38/0x94) [<c09b2038>] (__netif_receive_skb) from [<c09b2130>] (netif_receive_skb_internal+0x64/0xf8) [<c09b20cc>] (netif_receive_skb_internal) from [<c09b21f8>] (netif_receive_skb+0x34/0x19c) [<c09b21c4>] (netif_receive_skb) from [<c0791278>] (can_rx_offload_napi_poll+0x58/0xb4) Fixes: 0ae89beb283a ("can: add destructor for self generated skbs") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: http://lore.kernel.org/r/20200124132656.22156-1-o.rempel@pengutronix.de Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18time: Prevent undefined behaviour in timespec64_to_ns()Zeng Tao
[ Upstream commit cb47755725da7b90fecbb2aa82ac3b24a7adb89b ] UBSAN reports: Undefined behaviour in ./include/linux/time64.h:127:27 signed integer overflow: 17179869187 * 1000000000 cannot be represented in type 'long long int' Call Trace: timespec64_to_ns include/linux/time64.h:127 [inline] set_cpu_itimer+0x65c/0x880 kernel/time/itimer.c:180 do_setitimer+0x8e/0x740 kernel/time/itimer.c:245 __x64_sys_setitimer+0x14c/0x2c0 kernel/time/itimer.c:336 do_syscall_64+0xa1/0x540 arch/x86/entry/common.c:295 Commit bd40a175769d ("y2038: itimer: change implementation to timespec64") replaced the original conversion which handled time clamping correctly with timespec64_to_ns() which has no overflow protection. Fix it in timespec64_to_ns() as this is not necessarily limited to the usage in itimers. [ tglx: Added comment and adjusted the fixes tag ] Fixes: 361a3bf00582 ("time64: Add time64.h header and define struct timespec64") Signed-off-by: Zeng Tao <prime.zeng@hisilicon.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/1598952616-6416-1-git-send-email-prime.zeng@hisilicon.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-10mm: always have io_remap_pfn_range() set pgprot_decrypted()Jason Gunthorpe
commit f8f6ae5d077a9bdaf5cbf2ac960a5d1a04b47482 upstream. The purpose of io_remap_pfn_range() is to map IO memory, such as a memory mapped IO exposed through a PCI BAR. IO devices do not understand encryption, so this memory must always be decrypted. Automatically call pgprot_decrypted() as part of the generic implementation. This fixes a bug where enabling AMD SME causes subsystems, such as RDMA, using io_remap_pfn_range() to expose BAR pages to user space to fail. The CPU will encrypt access to those BAR pages instead of passing unencrypted IO directly to the device. Places not mapping IO should use remap_pfn_range(). Fixes: aca20d546214 ("x86/mm: Add support to make use of Secure Memory Encryption") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Dave Young" <dyoung@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Ingo Molnar <mingo@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Toshimitsu Kani <toshi.kani@hpe.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/0-v1-025d64bdf6c4+e-amd_sme_fix_jgg@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05hil/parisc: Disable HIL driver when it gets stuckHelge Deller
commit 879bc2d27904354b98ca295b6168718e045c4aa2 upstream. When starting a HP machine with HIL driver but without an HIL keyboard or HIL mouse attached, it may happen that data written to the HIL loop gets stuck (e.g. because the transaction queue is full). Usually one will then have to reboot the machine because all you see is and endless output of: Transaction add failed: transaction already queued? In the higher layers hp_sdc_enqueue_transaction() is called to queued up a HIL packet. This function returns an error code, and this patch adds the necessary checks for this return code and disables the HIL driver if further packets can't be sent. Tested on a HP 730 and a HP 715/64 machine. Signed-off-by: Helge Deller <deller@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05usb: typec: tcpm: During PR_SWAP, source caps should be sent only after ↵Badhri Jagan Sridharan
tSwapSourceStart [ Upstream commit 6bbe2a90a0bb4af8dd99c3565e907fe9b5e7fd88 ] The patch addresses the compliance test failures while running TD.PD.CP.E3, TD.PD.CP.E4, TD.PD.CP.E5 of the "Deterministic PD Compliance MOI" test plan published in https://www.usb.org/usbc. For a product to be Type-C compliant, it's expected that these tests are run on usb.org certified Type-C compliance tester as mentioned in https://www.usb.org/usbc. The purpose of the tests TD.PD.CP.E3, TD.PD.CP.E4, TD.PD.CP.E5 is to verify the PR_SWAP response of the device. While doing so, the test asserts that Source Capabilities message is NOT received from the test device within tSwapSourceStart min (20 ms) from the time the last bit of GoodCRC corresponding to the RS_RDY message sent by the UUT was sent. If it does then the test fails. This is in line with the requirements from the USB Power Delivery Specification Revision 3.0, Version 1.2: "6.6.8.1 SwapSourceStartTimer The SwapSourceStartTimer Shall be used by the new Source, after a Power Role Swap or Fast Role Swap, to ensure that it does not send Source_Capabilities Message before the new Sink is ready to receive the Source_Capabilities Message. The new Source Shall Not send the Source_Capabilities Message earlier than tSwapSourceStart after the last bit of the EOP of GoodCRC Message sent in response to the PS_RDY Message sent by the new Source indicating that its power supply is ready." The patch makes sure that TCPM does not send the Source_Capabilities Message within tSwapSourceStart(20ms) by transitioning into SRC_STARTUP only after tSwapSourceStart(20ms). Signed-off-by: Badhri Jagan Sridharan <badhri@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://lore.kernel.org/r/20200817183828.1895015-1-badhri@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-05fscrypt: fix race where ->lookup() marks plaintext dentry as ciphertextEric Biggers
commit b01531db6cec2aa330dbc91bfbfaaef4a0d387a4 upstream. ->lookup() in an encrypted directory begins as follows: 1. fscrypt_prepare_lookup(): a. Try to load the directory's encryption key. b. If the key is unavailable, mark the dentry as a ciphertext name via d_flags. 2. fscrypt_setup_filename(): a. Try to load the directory's encryption key. b. If the key is available, encrypt the name (treated as a plaintext name) to get the on-disk name. Otherwise decode the name (treated as a ciphertext name) to get the on-disk name. But if the key is concurrently added, it may be found at (2a) but not at (1a). In this case, the dentry will be wrongly marked as a ciphertext name even though it was actually treated as plaintext. This will cause the dentry to be wrongly invalidated on the next lookup, potentially causing problems. For example, if the racy ->lookup() was part of sys_mount(), then the new mount will be detached when anything tries to access it. This is despite the mountpoint having a plaintext path, which should remain valid now that the key was added. Of course, this is only possible if there's a userspace race. Still, the additional kernel-side race is confusing and unexpected. Close the kernel-side race by changing fscrypt_prepare_lookup() to also set the on-disk filename (step 2b), consistent with the d_flags update. Fixes: 28b4c263961c ("ext4 crypto: revalidate dentry after adding or removing the key") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05fscrypt: fix race allowing rename() and link() of ciphertext dentriesEric Biggers
commit 968dd6d0c6d6b6a989c6ddb9e2584a031b83e7b5 upstream. Close some race conditions where fscrypt allowed rename() and link() on ciphertext dentries that had been looked up just prior to the key being concurrently added. It's better to return -ENOKEY in this case. This avoids doing the nonsensical thing of encrypting the names a second time when searching for the actual on-disk dir entries. It also guarantees that DCACHE_ENCRYPTED_NAME dentries are never rename()d, so the dcache won't have support all possible combinations of moving DCACHE_ENCRYPTED_NAME around during __d_move(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05fscrypt: clean up and improve dentry revalidationEric Biggers
commit 6cc248684d3d23bbd073ae2fa73d3416c0558909 upstream. Make various improvements to fscrypt dentry revalidation: - Don't try to handle the case where the per-directory key is removed, as this can't happen without the inode (and dentries) being evicted. - Flag ciphertext dentries rather than plaintext dentries, since it's ciphertext dentries that need the special handling. - Avoid doing unnecessary work for non-ciphertext dentries. - When revalidating ciphertext dentries, try to set up the directory's i_crypt_info to make sure the key is really still absent, rather than invalidating all negative dentries as the previous code did. An old comment suggested we can't do this for locking reasons, but AFAICT this comment was outdated and it actually works fine. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05fscrypt: return -EXDEV for incompatible rename or link into encrypted dirEric Biggers
commit f5e55e777cc93eae1416f0fa4908e8846b6d7825 upstream. Currently, trying to rename or link a regular file, directory, or symlink into an encrypted directory fails with EPERM when the source file is unencrypted or is encrypted with a different encryption policy, and is on the same mountpoint. It is correct for the operation to fail, but the choice of EPERM breaks tools like 'mv' that know to copy rather than rename if they see EXDEV, but don't know what to do with EPERM. Our original motivation for EPERM was to encourage users to securely handle their data. Encrypting files by "moving" them into an encrypted directory can be insecure because the unencrypted data may remain in free space on disk, where it can later be recovered by an attacker. It's much better to encrypt the data from the start, or at least try to securely delete the source data e.g. using the 'shred' program. However, the current behavior hasn't been effective at achieving its goal because users tend to be confused, hack around it, and complain; see e.g. https://github.com/google/fscrypt/issues/76. And in some cases it's actually inconsistent or unnecessary. For example, 'mv'-ing files between differently encrypted directories doesn't work even in cases where it can be secure, such as when in userspace the same passphrase protects both directories. Yet, you *can* already 'mv' unencrypted files into an encrypted directory if the source files are on a different mountpoint, even though doing so is often insecure. There are probably better ways to teach users to securely handle their files. For example, the 'fscrypt' userspace tool could provide a command that migrates unencrypted files into an encrypted directory, acting like 'shred' on the source files and providing appropriate warnings depending on the type of the source filesystem and disk. Receiving errors on unimportant files might also force some users to disable encryption, thus making the behavior counterproductive. It's desirable to make encryption as unobtrusive as possible. Therefore, change the error code from EPERM to EXDEV so that tools looking for EXDEV will fall back to a copy. This, of course, doesn't prevent users from still doing the right things to securely manage their files. Note that this also matches the behavior when a file is renamed between two project quota hierarchies; so there's precedent for using EXDEV for things other than mountpoints. xfstests generic/398 will require an update with this change. [Rewritten from an earlier patch series by Michael Halcrow.] Cc: Michael Halcrow <mhalcrow@google.com> Cc: Joe Richey <joerichey@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05mtd: lpddr: Fix bad logic in print_drs_errorGustavo A. R. Silva
commit 1c9c02bb22684f6949d2e7ddc0a3ff364fd5a6fc upstream. Update logic for broken test. Use a more common logging style. It appears the logic in this function is broken for the consecutive tests of if (prog_status & 0x3) ... else if (prog_status & 0x2) ... else (prog_status & 0x1) ... Likely the first test should be if ((prog_status & 0x3) == 0x3) Found by inspection of include files using printk. Fixes: eb3db27507f7 ("[MTD] LPDDR PFOW definition") Cc: stable@vger.kernel.org Reported-by: Joe Perches <joe@perches.com> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/linux-mtd/3fb0e29f5b601db8be2938a01d974b00c8788501.1588016644.git.gustavo@embeddedor.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-30overflow: Include header file with SIZE_MAX declarationLeon Romanovsky
[ Upstream commit a4947e84f23474803b62a2759b5808147e4e15f9 ] The various array_size functions use SIZE_MAX define, but missed limits.h causes to failure to compile code that needs overflow.h. In file included from drivers/infiniband/core/uverbs_std_types_device.c:6: ./include/linux/overflow.h: In function 'array_size': ./include/linux/overflow.h:258:10: error: 'SIZE_MAX' undeclared (first use in this function) 258 | return SIZE_MAX; | ^~~~~~~~ Fixes: 610b15c50e86 ("overflow.h: Add allocation size calculation helpers") Link: https://lore.kernel.org/r/20200913102928.134985-1-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessarySuren Baghdasaryan
[ Upstream commit 67197a4f28d28d0b073ab0427b03cb2ee5382578 ] Currently __set_oom_adj loops through all processes in the system to keep oom_score_adj and oom_score_adj_min in sync between processes sharing their mm. This is done for any task with more that one mm_users, which includes processes with multiple threads (sharing mm and signals). However for such processes the loop is unnecessary because their signal structure is shared as well. Android updates oom_score_adj whenever a tasks changes its role (background/foreground/...) or binds to/unbinds from a service, making it more/less important. Such operation can happen frequently. We noticed that updates to oom_score_adj became more expensive and after further investigation found out that the patch mentioned in "Fixes" introduced a regression. Using Pixel 4 with a typical Android workload, write time to oom_score_adj increased from ~3.57us to ~362us. Moreover this regression linearly depends on the number of multi-threaded processes running on the system. Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK). Change __set_oom_adj to use MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj update should be synchronized between multiple processes. To prevent races between clone() and __set_oom_adj(), when oom_score_adj of the process being cloned might be modified from userspace, we use oom_adj_mutex. Its scope is changed to global. The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for the case of vfork(). To prevent performance regressions of vfork(), we skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is specified. Clearing the MMF_MULTIPROCESS flag (when the last process sharing the mm exits) is left out of this patch to keep it simple and because it is believed that this threading model is rare. Should there ever be a need for optimizing that case as well, it can be done by hooking into the exit path, likely following the mm_update_next_owner pattern. With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being quite rare, the regression is gone after the change is applied. [surenb@google.com: v3] Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj") Reported-by: Tim Murray <timmurray@google.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Eugene Syromiatnikov <esyr@redhat.com> Cc: Christian Kellner <christian@kellner.me> Cc: Adrian Reber <areber@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: John Johansen <john.johansen@canonical.com> Cc: Yafang Shao <laoar.shao@gmail.com> Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com Debugged-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-14mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected ↵Vijay Balakrishna
by khugepaged commit 4aab2be0983031a05cb4a19696c9da5749523426 upstream. When memory is hotplug added or removed the min_free_kbytes should be recalculated based on what is expected by khugepaged. Currently after hotplug, min_free_kbytes will be set to a lower default and higher default set when THP enabled is lost. This change restores min_free_kbytes as expected for THP consumers. [vijayb@linux.microsoft.com: v5] Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com Fixes: f000565adb77 ("thp: set recommended min free kbytes") Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Allen Pais <apais@microsoft.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14Fonts: Support FONT_EXTRA_WORDS macros for built-in fontsPeilin Ye
commit 6735b4632def0640dbdf4eb9f99816aca18c4f16 upstream. syzbot has reported an issue in the framebuffer layer, where a malicious user may overflow our built-in font data buffers. In order to perform a reliable range check, subsystems need to know `FONTDATAMAX` for each built-in font. Unfortunately, our font descriptor, `struct console_font` does not contain `FONTDATAMAX`, and is part of the UAPI, making it infeasible to modify it. For user-provided fonts, the framebuffer layer resolves this issue by reserving four extra words at the beginning of data buffers. Later, whenever a function needs to access them, it simply uses the following macros: Recently we have gathered all the above macros to <linux/font.h>. Let us do the same thing for built-in fonts, prepend four extra words (including `FONTDATAMAX`) to their data buffers, so that subsystems can use these macros for all fonts, no matter built-in or user-provided. This patch depends on patch "fbdev, newport_con: Move FONT_EXTRA_WORDS macros into linux/font.h". Cc: stable@vger.kernel.org Link: https://syzkaller.appspot.com/bug?id=08b8be45afea11888776f897895aef9ad1c3ecfd Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/ef18af00c35fb3cc826048a5f70924ed6ddce95b.1600953813.git.yepeilin.cs@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14fbdev, newport_con: Move FONT_EXTRA_WORDS macros into linux/font.hPeilin Ye
commit bb0890b4cd7f8203e3aa99c6d0f062d6acdaad27 upstream. drivers/video/console/newport_con.c is borrowing FONT_EXTRA_WORDS macros from drivers/video/fbdev/core/fbcon.h. To keep things simple, move all definitions into <linux/font.h>. Since newport_con now uses four extra words, initialize the fourth word in newport_set_font() properly. Cc: stable@vger.kernel.org Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/7fb8bc9b0abc676ada6b7ac0e0bd443499357267.1600953813.git.yepeilin.cs@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07mm: don't rely on system state to detect hot-plug operationsLaurent Dufour
commit f85086f95fa36194eb0db5cd5c12e56801b98523 upstream. In register_mem_sect_under_node() the system_state's value is checked to detect whether the call is made during boot time or during an hot-plug operation. Unfortunately, that check against SYSTEM_BOOTING is wrong because regular memory is registered at SYSTEM_SCHEDULING state. In addition, memory hot-plug operation can be triggered at this system state by the ACPI [1]. So checking against the system state is not enough. The consequence is that on system with interleaved node's ranges like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] This can be seen on PowerPC LPAR after multiple memory hot-plug and hot-unplug operations are done. At the next reboot the node's memory ranges can be interleaved and since the call to link_mem_sections() is made in topology_init() while the system is in the SYSTEM_SCHEDULING state, the node's id is not checked, and the sections registered to multiple nodes: $ ls -l /sys/devices/system/memory/memory21/node* total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 In that case, the system is able to boot but if later one of theses memory blocks is hot-unplugged and then hot-plugged, the sysfs inconsistency is detected and this is triggering a BUG_ON(): kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This patch addresses the root cause by not relying on the system_state value to detect whether the call is due to a hot-plug operation. An extra parameter is added to link_mem_sections() detailing whether the operation is due to a hot-plug operation. [1] According to Oscar Salvador, using this qemu command line, ACPI memory hotplug operations are raised at SYSTEM_SCHEDULING state: $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \ -m size=$MEM,slots=255,maxmem=4294967296k \ -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \ -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \ -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \ -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \ -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \ -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \ -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \ -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \ Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()") Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07mm: replace memmap_context by meminit_contextLaurent Dufour
commit c1d0da83358a2316d9be7f229f26126dbaa07468 upstream. Patch series "mm: fix memory to node bad links in sysfs", v3. Sometimes, firmware may expose interleaved memory layout like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] In that case, we can see memory blocks assigned to multiple nodes in sysfs: $ ls -l /sys/devices/system/memory/memory21 total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 -rw-r--r-- 1 root root 65536 Aug 24 05:27 online -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index drwxr-xr-x 2 root root 0 Aug 24 05:27 power -r--r--r-- 1 root root 65536 Aug 24 05:27 removable -rw-r--r-- 1 root root 65536 Aug 24 05:27 state lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones The same applies in the node's directory with a memory21 link in both the node1 and node2's directory. This is wrong but doesn't prevent the system to run. However when later, one of these memory blocks is hot-unplugged and then hot-plugged, the system is detecting an inconsistency in the sysfs layout and a BUG_ON() is raised: kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This has been seen on PowerPC LPAR. The root cause of this issue is that when node's memory is registered, the range used can overlap another node's range, thus the memory block is registered to multiple nodes in sysfs. There are two issues here: (a) The sysfs memory and node's layouts are broken due to these multiple links (b) The link errors in link_mem_sections() should not lead to a system panic. To address (a) register_mem_sect_under_node should not rely on the system state to detect whether the link operation is triggered by a hot plug operation or not. This is addressed by the patches 1 and 2 of this series. Issue (b) will be addressed separately. This patch (of 2): The memmap_context enum is used to detect whether a memory operation is due to a hot-add operation or happening at boot time. Make it general to the hotplug operation and rename it as meminit_context. There is no functional change introduced by this patch Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07vsock/virtio: add transport parameter to the virtio_transport_reset_no_sock()Stefano Garzarella
[ Upstream commit 4c7246dc45e2706770d5233f7ce1597a07e069ba ] We are going to add 'struct vsock_sock *' parameter to virtio_transport_get_ops(). In some cases, like in the virtio_transport_reset_no_sock(), we don't have any socket assigned to the packet received, so we can't use the virtio_transport_get_ops(). In order to allow virtio_transport_reset_no_sock() to use the '.send_pkt' callback from the 'vhost_transport' or 'virtio_transport', we add the 'struct virtio_transport *' to it and to its caller: virtio_transport_recv_pkt(). We moved the 'vhost_transport' and 'virtio_transport' definition, to pass their address to the virtio_transport_recv_pkt(). Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01ata: make qc_prep return ata_completion_errorsJiri Slaby
commit 95364f36701e62dd50eee91e1303187fd1a9f567 upstream. In case a driver wants to return an error from qc_prep, return enum ata_completion_errors. sata_mv is one of those drivers -- see the next patch. Other drivers return the newly defined AC_ERR_OK. [v2] use enum ata_completion_errors and AC_ERR_OK. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-ide@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01ata: define AC_ERR_OKJiri Slaby
commit 25937580a5065d6fbd92d9c8ebd47145ad80052e upstream. Since we will return enum ata_completion_errors from qc_prep in the next patch, let's define AC_ERR_OK to mark the OK status. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-ide@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()Trond Myklebust
[ Upstream commit 08ca8b21f760c0ed5034a5c122092eec22ccf8f4 ] When a subrequest is being detached from the subgroup, we want to ensure that it is not holding the group lock, or in the process of waiting for the group lock. Fixes: 5b2b5187fa85 ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01PCI: Use ioremap(), not phys_to_virt() for platform ROMMikel Rychliski
[ Upstream commit 72e0ef0e5f067fd991f702f0b2635d911d0cf208 ] On some EFI systems, the video BIOS is provided by the EFI firmware. The boot stub code stores the physical address of the ROM image in pdev->rom. Currently we attempt to access this pointer using phys_to_virt(), which doesn't work with CONFIG_HIGHMEM. On these systems, attempting to load the radeon module on a x86_32 kernel can result in the following: BUG: unable to handle page fault for address: 3e8ed03c #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP CPU: 0 PID: 317 Comm: systemd-udevd Not tainted 5.6.0-rc3-next-20200228 #2 Hardware name: Apple Computer, Inc. MacPro1,1/Mac-F4208DC8, BIOS MP11.88Z.005C.B08.0707021221 07/02/07 EIP: radeon_get_bios+0x5ed/0xe50 [radeon] Code: 00 00 84 c0 0f 85 12 fd ff ff c7 87 64 01 00 00 00 00 00 00 8b 47 08 8b 55 b0 e8 1e 83 e1 d6 85 c0 74 1a 8b 55 c0 85 d2 74 13 <80> 38 55 75 0e 80 78 01 aa 0f 84 a4 03 00 00 8d 74 26 00 68 dc 06 EAX: 3e8ed03c EBX: 00000000 ECX: 3e8ed03c EDX: 00010000 ESI: 00040000 EDI: eec04000 EBP: eef3fc60 ESP: eef3fbe0 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010206 CR0: 80050033 CR2: 3e8ed03c CR3: 2ec77000 CR4: 000006d0 Call Trace: r520_init+0x26/0x240 [radeon] radeon_device_init+0x533/0xa50 [radeon] radeon_driver_load_kms+0x80/0x220 [radeon] drm_dev_register+0xa7/0x180 [drm] radeon_pci_probe+0x10f/0x1a0 [radeon] pci_device_probe+0xd4/0x140 Fix the issue by updating all drivers which can access a platform provided ROM. Instead of calling the helper function pci_platform_rom() which uses phys_to_virt(), call ioremap() directly on the pdev->rom. radeon_read_platform_bios() previously directly accessed an __iomem pointer. Avoid this by calling memcpy_fromio() instead of kmemdup(). pci_platform_rom() now has no remaining callers, so remove it. Link: https://lore.kernel.org/r/20200319021623.5426-1-mikel@mikelr.com Signed-off-by: Mikel Rychliski <mikel@mikelr.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01skbuff: fix a data race in skb_queue_len()Qian Cai
[ Upstream commit 86b18aaa2b5b5bb48e609cd591b3d2d0fdbe0442 ] sk_buff.qlen can be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96: unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821 net/unix/af_unix.c:1761 ____sys_sendmsg+0x33e/0x370 ___sys_sendmsg+0xa6/0xf0 __sys_sendmsg+0x69/0xf0 __x64_sys_sendmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99: __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029 __skb_try_recv_datagram+0xbe/0x220 unix_dgram_recvmsg+0xee/0x850 ____sys_recvmsg+0x1fb/0x210 ___sys_recvmsg+0xa2/0xf0 __sys_recvmsg+0x66/0xf0 __x64_sys_recvmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe Since only the read is operating as lockless, it could introduce a logic bug in unix_recvq_full() due to the load tearing. Fix it by adding a lockless variant of skb_queue_len() and unix_recvq_full() where READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to the commit d7d16a89350a ("net: add skb_queue_empty_lockless()"). Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01seqlock: Require WRITE_ONCE surrounding raw_seqcount_barrierMarco Elver
[ Upstream commit bf07132f96d426bcbf2098227fb680915cf44498 ] This patch proposes to require marked atomic accesses surrounding raw_write_seqcount_barrier. We reason that otherwise there is no way to guarantee propagation nor atomicity of writes before/after the barrier [1]. For example, consider the compiler tears stores either before or after the barrier; in this case, readers may observe a partial value, and because readers are unaware that writes are going on (writes are not in a seq-writer critical section), will complete the seq-reader critical section while having observed some partial state. [1] https://lwn.net/Articles/793253/ This came up when designing and implementing KCSAN, because KCSAN would flag these accesses as data-races. After careful analysis, our reasoning as above led us to conclude that the best thing to do is to propose an amendment to the raw_seqcount_barrier usage. Signed-off-by: Marco Elver <elver@google.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01debugfs: Fix !DEBUG_FS debugfs_create_automountKusanagi Kouichi
[ Upstream commit 4250b047039d324e0ff65267c8beb5bad5052a86 ] If DEBUG_FS=n, compile fails with the following error: kernel/trace/trace.c: In function 'tracing_init_dentry': kernel/trace/trace.c:8658:9: error: passing argument 3 of 'debugfs_create_automount' from incompatible pointer type [-Werror=incompatible-pointer-types] 8658 | trace_automount, NULL); | ^~~~~~~~~~~~~~~ | | | struct vfsmount * (*)(struct dentry *, void *) In file included from kernel/trace/trace.c:24: ./include/linux/debugfs.h:206:25: note: expected 'struct vfsmount * (*)(void *)' but argument is of type 'struct vfsmount * (*)(struct dentry *, void *)' 206 | struct vfsmount *(*f)(void *), | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~ Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Link: https://lore.kernel.org/r/20191121102021787.MLMY.25002.ppp.dion.ne.jp@dmta0003.auone-net.jp Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01mmc: core: Fix size overflow for mmc partitionsBradley Bolen
[ Upstream commit f3d7c2292d104519195fdb11192daec13229c219 ] With large eMMC cards, it is possible to create general purpose partitions that are bigger than 4GB. The size member of the mmc_part struct is only an unsigned int which overflows for gp partitions larger than 4GB. Change this to a u64 to handle the overflow. Signed-off-by: Bradley Bolen <bradleybolen@gmail.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-26net: add __must_check to skb_put_padto()Eric Dumazet
[ Upstream commit 4a009cb04aeca0de60b73f37b102573354214b52 ] skb_put_padto() and __skb_put_padto() callers must check return values or risk use-after-free. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-23i2c: algo: pca: Reapply i2c bus settings after resetEvan Nimmo
[ Upstream commit 0a355aeb24081e4538d4d424cd189f16c0bbd983 ] If something goes wrong (such as the SCL being stuck low) then we need to reset the PCA chip. The issue with this is that on reset we lose all config settings and the chip ends up in a disabled state which results in a lock up/high CPU usage. We need to re-apply any configuration that had previously been set and re-enable the chip. Signed-off-by: Evan Nimmo <evan.nimmo@alliedtelesis.co.nz> Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Wolfram Sang <wsa@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-17netfilter: conntrack: allow sctp hearbeat after connection re-useFlorian Westphal
[ Upstream commit cc5453a5b7e90c39f713091a7ebc53c1f87d1700 ] If an sctp connection gets re-used, heartbeats are flagged as invalid because their vtag doesn't match. Handle this in a similar way as TCP conntrack when it suspects that the endpoints and conntrack are out-of-sync. When a HEARTBEAT request fails its vtag validation, flag this in the conntrack state and accept the packet. When a HEARTBEAT_ACK is received with an invalid vtag in the reverse direction after we allowed such a HEARTBEAT through, assume we are out-of-sync and re-set the vtag info. v2: remove left-over snippet from an older incarnation that moved new_state/old_state assignments, thats not needed so keep that as-is. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sashal@kernel.org>