| Age | Commit message (Collapse) | Author |
|
[ Upstream commit 1c5c12ee308aacf635c8819cd4baa3bd58f8a8b7 ]
Previously BMC's MAC address is calculated by simply adding 1 to the
last byte of network controller's MAC address, and it produces incorrect
result when network controller's MAC address ends with 0xFF.
The problem can be fixed by calling eth_addr_inc() function to increment
MAC address; besides, the MAC address is also validated before assigning
to BMC.
Fixes: cb10c7c0dfd9 ("net/ncsi: Add NCSI Broadcom OEM command")
Signed-off-by: Tao Ren <taoren@fb.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d3706566ae3d92677b932dd156157fd6c72534b1 upstream.
Syzkaller report this:
BUG: unable to handle kernel paging request at fffffbfff830524b
PGD 237fe8067 P4D 237fe8067 PUD 237e64067 PMD 1c9716067 PTE 0
Oops: 0000 [#1] SMP KASAN PTI
CPU: 1 PID: 4465 Comm: syz-executor.0 Not tainted 5.0.0+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:__list_add_valid+0x21/0xe0 lib/list_debug.c:23
Code: 8b 0c 24 e9 17 fd ff ff 90 55 48 89 fd 48 8d 7a 08 53 48 89 d3 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 48 83 ec 08 <80> 3c 02 00 0f 85 8b 00 00 00 48 8b 53 08 48 39 f2 75 35 48 89 f2
RSP: 0018:ffff8881ea2278d0 EFLAGS: 00010282
RAX: dffffc0000000000 RBX: ffffffffc1829250 RCX: 1ffff1103d444ef4
RDX: 1ffffffff830524b RSI: ffffffff85659300 RDI: ffffffffc1829258
RBP: ffffffffc1879250 R08: fffffbfff0acb269 R09: fffffbfff0acb269
R10: ffff8881ea2278f0 R11: fffffbfff0acb268 R12: ffffffffc1829250
R13: dffffc0000000000 R14: 0000000000000008 R15: ffffffffc187c830
FS: 00007fe0361df700(0000) GS:ffff8881f7300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff830524b CR3: 00000001eb39a001 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
__list_add include/linux/list.h:60 [inline]
list_add include/linux/list.h:79 [inline]
proto_register+0x444/0x8f0 net/core/sock.c:3375
nr_proto_init+0x73/0x4b3 [netrom]
? 0xffffffffc1628000
? 0xffffffffc1628000
do_one_initcall+0xbc/0x47d init/main.c:887
do_init_module+0x1b5/0x547 kernel/module.c:3456
load_module+0x6405/0x8c10 kernel/module.c:3804
__do_sys_finit_module+0x162/0x190 kernel/module.c:3898
do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462e99
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fe0361dec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
RDX: 0000000000000000 RSI: 0000000020000100 RDI: 0000000000000003
RBP: 00007fe0361dec70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe0361df6bc
R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004
Modules linked in: netrom(+) ax25 fcrypt pcbc af_alg arizona_ldo1 v4l2_common videodev media v4l2_dv_timings hdlc ide_cd_mod snd_soc_sigmadsp_regmap snd_soc_sigmadsp intel_spi_platform intel_spi mtd spi_nor snd_usbmidi_lib usbcore lcd ti_ads7950 hi6421_regulator snd_soc_kbl_rt5663_max98927 snd_soc_hdac_hdmi snd_hda_ext_core snd_hda_core snd_soc_rt5663 snd_soc_core snd_pcm_dmaengine snd_compress snd_soc_rl6231 mac80211 rtc_rc5t583 spi_slave_time leds_pwm hid_gt683r hid industrialio_triggered_buffer kfifo_buf industrialio ir_kbd_i2c rc_core led_class_flash dwc_xlgmac snd_ymfpci gameport snd_mpu401_uart snd_rawmidi snd_ac97_codec snd_pcm ac97_bus snd_opl3_lib snd_timer snd_seq_device snd_hwdep snd soundcore iptable_security iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan
bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun joydev mousedev ppdev tpm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ide_pci_generic piix aesni_intel aes_x86_64 crypto_simd cryptd glue_helper ide_core psmouse input_leds i2c_piix4 serio_raw intel_agp intel_gtt ata_generic agpgart pata_acpi parport_pc rtc_cmos parport floppy sch_fq_codel ip_tables x_tables sha1_ssse3 sha1_generic ipv6 [last unloaded: rxrpc]
Dumping ftrace buffer:
(ftrace buffer empty)
CR2: fffffbfff830524b
---[ end trace 039ab24b305c4b19 ]---
If nr_proto_init failed, it may forget to call proto_unregister,
tiggering this issue.This patch rearrange code of nr_proto_init
to avoid such issues.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bd4264112f93045704731850c5e4d85db981cd85 upstream.
When a driver unloads without unloading TTM we don't correctly
clear the global structures leading to errors on re-init.
Next step should probably be to remove the global structures and
kobjs all together, but this is tricky since we need to maintain
backward compatibility.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Karol Herbst <kherbst@redhat.com>
Tested-by: Karol Herbst <kherbst@redhat.com>
CC: stable@vger.kernel.org # 5.0.x
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b987222654f84f7b4ca95b3a55eca784cb30235b upstream.
This fixes multiple issues in buffer_pipe_buf_ops:
- The ->steal() handler must not return zero unless the pipe buffer has
the only reference to the page. But generic_pipe_buf_steal() assumes
that every reference to the pipe is tracked by the page's refcount,
which isn't true for these buffers - buffer_pipe_buf_get(), which
duplicates a buffer, doesn't touch the page's refcount.
Fix it by using generic_pipe_buf_nosteal(), which refuses every
attempted theft. It should be easy to actually support ->steal, but the
only current users of pipe_buf_steal() are the virtio console and FUSE,
and they also only use it as an optimization. So it's probably not worth
the effort.
- The ->get() and ->release() handlers can be invoked concurrently on pipe
buffers backed by the same struct buffer_ref. Make them safe against
concurrency by using refcount_t.
- The pointers stored in ->private were only zeroed out when the last
reference to the buffer_ref was dropped. As far as I know, this
shouldn't be necessary anyway, but if we do it, let's always do it.
Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 273fe3f1006ea5ebc63d6729e43e8e45e32b256a ]
Set deletion after flush coming in the same batch results in EBUSY. Add
set use counter to track the number of references to this set from
rules. We cannot rely on the list of bindings for this since such list
is still populated from the preparation phase.
Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 3ff9c075cc767b3060bdac12da72fc94dd7da1b8 upstream.
Verify the stack frame pointer on kretprobe trampoline handler,
If the stack frame pointer does not match, it skips the wrong
entry and tries to find correct one.
This can happen if user puts the kretprobe on the function
which can be used in the path of ftrace user-function call.
Such functions should not be probed, so this adds a warning
message that reports which function should be blacklisted.
Tested-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/155094059185.6137.15527904013362842072.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
dumping
commit 04f5866e41fb70690e28397487d8bd8eea7d712a upstream.
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it. Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier. For example in Hugh's post from Jul 2017:
https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
"Not strictly relevant here, but a related note: I was very surprised
to discover, only quite recently, how handle_mm_fault() may be called
without down_read(mmap_sem) - when core dumping. That seems a
misguided optimization to me, which would also be nice to correct"
In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.
Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.
Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.
For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs. Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.
Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.
In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.
Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm(). The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.
Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6491d698396fd5da4941980a35ca7c162a672016 ]
This is similar to commit e285d5bfb7e9 ("NFC: Fix the number of pipes")
where we changed NFC_HCI_MAX_PIPES from 127 to 128.
As the comment next to the define explains, the pipe identifier is 7
bits long. The highest possible pipe is 127, but the number of possible
pipes is 128. As the code is now, then there is potential for an
out of bounds array access:
net/nfc/nci/hci.c:297 nci_hci_cmd_received() warn: array off by one?
'ndev->hci_dev->pipes[pipe]' '0-127 == 127'
Fixes: 11f54f228643 ("NFC: nci: Add HCI over NCI protocol support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b4f47f3848eb70986f75d06112af7b48b7f5f462 ]
Unlike '&&' operator, the '&' does not have short-circuit
evaluation semantics. IOW both sides of the operator always
get evaluated. Fix the wrong operator in
tls_is_sk_tx_device_offloaded(), which would lead to
out-of-bounds access for for non-full sockets.
Fixes: 4799ac81e52a ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 35b71a34ada62c9573847a324bf06a133fe11b11 ]
David reports that tls triggers warnings related to
sk->sk_forward_alloc not being zero at destruction time:
WARNING: CPU: 5 PID: 6831 at net/core/stream.c:206 sk_stream_kill_queues+0x103/0x110
WARNING: CPU: 5 PID: 6831 at net/ipv4/af_inet.c:160 inet_sock_destruct+0x15b/0x170
When sender fills up the write buffer and dies from
SIGPIPE. This is due to the device implementation
not cleaning up the partially_sent_record.
This is because commit a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
moved the partial record cleanup to the SW-only path.
Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8065a779f17e94536a1c4dcee4f9d88011672f97 ]
When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.
As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.
Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.
It's less risky to lift up the rename restriction on failover slave
which is already UP. Although it's possible this change may potentially
break userspace component (most likely configuration scripts or
management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to listen for the rename
events on failover slaves. Userspace component interacting with slaves
is expected to be changed to operate on failover master interface
instead, as the failover slave is dynamic in nature which may come and
go at any point. The goal is to make the role of failover slaves less
relevant, and userspace components should only deal with failover master
in the long run.
Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit a4046c06be50a4f01d435aa7fe57514818e6cc82 ]
Use offsetof() to calculate offset of a field to take advantage of
compiler built-in version when possible, and avoid UBSAN warning when
compiling with Clang:
UBSAN: Undefined behaviour in mm/swapfile.c:3010:38
member access within null pointer of type 'union swap_header'
CPU: 6 PID: 1833 Comm: swapon Tainted: G S 4.19.23 #43
Call trace:
dump_backtrace+0x0/0x194
show_stack+0x20/0x2c
__dump_stack+0x20/0x28
dump_stack+0x70/0x94
ubsan_epilogue+0x14/0x44
ubsan_type_mismatch_common+0xf4/0xfc
__ubsan_handle_type_mismatch_v1+0x34/0x54
__se_sys_swapon+0x654/0x1084
__arm64_sys_swapon+0x1c/0x24
el0_svc_common+0xa8/0x150
el0_svc_compat_handler+0x2c/0x38
el0_svc_compat+0x8/0x18
Link: http://lkml.kernel.org/r/20190312081902.223764-1-pihsun@chromium.org
Signed-off-by: Pi-Hsun Shih <pihsun@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 930c9f9125c85b5134b3e711bc252ecc094708e3 ]
rxrpc_disconnect_client_call() reads the call's connection ID protocol
value (call->cid) as part of that function's variable declarations. This
is bad because it's not inside the locked section and so may race with
someone granting use of the channel to the call.
This manifests as an assertion failure (see below) where the call in the
presumed channel (0 because call->cid wasn't set when we read it) doesn't
match the call attached to the channel we were actually granted (if 1, 2 or
3).
Fix this by moving the read and dependent calculations inside of the
channel_lock section. Also, only set the channel number and pointer
variables if cid is not zero (ie. unset).
This problem can be induced by injecting an occasional error in
rxrpc_wait_for_channel() before the call to schedule().
Make two further changes also:
(1) Add a trace for wait failure in rxrpc_connect_call().
(2) Drop channel_lock before BUG'ing in the case of the assertion failure.
The failure causes a trace akin to the following:
rxrpc: Assertion failed - 18446612685268945920(0xffff8880beab8c00) == 18446612685268621312(0xffff8880bea69800) is false
------------[ cut here ]------------
kernel BUG at net/rxrpc/conn_client.c:824!
...
RIP: 0010:rxrpc_disconnect_client_call+0x2bf/0x99d
...
Call Trace:
rxrpc_connect_call+0x902/0x9b3
? wake_up_q+0x54/0x54
rxrpc_new_client_call+0x3a0/0x751
? rxrpc_kernel_begin_call+0x141/0x1bc
? afs_alloc_call+0x1b5/0x1b5
rxrpc_kernel_begin_call+0x141/0x1bc
afs_make_call+0x20c/0x525
? afs_alloc_call+0x1b5/0x1b5
? __lock_is_held+0x40/0x71
? lockdep_init_map+0xaf/0x193
? lockdep_init_map+0xaf/0x193
? __lock_is_held+0x40/0x71
? yfs_fs_fetch_data+0x33b/0x34a
yfs_fs_fetch_data+0x33b/0x34a
afs_fetch_data+0xdc/0x3b7
afs_read_dir+0x52d/0x97f
afs_dir_iterate+0xa0/0x661
? iterate_dir+0x63/0x141
iterate_dir+0xa2/0x141
ksys_getdents64+0x9f/0x11b
? filldir+0x111/0x111
? do_syscall_64+0x3e/0x1a0
__x64_sys_getdents64+0x16/0x19
do_syscall_64+0x7d/0x1a0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixes: 45025bceef17 ("rxrpc: Improve management and caching of client connection objects")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6377f787aeb945cae7abbb6474798de129e1f3ac ]
KASAN report this:
BUG: KASAN: use-after-free in pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71
Read of size 8 at addr ffff8881f41fe5b0 by task syz-executor.0/2806
CPU: 0 PID: 2806 Comm: syz-executor.0 Not tainted 5.0.0-rc7+ #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0xfa/0x1ce lib/dump_stack.c:113
print_address_description+0x65/0x270 mm/kasan/report.c:187
kasan_report+0x149/0x18d mm/kasan/report.c:317
pde_subdir_find+0x12d/0x150 fs/proc/generic.c:71
remove_proc_entry+0xe8/0x420 fs/proc/generic.c:667
atalk_proc_exit+0x18/0x820 [appletalk]
atalk_exit+0xf/0x5a [appletalk]
__do_sys_delete_module kernel/module.c:1018 [inline]
__se_sys_delete_module kernel/module.c:961 [inline]
__x64_sys_delete_module+0x3dc/0x5e0 kernel/module.c:961
do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462e99
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fb2de6b9c58 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000200001c0
RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb2de6ba6bc
R13: 00000000004bccaa R14: 00000000006f6bc8 R15: 00000000ffffffff
Allocated by task 2806:
set_track mm/kasan/common.c:85 [inline]
__kasan_kmalloc.constprop.3+0xa0/0xd0 mm/kasan/common.c:496
slab_post_alloc_hook mm/slab.h:444 [inline]
slab_alloc_node mm/slub.c:2739 [inline]
slab_alloc mm/slub.c:2747 [inline]
kmem_cache_alloc+0xcf/0x250 mm/slub.c:2752
kmem_cache_zalloc include/linux/slab.h:730 [inline]
__proc_create+0x30f/0xa20 fs/proc/generic.c:408
proc_mkdir_data+0x47/0x190 fs/proc/generic.c:469
0xffffffffc10c01bb
0xffffffffc10c0166
do_one_initcall+0xfa/0x5ca init/main.c:887
do_init_module+0x204/0x5f6 kernel/module.c:3460
load_module+0x66b2/0x8570 kernel/module.c:3808
__do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Freed by task 2806:
set_track mm/kasan/common.c:85 [inline]
__kasan_slab_free+0x130/0x180 mm/kasan/common.c:458
slab_free_hook mm/slub.c:1409 [inline]
slab_free_freelist_hook mm/slub.c:1436 [inline]
slab_free mm/slub.c:2986 [inline]
kmem_cache_free+0xa6/0x2a0 mm/slub.c:3002
pde_put+0x6e/0x80 fs/proc/generic.c:647
remove_proc_entry+0x1d3/0x420 fs/proc/generic.c:684
0xffffffffc10c031c
0xffffffffc10c0166
do_one_initcall+0xfa/0x5ca init/main.c:887
do_init_module+0x204/0x5f6 kernel/module.c:3460
load_module+0x66b2/0x8570 kernel/module.c:3808
__do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The buggy address belongs to the object at ffff8881f41fe500
which belongs to the cache proc_dir_entry of size 256
The buggy address is located 176 bytes inside of
256-byte region [ffff8881f41fe500, ffff8881f41fe600)
The buggy address belongs to the page:
page:ffffea0007d07f80 count:1 mapcount:0 mapping:ffff8881f6e69a00 index:0x0
flags: 0x2fffc0000000200(slab)
raw: 02fffc0000000200 dead000000000100 dead000000000200 ffff8881f6e69a00
raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8881f41fe480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
ffff8881f41fe500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8881f41fe580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8881f41fe600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
ffff8881f41fe680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
It should check the return value of atalk_proc_init fails,
otherwise atalk_exit will trgger use-after-free in pde_subdir_find
while unload the module.This patch fix error cleanup path of atalk_init
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ffc8599aa9763f39f6736a79da4d1575e7006f9a ]
On machines where the GART aperture is mapped over physical RAM,
/proc/kcore contains the GART aperture range. Accessing the GART range via
/proc/kcore results in a kernel crash.
vmcore used to have the same issue, until it was fixed with commit
2a3e83c6f96c ("x86/gart: Exclude GART aperture from vmcore")', leveraging
existing hook infrastructure in vmcore to let /proc/vmcore return zeroes
when attempting to read the aperture region, and so it won't read from the
actual memory.
Apply the same workaround for kcore. First implement the same hook
infrastructure for kcore, then reuse the hook functions introduced in the
previous vmcore fix. Just with some minor adjustment, rename some functions
for more general usage, and simplify the hook infrastructure a bit as there
is no module usage yet.
Suggested-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jiri Bohac <jbohac@suse.cz>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Dave Young <dyoung@redhat.com>
Link: https://lkml.kernel.org/r/20190308030508.13548-1-kasong@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit cf94db21905333e610e479688add629397a4b384 upstream.
vring_create_virtqueue() allows the caller to specify via the
may_reduce_num parameter whether the vring code is allowed to
allocate a smaller ring than specified.
However, the split ring allocation code tries to allocate a
smaller ring on allocation failure regardless of what the
caller specified. This may cause trouble for e.g. virtio-pci
in legacy mode, which does not support ring resizing. (The
packed ring code does not resize in any case.)
Let's fix this by bailing out immediately in the split ring code
if the requested size cannot be allocated and may_reduce_num has
not been specified.
While at it, fix a typo in the usage instructions.
Fixes: 2a2d1382fe9d ("virtio: Add improved queue allocation API")
Cc: stable@vger.kernel.org # v4.6+
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Jens Freimann <jfreimann@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream.
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:
1) per-memcg per-cpu values in range of [-32..32]
2) per-memcg atomic counter
When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic. Stat readers only check the atomic. Thus readers such as
balance_dirty_pages() may see a nontrivial error margin: 32 pages per
cpu.
Assuming 100 cpus:
4k x86 page_size: 13 MiB error per memcg
64k ppc page_size: 200 MiB error per memcg
Considering that dirty+writeback are used together for some decisions the
errors double.
This inaccuracy can lead to undeserved oom kills. One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages. If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.
It could be argued that tiny containers are not supported, but it's more
subtle. It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.
The following test reliably ooms without this patch. This patch avoids
oom kills.
$ cat test
mount -t cgroup2 none /dev/cgroup
cd /dev/cgroup
echo +io +memory > cgroup.subtree_control
mkdir test
cd test
echo 10M > memory.max
(echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
(echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
$ cat memcg-writeback-stress.c
/*
* Dirty pages from all but one cpu.
* Clean pages from the non dirtying cpu.
* This is to stress per cpu counter imbalance.
* On a 100 cpu machine:
* - per memcg per cpu dirty count is 32 pages for each of 99 cpus
* - per memcg atomic is -99*32 pages
* - thus the complete dirty limit: sum of all counters 0
* - balance_dirty_pages() only sees atomic count -99*32 pages, which
* it max()s to 0.
* - So a workload can dirty -99*32 pages before balance_dirty_pages()
* cares.
*/
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/stat.h>
#include <sys/sysinfo.h>
#include <sys/types.h>
#include <unistd.h>
static char *buf;
static int bufSize;
static void set_affinity(int cpu)
{
cpu_set_t affinity;
CPU_ZERO(&affinity);
CPU_SET(cpu, &affinity);
if (sched_setaffinity(0, sizeof(affinity), &affinity))
err(1, "sched_setaffinity");
}
static void dirty_on(int output_fd, int cpu)
{
int i, wrote;
set_affinity(cpu);
for (i = 0; i < 32; i++) {
for (wrote = 0; wrote < bufSize; ) {
int ret = write(output_fd, buf+wrote, bufSize-wrote);
if (ret == -1)
err(1, "write");
wrote += ret;
}
}
}
int main(int argc, char **argv)
{
int cpu, flush_cpu = 1, output_fd;
const char *output;
if (argc != 2)
errx(1, "usage: output_file");
output = argv[1];
bufSize = getpagesize();
buf = malloc(getpagesize());
if (buf == NULL)
errx(1, "malloc failed");
output_fd = open(output, O_CREAT|O_RDWR);
if (output_fd == -1)
err(1, "open(%s)", output);
for (cpu = 0; cpu < get_nprocs(); cpu++) {
if (cpu != flush_cpu)
dirty_on(output_fd, cpu);
}
set_affinity(flush_cpu);
if (fsync(output_fd))
err(1, "fsync(%s)", output);
if (close(output_fd))
err(1, "close(%s)", output);
free(buf);
}
Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters. This avoids the aforementioned oom
kills.
This does not affect the overhead of memory.stat, which still reads the
single atomic counter.
Why not use percpu_counter? memcg already handles cpus going offline, so
no need for that overhead from percpu_counter. And the percpu_counter
spinlocks are more heavyweight than is required.
It probably also makes sense to use exact dirty and writeback counters
in memcg oom reports. But that is saved for later.
Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org> [4.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6147e136ff5071609b54f18982dea87706288e21 upstream.
clang points out with hundreds of warnings that the bitrev macros have a
problem with constant input:
drivers/hwmon/sht15.c:187:11: error: variable '__x' is uninitialized when used within its own initialization
[-Werror,-Wuninitialized]
u8 crc = bitrev8(data->val_status & 0x0F);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/bitrev.h:102:21: note: expanded from macro 'bitrev8'
__constant_bitrev8(__x) : \
~~~~~~~~~~~~~~~~~~~^~~~
include/linux/bitrev.h:67:11: note: expanded from macro '__constant_bitrev8'
u8 __x = x; \
~~~ ^
Both the bitrev and the __constant_bitrev macros use an internal
variable named __x, which goes horribly wrong when passing one to the
other.
The obvious fix is to rename one of the variables, so this adds an extra
'_'.
It seems we got away with this because
- there are only a few drivers using bitrev macros
- usually there are no constant arguments to those
- when they are constant, they tend to be either 0 or (unsigned)-1
(drivers/isdn/i4l/isdnhdlc.o, drivers/iio/amplifiers/ad8366.c) and
give the correct result by pure chance.
In fact, the only driver that I could find that gets different results
with this is drivers/net/wan/slic_ds26522.c, which in turn is a driver
for fairly rare hardware (adding the maintainer to Cc for testing).
Link: http://lkml.kernel.org/r/20190322140503.123580-1-arnd@arndb.de
Fixes: 556d2f055bf6 ("ARM: 8187/1: add CONFIG_HAVE_ARCH_BITREVERSE to support rbit instruction")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Zhao Qiang <qiang.zhao@nxp.com>
Cc: Yalin Wang <yalin.wang@sonymobile.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 5f074f3e192f10c9fade898b9b3b8812e3d83342 ]
A recent optimization in Clang (r355672) lowers comparisons of the
return value of memcmp against zero to comparisons of the return value
of bcmp against zero. This helps some platforms that implement bcmp
more efficiently than memcmp. glibc simply aliases bcmp to memcmp, but
an optimized implementation is in the works.
This results in linkage failures for all targets with Clang due to the
undefined symbol. For now, just implement bcmp as a tailcail to memcmp
to unbreak the build. This routine can be further optimized in the
future.
Other ideas discussed:
* A weak alias was discussed, but breaks for architectures that define
their own implementations of memcmp since aliases to declarations are
not permitted (only definitions). Arch-specific memcmp
implementations typically declare memcmp in C headers, but implement
them in assembly.
* -ffreestanding also is used sporadically throughout the kernel.
* -fno-builtin-bcmp doesn't work when doing LTO.
Link: https://bugs.llvm.org/show_bug.cgi?id=41035
Link: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#bcmp
Link: https://github.com/llvm/llvm-project/commit/8e16d73346f8091461319a7dfc4ddd18eedcff13
Link: https://github.com/ClangBuiltLinux/linux/issues/416
Link: http://lkml.kernel.org/r/20190313211335.165605-1-ndesaulniers@google.com
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: James Y Knight <jyknight@google.com>
Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Nathan Chancellor <natechancellor@gmail.com>
Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 80a2a9026b24c6bd34b8d58256973e22270bedec ]
Refresh tirs is looping over a global list of tirs while netdevs are
adding and removing tirs from that list. That is why a lock is
required.
Fixes: 724b2aa15126 ("net/mlx5e: TIRs management refactoring")
Signed-off-by: Yuval Avnery <yuvalav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8c83f2df9c6578ea4c5b940d8238ad8a41b87e9e ]
Configuration check to accept source route IP options should be made on
the incoming netdevice when the skb->dev is an l3mdev master. The route
lookup for the source route next hop also needs the incoming netdev.
v2->v3:
- Simplify by passing the original netdevice down the stack (per David
Ahern).
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 355b98553789b646ed97ad801a619ff898471b92 ]
net_hash_mix() currently uses kernel address of a struct net,
and is used in many places that could be used to reveal this
address to a patient attacker, thus defeating KASLR, for
the typical case (initial net namespace, &init_net is
not dynamically allocated)
I believe the original implementation tried to avoid spending
too many cycles in this function, but security comes first.
Also provide entropy regardless of CONFIG_NET_NS.
Fixes: 0b4419162aa6 ("netns: introduce the net_hash_mix "salt" for hashes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b8e204006340b7aaf32bd2b9806c692f6e0cb38a ]
Add .release_ops, that is called in case of error at a later stage in
the expression initialization path, ie. .select_ops() has been already
set up operations and that needs to be undone. This allows us to unwind
.select_ops from the error path, ie. release the dynamic operations for
this extension.
Moreover, allocate one single operation instead of recycling them, this
comes at the cost of consuming a bit more memory per rule, but it
simplifies the infrastructure.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 27da0d2ef998e222a876c0cec72aa7829a626266 ]
A bugfix just broke compilation of appletalk when CONFIG_SYSCTL
is disabled:
In file included from net/appletalk/ddp.c:65:
net/appletalk/ddp.c: In function 'atalk_init':
include/linux/atalk.h:164:34: error: expected expression before 'do'
#define atalk_register_sysctl() do { } while(0)
^~
net/appletalk/ddp.c:1934:7: note: in expansion of macro 'atalk_register_sysctl'
rc = atalk_register_sysctl();
This is easier to avoid by using conventional inline functions
as stubs rather than macros. The header already has inline
functions for other purposes, so I'm changing over all the
macros for consistency.
Fixes: 6377f787aeb9 ("appletalk: Fix use-after-free in atalk_proc_exit")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8e2f311a68494a6677c1724bdcb10bada21af37c ]
Following command:
iptables -D FORWARD -m physdev ...
causes connectivity loss in some setups.
Reason is that iptables userspace will probe kernel for the module revision
of the physdev patch, and physdev has an artificial dependency on
br_netfilter (xt_physdev use makes no sense unless a br_netfilter module
is loaded).
This causes the "phydev" module to be loaded, which in turn enables the
"call-iptables" infrastructure.
bridged packets might then get dropped by the iptables ruleset.
The better fix would be to change the "call-iptables" defaults to 0 and
enforce explicit setting to 1, but that breaks backwards compatibility.
This does the next best thing: add a request_module call to checkentry.
This was a stray '-D ... -m physdev' won't activate br_netfilter
anymore.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
the accounting
[ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]
The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
needs pids_free() to uncharge the pid.
However, ->free() is called from __put_task_struct()->cgroup_free() and this
is too late. Even the trivial program which does
for (;;) {
int pid = fork();
assert(pid >= 0);
if (pid)
wait(NULL);
else
exit(0);
}
can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
implies an RCU gp after the task/pid goes away and before the final put().
Test-case:
mkdir -p /tmp/CG
mount -t cgroup2 none /tmp/CG
echo '+pids' > /tmp/CG/cgroup.subtree_control
mkdir /tmp/CG/PID
echo 2 > /tmp/CG/PID/pids.max
perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
echo $! > /tmp/CG/PID/cgroup.procs
Without this patch the forking process fails soon after migration.
Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
into the new helper, cgroup_release(), called by release_task() which actually
frees the pid(s).
Reported-by: Herton R. Krzesinski <hkrzesin@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 116bfa96a255123ed209da6544f74a4f2eaca5da ]
Compiling with W=1 generates warnings:
CC kernel/bpf/core.o
kernel/bpf/core.c:721:12: warning: no previous prototype for ?bpf_jit_alloc_exec_limit? [-Wmissing-prototypes]
721 | u64 __weak bpf_jit_alloc_exec_limit(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~
kernel/bpf/core.c:757:14: warning: no previous prototype for ?bpf_jit_alloc_exec? [-Wmissing-prototypes]
757 | void *__weak bpf_jit_alloc_exec(unsigned long size)
| ^~~~~~~~~~~~~~~~~~
kernel/bpf/core.c:762:13: warning: no previous prototype for ?bpf_jit_free_exec? [-Wmissing-prototypes]
762 | void __weak bpf_jit_free_exec(void *addr)
| ^~~~~~~~~~~~~~~~~
All three are weak functions that archs can override, provide
proper prototypes for when a new arch provides their own.
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c546951d9c9300065bad253ecdf1ac59ce9d06c8 ]
move_queued_task() synchronizes with task_rq_lock() as follows:
move_queued_task() task_rq_lock()
[S] ->on_rq = MIGRATING [L] rq = task_rq()
WMB (__set_task_cpu()) ACQUIRE (rq->lock);
[S] ->cpu = new_cpu [L] ->on_rq
where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
"[L] ->on_rq" by the ACQUIRE itself.
Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
this address dependency. Also, mark the accesses to ->cpu and ->on_rq
with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 840018668ce2d96783356204ff282d6c9b0e5f66 ]
When pmu::setup_aux() is called the coresight PMU needs to know which
sink to use for the session by looking up the information in the
event's attr::config2 field.
As such simply replace the cpu information by the complete perf_event
structure and change all affected customers.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Suzuki Poulouse <suzuki.poulose@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1136b0728969901a091f0471968b2b76ed14d9ad ]
Waiman reported that on large systems with a large amount of interrupts the
readout of /proc/stat takes a long time to sum up the interrupt
statistics. In principle this is not a problem. but for unknown reasons
some enterprise quality software reads /proc/stat with a high frequency.
The reason for this is that interrupt statistics are accounted per cpu. So
the /proc/stat logic has to sum up the interrupt stats for each interrupt.
This can be largely avoided for interrupts which are not marked as
'PER_CPU' interrupts by simply adding a per interrupt summation counter
which is incremented along with the per interrupt per cpu counter.
The PER_CPU interrupts need to avoid that and use only per cpu accounting
because they share the interrupt number and the interrupt descriptor and
concurrent updates would conflict or require unwanted synchronization.
Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20190208135020.925487496@linutronix.de
8<-------------
v2: Undo the unintentional layout change of struct irq_desc.
include/linux/irqdesc.h | 1 +
kernel/irq/chip.c | 12 ++++++++++--
kernel/irq/internals.h | 8 +++++++-
kernel/irq/irqdesc.c | 7 ++++++-
4 files changed, 24 insertions(+), 4 deletions(-)
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 99687cdbb3f6c8e32bcc7f37496e811f30460e48 ]
The percpu members of struct sd_data and s_data are declared as:
struct ... ** __percpu member;
So their type is:
__percpu pointer to pointer to struct ...
But looking at how they're used, their type should be:
pointer to __percpu pointer to struct ...
and they should thus be declared as:
struct ... * __percpu *member;
So fix the placement of '__percpu' in the definition of these
structures.
This addresses a bunch of Sparse's warnings like:
warning: incorrect type in initializer (different address spaces)
expected void const [noderef] <asn:3> *__vpp_verify
got struct sched_domain **
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190118144936.79158-1-luc.vanoostenryck@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5c418dc789a3898717ebf2caa5716ba91a7150b2 ]
The following commit:
a893ea15d764 ("tpm: move tpm_chip definition to include/linux/tpm.h")
introduced a build error when both IMA and EFI are enabled:
In file included from ../security/integrity/ima/ima_fs.c:30:
../security/integrity/ima/ima.h:176:7: error: redeclaration of enumerator "NONE"
What happens is that both headers (ima.h and efi.h) defines the same
'NONE' constant, and it broke when they started getting included from
the same file:
Rework to prefix the EFI enum with 'EFI_*'.
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20190215165551.12220-2-ard.biesheuvel@linaro.org
[ Cleaned up the changelog a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8beb90aaf334a6efa3e924339926b5f93a234dbb ]
commit 1917d42d14b7 ("fcoe: use enum for fip_mode") introduces a separate
enum for the fip_mode that shall be used during initialisation handling
until it is passed to fcoe_ctrl_link_up to set the initial fip_state. That
change was incomplete and gcc quietly converted in various places between
the fip_mode and the fip_state enum values with implicit enum conversions,
which fortunately cannot cause any issues in the actual code's execution.
clang however warns about these implicit enum conversions in the scsi
drivers. This commit consolidates the use of the two enums, guided by
clang's enum-conversion warnings.
This commit now completes the use of the fip_mode: It expects and uses
fip_mode in {bnx2fc,fcoe}_interface_create and fcoe_ctlr_init, and it calls
fcoe_ctrl_set_set() with the correct values in fcoe_ctlr_link_up(). It
also breaks the association between FIP_MODE_AUTO and FIP_ST_AUTO to
indicate these two enums are distinct.
Link: https://github.com/ClangBuiltLinux/linux/issues/151
Fixes: 1917d42d14b7 ("fcoe: use enum for fip_mode")
Reported-by: Dmitry Golovin <dima@golovin.in>
Original-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
CC: Lukas Bulwahn <lukas.bulwahn@gmail.com>
CC: Nick Desaulniers <ndesaulniers@google.com>
CC: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Suggested-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d13501a2bedfbea0983cc868d3f1dc692627f60d ]
Custom approximation of fractional-divider may not need parent clock
rate checking. For example Rockchip SoCs work fine using grand parent
clock rate even if target rate is greater than parent.
This patch checks parent clock rate only if CLK_SET_RATE_PARENT flag
is set.
For detailed example, clock tree of Rockchip I2S audio hardware.
- Clock rate of CPLL is 1.2GHz, GPLL is 491.52MHz.
- i2s1_div is integer divider can divide N (N is 1~128).
Input clock is CPLL or GPLL. Initial divider value is N = 1.
Ex) PLL = CPLL, N = 10, i2s1_div output rate is
CPLL / 10 = 1.2GHz / 10 = 120MHz
- i2s1_frac is fractional divider can divide input to x/y, x and
y are 16bit integer.
CPLL --> | selector | ---> i2s1_div -+--> | selector | --> I2S1 MCLK
GPLL --> | | ,--------------' | |
`--> i2s1_frac ---> | |
Clock mux system try to choose suitable one from i2s1_div and
i2s1_frac for master clock (MCLK) of I2S1.
Bad scenario as follows:
- Try to set MCLK to 8.192MHz (32kHz audio replay)
Candidate setting is
- i2s1_div: GPLL / 60 = 8.192MHz
i2s1_div candidate is exactly same as target clock rate, so mux
choose this clock source. i2s1_div output rate is changed
491.52MHz -> 8.192MHz
- After that try to set to 11.2896MHz (44.1kHz audio replay)
Candidate settings are
- i2s1_div : CPLL / 107 = 11.214945MHz
- i2s1_frac: i2s1_div = 8.192MHz
This is because clk_fd_round_rate() thinks target rate
(11.2896MHz) is higher than parent rate (i2s1_div = 8.192MHz)
and returns parent clock rate.
Above is current upstreamed behavior. Clock mux system choose
i2s1_div, but this clock rate is not acceptable for I2S driver, so
users cannot replay audio.
Expected behavior is:
- Try to set master clock to 11.2896MHz (44.1kHz audio replay)
Candidate settings are
- i2s1_div : CPLL / 107 = 11.214945MHz
- i2s1_frac: i2s1_div * 147/6400 = 11.2896MHz
Change i2s1_div to GPLL / 1 = 491.52MHz at same
time.
If apply this commit, clk_fd_round_rate() calls custom approximate
function of Rockchip even if target rate is higher than parent.
Custom function changes both grand parent (i2s1_div) and parent
(i2s_frac) settings at same time. Clock mux system can choose
i2s1_frac and audio works fine.
Signed-off-by: Katsuhiro Suzuki <katsuhiro@katsuster.net>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
[sboyd@kernel.org: Make function into a macro instead]
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 93b6604c5a669d84e45fe5129294875bf82eb1ff ]
A previous change allowed I2C client devices to discover new IRQs upon
reprobe by clearing the IRQ in i2c_device_remove. However, if an IRQ was
assigned in i2c_new_device, that information is lost.
For example, the touchscreen and trackpad devices on a Dell Inspiron laptop
are I2C devices whose IRQs are defined by ACPI extended IRQ types. The
client device structures are initialized during an ACPI walk. After
removing the i2c_hid device, modprobe fails.
This change caches the initial IRQ value in i2c_new_device and then resets
the client device IRQ to the initial value in i2c_device_remove.
Fixes: 6f108dd70d30 ("i2c: Clear client->irq in i2c_device_remove")
Signed-off-by: Jim Broadus <jbroadus@gmail.com>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Reviewed-by: Charles Keepax <ckeepax@opensource.cirrus.com>
[wsa: this is an easy to backport fix for the regression. We will
refactor the code to handle irq assignments better in general.]
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 500e0b28ecd3c5aade98f3c3a339d18dcb166bb6 ]
We use below condition to check inline_xattr_size boundary:
if (!F2FS_OPTION(sbi).inline_xattr_size ||
F2FS_OPTION(sbi).inline_xattr_size >=
DEF_ADDRS_PER_INODE -
F2FS_TOTAL_EXTRA_ATTR_SIZE -
DEF_INLINE_RESERVED_SIZE -
DEF_MIN_INLINE_SIZE)
There is there problems in that check:
- we should allow inline_xattr_size equaling to min size of inline
{data,dentry} area.
- F2FS_TOTAL_EXTRA_ATTR_SIZE and inline_xattr_size are based on
different size unit, previous one is 4 bytes, latter one is 1 bytes.
- DEF_MIN_INLINE_SIZE only indicate min size of inline data area,
however, we need to consider min size of inline dentry area as well,
minimal inline dentry should at least contain two entries: '.' and
'..', so that min inline_dentry size is 40 bytes.
.bitmap 1 * 1 = 1
.reserved 1 * 1 = 1
.dentry 11 * 2 = 22
.filename 8 * 2 = 16
total 40
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bcf6f55a0d05eedd8ebb6ecc60ae3f93205ad833 ]
Building little-endian allmodconfig kernels on arm64 started failing
with the generated atomic.h implementation, since we now try to call
kasan helpers from the EFI stub:
aarch64-linux-gnu-ld: drivers/firmware/efi/libstub/arm-stub.stub.o: in function `atomic_set':
include/generated/atomic-instrumented.h:44: undefined reference to `__efistub_kasan_check_write'
I suspect that we get similar problems in other files that explicitly
disable KASAN for some reason but call atomic_t based helper functions.
We can fix this by checking the predefined __SANITIZE_ADDRESS__ macro
that the compiler sets instead of checking CONFIG_KASAN, but this in
turn requires a small hack in mm/kasan/common.c so we do see the extern
declaration there instead of the inline function.
Link: http://lkml.kernel.org/r/20181211133453.2835077-1-arnd@arndb.de
Fixes: b1864b828644 ("locking/atomics: build atomic headers as required")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 62461ac2e5b6520b6d65fc6d7d7b4b8df4b848d8 ]
The percpu member of this structure is declared as:
struct ... ** __percpu member;
So its type is:
__percpu pointer to pointer to struct ...
But looking at how it's used, its type should be:
pointer to __percpu pointer to struct ...
and it should thus be declared as:
struct ... * __percpu *member;
So fix the placement of '__percpu' in the definition of this
structures.
This silents a few Sparse's warnings like:
warning: incorrect type in initializer (different address spaces)
expected void const [noderef] <asn:3> *__vpp_verify
got struct sched_domain **
Link: http://lkml.kernel.org/r/20190118144902.79065-1-luc.vanoostenryck@gmail.com
Fixes: 017c59c042d01 ("relay: Use per CPU constructs for the relay channel buffer pointers")
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Jens Axboe <axboe@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 31b265b3baaf55f209229888b7ffea523ddab366 ]
As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
BUG for "sleeping function called from invalid context".
kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
atomic context. A very simple solution for this is to add allocation
flags to ring_buffer_read_prepare() so kdb can call it without
triggering the allocation error. This patch does that.
Note that in the original email thread about this, it was suggested
that perhaps the solution for kdb was to either preallocate the buffer
ahead of time or create our own iterator. I'm hoping that this
alternative of adding allocation flags to ring_buffer_read_prepare()
can be considered since it means I don't need to duplicate more of the
core trace code into "trace_kdb.c" (for either creating my own
iterator or re-preparing a ring allocator whose memory was already
allocated).
NOTE: another option for kdb is to actually figure out how to make it
reuse the existing ftrace_dump() function and totally eliminate the
duplication. This sounds very appealing and actually works (the "sr
z" command can be seen to properly dump the ftrace buffer). The
downside here is that ftrace_dump() fully consumes the trace buffer.
Unless that is changed I'd rather not use it because it means "ftdump
| grep xyz" won't be very useful to search the ftrace buffer since it
will throw away the whole trace on the first grep. A future patch to
dump only the last few lines of the buffer will also be hard to
implement.
[1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com
Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org
Reported-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 6d6ea1e967a246f12cfe2f5fb743b70b2e608d4a upstream.
Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
v6.
This is a followup to the discussion in [1], [2].
IOMMUs using ARMv7 short-descriptor format require page tables (level 1
and 2) to be allocated within the first 4GB of RAM, even on 64-bit
systems.
For L1 tables that are bigger than a page, we can just use
__get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
use GFP_DMA).
For L2 tables that only take 1KB, it would be a waste to allocate a full
page, so we considered 3 approaches:
1. This series, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2 page
tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable to reuse
freed fragments until the whole page is freed. [3]
This series is the most memory-efficient approach.
stable@ note:
We confirmed that this is a regression, and IOMMU errors happen on 4.19
and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
platforms (and maybe others?).
[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
[2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
[3] https://patchwork.codeaurora.org/patch/671639/
This patch (of 3):
IOMMUs using ARMv7 short-descriptor format require page tables to be
allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
this is done by passing GFP_DMA32 flag to memory allocation functions.
For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
a full page using get_free_pages, so we considered 3 approaches:
1. This patch, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2
page tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable
to reuse freed fragments until the whole page is freed.
This change makes it possible to create a custom cache in DMA32 zone using
kmem_cache_create, then allocate memory using kmem_cache_alloc.
We do not create a DMA32 kmalloc cache array, as there are currently no
users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.
This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
unnecessary).
Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Sasha Levin <Alexander.Levin@microsoft.com>
Cc: Huaisheng Ye <yehs1@lenovo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Yong Wu <yong.wu@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Tomasz Figa <tfiga@google.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9b7ea46a82b31c74a37e6ff1c2a1df7d53e392ab upstream.
Commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded
memory to zones until online") introduced move_pfn_range_to_zone() which
calls memmap_init_zone() during onlining a memory block.
memmap_init_zone() will reset pagetype flags and makes migrate type to
be MOVABLE.
However, in __offline_pages(), it also call undo_isolate_page_range()
after offline_isolated_pages() to do the same thing. Due to commit
2ce13640b3f4 ("mm: __first_valid_page skip over offline pages") changed
__first_valid_page() to skip offline pages, undo_isolate_page_range()
here just waste CPU cycles looping around the offlining PFN range while
doing nothing, because __first_valid_page() will return NULL as
offline_isolated_pages() has already marked all memory sections within
the pfn range as offline via offline_mem_sections().
Also, after calling the "useless" undo_isolate_page_range() here, it
reaches the point of no returning by notifying MEM_OFFLINE. Those pages
will be marked as MIGRATE_MOVABLE again once onlining. The only thing
left to do is to decrease the number of isolated pageblocks zone counter
which would make some paths of the page allocation slower that the above
commit introduced.
Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
on ppc64, an "int" should still be enough to represent the number of
pageblocks there. Fix an incorrect comment along the way.
[cai@lca.pw: v4]
Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw
Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw
Fixes: 2ce13640b3f4 ("mm: __first_valid_page skip over offline pages")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7f07e5f1f778605e98cf2156d4db1ff3a3a1a74a ]
With a recent link mode advertisement code update this helper
providing local pause capability translation used for flow
control link mode negotiation got broken.
For eth drivers using this helper, the issue is apparent only
if either PAUSE or ASYM_PAUSE is being advertised.
Fixes: 3c1bcc8614db ("net: ethernet: Convert phydev advertize and supported from u32 to link mode")
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 273160ffc6b993c7c91627f5a84799c66dfe4dee ]
sctp_hdr(skb) only works when skb->transport_header is set properly.
But in Netfilter, skb->transport_header for ipv6 is not guaranteed
to be right value for sctphdr. It would cause to fail to check the
checksum for sctp packets.
So fix it by using offset, which is always right in all places.
v1->v2:
- Fix the changelog.
Fixes: e6d8b64b34aa ("net: sctp: fix and consolidate SCTP checksumming code")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit a4dc6a49156b1f8d6e17251ffda17c9e6a5db78a ]
When using fanouts with AF_PACKET, the demux functions such as
fanout_demux_cpu will return an index in the fanout socket array, which
corresponds to the selected socket.
The ordering of this array depends on the order the sockets were added
to a given fanout group, so for FANOUT_CPU this means sockets are bound
to cpus in the order they are configured, which is OK.
However, when stopping then restarting the interface these sockets are
bound to, the sockets are reassigned to the fanout group in the reverse
order, due to the fact that they were inserted at the head of the
interface's AF_PACKET socket list.
This means that traffic that was directed to the first socket in the
fanout group is now directed to the last one after an interface restart.
In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
socket that used to receive traffic from the last CPU after an interface
restart.
This commit introduces a helper to add a socket at the tail of a list,
then uses it to register AF_PACKET sockets.
Note that this changes the order in which sockets are listed in /proc and
with sock_diag.
Fixes: dc99f600698d ("packet: Add fanout support")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 40ba1d9b4d19796afc9b7ece872f5f3e8f5e2c13 ]
The abort path can cause a double-free of an anonymous set.
Added-and-to-be-aborted rule looks like this:
udp dport { 137, 138 } drop
The to-be-aborted transaction list looks like this:
newset
newsetelem
newsetelem
rule
This gets walked in reverse order, so first pass disables the rule, the
set elements, then the set.
After synchronize_rcu(), we then destroy those in same order: rule, set
element, set element, newset.
Problem is that the anonymous set has already been bound to the rule, so
the rule (lookup expression destructor) already frees the set, when then
cause use-after-free when trying to delete the elements from this set,
then try to free the set again when handling the newset expression.
Rule releases the bound set in first place from the abort path, this
causes the use-after-free on set element removal when undoing the new
element transactions. To handle this, skip new element transaction if
set is bound from the abort path.
This is still causes the use-after-free on set element removal. To
handle this, remove transaction from the list when the set is already
bound.
Joint work with Florian Westphal.
Fixes: f6ac85858976 ("netfilter: nf_tables: unbind set in rule from commit path")
Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1325
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 84c4e1f89fefe70554da0ab33be72c9be7994379 upstream.
Al Viro root-caused a race where the IOCB_CMD_POLL handling of
fget/fput() could cause us to access the file pointer after it had
already been freed:
"In more details - normally IOCB_CMD_POLL handling looks so:
1) io_submit(2) allocates aio_kiocb instance and passes it to
aio_poll()
2) aio_poll() resolves the descriptor to struct file by req->file =
fget(iocb->aio_fildes)
3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
aio_kiocb to 2 (bumps by 1, that is).
4) aio_poll() calls vfs_poll(). After sanity checks (basically,
"poll_wait() had been called and only once") it locks the queue.
That's what the extra reference to iocb had been for - we know we
can safely access it.
5) With queue locked, we check if ->woken has already been set to
true (by aio_poll_wake()) and, if it had been, we unlock the
queue, drop a reference to aio_kiocb and bugger off - at that
point it's a responsibility to aio_poll_wake() and the stuff
called/scheduled by it. That code will drop the reference to file
in req->file, along with the other reference to our aio_kiocb.
6) otherwise, we see whether we need to wait. If we do, we unlock the
queue, drop one reference to aio_kiocb and go away - eventual
wakeup (or cancel) will deal with the reference to file and with
the other reference to aio_kiocb
7) otherwise we remove ourselves from waitqueue (still under the
queue lock), so that wakeup won't get us. No async activity will
be happening, so we can safely drop req->file and iocb ourselves.
If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
won't get freed under us, so we can do all the checks and locking
safely. And we don't touch ->file if we detect that case.
However, vfs_poll() most certainly *does* touch the file it had been
given. So wakeup coming while we are still in ->poll() might end up
doing fput() on that file. That case is not too rare, and usually we
are saved by the still present reference from descriptor table - that
fput() is not the final one.
But if another thread closes that descriptor right after our fget()
and wakeup does happen before ->poll() returns, we are in trouble -
final fput() done while we are in the middle of a method:
Al also wrote a patch to take an extra reference to the file descriptor
to fix this, but I instead suggested we just streamline the whole file
pointer handling by submit_io() so that the generic aio submission code
simply keeps the file pointer around until the aio has completed.
Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bb229bbb3bf63d23128e851a1f3b85c083178fa1 upstream.
Because map updates are distributed lazily, an OSD may not know about
the new blacklist for quite some time after "osd blacklist add" command
is completed. This makes it possible for a blacklisted but still alive
client to overwrite a post-blacklist update, resulting in data
corruption.
Waiting for latest osdmap in ceph_monc_blacklist_add() and thus using
the post-blacklist epoch for all post-blacklist requests ensures that
all such requests "wait" for the blacklist to come into force on their
respective OSDs.
Cc: stable@vger.kernel.org
Fixes: 6305a3b41515 ("libceph: support for blacklisting clients")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream.
kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound. x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.
Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots. Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.
Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f76a16adc485699f95bb71fce114f97c832fe664 upstream.
The .orc_unwind section is a packed array of 6-byte structs. It's
currently aligned to 6 bytes, which is causing warnings in the LLD
linker.
Six isn't a power of two, so it's not a valid alignment value. The
actual alignment doesn't matter much because it's an array of packed
structs. An alignment of two is sufficient. In reality it always gets
aligned to four bytes because it comes immediately after the
4-byte-aligned .orc_unwind_ip section.
Fixes: ee9f8fce9964 ("x86/unwind: Add the ORC unwinder")
Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Reported-by: Dmitry Golovin <dima@golovin.in>
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://github.com/ClangBuiltLinux/linux/issues/218
Link: https://lkml.kernel.org/r/d55027ee95fe73e952dcd8be90aebd31b0095c45.1551892041.git.jpoimboe@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0bdb50c531f7377a9da80d3ce2d61f389c84cb30 upstream.
A dm-raid array with devices larger than 4GB won't assemble on
a 32 bit host since _check_data_dev_sectors() was added in 4.16.
This is because to_sector() treats its argument as an "unsigned long"
which is 32bits (4GB) on a 32bit host. Using "unsigned long long"
is more correct.
Kernels as early as 4.2 can have other problems due to to_sector()
being used on the size of a device.
Fixes: 0cf4503174c1 ("dm raid: add support for the MD RAID0 personality")
cc: stable@vger.kernel.org (v4.2+)
Reported-and-tested-by: Guillaume Perréal <gperreal@free.fr>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|