| Age | Commit message (Collapse) | Author |
|
commit 63530aba7826a0f8e129874df9c4d264f9db3f9e upstream.
syzbot found that ax25 routes where not properly protected
against concurrent use [1].
In this particular report the bug happened while
copying ax25->digipeat.
Fix this problem by making sure we call ax25_get_route()
while ax25_route_lock is held, so that no modification
could happen while using the route.
The current two ax25_get_route() callers do not sleep,
so this change should be fine.
Once we do that, ax25_get_route() no longer needs to
grab a reference on the found route.
[1]
ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de
BUG: KASAN: use-after-free in memcpy include/linux/string.h:352 [inline]
BUG: KASAN: use-after-free in kmemdup+0x42/0x60 mm/util.c:113
Read of size 66 at addr ffff888066641a80 by task syz-executor2/531
ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de
CPU: 1 PID: 531 Comm: syz-executor2 Not tainted 5.0.0-rc2+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1db/0x2d0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
check_memory_region_inline mm/kasan/generic.c:185 [inline]
check_memory_region+0x123/0x190 mm/kasan/generic.c:191
memcpy+0x24/0x50 mm/kasan/common.c:130
memcpy include/linux/string.h:352 [inline]
kmemdup+0x42/0x60 mm/util.c:113
kmemdup include/linux/string.h:425 [inline]
ax25_rt_autobind+0x25d/0x750 net/ax25/ax25_route.c:424
ax25_connect.cold+0x30/0xa4 net/ax25/af_ax25.c:1224
__sys_connect+0x357/0x490 net/socket.c:1664
__do_sys_connect net/socket.c:1675 [inline]
__se_sys_connect net/socket.c:1672 [inline]
__x64_sys_connect+0x73/0xb0 net/socket.c:1672
do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x458099
Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f870ee22c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458099
RDX: 0000000000000048 RSI: 0000000020000080 RDI: 0000000000000005
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f870ee236d4
R13: 00000000004be48e R14: 00000000004ce9a8 R15: 00000000ffffffff
Allocated by task 526:
save_stack+0x45/0xd0 mm/kasan/common.c:73
set_track mm/kasan/common.c:85 [inline]
__kasan_kmalloc mm/kasan/common.c:496 [inline]
__kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:469
kasan_kmalloc+0x9/0x10 mm/kasan/common.c:504
ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de
kmem_cache_alloc_trace+0x151/0x760 mm/slab.c:3609
kmalloc include/linux/slab.h:545 [inline]
ax25_rt_add net/ax25/ax25_route.c:95 [inline]
ax25_rt_ioctl+0x3b9/0x1270 net/ax25/ax25_route.c:233
ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763
sock_do_ioctl+0xe2/0x400 net/socket.c:950
sock_ioctl+0x32f/0x6c0 net/socket.c:1074
vfs_ioctl fs/ioctl.c:46 [inline]
file_ioctl fs/ioctl.c:509 [inline]
do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696
ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
__do_sys_ioctl fs/ioctl.c:720 [inline]
__se_sys_ioctl fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de
Freed by task 550:
save_stack+0x45/0xd0 mm/kasan/common.c:73
set_track mm/kasan/common.c:85 [inline]
__kasan_slab_free+0x102/0x150 mm/kasan/common.c:458
kasan_slab_free+0xe/0x10 mm/kasan/common.c:466
__cache_free mm/slab.c:3487 [inline]
kfree+0xcf/0x230 mm/slab.c:3806
ax25_rt_add net/ax25/ax25_route.c:92 [inline]
ax25_rt_ioctl+0x304/0x1270 net/ax25/ax25_route.c:233
ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763
sock_do_ioctl+0xe2/0x400 net/socket.c:950
sock_ioctl+0x32f/0x6c0 net/socket.c:1074
vfs_ioctl fs/ioctl.c:46 [inline]
file_ioctl fs/ioctl.c:509 [inline]
do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696
ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
__do_sys_ioctl fs/ioctl.c:720 [inline]
__se_sys_ioctl fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The buggy address belongs to the object at ffff888066641a80
which belongs to the cache kmalloc-96 of size 96
The buggy address is located 0 bytes inside of
96-byte region [ffff888066641a80, ffff888066641ae0)
The buggy address belongs to the page:
page:ffffea0001999040 count:1 mapcount:0 mapping:ffff88812c3f04c0 index:0x0
flags: 0x1fffc0000000200(slab)
ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de
raw: 01fffc0000000200 ffffea0001817948 ffffea0002341dc8 ffff88812c3f04c0
raw: 0000000000000000 ffff888066641000 0000000100000020 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888066641980: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff888066641a00: 00 00 00 00 00 00 00 00 02 fc fc fc fc fc fc fc
>ffff888066641a80: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^
ffff888066641b00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
ffff888066641b80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Commit 582a32e708823e5957fd73ccd78dc4a9e49d21ea upstream.
This reverts commit eff896288872d687d9662000ec9ae11b6d61766f, which
deferred the processing of persistent memory reservations to a point
where the memory may have already been allocated and overwritten,
defeating the purpose.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20190215123333.21209-3-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
Commit 8a5b403d71affa098009cc3dff1b2c45113021ad upstream.
In the irqchip and EFI code, we have what basically amounts to a quirk
to work around a peculiarity in the GICv3 architecture, which permits
the system memory address of LPI tables to be programmable only once
after a CPU reset. This means kexec kernels must use the same memory
as the first kernel, and thus ensure that this memory has not been
given out for other purposes by the time the ITS init code runs, which
is not very early for secondary CPUs.
On systems with many CPUs, these reservations could overflow the
memblock reservation table, and this was addressed in commit:
eff896288872 ("efi/arm: Defer persistent reservations until after paging_init()")
However, this turns out to have made things worse, since the allocation
of page tables and heap space for the resized memblock reservation table
itself may overwrite the regions we are attempting to reserve, which may
cause all kinds of corruption, also considering that the ITS will still
be poking bits into that memory in response to incoming MSIs.
So instead, let's grow the static memblock reservation table on such
systems so it can accommodate these reservations at an earlier time.
This will permit us to revert the above commit in a subsequent patch.
[ mingo: Minor cleanups. ]
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20190215123333.21209-2-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ ardb: Double the size of the slack to account for the lack of an
optimization that was introduced in mainline after the release
of v4.20. ]
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8681ef1f3d295bd3600315325f3b3396d76d02f6 ]
Fixes: 3b89ea9c5902 ("net: Fix for_each_netdev_feature on Big endian")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3b89ea9c5902acccdbbdec307c85edd1bf52515e ]
The features attribute is of type u64 and stored in the native endianes on
the system. The for_each_set_bit() macro takes a pointer to a 32 bit array
and goes over the bits in this area. On little Endian systems this also
works with an u64 as the most significant bit is on the highest address,
but on big endian the words are swapped. When we expect bit 15 here we get
bit 47 (15 + 32).
This patch converts it more or less to its own for_each_set_bit()
implementation which works on 64 bit integers directly. This is then
completely in host endianness and should work like expected.
Fixes: fd867d51f ("net/core: generic support for disabling netdev features down stack")
Signed-off-by: Hauke Mehrtens <hauke.mehrtens@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c09551c6ff7fe16a79a42133bcecba5fc2fc3291 ]
According to the algorithm described in the comment block at the
beginning of ip_rt_send_redirect, the host should try to send
'ip_rt_redirect_number' ICMP redirect packets with an exponential
backoff and then stop sending them at all assuming that the destination
ignores redirects.
If the device has previously sent some ICMP error packets that are
rate-limited (e.g TTL expired) and continues to receive traffic,
the redirect packets will never be transmitted. This happens since
peer->rate_tokens will be typically greater than 'ip_rt_redirect_number'
and so it will never be reset even if the redirect silence timeout
(ip_rt_redirect_silence) has elapsed without receiving any packet
requiring redirects.
Fix it by using a dedicated counter for the number of ICMP redirect
packets that has been sent by the host
I have not been able to identify a given commit that introduced the
issue since ip_rt_send_redirect implements the same rate-limiting
algorithm from commit 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit dcf6e2e38a1c7ccbc535de5e1d9b14998847499d upstream.
The kblockd workqueue is created with the WQ_MEM_RECLAIM flag set.
This generates a rescuer thread for that queue that will trigger when
the CPU is under heavy load and collect the uncompleted work.
In the case of mmc, this creates the possibility of a deadlock when
there are multiple partitions on the device as other blk-mq work is
also run on the same queue. For example:
- worker 0 claims the mmc host to work on partition 1
- worker 1 attempts to claim the host for partition 2 but has to wait
for worker 0 to finish
- worker 0 schedules complete_work to release the host
- rescuer thread is triggered after time-out and collects the dangling
work
- rescuer thread attempts to complete the work in order starting with
claim host
- the task to release host is now blocked by a task to claim it and
will never be called
The above results in multiple hung tasks that lead to failures to
mount partitions.
Handling complete_work on a separate workqueue avoids this by keeping
the work completion tasks separate from the other blk-mq work. This
allows the host to be released without getting blocked by other tasks
attempting to claim the host.
Signed-off-by: Zachary Hays <zhays@lexmark.com>
Fixes: 81196976ed94 ("mmc: block: Add blk-mq support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb upstream.
Vince (and later on Ravi) reported crashes in the BTS code during
fuzzing with the following backtrace:
general protection fault: 0000 [#1] SMP PTI
...
RIP: 0010:perf_prepare_sample+0x8f/0x510
...
Call Trace:
<IRQ>
? intel_pmu_drain_bts_buffer+0x194/0x230
intel_pmu_drain_bts_buffer+0x160/0x230
? tick_nohz_irq_exit+0x31/0x40
? smp_call_function_single_interrupt+0x48/0xe0
? call_function_single_interrupt+0xf/0x20
? call_function_single_interrupt+0xa/0x20
? x86_schedule_events+0x1a0/0x2f0
? x86_pmu_commit_txn+0xb4/0x100
? find_busiest_group+0x47/0x5d0
? perf_event_set_state.part.42+0x12/0x50
? perf_mux_hrtimer_restart+0x40/0xb0
intel_pmu_disable_event+0xae/0x100
? intel_pmu_disable_event+0xae/0x100
x86_pmu_stop+0x7a/0xb0
x86_pmu_del+0x57/0x120
event_sched_out.isra.101+0x83/0x180
group_sched_out.part.103+0x57/0xe0
ctx_sched_out+0x188/0x240
ctx_resched+0xa8/0xd0
__perf_event_enable+0x193/0x1e0
event_function+0x8e/0xc0
remote_function+0x41/0x50
flush_smp_call_function_queue+0x68/0x100
generic_smp_call_function_single_interrupt+0x13/0x30
smp_call_function_single_interrupt+0x3e/0xe0
call_function_single_interrupt+0xf/0x20
</IRQ>
The reason is that while event init code does several checks
for BTS events and prevents several unwanted config bits for
BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
to create BTS event without those checks being done.
Following sequence will cause the crash:
If we create an 'almost' BTS event with precise_ip and callchains,
and it into a BTS event it will crash the perf_prepare_sample()
function because precise_ip events are expected to come
in with callchain data initialized, but that's not the
case for intel_pmu_drain_bts_buffer() caller.
Adding a check_period callback to be called before the period
is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
if the event would become BTS. Plus adding also the limit_period
check as well.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20190204123532.GA4794@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b284909abad48b07d3071a9fc9b5692b3e64914b upstream.
With the following commit:
73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")
... the hotplug code attempted to detect when SMT was disabled by BIOS,
in which case it reported SMT as permanently disabled. However, that
code broke a virt hotplug scenario, where the guest is booted with only
primary CPU threads, and a sibling is brought online later.
The problem is that there doesn't seem to be a way to reliably
distinguish between the HW "SMT disabled by BIOS" case and the virt
"sibling not yet brought online" case. So the above-mentioned commit
was a bit misguided, as it permanently disabled SMT for both cases,
preventing future virt sibling hotplugs.
Going back and reviewing the original problems which were attempted to
be solved by that commit, when SMT was disabled in BIOS:
1) /sys/devices/system/cpu/smt/control showed "on" instead of
"notsupported"; and
2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
I'd propose that we instead consider #1 above to not actually be a
problem. Because, at least in the virt case, it's possible that SMT
wasn't disabled by BIOS and a sibling thread could be brought online
later. So it makes sense to just always default the smt control to "on"
to allow for that possibility (assuming cpuid indicates that the CPU
supports SMT).
The real problem is #2, which has a simple fix: change vmx_vm_init() to
query the actual current SMT state -- i.e., whether any siblings are
currently online -- instead of looking at the SMT "control" sysfs value.
So fix it by:
a) reverting the original "fix" and its followup fix:
73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")
bc2d8d262cba ("cpu/hotplug: Fix SMT supported evaluation")
and
b) changing vmx_vm_init() to query the actual current SMT state --
instead of the sysfs control value -- to determine whether the L1TF
warning is needed. This also requires the 'sched_smt_present'
variable to exported, instead of 'cpu_smt_control'.
Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")
Reported-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joe Mario <jmario@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 13054abbaa4f1fd4e6f3b4b63439ec033b4c8035 upstream.
Ring buffer implementation in hid_debug_event() and hid_debug_events_read()
is strange allowing lost or corrupted data. After commit 717adfdaf147
("HID: debug: check length before copy_to_user()") it is possible to enter
an infinite loop in hid_debug_events_read() by providing 0 as count, this
locks up a system. Fix this by rewriting the ring buffer implementation
with kfifo and simplify the code.
This fixes CVE-2019-3819.
v2: fix an execution logic and add a comment
v3: use __set_current_state() instead of set_current_state()
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1669187
Cc: stable@vger.kernel.org # v4.18+
Fixes: cd667ce24796 ("HID: use debugfs for events/reports dumping")
Fixes: 717adfdaf147 ("HID: debug: check length before copy_to_user()")
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 305a0ade180981686eec1f92aa6252a7c6ebb1cf upstream.
In the current code, the codec registration may happen both at the
codec bind time and the end of the controller probe time. In a rare
occasion, they race with each other, leading to Oops due to the still
uninitialized card device.
This patch introduces a simple flag to prevent the codec registration
at the codec bind time as long as the controller probe is going on.
The controller probe invokes snd_card_register() that does the whole
registration task, and we don't need to register each piece
beforehand.
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4f2ab5e1d13d6aa77c55f4914659784efd776eb4 upstream.
It is normal user behaviour to start, stop, then start a stream
again without closing it. Currently this works for compressed
playback streams but not capture ones.
The states on a compressed capture stream go directly from OPEN to
PREPARED, unlike a playback stream which moves to SETUP and waits
for a write of data before moving to PREPARED. Currently however,
when a stop is sent the state is set to SETUP for both types of
streams. This leaves a capture stream in the situation where a new
start can't be sent as that requires the state to be PREPARED and
a new set_params can't be sent as that requires the state to be
OPEN. The only option being to close the stream, and then reopen.
Correct this issues by allowing snd_compr_drain_notify to set the
state depending on the stream direction, as we already do in
set_params.
Fixes: 49bb6402f1aa ("ALSA: compress_core: Add support for capture streams")
Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7a86dab8cf2f0fdf508f3555dddfc236623bff60 ]
Since the offset is added directly to the hva from the
gfn_to_hva_cache, a negative offset could result in an out of bounds
write. The existing BUG_ON only checks for addresses beyond the end of
the gfn_to_hva_cache, not for addresses before the start of the
gfn_to_hva_cache.
Note that all current call sites have non-negative offsets.
Fixes: 4ec6e8636256 ("kvm: Introduce kvm_write_guest_offset_cached()")
Reported-by: Cfir Cohen <cfir@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Cfir Cohen <cfir@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 552de91068828daef50a227a665068cf8dde835e ]
When a skb verdict program is in-use and either another BPF program
redirects to that socket or the new SK_PASS support is used the
data_ready callback does not wake up application. Instead because
the stream parser/verdict is using the sk data_ready callback we wake
up the stream parser/verdict block.
Fix this by adding a helper to check if the stream parser block is
enabled on the sk and if so call the saved pointer which is the
upper layers wake up function.
This fixes application stalls observed when an application is waiting
for data in a blocking read().
Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a52c5a16cf19d8a85831bb1b915a221dd4ffae3c ]
There are several warnings from Clang about no case statement matching
the constant 0:
In file included from drivers/block/drbd/drbd_receiver.c:48:
In file included from drivers/block/drbd/drbd_int.h:48:
In file included from ./include/linux/drbd_genl_api.h:54:
In file included from ./include/linux/genl_magic_struct.h:236:
./include/linux/drbd_genl.h:321:1: warning: no case matching constant
switch condition '0'
GENL_struct(DRBD_NLA_HELPER, 24, drbd_helper_info,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/linux/genl_magic_struct.h:220:10: note: expanded from macro
'GENL_struct'
switch (0) {
^
Silence this warning by adding a 'case 0:' statement. Additionally,
adjust the alignment of the statements in the ct_assert_unique macro to
avoid a checkpatch warning.
This solution was originally sent by Arnd Bergmann with a default case
statement: https://lore.kernel.org/patchwork/patch/756723/
Link: https://github.com/ClangBuiltLinux/linux/issues/43
Suggested-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 01b671170d7f82b959dad6d5dbb44d7a915e647d ]
When the rdma device is getting removed, get resource info can race with
device removal, as below:
CPU-0 CPU-1
-------- --------
rdma_nl_rcv_msg()
nldev_res_get_cq_dumpit()
mutex_lock(device_lock);
get device reference
mutex_unlock(device_lock); [..]
ib_unregister_device()
/* Valid reference to
* device->dev exists.
*/
ib_dealloc_device()
[..]
provider->fill_res_entry();
Even though device object is not freed, fill_res_entry() can get called on
device which doesn't have a driver anymore. Kernel core device reference
count is not sufficient, as this only keeps the structure valid, and
doesn't guarantee the driver is still loaded.
Similar race can occur with device renaming and device removal, where
device_rename() tries to rename a unregistered device. While this is fine
for devices of a class which are not net namespace aware, but it is
incorrect for net namespace aware class coming in subsequent series. If a
class is net namespace aware, then the below [1] call trace is observed in
above situation.
Therefore, to avoid the race, keep a reference count and let device
unregistration wait until all netlink users drop the reference.
[1] Call trace:
kernfs: ns required in 'infiniband' for 'mlx5_0'
WARNING: CPU: 18 PID: 44270 at fs/kernfs/dir.c:842 kernfs_find_ns+0x104/0x120
libahci i2c_core mlxfw libata dca [last unloaded: devlink]
RIP: 0010:kernfs_find_ns+0x104/0x120
Call Trace:
kernfs_find_and_get_ns+0x2e/0x50
sysfs_rename_link_ns+0x40/0xb0
device_rename+0xb2/0xf0
ib_device_rename+0xb3/0x100 [ib_core]
nldev_set_doit+0x165/0x190 [ib_core]
rdma_nl_rcv_msg+0x249/0x250 [ib_core]
? netlink_deliver_tap+0x8f/0x3e0
rdma_nl_rcv+0xd6/0x120 [ib_core]
netlink_unicast+0x17c/0x230
netlink_sendmsg+0x2f0/0x3e0
sock_sendmsg+0x30/0x40
__sys_sendto+0xdc/0x160
Fixes: da5c85078215 ("RDMA/nldev: add driver-specific resource tracking")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1e86ace4c140fd5a693e266c9b23409358f25381 ]
Currently the cpu affinity hint mask for completion EQs is stored and
read from the wrong place, since reading and storing is done from the
same index, there is no actual issue with that, but internal irq_info
for completion EQs stars at MLX5_EQ_VEC_COMP_BASE offset in irq_info
array, this patch changes the code to use the correct offset to store
and read the IRQ affinity hint.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 18534df419041e6c1f4b41af56ee7d41f757815c ]
gpiod_request_commit() copies the pointer to the label passed as
an argument only to be used later. But there's a chance the caller
could immediately free the passed string(e.g., local variable).
This could trigger a use after free when we use gpio label(e.g.,
gpiochip_unlock_as_irq(), gpiochip_is_requested()).
To be on the safe side: duplicate the string with kstrdup_const()
so that if an unaware user passes an address to a stack-allocated
buffer, we won't get the arbitrary label.
Also fix gpiod_set_consumer_name().
Signed-off-by: Muchun Song <smuchun@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 144552c786925314c1e7cb8f91a71dae1aca8798 upstream.
Add checks:
- attempted kfree due to refcount reaching zero before overlay
is removed
- properties linked to an overlay node when the node is removed
- node refcount > one during node removal in a changeset destroy,
if the node was created by the changeset
After applying this patch, several validation warnings will be
reported from the devicetree unittest during boot due to
pre-existing devicetree bugs. The warnings will be similar to:
OF: ERROR: of_node_release(), unexpected properties in /testcase-data/overlay-node/test-bus/test-unittest11
OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /testcase-data-2/substation@100/
hvac-medium-2
Tested-by: Alan Tull <atull@kernel.org>
Signed-off-by: Frank Rowand <frank.rowand@sony.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 upstream.
Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite the
number of tasks in that memcg is not 0. It turned out that there is a
bug in wake_oom_reaper() which allows enqueuing same task twice which
makes impossible to decrease the number of tasks in that memcg due to a
refcount leak.
This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
T1@P1 |T2@P1 |T3@P1 |OOM reaper
----------+----------+----------+------------
# Processing an OOM victim in a different memcg domain.
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(&oom_lock)
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(&oom_lock)
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(&oom_lock)
out_of_memory()
oom_kill_process(P1)
do_send_sig_info(SIGKILL, @P1)
mark_oom_victim(T1@P1)
wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
mutex_unlock(&oom_lock)
out_of_memory()
mark_oom_victim(T2@P1)
wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
mutex_unlock(&oom_lock)
out_of_memory()
mark_oom_victim(T1@P1)
wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
mutex_unlock(&oom_lock)
# Completed processing an OOM victim in a different memcg domain.
spin_lock(&oom_reaper_lock)
# T1P1 is dequeued.
spin_unlock(&oom_reaper_lock)
but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.
Fix this bug using an approach used by commit 855b018325737f76 ("oom,
oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a
side effect of this patch, this patch also avoids enqueuing multiple
threads sharing memory via task_will_free_mem(current) path.
Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: Jay Kamat <jgkamat@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 32eb67b93c9e3cd62cb423e30b090cdd4aa8d275 ]
aead_request_set_crypt takes an iv pointer, and we change the iv
soon after setting it. Some async crypto algorithms don't save the iv,
so we need to save it in the tls_rec for async requests.
Found by hardcoding x64 aesni to use async crypto manager (to test the async
codepath), however I don't think this combination can happen in the wild.
Presumably other hardware offloads will need this fix, but there have been
no user reports.
Fixes: a42055e8d2c30 ("Add support for async encryption of records...")
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d5256083f62e2720f75bb3c5a928a0afe47d6bc3 ]
While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
I ran into the issue that while l3 mode is working fine, l3s mode
does not have any connectivity to kube-apiserver and hence all pods
end up in Error state as well. The ipvlan master device sits on
top of a bond device and hostns traffic to kube-apiserver (also running
in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
where the latter is the address of the bond0. While in l3 mode, a
curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
works fine from hostns, neither of them do in case of l3s. In the
latter only a curl to https://127.0.0.1:37573 appeared to work where
for local addresses of bond0 I saw kernel suddenly starting to emit
ARP requests to query HW address of bond0 which remained unanswered
and neighbor entries in INCOMPLETE state. These ARP requests only
happen while in l3s.
Debugging this further, I found the issue is that l3s mode is piggy-
backing on l3 master device, and in this case local routes are using
l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev
if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be
a loopback"). I found that reverting them back into using the
net->loopback_dev fixed ipvlan l3s connectivity and got everything
working for the CNI.
Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the
l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
on l3 master device is to get the l3mdev_ip_rcv() receive hook for
setting the dst entry of the input route without adding its own
ipvlan specific hacks into the receive path, however, any l3 domain
semantics beyond just that are breaking l3s operation. Note that
ipvlan also has the ability to dynamically switch its internal
operation from l3 to l3s for all ports via ipvlan_set_port_mode()
at runtime. In any case, l3 vs l3s soley distinguishes itself by
'de-confusing' netfilter through switching skb->dev to ipvlan slave
device late in NF_INET_LOCAL_IN before handing the skb to L4.
Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
without any additional l3mdev semantics on top. This should also have
minimal impact since dev->priv_flags is already hot in cache. With
this set, l3s mode is working fine and I also get things like
masquerading pod traffic on the ipvlan master properly working.
[0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback")
Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Martynas Pumputis <m@lambda.lt>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 141e5dcaa7356077028b4cd48ec351a38c70e5e5 upstream.
Arnd Bergmann pointed out that CONFIG_* cannot be used in a uapi header.
Override with an equivalent conditional.
Fixes: 2e746942ebac ("Input: input_event - provide override for sparc64")
Fixes: 152194fe9c3f ("Input: extend usable life of event timestamps to 2106 on 32 bit systems")
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4d3c5c69191f98c7f7e699ff08d2fd96d7070ddb ]
Commit d86adf482b84 ("scsi: storvsc: Enable multi-queue support") removed
the usage of the API in Jan 2017, and the API is not used since then.
netvsc and storvsc have their own algorithms to determine the outgoing
channel, so this API is useless.
And the API is potentially unsafe, because it reads primary->num_sc without
any lock held. This can be risky considering the RESCIND-OFFER message.
Let's remove the API.
Cc: Long Li <longli@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ commit d3bd7413e0ca40b60cf60d4003246d067cafdeda upstream ]
While 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer
arithmetic") took care of rejecting alu op on pointer when e.g. pointer
came from two different map values with different map properties such as
value size, Jann reported that a case was not covered yet when a given
alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from
different branches where we would incorrectly try to sanitize based
on the pointer's limit. Catch this corner case and reject the program
instead.
Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ commit 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 upstream ]
Jann reported that the original commit back in b2157399cc98
("bpf: prevent out-of-bounds speculation") was not sufficient
to stop CPU from speculating out of bounds memory access:
While b2157399cc98 only focussed on masking array map access
for unprivileged users for tail calls and data access such
that the user provided index gets sanitized from BPF program
and syscall side, there is still a more generic form affected
from BPF programs that applies to most maps that hold user
data in relation to dynamic map access when dealing with
unknown scalars or "slow" known scalars as access offset, for
example:
- Load a map value pointer into R6
- Load an index into R7
- Do a slow computation (e.g. with a memory dependency) that
loads a limit into R8 (e.g. load the limit from a map for
high latency, then mask it to make the verifier happy)
- Exit if R7 >= R8 (mispredicted branch)
- Load R0 = R6[R7]
- Load R0 = R6[R0]
For unknown scalars there are two options in the BPF verifier
where we could derive knowledge from in order to guarantee
safe access to the memory: i) While </>/<=/>= variants won't
allow to derive any lower or upper bounds from the unknown
scalar where it would be safe to add it to the map value
pointer, it is possible through ==/!= test however. ii) another
option is to transform the unknown scalar into a known scalar,
for example, through ALU ops combination such as R &= <imm>
followed by R |= <imm> or any similar combination where the
original information from the unknown scalar would be destroyed
entirely leaving R with a constant. The initial slow load still
precedes the latter ALU ops on that register, so the CPU
executes speculatively from that point. Once we have the known
scalar, any compare operation would work then. A third option
only involving registers with known scalars could be crafted
as described in [0] where a CPU port (e.g. Slow Int unit)
would be filled with many dependent computations such that
the subsequent condition depending on its outcome has to wait
for evaluation on its execution port and thereby executing
speculatively if the speculated code can be scheduled on a
different execution port, or any other form of mistraining
as described in [1], for example. Given this is not limited
to only unknown scalars, not only map but also stack access
is affected since both is accessible for unprivileged users
and could potentially be used for out of bounds access under
speculation.
In order to prevent any of these cases, the verifier is now
sanitizing pointer arithmetic on the offset such that any
out of bounds speculation would be masked in a way where the
pointer arithmetic result in the destination register will
stay unchanged, meaning offset masked into zero similar as
in array_index_nospec() case. With regards to implementation,
there are three options that were considered: i) new insn
for sanitation, ii) push/pop insn and sanitation as inlined
BPF, iii) reuse of ax register and sanitation as inlined BPF.
Option i) has the downside that we end up using from reserved
bits in the opcode space, but also that we would require
each JIT to emit masking as native arch opcodes meaning
mitigation would have slow adoption till everyone implements
it eventually which is counter-productive. Option ii) and iii)
have both in common that a temporary register is needed in
order to implement the sanitation as inlined BPF since we
are not allowed to modify the source register. While a push /
pop insn in ii) would be useful to have in any case, it
requires once again that every JIT needs to implement it
first. While possible, amount of changes needed would also
be unsuitable for a -stable patch. Therefore, the path which
has fewer changes, less BPF instructions for the mitigation
and does not require anything to be changed in the JITs is
option iii) which this work is pursuing. The ax register is
already mapped to a register in all JITs (modulo arm32 where
it's mapped to stack as various other BPF registers there)
and used in constant blinding for JITs-only so far. It can
be reused for verifier rewrites under certain constraints.
The interpreter's tmp "register" has therefore been remapped
into extending the register set with hidden ax register and
reusing that for a number of instructions that needed the
prior temporary variable internally (e.g. div, mod). This
allows for zero increase in stack space usage in the interpreter,
and enables (restricted) generic use in rewrites otherwise as
long as such a patchlet does not make use of these instructions.
The sanitation mask is dynamic and relative to the offset the
map value or stack pointer currently holds.
There are various cases that need to be taken under consideration
for the masking, e.g. such operation could look as follows:
ptr += val or val += ptr or ptr -= val. Thus, the value to be
sanitized could reside either in source or in destination
register, and the limit is different depending on whether
the ALU op is addition or subtraction and depending on the
current known and bounded offset. The limit is derived as
follows: limit := max_value_size - (smin_value + off). For
subtraction: limit := umax_value + off. This holds because
we do not allow any pointer arithmetic that would
temporarily go out of bounds or would have an unknown
value with mixed signed bounds where it is unclear at
verification time whether the actual runtime value would
be either negative or positive. For example, we have a
derived map pointer value with constant offset and bounded
one, so limit based on smin_value works because the verifier
requires that statically analyzed arithmetic on the pointer
must be in bounds, and thus it checks if resulting
smin_value + off and umax_value + off is still within map
value bounds at time of arithmetic in addition to time of
access. Similarly, for the case of stack access we derive
the limit as follows: MAX_BPF_STACK + off for subtraction
and -off for the case of addition where off := ptr_reg->off +
ptr_reg->var_off.value. Subtraction is a special case for
the masking which can be in form of ptr += -val, ptr -= -val,
or ptr -= val. In the first two cases where we know that
the value is negative, we need to temporarily negate the
value in order to do the sanitation on a positive value
where we later swap the ALU op, and restore original source
register if the value was in source.
The sanitation of pointer arithmetic alone is still not fully
sufficient as is, since a scenario like the following could
happen ...
PTR += 0x1000 (e.g. K-based imm)
PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
PTR += 0x1000
PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
[...]
... which under speculation could end up as ...
PTR += 0x1000
PTR -= 0 [ truncated by mitigation ]
PTR += 0x1000
PTR -= 0 [ truncated by mitigation ]
[...]
... and therefore still access out of bounds. To prevent such
case, the verifier is also analyzing safety for potential out
of bounds access under speculative execution. Meaning, it is
also simulating pointer access under truncation. We therefore
"branch off" and push the current verification state after the
ALU operation with known 0 to the verification stack for later
analysis. Given the current path analysis succeeded it is
likely that the one under speculation can be pruned. In any
case, it is also subject to existing complexity limits and
therefore anything beyond this point will be rejected. In
terms of pruning, it needs to be ensured that the verification
state from speculative execution simulation must never prune
a non-speculative execution path, therefore, we mark verifier
state accordingly at the time of push_stack(). If verifier
detects out of bounds access under speculative execution from
one of the possible paths that includes a truncation, it will
reject such program.
Given we mask every reg-based pointer arithmetic for
unprivileged programs, we've been looking into how it could
affect real-world programs in terms of size increase. As the
majority of programs are targeted for privileged-only use
case, we've unconditionally enabled masking (with its alu
restrictions on top of it) for privileged programs for the
sake of testing in order to check i) whether they get rejected
in its current form, and ii) by how much the number of
instructions and size will increase. We've tested this by
using Katran, Cilium and test_l4lb from the kernel selftests.
For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
we've used test_l4lb.o as well as test_l4lb_noinline.o. We
found that none of the programs got rejected by the verifier
with this change, and that impact is rather minimal to none.
balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
7,797 bytes JITed before and after the change. Most complex
program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
and 18,538 bytes JITed before and after and none of the other
tail call programs in bpf_lxc.o had any changes either. For
the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
after the change. Other programs from that object file had
similar small increase. Both test_l4lb.o had no change and
remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
JITed and for test_l4lb_noinline.o constant at 5,080 bytes
(634 insns) xlated and 3,313 bytes JITed. This can be explained
in that LLVM typically optimizes stack based pointer arithmetic
by using K-based operations and that use of dynamic map access
is not overly frequent. However, in future we may decide to
optimize the algorithm further under known guarantees from
branch and value speculation. Latter seems also unclear in
terms of prediction heuristics that today's CPUs apply as well
as whether there could be collisions in e.g. the predictor's
Value History/Pattern Table for triggering out of bounds access,
thus masking is performed unconditionally at this point but could
be subject to relaxation later on. We were generally also
brainstorming various other approaches for mitigation, but the
blocker was always lack of available registers at runtime and/or
overhead for runtime tracking of limits belonging to a specific
pointer. Thus, we found this to be minimally intrusive under
given constraints.
With that in place, a simple example with sanitized access on
unprivileged load at post-verification time looks as follows:
# bpftool prog dump xlated id 282
[...]
28: (79) r1 = *(u64 *)(r7 +0)
29: (79) r2 = *(u64 *)(r7 +8)
30: (57) r1 &= 15
31: (79) r3 = *(u64 *)(r0 +4608)
32: (57) r3 &= 1
33: (47) r3 |= 1
34: (2d) if r2 > r3 goto pc+19
35: (b4) (u32) r11 = (u32) 20479 |
36: (1f) r11 -= r2 | Dynamic sanitation for pointer
37: (4f) r11 |= r2 | arithmetic with registers
38: (87) r11 = -r11 | containing bounded or known
39: (c7) r11 s>>= 63 | scalars in order to prevent
40: (5f) r11 &= r2 | out of bounds speculation.
41: (0f) r4 += r11 |
42: (71) r4 = *(u8 *)(r4 +0)
43: (6f) r4 <<= r1
[...]
For the case where the scalar sits in the destination register
as opposed to the source register, the following code is emitted
for the above example:
[...]
16: (b4) (u32) r11 = (u32) 20479
17: (1f) r11 -= r2
18: (4f) r11 |= r2
19: (87) r11 = -r11
20: (c7) r11 s>>= 63
21: (5f) r2 &= r11
22: (0f) r2 += r0
23: (61) r0 = *(u32 *)(r2 +0)
[...]
JIT blinding example with non-conflicting use of r10:
[...]
d5: je 0x0000000000000106 _
d7: mov 0x0(%rax),%edi |
da: mov $0xf153246,%r10d | Index load from map value and
e0: xor $0xf153259,%r10 | (const blinded) mask with 0x1f.
e7: and %r10,%rdi |_
ea: mov $0x2f,%r10d |
f0: sub %rdi,%r10 | Sanitized addition. Both use r10
f3: or %rdi,%r10 | but do not interfere with each
f6: neg %r10 | other. (Neither do these instructions
f9: sar $0x3f,%r10 | interfere with the use of ax as temp
fd: and %r10,%rdi | in interpreter.)
100: add %rax,%rdi |_
103: mov 0x0(%rdi),%eax
[...]
Tested that it fixes Jann's reproducer, and also checked that test_verifier
and test_progs suite with interpreter, JIT and JIT with hardening enabled
on x86-64 and arm64 runs successfully.
[0] Speculose: Analyzing the Security Implications of Speculative
Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
https://arxiv.org/pdf/1801.04084.pdf
[1] A Systematic Evaluation of Transient Execution Attacks and
Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
Dmitry Evtyushkin, Daniel Gruss,
https://arxiv.org/pdf/1811.05441.pdf
Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ commit 9b73bfdd08e73231d6a90ae6db4b46b3fbf56c30 upstream ]
Right now we are using BPF ax register in JIT for constant blinding as
well as in interpreter as temporary variable. Verifier will not be able
to use it simply because its use will get overridden from the former in
bpf_jit_blind_insn(). However, it can be made to work in that blinding
will be skipped if there is prior use in either source or destination
register on the instruction. Taking constraints of ax into account, the
verifier is then open to use it in rewrites under some constraints. Note,
ax register already has mappings in every eBPF JIT.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ commit 144cd91c4c2bced6eb8a7e25e590f6618a11e854 upstream ]
This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
into the hidden ax register. The latter is currently only used in JITs
for constant blinding as a temporary scratch register, meaning the BPF
interpreter will never see the use of ax. Therefore it is safe to use
it for the cases where tmp has been used earlier. This is needed to later
on allow restricted hidden use of ax in both interpreter and JITs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ commit c08435ec7f2bc8f4109401f696fd55159b4b40cb upstream ]
Move prev_insn_idx and insn_idx from the do_check() function into
the verifier environment, so they can be read inside the various
helper functions for handling the instructions. It's easier to put
this into the environment rather than changing all call-sites only
to pass it along. insn_idx is useful in particular since this later
on allows to hold state in env->insn_aux_data[env->insn_idx].
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 2e746942ebacf1565caa72cf980745e5ce297c48 upstream.
The usec part of the timeval is defined as
__kernel_suseconds_t tv_usec; /* microseconds */
Arnd noticed that sparc64 is the only architecture that defines
__kernel_suseconds_t as int rather than long.
This breaks the current y2038 fix for kernel as we only access and define
the timeval struct for non-kernel use cases. But, this was hidden by an
another typo in the use of __KERNEL__ qualifier.
Fix the typo, and provide an override for sparc64.
Fixes: 152194fe9c3f ("Input: extend usable life of event timestamps to 2106 on 32 bit systems")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ba50bf1ce9a51fc97db58b96d01306aa70bc3979 upstream.
fc96df16a1ce is good and can already fix the "return stack garbage" issue,
but let's also improve hv_ringbuffer_get_debuginfo(), which would silently
return stack garbage, if people forget to check channel->state or
ring_info->ring_buffer, when using the function in the future.
Having an error check in the function would eliminate the potential risk.
Add a Fixes tag to indicate the patch depdendency.
Fixes: fc96df16a1ce ("Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels")
Cc: stable@vger.kernel.org
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f97f4dd8b3bb9d0993d2491e0f22024c68109184 ]
IPv4 routing tables are flushed in two cases:
1. In response to events in the netdev and inetaddr notification chains
2. When a network namespace is being dismantled
In both cases only routes associated with a dead nexthop group are
flushed. However, a nexthop group will only be marked as dead in case it
is populated with actual nexthops using a nexthop device. This is not
the case when the route in question is an error route (e.g.,
'blackhole', 'unreachable').
Therefore, when a network namespace is being dismantled such routes are
not flushed and leaked [1].
To reproduce:
# ip netns add blue
# ip -n blue route add unreachable 192.0.2.0/24
# ip netns del blue
Fix this by not skipping error routes that are not marked with
RTNH_F_DEAD when flushing the routing tables.
To prevent the flushing of such routes in case #1, add a parameter to
fib_table_flush() that indicates if the table is flushed as part of
namespace dismantle or not.
Note that this problem does not exist in IPv6 since error routes are
associated with the loopback device.
[1]
unreferenced object 0xffff888066650338 (size 56):
comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff ..........ba....
e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00 ...d............
backtrace:
[<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
[<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
[<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
[<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
[<0000000014f62875>] netlink_sendmsg+0x929/0xe10
[<00000000bac9d967>] sock_sendmsg+0xc8/0x110
[<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
[<000000002e94f880>] __sys_sendmsg+0xf7/0x250
[<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
[<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<000000003a8b605b>] 0xffffffffffffffff
unreferenced object 0xffff888061621c88 (size 48):
comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
hex dump (first 32 bytes):
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff kkkkkkkk..&_....
backtrace:
[<00000000733609e3>] fib_table_insert+0x978/0x1500
[<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
[<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
[<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
[<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
[<0000000014f62875>] netlink_sendmsg+0x929/0xe10
[<00000000bac9d967>] sock_sendmsg+0xc8/0x110
[<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
[<000000002e94f880>] __sys_sendmsg+0xf7/0x250
[<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
[<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<000000003a8b605b>] 0xffffffffffffffff
Fixes: 8cced9eff1d4 ("[NETNS]: Enable routing configuration in non-initial namespace.")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 3e64cf7a435ed0500e3adaa8aada2272d3ae8abc ]
Since phy driver features became a link_mode bitmap, phy drivers that
don't have a list of features configured will cause the kernel to crash
when probed.
Prevent the phy driver from registering if the features field is missing.
Fixes: 719655a14971 ("net: phy: Replace phy driver features u32 with link_mode bitmap")
Reported-by: Scott Wood <oss@buserror.net>
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6c57f0458022298e4da1729c67bd33ce41c14e7a ]
In certain cases, pskb_trim_rcsum() may change skb pointers.
Reinitialize header pointers afterwards to avoid potential
use-after-frees. Add a note in the documentation of
pskb_trim_rcsum(). Found by KASAN.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit fed84c78527009d4f799a3ed9a566502fa026d82 ]
Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
Huawei TaiShan 2280 aarch64 servers).
After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
log buffer went from something like 280 to 260000 which caused kmemleak
disabled and crash dump memory reservation failed. The multitude of
kmemleak_alloc() calls is from nested loops while KASAN is setting up full
memory mappings, so let early kmemleak allocations skip those
memblock_alloc_internal() calls came from kasan_init() given that those
early KASAN memory mappings should not reference to other memory. Hence,
no kmemleak false positives.
kasan_init
kasan_map_populate [1]
kasan_pgd_populate [2]
kasan_pud_populate [3]
kasan_pmd_populate [4]
kasan_pte_populate [5]
kasan_alloc_zeroed_page
memblock_alloc_try_nid
memblock_alloc_internal
kmemleak_alloc
[1] for_each_memblock(memory, reg)
[2] while (pgdp++, addr = next, addr != end)
[3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
[4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
[5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))
Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.us
Signed-off-by: Qian Cai <cai@gmx.us>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 66f71da9dd38af17dc17209cdde7987d4679a699 ]
Since a2468cc9bfdf ("swap: choose swap device according to numa node"),
avail_lists field of swap_info_struct is changed to an array with
MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB
and needs an order-4 page to hold it.
This is not optimal in that:
1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
is a waste of memory;
2 It could cause swapon failure if the swap device is swapped on
after system has been running for a while, due to no order-4
page is available as pointed out by Vasily Averin.
Solve the above two issues by using nr_node_ids(which is the actual
possible node number the running system has) for avail_lists instead of
MAX_NUMNODES.
nr_node_ids is unknown at compile time so can't be directly used when
declaring this array. What I did here is to declare avail_lists as zero
element array and allocate space for it when allocating space for
swap_info_struct. The reason why keep using array but not pointer is
plist_for_each_entry needs the field to be part of the struct, so pointer
will not work.
This patch is on top of Vasily Averin's fix commit. I think the use of
kvzalloc for swap_info_struct is still needed in case nr_node_ids is
really big on some systems.
Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ad669505c4e9db9af9faeb5c51aa399326a80d91 ]
A session must only be released after all code that accesses the session
structure has finished. Make sure that this is the case by introducing a
new command counter per session that is only decremented after the
.release_cmd() callback has finished. This patch fixes the following crash:
BUG: KASAN: use-after-free in do_raw_spin_lock+0x1c/0x130
Read of size 4 at addr ffff8801534b16e4 by task rmdir/14805
CPU: 16 PID: 14805 Comm: rmdir Not tainted 4.18.0-rc2-dbg+ #5
Call Trace:
dump_stack+0xa4/0xf5
print_address_description+0x6f/0x270
kasan_report+0x241/0x360
__asan_load4+0x78/0x80
do_raw_spin_lock+0x1c/0x130
_raw_spin_lock_irqsave+0x52/0x60
srpt_set_ch_state+0x27/0x70 [ib_srpt]
srpt_disconnect_ch+0x1b/0xc0 [ib_srpt]
srpt_close_session+0xa8/0x260 [ib_srpt]
target_shutdown_sessions+0x170/0x180 [target_core_mod]
core_tpg_del_initiator_node_acl+0xf3/0x200 [target_core_mod]
target_fabric_nacl_base_release+0x25/0x30 [target_core_mod]
config_item_release+0x9c/0x110 [configfs]
config_item_put+0x26/0x30 [configfs]
configfs_rmdir+0x3b8/0x510 [configfs]
vfs_rmdir+0xb3/0x1e0
do_rmdir+0x262/0x2c0
do_syscall_64+0x77/0x230
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Disseldorp <ddiss@suse.de>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 46f53a65d2de3e1591636c22b626b09d8684fd71 ]
Currently BPF verifier allows narrow loads for a context field only with
offset zero. E.g. if there is a __u32 field then only the following
loads are permitted:
* off=0, size=1 (narrow);
* off=0, size=2 (narrow);
* off=0, size=4 (full).
On the other hand LLVM can generate a load with offset different than
zero that make sense from program logic point of view, but verifier
doesn't accept it.
E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code:
#define DST_IP4 0xC0A801FEU /* 192.168.1.254 */
...
if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&
where ctx is struct bpf_sock_addr.
Some versions of LLVM can produce the following byte code for it:
8: 71 12 07 00 00 00 00 00 r2 = *(u8 *)(r1 + 7)
9: 67 02 00 00 18 00 00 00 r2 <<= 24
10: 18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00 r3 = 4261412864 ll
12: 5d 32 07 00 00 00 00 00 if r2 != r3 goto +7 <LBB0_6>
where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1
and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently
rejected by verifier.
Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok()
what means any is_valid_access implementation, that uses the function,
works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or
sock_addr_is_valid_access() for bpf_sock_addr.
The patch makes such loads supported. Offset can be in [0; size_default)
but has to be multiple of load size. E.g. for __u32 field the following
loads are supported now:
* off=0, size=1 (narrow);
* off=1, size=1 (narrow);
* off=2, size=1 (narrow);
* off=3, size=1 (narrow);
* off=0, size=2 (narrow);
* off=2, size=2 (narrow);
* off=0, size=4 (full).
Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 347a28b586802d09604a149c1a1f6de5dccbe6fa ]
This happened while running in qemu-system-aarch64, the AMBA PL011 UART
driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE.
arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init),
devtmpfs' handle_remove() crashes because the reference count is a NULL
pointer only because wb->bdi hasn't been initialized yet.
Rework so that wb_put have an extra check if wb->bdi before decrement
wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens again
in other drivers.
Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 23b5f73266e59a598c1e5dd435d87651b5a7626b ]
During HARD_RESET the data link is disconnected.
For self powered device, the spec is advising against doing that.
>From USB_PD_R3_0
7.1.5 Response to Hard Resets
Device operation during and after a Hard Reset is defined as follows:
Self-powered devices Should Not disconnect from USB during a Hard Reset
(see Section 9.1.2).
Bus powered devices will disconnect from USB during a Hard Reset due to the
loss of their power source.
Tackle this by letting TCPM know whether the device is self or bus powered.
This overcomes unnecessary port disconnections from hard reset.
Also, speeds up the enumeration time when connected to Type-A ports.
Signed-off-by: Badhri Jagan Sridharan <badhri@google.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
---------
Version history:
V3:
Rebase on top of usb-next
V2:
Based on feedback from heikki.krogerus@linux.intel.com
- self_powered added to the struct tcpm_port which is populated from
a. "connector" node of the device tree in tcpm_fw_get_caps()
b. "self_powered" node of the tcpc_config in tcpm_copy_caps
Based on feedbase from linux@roeck-us.net
- Code was refactored
- SRC_HARD_RESET_VBUS_OFF sets the link state to false based
on self_powered flag
V1 located here:
https://lkml.org/lkml/2018/9/13/94
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 576ce4075bfa0f03e0e91a89eecc539b3b828b08 ]
gcc notices that without either the ac97 bus or the pdata, we never
initialize the regmap pointer, which leads to an uninitialized variable
access:
sound/soc/codecs/wm9712.c: In function 'wm9712_soc_probe':
sound/soc/codecs/wm9712.c:666:2: error: 'regmap' may be used uninitialized in this function [-Werror=maybe-uninitialized]
Since that configuration is invalid, it's better to return an error
here. I tried to avoid adding complexity to the conditions, and turned
the #ifdef into a regular if(IS_ENABLED()) check for readability.
This in turn requires moving some header file declarations out of
an #ifdef.
The same code is used in three drivers, all of which I'm changing
the same way.
Fixes: 2ed1a8e0ce8d ("ASoC: wm9712: add ac97 new bus support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 94a2c3a32b62e868dc1e3d854326745a7f1b8c7a upstream.
We recently got a stack by syzkaller like this:
BUG: sleeping function called from invalid context at mm/slab.h:361
in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
INFO: lockdep is turned off.
CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
0000000000000000 0000000000000001 0000000000000004 0000000000000000
Call Trace:
<IRQ> [<ffffffff81cb2194>] __dump_stack lib/dump_stack.c:15 [inline]
<IRQ> [<ffffffff81cb2194>] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
[<ffffffff8129a981>] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
[<ffffffff8129ac33>] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
[<ffffffff81794c13>] slab_pre_alloc_hook mm/slab.h:361 [inline]
[<ffffffff81794c13>] slab_alloc_node mm/slub.c:2610 [inline]
[<ffffffff81794c13>] slab_alloc mm/slub.c:2692 [inline]
[<ffffffff81794c13>] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
[<ffffffff81cbe9a7>] kmalloc include/linux/slab.h:479 [inline]
[<ffffffff81cbe9a7>] kzalloc include/linux/slab.h:623 [inline]
[<ffffffff81cbe9a7>] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
[<ffffffff81cbf84f>] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
[<ffffffff81cbb5b9>] kobject_cleanup lib/kobject.c:633 [inline]
[<ffffffff81cbb5b9>] kobject_release+0x229/0x440 lib/kobject.c:675
[<ffffffff81cbb0a2>] kref_sub include/linux/kref.h:73 [inline]
[<ffffffff81cbb0a2>] kref_put include/linux/kref.h:98 [inline]
[<ffffffff81cbb0a2>] kobject_put+0x72/0xd0 lib/kobject.c:692
[<ffffffff8216f095>] put_device+0x25/0x30 drivers/base/core.c:1237
[<ffffffff81c4cc34>] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
[<ffffffff813c08bc>] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
[<ffffffff813c08bc>] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
[<ffffffff813c08bc>] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
[<ffffffff813c08bc>] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
[<ffffffff813c08bc>] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
[<ffffffff8120f509>] __do_softirq+0x299/0xe20 kernel/softirq.c:273
[<ffffffff81210496>] invoke_softirq kernel/softirq.c:350 [inline]
[<ffffffff81210496>] irq_exit+0x216/0x2c0 kernel/softirq.c:391
[<ffffffff82c2cd7b>] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
[<ffffffff82c2cd7b>] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
[<ffffffff82c2bc25>] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
<EOI> [<ffffffff814cbf40>] ? audit_kill_trees+0x180/0x180
[<ffffffff8187d2f7>] fd_install+0x57/0x80 fs/file.c:626
[<ffffffff8180989e>] do_sys_open+0x45e/0x550 fs/open.c:1043
[<ffffffff818099c2>] SYSC_open fs/open.c:1055 [inline]
[<ffffffff818099c2>] SyS_open+0x32/0x40 fs/open.c:1050
[<ffffffff82c299e1>] entry_SYSCALL_64_fastpath+0x1e/0x9a
In softirq context, we call rcu callback function delete_partition_rcu_cb(),
which may allocate memory by kzalloc with GFP_KERNEL flag. If the
allocation cannot be satisfied, it may sleep. However, That is not allowed
in softirq contex.
Although we found this problem on linux 4.4, the latest kernel version
seems to have this problem as well. And it is very similar to the
previous one:
https://lkml.org/lkml/2018/7/9/391
Fix it by using RCU workqueue, which allows sleep.
Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6325e01b6cdf4636b721cf7259c1616e3cf28ce2 upstream.
Since the IB_WR_REG_MR opcode value changed, let's set the PVRDMA device
opcodes explicitly.
Reported-by: Ruishuang Wang <ruishuangw@vmware.com>
Fixes: 9a59739bd01f ("IB/rxe: Revise the ib_wr_opcode enum")
Cc: stable@vger.kernel.org
Reviewed-by: Bryan Tan <bryantan@vmware.com>
Reviewed-by: Ruishuang Wang <ruishuangw@vmware.com>
Reviewed-by: Vishnu Dasa <vdasa@vmware.com>
Signed-off-by: Adit Ranadive <aditr@vmware.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 321c46b91550adc03054125fa7a1639390608e1a upstream.
So far we never had any device registered for the SoC. This resulted in
some small issues that we kept ignoring like:
1) Not working GPIOLIB_IRQCHIP (gpiochip_irqchip_add_key() failing)
2) Lack of proper tree in the /sys/devices/
3) mips_dma_alloc_coherent() silently handling empty coherent_dma_mask
Kernel 4.19 came with a lot of DMA changes and caused a regression on
bcm47xx. Starting with the commit f8c55dc6e828 ("MIPS: use generic dma
noncoherent ops for simple noncoherent platforms") DMA coherent
allocations just fail. Example:
[ 1.114914] bgmac_bcma bcma0:2: Allocation of TX ring 0x200 failed
[ 1.121215] bgmac_bcma bcma0:2: Unable to alloc memory for DMA
[ 1.127626] bgmac_bcma: probe of bcma0:2 failed with error -12
[ 1.133838] bgmac_bcma: Broadcom 47xx GBit MAC driver loaded
The bgmac driver also triggers a WARNING:
[ 0.959486] ------------[ cut here ]------------
[ 0.964387] WARNING: CPU: 0 PID: 1 at ./include/linux/dma-mapping.h:516 bgmac_enet_probe+0x1b4/0x5c4
[ 0.973751] Modules linked in:
[ 0.976913] CPU: 0 PID: 1 Comm: swapper Not tainted 4.19.9 #0
[ 0.982750] Stack : 804a0000 804597c4 00000000 00000000 80458fd8 8381bc2c 838282d4 80481a47
[ 0.991367] 8042e3ec 00000001 804d38f0 00000204 83980000 00000065 8381bbe0 6f55b24f
[ 0.999975] 00000000 00000000 80520000 00002018 00000000 00000075 00000007 00000000
[ 1.008583] 00000000 80480000 000ee811 00000000 00000000 00000000 80432c00 80248db8
[ 1.017196] 00000009 00000204 83980000 803ad7b0 00000000 801feeec 00000000 804d0000
[ 1.025804] ...
[ 1.028325] Call Trace:
[ 1.030875] [<8000aef8>] show_stack+0x58/0x100
[ 1.035513] [<8001f8b4>] __warn+0xe4/0x118
[ 1.039708] [<8001f9a4>] warn_slowpath_null+0x48/0x64
[ 1.044935] [<80248db8>] bgmac_enet_probe+0x1b4/0x5c4
[ 1.050101] [<802498e0>] bgmac_probe+0x558/0x590
[ 1.054906] [<80252fd0>] bcma_device_probe+0x38/0x70
[ 1.060017] [<8020e1e8>] really_probe+0x170/0x2e8
[ 1.064891] [<8020e714>] __driver_attach+0xa4/0xec
[ 1.069784] [<8020c1e0>] bus_for_each_dev+0x58/0xb0
[ 1.074833] [<8020d590>] bus_add_driver+0xf8/0x218
[ 1.079731] [<8020ef24>] driver_register+0xcc/0x11c
[ 1.084804] [<804b54cc>] bgmac_init+0x1c/0x44
[ 1.089258] [<8000121c>] do_one_initcall+0x7c/0x1a0
[ 1.094343] [<804a1d34>] kernel_init_freeable+0x150/0x218
[ 1.099886] [<803a082c>] kernel_init+0x10/0x104
[ 1.104583] [<80005878>] ret_from_kernel_thread+0x14/0x1c
[ 1.110107] ---[ end trace f441c0d873d1fb5b ]---
This patch setups a "struct device" (and passes it to the bcma) which
allows fixing all the mentioned problems. It'll also require a tiny bcma
patch which will follow through the wireless tree & its maintainer.
Fixes: f8c55dc6e828 ("MIPS: use generic dma noncoherent ops for simple noncoherent platforms")
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: Paul Burton <paul.burton@mips.com>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: linux-wireless@vger.kernel.org
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f275ee0fa3a06eb87edc229749cf1eb18f0663fa ]
Commit 65cab850f0ee ("net: Allow class-e address assignment via ifconfig
ioctl") modified the IN_BADCLASS macro a bit, but unfortunatly one too
many '(' characters were added to the line, making any code that used
it, not build properly.
Also, the macro now compares an unsigned with a signed value, which
isn't ok, so fix that up by making both types match properly.
Reported-by: Christopher Ferris <cferris@google.com>
Fixes: 65cab850f0ee ("net: Allow class-e address assignment via ifconfig ioctl")
Cc: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 9e857a40dc4eba15a739b4194d7db873d82c28a0 ]
The bcm87xx and micrel driver has PHYs which are missing the .features
value. Add them. The bcm87xx is a 10G FEC only PHY. Add the needed
features definition of this PHY.
Fixes: 719655a14971 ("net: phy: Replace phy driver features u32 with link_mode bitmap")
Reported-by: Scott Wood <oss@buserror.net>
Reported-by: Camelia Groza <camelia.groza@nxp.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c80f10bc973af2ace6b1414724eeff61eaa71837 upstream.
Instead of removing a empty list node that might be reintroduced soon
thereafter, tentatively place the empty list node on the list passed to
tree_nodes_free(), then re-check if the list is empty again before erasing
it from the tree.
[ Florian: rebase on top of pending nf_conncount fixes ]
Fixes: 5c789e131cbb9 ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit df4a902509766897f7371fdfa4c3bf8bc321b55d upstream.
'lookup' is always followed by 'add'.
Merge both and make the list-walk part of nf_conncount_add().
This also avoids one unneeded unlock/re-lock pair.
Extra care needs to be taken in count_tree, as we only hold rcu
read lock, i.e. we can only insert to an existing tree node after
acquiring its lock and making sure it has a nonzero count.
As a zero count should be rare, just fall back to insert_tree()
(which acquires tree lock).
This issue and its solution were pointed out by Shawn Bohrer
during patch review.
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d4b09acf924b84bae77cad090a9d108e70b43643 upstream.
if node have NFSv41+ mounts inside several net namespaces
it can lead to use-after-free in svc_process_common()
svc_process_common()
/* Setup reply header */
rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE
svc_process_common() can use incorrect rqstp->rq_xprt,
its caller function bc_svc_process() takes it from serv->sv_bc_xprt.
The problem is that serv is global structure but sv_bc_xprt
is assigned per-netnamespace.
According to Trond, the whole "let's set up rqstp->rq_xprt
for the back channel" is nothing but a giant hack in order
to work around the fact that svc_process_common() uses it
to find the xpt_ops, and perform a couple of (meaningless
for the back channel) tests of xpt_flags.
All we really need in svc_process_common() is to be able to run
rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr()
Bruce J Fields points that this xpo_prep_reply_hdr() call
is an awfully roundabout way just to do "svc_putnl(resv, 0);"
in the tcp case.
This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(),
now it calls svc_process_common() with rqstp->rq_xprt = NULL.
To adjust reply header svc_process_common() just check
rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case.
To handle rqstp->rq_xprt = NULL case in functions called from
svc_process_common() patch intruduces net namespace pointer
svc_rqst->rq_bc_net and adjust SVC_NET() definition.
Some other function was also adopted to properly handle described case.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Cc: stable@vger.kernel.org
Fixes: 23c20ecd4475 ("NFS: callback up - users counting cleanup")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
v2: added lost extern svc_tcp_prep_reply_hdr()
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e4f358916d528d479c3c12bd2fd03f2d5a576380 upstream.
Commit
4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
replaced the RETPOLINE define with CONFIG_RETPOLINE checks. Remove the
remaining pieces.
[ bp: Massage commit message. ]
Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
Signed-off-by: WANG Chao <chao.wang@ucloud.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: linux-kbuild@vger.kernel.org
Cc: srinivas.eeda@oracle.com
Cc: stable <stable@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20181210163725.95977-1-chao.wang@ucloud.cn
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|