| Age | Commit message (Collapse) | Author |
|
commit 4cd2bb12981165f865d2b8ed92b446b52310ef74 upstream.
Since sched_clock_read_begin() and sched_clock_read_retry() are called
by notrace function sched_clock(), they shouldn't be traceable either,
or else ftrace_graph_caller will run into a dead loop on the path
as below (arm for instance):
ftrace_graph_caller()
prepare_ftrace_return()
function_graph_enter()
ftrace_push_return_trace()
trace_clock_local()
sched_clock()
sched_clock_read_begin/retry()
Fixes: 1b86abc1c645 ("sched_clock: Expose struct clock_read_data")
Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200929082027.16787-1-quanyang.wang@windriver.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cb47755725da7b90fecbb2aa82ac3b24a7adb89b upstream.
UBSAN reports:
Undefined behaviour in ./include/linux/time64.h:127:27
signed integer overflow:
17179869187 * 1000000000 cannot be represented in type 'long long int'
Call Trace:
timespec64_to_ns include/linux/time64.h:127 [inline]
set_cpu_itimer+0x65c/0x880 kernel/time/itimer.c:180
do_setitimer+0x8e/0x740 kernel/time/itimer.c:245
__x64_sys_setitimer+0x14c/0x2c0 kernel/time/itimer.c:336
do_syscall_64+0xa1/0x540 arch/x86/entry/common.c:295
Commit bd40a175769d ("y2038: itimer: change implementation to timespec64")
replaced the original conversion which handled time clamping correctly with
timespec64_to_ns() which has no overflow protection.
Fix it in timespec64_to_ns() as this is not necessarily limited to the
usage in itimers.
[ tglx: Added comment and adjusted the fixes tag ]
Fixes: 361a3bf00582 ("time64: Add time64.h header and define struct timespec64")
Signed-off-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/1598952616-6416-1-git-send-email-prime.zeng@hisilicon.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d1e7c2996e988866e7ceceb4641a0886885b7889 upstream.
Because sugov_update_next_freq() may skip a frequency update even if
the need_freq_update flag has been set for the policy at hand, policy
limits updates may not take effect as expected.
For example, if the intel_pstate driver operates in the passive mode
with HWP enabled, it needs to update the HWP min and max limits when
the policy min and max limits change, respectively, but that may not
happen if the target frequency does not change along with the limit
at hand. In particular, if the policy min is changed first, causing
the target frequency to be adjusted to it, and the policy max limit
is changed later to the same value, the HWP max limit will not be
updated to follow it as expected, because the target frequency is
still equal to the policy min limit and it will not change until
that limit is updated.
To address this issue, modify get_next_freq() to let the driver
callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag
is set regardless of whether or not the new frequency to set is
equal to the previous one.
Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Reported-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ...
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+: a62f68f5ca53 cpufreq: Introduce cpufreq_driver_test_flags()
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4230e2deaa484b385aa01d598b2aea8e7f2660a6 upstream.
Some architectures assume that the stopped CPUs don't make function calls
to traceable functions when they are in the stopped state. See also commit
cb9d7fd51d9f ("watchdog: Mark watchdog touch functions as notrace").
Violating this assumption causes kernel crashes when switching tracer on
RISC-V.
Mark rcu_momentary_dyntick_idle() and stop_machine_yield() notrace to
prevent this.
Fixes: 4ecf0a43e729 ("processor: get rid of cpu_relax_yield")
Fixes: 366237e7b083 ("stop_machine: Provide RCU quiescent state in multi_cpu_stop()")
Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Atish Patra <atish.patra@wdc.com>
Tested-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201021073839.43935-1-zong.li@sifive.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c2f7d08cccf4af2ce6992feaabb9e68e4ae0bff3 upstream.
For all commands except FUTEX_WAIT, the timeout is interpreted as an
absolute value. This absolute value is inside the task's time namespace and
has to be converted to the host's time.
Fixes: 5a590f35add9 ("posix-clocks: Wire up clock_gettime() with timens offsets")
Reported-by: Hans van der Laan <j.h.vanderlaan@student.utwente.nl>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20201015160020.293748-1-avagin@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0a1754b2a97efa644aa6e84d1db5b17c42251483 upstream.
We don't need to check the new buffer size, and the return value
had confused resize_buffer_duplicate_size().
...
ret = ring_buffer_resize(trace_buf->buffer,
per_cpu_ptr(size_buf->data,cpu_id)->entries, cpu_id);
if (ret == 0)
per_cpu_ptr(trace_buf->data, cpu_id)->entries =
per_cpu_ptr(size_buf->data, cpu_id)->entries;
...
Link: https://lkml.kernel.org/r/20201019142242.11560-1-hqjagain@gmail.com
Cc: stable@vger.kernel.org
Fixes: d60da506cbeb3 ("tracing: Add a resize function to make one buffer equivalent to another buffer")
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f747c7e15d7bc71a967a94ceda686cf2460b69e8 upstream.
The rcu_tasks_trace_postgp() function uses for_each_process_thread()
to scan the task list without the benefit of RCU read-side protection,
which can result in use-after-free errors on task_struct structures.
This error was missed because the TRACE01 rcutorture scenario enables
lockdep, but also builds with CONFIG_PREEMPT_NONE=y. In this situation,
preemption is disabled everywhere, so lockdep thinks everywhere can
be a legitimate RCU reader. This commit therefore adds the needed
rcu_read_lock() and rcu_read_unlock().
Note that this bug can occur only after an RCU Tasks Trace CPU stall
warning, which by default only happens after a grace period has extended
for ten minutes (yes, not a typo, minutes).
Fixes: 4593e772b502 ("rcu-tasks: Add stall warnings for RCU Tasks Trace")
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: <bpf@vger.kernel.org>
Cc: <stable@vger.kernel.org> # 5.7.x
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 592031cc10858be4adb10f6c0f2608f6f21824aa upstream.
When rcu_tasks_trace_postgp() function detects an RCU Tasks Trace
CPU stall, it adds all tasks blocking the current grace period to
a list, invoking get_task_struct() on each to prevent them from
being freed while on the list. It then traverses that list,
printing stall-warning messages for each one that is still blocking
the current grace period and removing it from the list. The list
removal invokes the matching put_task_struct().
This of course means that in the admittedly unlikely event that some
task executes its outermost rcu_read_unlock_trace() in the meantime, it
won't be removed from the list and put_task_struct() won't be executing,
resulting in a task_struct leak. This commit therefore makes the list
removal and put_task_struct() unconditional, stopping the leak.
Note further that this bug can occur only after an RCU Tasks Trace CPU
stall warning, which by default only happens after a grace period has
extended for ten minutes (yes, not a typo, minutes).
Fixes: 4593e772b502 ("rcu-tasks: Add stall warnings for RCU Tasks Trace")
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: <bpf@vger.kernel.org>
Cc: <stable@vger.kernel.org> # 5.7.x
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ba3a86e47232ad9f76160929f33ac9c64e4d0567 upstream.
The more intense grace-period processing resulting from the 50x RCU
Tasks Trace grace-period speedups exposed the following race condition:
o Task A running on CPU 0 executes rcu_read_lock_trace(),
entering a read-side critical section.
o When Task A eventually invokes rcu_read_unlock_trace()
to exit its read-side critical section, this function
notes that the ->trc_reader_special.s flag is zero and
and therefore invoke wil set ->trc_reader_nesting to zero
using WRITE_ONCE(). But before that happens...
o The RCU Tasks Trace grace-period kthread running on some other
CPU interrogates Task A, but this fails because this task is
currently running. This kthread therefore sends an IPI to CPU 0.
o CPU 0 receives the IPI, and thus invokes trc_read_check_handler().
Because Task A has not yet cleared its ->trc_reader_nesting
counter, this function sees that Task A is still within its
read-side critical section. This function therefore sets the
->trc_reader_nesting.b.need_qs flag, AKA the .need_qs flag.
Except that Task A has already checked the .need_qs flag, which
is part of the ->trc_reader_special.s flag. The .need_qs flag
therefore remains set until Task A's next rcu_read_unlock_trace().
o Task A now invokes synchronize_rcu_tasks_trace(), which cannot
start a new grace period until the current grace period completes.
And thus cannot return until after that time.
But Task A's .need_qs flag is still set, which prevents the current
grace period from completing. And because Task A is blocked, it
will never execute rcu_read_unlock_trace() until its call to
synchronize_rcu_tasks_trace() returns.
We are therefore deadlocked.
This race is improbable, but 80 hours of rcutorture made it happen twice.
The race was possible before the grace-period speedup, but roughly 50x
less probable. Several thousand hours of rcutorture would have been
necessary to have a reasonable chance of making this happen before this
50x speedup.
This commit therefore eliminates this deadlock by setting
->trc_reader_nesting to a large negative number before checking the
.need_qs and zeroing (or decrementing with respect to its initial
value) ->trc_reader_nesting. For its part, the IPI handler's
trc_read_check_handler() function adds a check for negative values,
deferring evaluation of the task in this case. Taken together, these
changes avoid this deadlock scenario.
Fixes: 276c410448db ("rcu-tasks: Split ->trc_reader_need_end")
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: <bpf@vger.kernel.org>
Cc: <stable@vger.kernel.org> # 5.7.x
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bbeb97464eefc65f506084fd9f18f21653e01137 upstream.
Below race can come, if trace_open and resize of
cpu buffer is running parallely on different cpus
CPUX CPUY
ring_buffer_resize
atomic_read(&buffer->resize_disabled)
tracing_open
tracing_reset_online_cpus
ring_buffer_reset_cpu
rb_reset_cpu
rb_update_pages
remove/insert pages
resetting pointer
This race can cause data abort or some times infinte loop in
rb_remove_pages and rb_insert_pages while checking pages
for sanity.
Take buffer lock to fix this.
Link: https://lkml.kernel.org/r/1601976833-24377-1-git-send-email-gkohli@codeaurora.org
Cc: stable@vger.kernel.org
Fixes: b23d7a5f4a07a ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU")
Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c307459b9d1fcb8bbf3ea5a4162979532322ef77 upstream.
FIRMWARE_PREALLOC_BUFFER is a "how", not a "what", and confuses the LSMs
that are interested in filtering between types of things. The "how"
should be an internal detail made uninteresting to the LSMs.
Fixes: a098ecd2fa7d ("firmware: support loading into a pre-allocated buffer")
Fixes: fd90bc559bfb ("ima: based on policy verify firmware signatures (pre-allocated buffer)")
Fixes: 4f0496d8ffa3 ("ima: based on policy warn about loading firmware (pre-allocated buffer)")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201002173828.2099543-2-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dfe719fef03d752f1682fa8aeddf30ba501c8555 upstream.
Currently, init_listener() tries to prevent adding a filter with
SECCOMP_FILTER_FLAG_NEW_LISTENER if one of the existing filters already
has a listener. However, this check happens without holding any lock that
would prevent another thread from concurrently installing a new filter
(potentially with a listener) on top of the ones we already have.
Theoretically, this is also a data race: The plain load from
current->seccomp.filter can race with concurrent writes to the same
location.
Fix it by moving the check into the region that holds the siglock to guard
against concurrent TSYNC.
(The "Fixes" tag points to the commit that introduced the theoretical
data race; concurrent installation of another filter with TSYNC only
became possible later, in commit 51891498f2da ("seccomp: allow TSYNC and
USER_NOTIF together").)
Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace")
Reviewed-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201005014401.490175-1-jannh@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 7c6967326267bd5c0dded0a99541357d70dd11ac ]
Commit 41c48f3a98231 ("bpf: Support access
to bpf map fields") added support to access map fields
with CORE support. For example,
struct bpf_map {
__u32 max_entries;
} __attribute__((preserve_access_index));
struct bpf_array {
struct bpf_map map;
__u32 elem_size;
} __attribute__((preserve_access_index));
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 4);
__type(key, __u32);
__type(value, __u32);
} m_array SEC(".maps");
SEC("cgroup_skb/egress")
int cg_skb(void *ctx)
{
struct bpf_array *array = (struct bpf_array *)&m_array;
/* .. array->map.max_entries .. */
}
In kernel, bpf_htab has similar structure,
struct bpf_htab {
struct bpf_map map;
...
}
In the above cg_skb(), to access array->map.max_entries, with CORE, the clang will
generate two builtin's.
base = &m_array;
/* access array.map */
map_addr = __builtin_preserve_struct_access_info(base, 0, 0);
/* access array.map.max_entries */
max_entries_addr = __builtin_preserve_struct_access_info(map_addr, 0, 0);
max_entries = *max_entries_addr;
In the current llvm, if two builtin's are in the same function or
in the same function after inlining, the compiler is smart enough to chain
them together and generates like below:
base = &m_array;
max_entries = *(base + reloc_offset); /* reloc_offset = 0 in this case */
and we are fine.
But if we force no inlining for one of functions in test_map_ptr() selftest, e.g.,
check_default(), the above two __builtin_preserve_* will be in two different
functions. In this case, we will have code like:
func check_hash():
reloc_offset_map = 0;
base = &m_array;
map_base = base + reloc_offset_map;
check_default(map_base, ...)
func check_default(map_base, ...):
max_entries = *(map_base + reloc_offset_max_entries);
In kernel, map_ptr (CONST_PTR_TO_MAP) does not allow any arithmetic.
The above "map_base = base + reloc_offset_map" will trigger a verifier failure.
; VERIFY(check_default(&hash->map, map));
0: (18) r7 = 0xffffb4fe8018a004
2: (b4) w1 = 110
3: (63) *(u32 *)(r7 +0) = r1
R1_w=invP110 R7_w=map_value(id=0,off=4,ks=4,vs=8,imm=0) R10=fp0
; VERIFY_TYPE(BPF_MAP_TYPE_HASH, check_hash);
4: (18) r1 = 0xffffb4fe8018a000
6: (b4) w2 = 1
7: (63) *(u32 *)(r1 +0) = r2
R1_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R2_w=invP1 R7_w=map_value(id=0,off=4,ks=4,vs=8,imm=0) R10=fp0
8: (b7) r2 = 0
9: (18) r8 = 0xffff90bcb500c000
11: (18) r1 = 0xffff90bcb500c000
13: (0f) r1 += r2
R1 pointer arithmetic on map_ptr prohibited
To fix the issue, let us permit map_ptr + 0 arithmetic which will
result in exactly the same map_ptr.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200908175702.2463625-1-yhs@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b18b099e04f450cdc77bec72acefcde7042bd1f3 ]
On my system the kernel processes the "kgdb_earlycon" parameter before
the "kgdbcon" parameter. When we setup "kgdb_earlycon" we'll end up
in kgdb_register_callbacks() and "kgdb_use_con" won't have been set
yet so we'll never get around to starting "kgdbcon". Let's remedy
this by detecting that the IO module was already registered when
setting "kgdb_use_con" and registering the console then.
As part of this, to avoid pre-declaring things, move the handling of
the "kgdbcon" further down in the file.
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200630151422.1.I4aa062751ff5e281f5116655c976dff545c09a46@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 921c7ebd1337d1a46783d7e15a850e12aed2eaa0 ]
If should_futex_fail() returns true in futex_wake_pi(), then the 'ret'
variable is set to -EFAULT and then immediately overwritten. So the failure
injection is non-functional.
Fix it by actually leaving the function and returning -EFAULT.
The Fixes tag is kinda blury because the initial commit which introduced
failure injection was already sloppy, but the below mentioned commit broke
it completely.
[ tglx: Massaged changelog ]
Fixes: 6b4f4bc9cb22 ("locking/futex: Allow low-level atomic operations to return -EAGAIN")
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200927000858.24219-1-mateusznosek0@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f8e48a3dca060e80f672d398d181db1298fbc86c ]
It is valid (albeit uncommon) to call local_irq_enable() without first
having called local_irq_disable(). In this case we enter
lockdep_hardirqs_on*() with IRQs enabled and trip a preemption warning
for using __this_cpu_read().
Use this_cpu_read() instead to avoid the warning.
Fixes: 4d004099a6 ("lockdep: Fix lockdep recursion")
Reported-by: syzbot+53f8ce8bbc07924b6417@syzkaller.appspotmail.com
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 761a8c58db6bc884994b28cd6d9707b467d680c1 ]
There was a memory corruption bug happening while running the synthetic
event selftests:
kmemleak: Cannot insert 0xffff8c196fa2afe5 into the object search tree (overlaps existing)
CPU: 5 PID: 6866 Comm: ftracetest Tainted: G W 5.9.0-rc5-test+ #577
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
Call Trace:
dump_stack+0x8d/0xc0
create_object.cold+0x3b/0x60
slab_post_alloc_hook+0x57/0x510
? tracing_map_init+0x178/0x340
__kmalloc+0x1b1/0x390
tracing_map_init+0x178/0x340
event_hist_trigger_func+0x523/0xa40
trigger_process_regex+0xc5/0x110
event_trigger_write+0x71/0xd0
vfs_write+0xca/0x210
ksys_write+0x70/0xf0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fef0a63a487
Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fff76f18398 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000039 RCX: 00007fef0a63a487
RDX: 0000000000000039 RSI: 000055eb3b26d690 RDI: 0000000000000001
RBP: 000055eb3b26d690 R08: 000000000000000a R09: 0000000000000038
R10: 000055eb3b2cdb80 R11: 0000000000000246 R12: 0000000000000039
R13: 00007fef0a70b500 R14: 0000000000000039 R15: 00007fef0a70b700
kmemleak: Kernel memory leak detector disabled
kmemleak: Object 0xffff8c196fa2afe0 (size 8):
kmemleak: comm "ftracetest", pid 6866, jiffies 4295082531
kmemleak: min_count = 1
kmemleak: count = 0
kmemleak: flags = 0x1
kmemleak: checksum = 0
kmemleak: backtrace:
__kmalloc+0x1b1/0x390
tracing_map_init+0x1be/0x340
event_hist_trigger_func+0x523/0xa40
trigger_process_regex+0xc5/0x110
event_trigger_write+0x71/0xd0
vfs_write+0xca/0x210
ksys_write+0x70/0xf0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The cause came down to a use of strcat() that was adding an string that was
shorten, but the strcat() did not take that into account.
strcat() is extremely dangerous as it does not care how big the buffer is.
Replace it with seq_buf operations that prevent the buffer from being
overwritten if what is being written is bigger than the buffer.
Fixes: 10819e25799a ("tracing: Handle synthetic event array field type checking correctly")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 0f2122045b946241a9e549c2a76cea54fa58a7ff upstream.
Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.
With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.
Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f959dcd6ddfd29235030e8026471ac1b022ad2b0 ]
When booting the kernel v5.9-rc4 on a VM, the kernel would panic when
printing a warning message in swiotlb_map(). The dev->dma_mask must not
be a NULL pointer when calling the dma mapping layer. A NULL pointer
check can potentially avoid the panic.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7f6e4312e15a5c370e84eaa685879b6bdcc717e4 ]
Protect against potential stack overflow that might happen when bpf2bpf
calls get combined with tailcalls. Limit the caller's stack depth for
such case down to 256 so that the worst case scenario would result in 8k
stack size (32 which is tailcall limit * 256 = 8k).
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 428805c0c5e76ef643b1fbc893edfb636b3d8aef ]
get_gendisk grabs a reference on the disk and file operation, so this
code will leak both of them while having absolutely no use for the
gendisk itself.
This effectively reverts commit 2df83fa4bce421f ("PM / Hibernate: Use
get_gendisk to verify partition if resume_file is integer format")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 39d8f0d1026a990604770a658708f5845f7dbec0 ]
Recent improvements in LOCKDEP highlighted a potential A-A deadlock with
pcpu_freelist in NMI:
./tools/testing/selftests/bpf/test_progs -t stacktrace_build_id_nmi
[ 18.984807] ================================
[ 18.984807] WARNING: inconsistent lock state
[ 18.984808] 5.9.0-rc6-01771-g1466de1330e1 #2967 Not tainted
[ 18.984809] --------------------------------
[ 18.984809] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[ 18.984810] test_progs/1990 [HC2[2]:SC0[0]:HE0:SE1] takes:
[ 18.984810] ffffe8ffffc219c0 (&head->lock){....}-{2:2}, at: __pcpu_freelist_pop+0xe3/0x180
[ 18.984813] {INITIAL USE} state was registered at:
[ 18.984814] lock_acquire+0x175/0x7c0
[ 18.984814] _raw_spin_lock+0x2c/0x40
[ 18.984815] __pcpu_freelist_pop+0xe3/0x180
[ 18.984815] pcpu_freelist_pop+0x31/0x40
[ 18.984816] htab_map_alloc+0xbbf/0xf40
[ 18.984816] __do_sys_bpf+0x5aa/0x3ed0
[ 18.984817] do_syscall_64+0x2d/0x40
[ 18.984818] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 18.984818] irq event stamp: 12
[...]
[ 18.984822] other info that might help us debug this:
[ 18.984823] Possible unsafe locking scenario:
[ 18.984823]
[ 18.984824] CPU0
[ 18.984824] ----
[ 18.984824] lock(&head->lock);
[ 18.984826] <Interrupt>
[ 18.984826] lock(&head->lock);
[ 18.984827]
[ 18.984828] *** DEADLOCK ***
[ 18.984828]
[ 18.984829] 2 locks held by test_progs/1990:
[...]
[ 18.984838] <NMI>
[ 18.984838] dump_stack+0x9a/0xd0
[ 18.984839] lock_acquire+0x5c9/0x7c0
[ 18.984839] ? lock_release+0x6f0/0x6f0
[ 18.984840] ? __pcpu_freelist_pop+0xe3/0x180
[ 18.984840] _raw_spin_lock+0x2c/0x40
[ 18.984841] ? __pcpu_freelist_pop+0xe3/0x180
[ 18.984841] __pcpu_freelist_pop+0xe3/0x180
[ 18.984842] pcpu_freelist_pop+0x17/0x40
[ 18.984842] ? lock_release+0x6f0/0x6f0
[ 18.984843] __bpf_get_stackid+0x534/0xaf0
[ 18.984843] bpf_prog_1fd9e30e1438d3c5_oncpu+0x73/0x350
[ 18.984844] bpf_overflow_handler+0x12f/0x3f0
This is because pcpu_freelist_head.lock is accessed in both NMI and
non-NMI context. Fix this issue by using raw_spin_trylock() in NMI.
Since NMI interrupts non-NMI context, when NMI context tries to lock the
raw_spinlock, non-NMI context of the same CPU may already have locked a
lock and is blocked from unlocking the lock. For a system with N CPUs,
there could be N NMIs at the same time, and they may block N non-NMI
raw_spinlocks. This is tricky for pcpu_freelist_push(), where unlike
_pop(), failing _push() means leaking memory. This issue is more likely to
trigger in non-SMP system.
Fix this issue with an extra list, pcpu_freelist.extralist. The extralist
is primarily used to take _push() when raw_spin_trylock() failed on all
the per CPU lists. It should be empty most of the time. The following
table summarizes the behavior of pcpu_freelist in NMI and non-NMI:
non-NMI pop(): use _lock(); check per CPU lists first;
if all per CPU lists are empty, check extralist;
if extralist is empty, return NULL.
non-NMI push(): use _lock(); only push to per CPU lists.
NMI pop(): use _trylock(); check per CPU lists first;
if all per CPU lists are locked or empty, check extralist;
if extralist is locked or empty, return NULL.
NMI push(): use _trylock(); check per CPU lists first;
if all per CPU lists are locked; try push to extralist;
if extralist is also locked, keep trying on per CPU lists.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201005165838.3735218-1-songliubraving@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 70d932985757fbe978024db313001218e9f8fe5c ]
The current notifiers have the following error handling pattern all
over the place:
int err, nr;
err = __foo_notifier_call_chain(&chain, val_up, v, -1, &nr);
if (err & NOTIFIER_STOP_MASK)
__foo_notifier_call_chain(&chain, val_down, v, nr-1, NULL)
And aside from the endless repetition thereof, it is broken. Consider
blocking notifiers; both calls take and drop the rwsem, this means
that the notifier list can change in between the two calls, making @nr
meaningless.
Fix this by replacing all the __foo_notifier_call_chain() functions
with foo_notifier_call_chain_robust() that embeds the above pattern,
but ensures it is inside a single lock region.
Note: I switched atomic_notifier_call_chain_robust() to use
the spinlock, since RCU cannot provide the guarantee
required for the recovery.
Note: software_resume() error handling was broken afaict.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20200818135804.325626653@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c51f8f88d705e06bd696d7510aff22b33eb8e638 ]
Non-cryptographic PRNGs may have great statistical properties, but
are usually trivially predictable to someone who knows the algorithm,
given a small sample of their output. An LFSR like prandom_u32() is
particularly simple, even if the sample is widely scattered bits.
It turns out the network stack uses prandom_u32() for some things like
random port numbers which it would prefer are *not* trivially predictable.
Predictability led to a practical DNS spoofing attack. Oops.
This patch replaces the LFSR with a homebrew cryptographic PRNG based
on the SipHash round function, which is in turn seeded with 128 bits
of strong random key. (The authors of SipHash have *not* been consulted
about this abuse of their algorithm.) Speed is prioritized over security;
attacks are rare, while performance is always wanted.
Replacing all callers of prandom_u32() is the quick fix.
Whether to reinstate a weaker PRNG for uses which can tolerate it
is an open question.
Commit f227e3ec3b5c ("random32: update the net random state on interrupt
and activity") was an earlier attempt at a solution. This patch replaces
it.
Reported-by: Amit Klein <aksecurity@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tytso@mit.edu
Cc: Florian Westphal <fw@strlen.de>
Cc: Marc Plumb <lkml.mplumb@gmail.com>
Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity")
Signed-off-by: George Spelvin <lkml@sdf.org>
Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
[ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions
to prandom.h for later use; merged George's prandom_seed() proposal;
inlined siprand_u32(); replaced the net_rand_state[] array with 4
members to fix a build issue; cosmetic cleanups to make checkpatch
happy; fixed RANDOM32_SELFTEST build ]
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a73f863af4ce9730795eab7097fb2102e6854365 ]
Commit:
765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")
made sched features static for !CONFIG_SCHED_DEBUG configurations, but
overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases.
For the latter echoing changes to /sys/kernel/debug/sched_features has
the nasty effect of effectively changing what sched_features reports,
but without actually changing the scheduler behaviour (since different
translation units get different sysctl_sched_features).
Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly
restructuring ifdefs.
Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")
Co-developed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Patrick Bellasi <patrick.bellasi@matbug.net>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 93c230e3f5bd6e1d2b2759d582fdfe9c2731473b ]
The commit af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
introduces RET_PTR_TO_BTF_ID_OR_NULL and
the commit eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
introduces RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL.
Note that for RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL, the reg0->type
could become PTR_TO_MEM_OR_NULL which is not covered by
BPF_PROBE_MEM.
The BPF_REG_0 will then hold a _OR_NULL pointer type. This _OR_NULL
pointer type requires the bpf program to explicitly do a NULL check first.
After NULL check, the verifier will mark all registers having
the same reg->id as safe to use. However, the reg->id
is not set for those new _OR_NULL return types. One of the ways
that may be wrong is, checking NULL for one btf_id typed pointer will
end up validating all other btf_id typed pointers because
all of them have id == 0. The later tests will exercise
this path.
To fix it and also avoid similar issue in the future, this patch
moves the id generation logic out of each individual RET type
test in check_helper_call(). Instead, it does one
reg_type_may_be_null() test and then do the id generation
if needed.
This patch also adds a WARN_ON_ONCE in mark_ptr_or_null_reg()
to catch future breakage.
The _OR_NULL pointer usage in the bpf_iter_reg.ctx_arg_info is
fine because it just happens that the existing id generation after
check_ctx_access() has covered it. It is also using the
reg_type_may_be_null() to decide if id generation is needed or not.
Fixes: af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201019194212.1050855-1-kafai@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fdf09ab887829cd1b671e45d9549f8ec1ffda0fa ]
Corentin hit the following workqueue warning when running with
CRYPTO_MANAGER_EXTRA_TESTS:
WARNING: CPU: 2 PID: 147 at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0
Modules linked in: ghash_generic
CPU: 2 PID: 147 Comm: modprobe Not tainted
5.6.0-rc1-next-20200214-00068-g166c9264f0b1-dirty #545
Hardware name: Pine H64 model A (DT)
pc : __queue_work+0x3b8/0x3d0
Call trace:
__queue_work+0x3b8/0x3d0
queue_work_on+0x6c/0x90
do_init_module+0x188/0x1f0
load_module+0x1d00/0x22b0
I wasn't able to reproduce on x86 or rpi 3b+.
This is
WARN_ON(!list_empty(&work->entry))
from __queue_work(), and it happens because the init_free_wq work item
isn't initialized in time for a crypto test that requests the gcm
module. Some crypto tests were recently moved earlier in boot as
explained in commit c4741b230597 ("crypto: run initcalls for generic
implementations earlier"), which went into mainline less than two weeks
before the Fixes commit.
Avoid the warning by statically initializing init_free_wq and the
corresponding llist.
Link: https://lore.kernel.org/lkml/20200217204803.GA13479@Red/
Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Reported-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Tested-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Tested-on: sun50i-h6-pine-h64
Tested-on: imx8mn-ddr4-evk
Tested-on: sun50i-a64-bananapi-m64
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d081a6e353168f15e63eb9e9334757f20343319f ]
Currently using forward search doesn't handle multi-line strings correctly.
The search routine replaces line breaks with \0 during the search and, for
regular searches ("help | grep Common\n"), there is code after the line
has been discarded or printed to replace the break character.
However during a pager search ("help\n" followed by "/Common\n") when the
string is matched we will immediately return to normal output and the code
that should restore the \n becomes unreachable. Fix this by restoring the
replaced character when we disable the search mode and update the comment
accordingly.
Fixes: fb6daa7520f9d ("kdb: Provide forward search at more prompt")
Link: https://lore.kernel.org/r/20200909141708.338273-1-daniel.thompson@linaro.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 58db5785b0d76be4582a32a7900acce88e691d36 ]
Currently in the unlikely event that buf fails to be allocated it
is dereferenced a few times. Use the errexit flag to determine if
buf should be written to to avoid the null pointer dereferences.
Addresses-Coverity: ("Dereference after null check")
Fixes: f518f154ecef ("refperf: Dynamically allocate experiment-summary output buffer")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c8fa63714763b7795a3f5fb7ed6d000763e6dccc ]
The conversion of rcu_fwds to dynamic allocation failed to actually
allocate the required structure. This commit therefore allocates it,
frees it, and updates rcu_fwds accordingly. While in the area, it
abstracts the cleanup actions into rcu_torture_fwd_prog_cleanup().
Fixes: 5155be9994e5 ("rcutorture: Dynamically allocate rcu_fwds structure")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9c39245382de4d52a122641952900709d4a9950b ]
On callback overload, it is necessary to quickly detect idle CPUs,
and rcu_gp_fqs_check_wake() checks for this condition. Unfortunately,
the code following the call to this function does not repeat this check,
which means that in reality no actual quiescent-state forcing, instead
only a couple of quick and pointless wakeups at the beginning of the
grace period.
This commit therefore adds a check for the RCU_GP_FLAG_OVLD flag in
the post-wakeup "if" statement in rcu_gp_fqs_loop().
Fixes: 1fca4d12f4637 ("rcu: Expedite first two FQS scans under callback-overload conditions")
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 10819e25799aae564005b6049a45e9808797b3bb ]
Since synthetic event array types are derived from the field name,
there may be a semicolon at the end of the type which should be
stripped off.
If there are more characters following that, normal type string
checking will result in an invalid type.
Without this patch, you can end up with an invalid field type string
that gets displayed in both the synthetic event description and the
event format:
Before:
# echo 'myevent char str[16]; int v' >> synthetic_events
# cat synthetic_events
myevent char[16]; str; int v
name: myevent
ID: 1936
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char str[16];; offset:8; size:16; signed:1;
field:int v; offset:40; size:4; signed:1;
print fmt: "str=%s, v=%d", REC->str, REC->v
After:
# echo 'myevent char str[16]; int v' >> synthetic_events
# cat synthetic_events
myevent char[16] str; int v
# cat events/synthetic/myevent/format
name: myevent
ID: 1936
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char str[16]; offset:8; size:16; signed:1;
field:int v; offset:40; size:4; signed:1;
print fmt: "str=%s, v=%d", REC->str, REC->v
Link: https://lkml.kernel.org/r/6587663b56c2d45ab9d8c8472a2110713cdec97d.1602598160.git.zanussi@kernel.org
[ <rostedt@goodmis.org>: wrote parse_synth_field() snippet. ]
Fixes: 4b147936fa50 (tracing: Add support for 'synthetic' events)
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 67197a4f28d28d0b073ab0427b03cb2ee5382578 ]
Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm. This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.
Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important. Such operation can happen frequently. We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression. Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us. Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.
Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM && !CLONE_THREAD && !CLONE_VFORK). Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes. To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex. Its scope is changed to global.
The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
the case of vfork(). To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified. Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare. Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.
With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.
[surenb@google.com: v3]
Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com
Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: Tim Murray <timmurray@google.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Christian Kellner <christian@kellner.me>
Cc: Adrian Reber <areber@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
Debugged-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8fbeb52a598c7ab5aa603d6bb083b8a8d16d607a ]
synth_field_size() returns either a positive size or an error (zero or
a negative value). However, the existing code assumes the only error
value is 0. It doesn't handle negative error codes, as it assigns
directly to field->size (a size_t; unsigned), thereby interpreting the
error code as a valid size instead.
Do the test before assignment to field->size.
[ axelrasmussen@google.com: changelog addition, first paragraph above ]
Link: https://lkml.kernel.org/r/9b6946d9776b2eeb43227678158196de1c3c6e1d.1601848695.git.zanussi@kernel.org
Fixes: 4b147936fa50 (tracing: Add support for 'synthetic' events)
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1af9270e908cd50a4f5d815c9b6f794c7d57ed07 ]
From the checks and commit messages for modify_return, it seems it was
never the intention that it should be possible to attach a tracing program
with expected_attach_type == BPF_MODIFY_RETURN to another BPF program.
However, check_attach_modify_return() will only look at the function name,
so if the target function starts with "security_", the attach will be
allowed even for bpf2bpf attachment.
Fix this oversight by also blocking the modification if a target program is
supplied.
Fixes: 18644cec714a ("bpf: Fix use-after-free in fmod_ret check")
Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f91072ed1b7283b13ca57fcfbece5a3b92726143 ]
There's a possible race in perf_mmap_close() when checking ring buffer's
mmap_count refcount value. The problem is that the mmap_count check is
not atomic because we call atomic_dec() and atomic_read() separately.
perf_mmap_close:
...
atomic_dec(&rb->mmap_count);
...
if (atomic_read(&rb->mmap_count))
goto out_put;
<ring buffer detach>
free_uid
out_put:
ring_buffer_put(rb); /* could be last */
The race can happen when we have two (or more) events sharing same ring
buffer and they go through atomic_dec() and then they both see 0 as refcount
value later in atomic_read(). Then both will go on and execute code which
is meant to be run just once.
The code that detaches ring buffer is probably fine to be executed more
than once, but the problem is in calling free_uid(), which will later on
demonstrate in related crashes and refcount warnings, like:
refcount_t: addition on 0; use-after-free.
...
RIP: 0010:refcount_warn_saturate+0x6d/0xf
...
Call Trace:
prepare_creds+0x190/0x1e0
copy_creds+0x35/0x172
copy_process+0x471/0x1a80
_do_fork+0x83/0x3a0
__do_sys_wait4+0x83/0x90
__do_sys_clone+0x85/0xa0
do_syscall_64+0x5b/0x1e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Using atomic decrease and check instead of separated calls.
Tested-by: Michael Petlan <mpetlan@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Wade Mealing <wmealing@redhat.com>
Fixes: 9bb5d40cd93c ("perf: Fix mmap() accounting hole");
Link: https://lore.kernel.org/r/20200916115311.GE2301783@krava
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4d004099a668c41522242aa146a38cc4eb59cb1e ]
Steve reported that lockdep_assert*irq*(), when nested inside lockdep
itself, will trigger a false-positive.
One example is the stack-trace code, as called from inside lockdep,
triggering tracing, which in turn calls RCU, which then uses
lockdep_assert_irqs_disabled().
Fixes: a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context} to per-cpu variables")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2bb8945bcc1a768f2bc402a16c9610bba8d5187d ]
Basically print_lock_class_header()'s for loop is out of sync with the
the size of of ->usage_traces[].
Also clean things up a bit while at it, to avoid such mishaps in the future.
Fixes: 23870f122768 ("locking/lockdep: Fix "USED" <- "IN-NMI" inversions")
Reported-by: Qian Cai <cai@redhat.com>
Debugged-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Qian Cai <cai@redhat.com>
Link: https://lkml.kernel.org/r/20200930094937.GE2651@hirez.programming.kicks-ass.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 233e7aca4c8a2c764f556bba9644c36154017e7f ]
Barry Song noted the following
Something is wrong. In find_busiest_group(), we are checking if
src has higher load, however, in task_numa_find_cpu(), we are
checking if dst will have higher load after balancing. It seems
it is not sensible to check src.
It maybe cause wrong imbalance value, for example,
if dst_running = env->dst_stats.nr_running + 1 results in 3 or
above, and src_running = env->src_stats.nr_running - 1 results
in 1;
The current code is thinking imbalance as 0 since src_running is
smaller than 2. This is inconsistent with load balancer.
Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving
a task "from an almost idle domain" to a "domain with spare capacity". This
patch forbids movement "from a misplaced domain" to "an almost idle domain"
as that is closer to what the CPU load balancer expects.
This patch is not a universal win. The old behaviour was intended to allow
a task from an almost idle NUMA node to migrate to its preferred node if
the destination had capacity but there are corner cases. For example,
a NAS compute load could be parallelised to use 1/3rd of available CPUs
but not all those potential tasks are active at all times allowing this
logic to trigger. An obvious example is specjbb 2005 running various
numbers of warehouses on a 2 socket box with 80 cpus.
specjbb
5.9.0-rc4 5.9.0-rc4
vanilla dstbalance-v1r1
Hmean tput-1 46425.00 ( 0.00%) 43394.00 * -6.53%*
Hmean tput-2 98416.00 ( 0.00%) 96031.00 * -2.42%*
Hmean tput-3 150184.00 ( 0.00%) 148783.00 * -0.93%*
Hmean tput-4 200683.00 ( 0.00%) 197906.00 * -1.38%*
Hmean tput-5 236305.00 ( 0.00%) 245549.00 * 3.91%*
Hmean tput-6 281559.00 ( 0.00%) 285692.00 * 1.47%*
Hmean tput-7 338558.00 ( 0.00%) 334467.00 * -1.21%*
Hmean tput-8 340745.00 ( 0.00%) 372501.00 * 9.32%*
Hmean tput-9 424343.00 ( 0.00%) 413006.00 * -2.67%*
Hmean tput-10 421854.00 ( 0.00%) 434261.00 * 2.94%*
Hmean tput-11 493256.00 ( 0.00%) 485330.00 * -1.61%*
Hmean tput-12 549573.00 ( 0.00%) 529959.00 * -3.57%*
Hmean tput-13 593183.00 ( 0.00%) 555010.00 * -6.44%*
Hmean tput-14 588252.00 ( 0.00%) 599166.00 * 1.86%*
Hmean tput-15 623065.00 ( 0.00%) 642713.00 * 3.15%*
Hmean tput-16 703924.00 ( 0.00%) 660758.00 * -6.13%*
Hmean tput-17 666023.00 ( 0.00%) 697675.00 * 4.75%*
Hmean tput-18 761502.00 ( 0.00%) 758360.00 * -0.41%*
Hmean tput-19 796088.00 ( 0.00%) 798368.00 * 0.29%*
Hmean tput-20 733564.00 ( 0.00%) 823086.00 * 12.20%*
Hmean tput-21 840980.00 ( 0.00%) 856711.00 * 1.87%*
Hmean tput-22 804285.00 ( 0.00%) 872238.00 * 8.45%*
Hmean tput-23 795208.00 ( 0.00%) 889374.00 * 11.84%*
Hmean tput-24 848619.00 ( 0.00%) 966783.00 * 13.92%*
Hmean tput-25 750848.00 ( 0.00%) 903790.00 * 20.37%*
Hmean tput-26 780523.00 ( 0.00%) 962254.00 * 23.28%*
Hmean tput-27 1042245.00 ( 0.00%) 991544.00 * -4.86%*
Hmean tput-28 1090580.00 ( 0.00%) 1035926.00 * -5.01%*
Hmean tput-29 999483.00 ( 0.00%) 1082948.00 * 8.35%*
Hmean tput-30 1098663.00 ( 0.00%) 1113427.00 * 1.34%*
Hmean tput-31 1125671.00 ( 0.00%) 1134175.00 * 0.76%*
Hmean tput-32 968167.00 ( 0.00%) 1250286.00 * 29.14%*
Hmean tput-33 1077676.00 ( 0.00%) 1060893.00 * -1.56%*
Hmean tput-34 1090538.00 ( 0.00%) 1090933.00 * 0.04%*
Hmean tput-35 967058.00 ( 0.00%) 1107421.00 * 14.51%*
Hmean tput-36 1051745.00 ( 0.00%) 1210663.00 * 15.11%*
Hmean tput-37 1019465.00 ( 0.00%) 1351446.00 * 32.56%*
Hmean tput-38 1083102.00 ( 0.00%) 1064541.00 * -1.71%*
Hmean tput-39 1232990.00 ( 0.00%) 1303623.00 * 5.73%*
Hmean tput-40 1175542.00 ( 0.00%) 1340943.00 * 14.07%*
Hmean tput-41 1127826.00 ( 0.00%) 1339492.00 * 18.77%*
Hmean tput-42 1198313.00 ( 0.00%) 1411023.00 * 17.75%*
Hmean tput-43 1163733.00 ( 0.00%) 1228253.00 * 5.54%*
Hmean tput-44 1305562.00 ( 0.00%) 1357886.00 * 4.01%*
Hmean tput-45 1326752.00 ( 0.00%) 1406061.00 * 5.98%*
Hmean tput-46 1339424.00 ( 0.00%) 1418451.00 * 5.90%*
Hmean tput-47 1415057.00 ( 0.00%) 1381570.00 * -2.37%*
Hmean tput-48 1392003.00 ( 0.00%) 1421167.00 * 2.10%*
Hmean tput-49 1408374.00 ( 0.00%) 1418659.00 * 0.73%*
Hmean tput-50 1359822.00 ( 0.00%) 1391070.00 * 2.30%*
Hmean tput-51 1414246.00 ( 0.00%) 1392679.00 * -1.52%*
Hmean tput-52 1432352.00 ( 0.00%) 1354020.00 * -5.47%*
Hmean tput-53 1387563.00 ( 0.00%) 1409563.00 * 1.59%*
Hmean tput-54 1406420.00 ( 0.00%) 1388711.00 * -1.26%*
Hmean tput-55 1438804.00 ( 0.00%) 1387472.00 * -3.57%*
Hmean tput-56 1399465.00 ( 0.00%) 1400296.00 * 0.06%*
Hmean tput-57 1428132.00 ( 0.00%) 1396399.00 * -2.22%*
Hmean tput-58 1432385.00 ( 0.00%) 1386253.00 * -3.22%*
Hmean tput-59 1421612.00 ( 0.00%) 1371416.00 * -3.53%*
Hmean tput-60 1429423.00 ( 0.00%) 1389412.00 * -2.80%*
Hmean tput-61 1396230.00 ( 0.00%) 1351122.00 * -3.23%*
Hmean tput-62 1418396.00 ( 0.00%) 1383098.00 * -2.49%*
Hmean tput-63 1409918.00 ( 0.00%) 1374662.00 * -2.50%*
Hmean tput-64 1410236.00 ( 0.00%) 1376216.00 * -2.41%*
Hmean tput-65 1396405.00 ( 0.00%) 1364418.00 * -2.29%*
Hmean tput-66 1395975.00 ( 0.00%) 1357326.00 * -2.77%*
Hmean tput-67 1392986.00 ( 0.00%) 1349642.00 * -3.11%*
Hmean tput-68 1386541.00 ( 0.00%) 1343261.00 * -3.12%*
Hmean tput-69 1374407.00 ( 0.00%) 1342588.00 * -2.32%*
Hmean tput-70 1377513.00 ( 0.00%) 1334654.00 * -3.11%*
Hmean tput-71 1369319.00 ( 0.00%) 1334952.00 * -2.51%*
Hmean tput-72 1354635.00 ( 0.00%) 1329005.00 * -1.89%*
Hmean tput-73 1350933.00 ( 0.00%) 1318942.00 * -2.37%*
Hmean tput-74 1351714.00 ( 0.00%) 1316347.00 * -2.62%*
Hmean tput-75 1352198.00 ( 0.00%) 1309974.00 * -3.12%*
Hmean tput-76 1349490.00 ( 0.00%) 1286064.00 * -4.70%*
Hmean tput-77 1336131.00 ( 0.00%) 1303684.00 * -2.43%*
Hmean tput-78 1308896.00 ( 0.00%) 1271024.00 * -2.89%*
Hmean tput-79 1326703.00 ( 0.00%) 1290862.00 * -2.70%*
Hmean tput-80 1336199.00 ( 0.00%) 1291629.00 * -3.34%*
The performance at the mid-point is better but not universally better. The
patch is a mixed bag depending on the workload, machine and overall
levels of utilisation. Sometimes it's better (sometimes much better),
other times it is worse (sometimes much worse). Given that there isn't a
universally good decision in this section and more people seem to prefer
the patch then it may be best to keep the LB decisions consistent and
revisit imbalance handling when the load balancer code changes settle down.
Jirka Hladky added the following observation.
Our results are mostly in line with what you see. We observe
big gains (20-50%) when the system is loaded to 1/3 of the
maximum capacity and mixed results at the full load - some
workloads benefit from the patch at the full load, others not,
but performance changes at the full load are mostly within the
noise of results (+/-5%). Overall, we think this patch is helpful.
[mgorman@techsingularity.net: Rewrote changelog]
Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity")
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200921221849.GI3179@techsingularity.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit df3cb4ea1fb63ff326488efd671ba3c39034255e ]
We've met problems that occasionally tasks with full cpumask
(e.g. by putting it into a cpuset or setting to full affinity)
were migrated to our isolated cpus in production environment.
After some analysis, we found that it is due to the current
select_idle_smt() not considering the sched_domain mask.
Steps to reproduce on my 31-CPU hyperthreads machine:
1. with boot parameter: "isolcpus=domain,2-31"
(thread lists: 0,16 and 1,17)
2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
3. some threads will be migrated to the isolated cpu16~17.
Fix it by checking the valid domain mask in select_idle_smt().
Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
Reported-by: Wetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit da0777d35f47892f359c3f73ea155870bb595700 ]
In find_energy_efficient_cpu() 'cpu_cap' could be less that 'util'.
It might be because of RT, DL (so higher sched class than CFS), irq or
thermal pressure signal, which reduce the capacity value.
In such situation the result of 'cpu_cap - util' might be negative but
stored in the unsigned long. Then it might be compared with other unsigned
long when uclamp_rq_util_with() reduced the 'util' such that is passes the
fits_capacity() check.
Prevent this situation and make the arithmetic more safe.
Fixes: 1d42509e475cd ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20200810083004.26420-1-lukasz.luba@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:
"Fix an error handling bug that can cause a lockup if a CPU is offline
(doh ...)"
* tag 'perf-urgent-2020-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix task_function_call() error handling
|
|
The error handling introduced by commit:
2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()")
looses any return value from smp_call_function_single() that is not
{0, -EINVAL}. This is a problem because it will return -EXNIO when the
target CPU is offline. Worse, in that case it'll turn into an infinite
loop.
Fixes: 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()")
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Barret Rhoden <brho@google.com>
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: https://lkml.kernel.org/r/20200827064732.20860-1-kjain@linux.ibm.com
|
|
Pull networking fixes from Jakub Kicinski:
"One more set of fixes from the networking tree:
- add missing input validation in nl80211_del_key(), preventing
out-of-bounds access
- last minute fix / improvement of a MRP netlink (uAPI) interface
introduced in 5.9 (current) release
- fix "unresolved symbol" build error under CONFIG_NET w/o
CONFIG_INET due to missing tcp_timewait_sock and inet_timewait_sock
BTF.
- fix 32 bit sub-register bounds tracking in the bpf verifier for OR
case
- tcp: fix receive window update in tcp_add_backlog()
- openvswitch: handle DNAT tuple collision in conntrack-related code
- r8169: wait for potential PHY reset to finish after applying a FW
file, avoiding unexpected PHY behaviour and failures later on
- mscc: fix tail dropping watermarks for Ocelot switches
- avoid use-after-free in macsec code after a call to the GRO layer
- avoid use-after-free in sctp error paths
- add a device id for Cellient MPL200 WWAN card
- rxrpc fixes:
- fix the xdr encoding of the contents read from an rxrpc key
- fix a BUG() for a unsupported encoding type.
- fix missing _bh lock annotations.
- fix acceptance handling for an incoming call where the incoming
call is encrypted.
- the server token keyring isn't network namespaced - it belongs
to the server, so there's no need. Namespacing it means that
request_key() fails to find it.
- fix a leak of the server keyring"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (21 commits)
net: usb: qmi_wwan: add Cellient MPL200 card
macsec: avoid use-after-free in macsec_handle_frame()
r8169: consider that PHY reset may still be in progress after applying firmware
openvswitch: handle DNAT tuple collision
sctp: fix sctp_auth_init_hmacs() error path
bridge: Netlink interface fix.
net: wireless: nl80211: fix out-of-bounds access in nl80211_del_key()
bpf: Fix scalar32_min_max_or bounds tracking
tcp: fix receive window update in tcp_add_backlog()
net: usb: rtl8150: set random MAC address when set_ethernet_addr() fails
mptcp: more DATA FIN fixes
net: mscc: ocelot: warn when encoding an out-of-bounds watermark value
net: mscc: ocelot: divide watermark value by 60 when writing to SYS_ATOP
net: qrtr: ns: Fix the incorrect usage of rcu_read_lock()
rxrpc: Fix server keyring leak
rxrpc: The server keyring isn't network-namespaced
rxrpc: Fix accept on a connection that need securing
rxrpc: Fix some missing _bh annotations on locking conn->state_lock
rxrpc: Downgrade the BUG() for unsupported token type in rxrpc_read()
rxrpc: Fix rxkad token xdr encoding
...
|
|
Daniel Borkmann says:
====================
pull-request: bpf 2020-10-08
The main changes are:
1) Fix "unresolved symbol" build error under CONFIG_NET w/o CONFIG_INET due
to missing tcp_timewait_sock and inet_timewait_sock BTF, from Yonghong Song.
2) Fix 32 bit sub-register bounds tracking for OR case, from Daniel Borkmann.
====================
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Simon reported an issue with the current scalar32_min_max_or() implementation.
That is, compared to the other 32 bit subreg tracking functions, the code in
scalar32_min_max_or() stands out that it's using the 64 bit registers instead
of 32 bit ones. This leads to bounds tracking issues, for example:
[...]
8: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
8: (79) r1 = *(u64 *)(r0 +0)
R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
9: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
9: (b7) r0 = 1
10: R0_w=inv1 R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
10: (18) r2 = 0x600000002
12: R0_w=inv1 R1_w=inv(id=0) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
12: (ad) if r1 < r2 goto pc+1
R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
13: R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
13: (95) exit
14: R0_w=inv1 R1_w=inv(id=0,umax_value=25769803777,var_off=(0x0; 0x7ffffffff)) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
14: (25) if r1 > 0x0 goto pc+1
R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
15: R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
15: (95) exit
16: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=25769803777,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
16: (47) r1 |= 0
17: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=32212254719,var_off=(0x1; 0x700000000),s32_max_value=1,u32_max_value=1) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
[...]
The bound tests on the map value force the upper unsigned bound to be 25769803777
in 64 bit (0b11000000000000000000000000000000001) and then lower one to be 1. By
using OR they are truncated and thus result in the range [1,1] for the 32 bit reg
tracker. This is incorrect given the only thing we know is that the value must be
positive and thus 2147483647 (0b1111111111111111111111111111111) at max for the
subregs. Fix it by using the {u,s}32_{min,max}_value vars instead. This also makes
sense, for example, for the case where we update dst_reg->s32_{min,max}_value in
the else branch we need to use the newly computed dst_reg->u32_{min,max}_value as
we know that these are positive. Previously, in the else branch the 64 bit values
of umin_value=1 and umax_value=32212254719 were used and latter got truncated to
be 1 as upper bound there. After the fix the subreg range is now correct:
[...]
8: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
8: (79) r1 = *(u64 *)(r0 +0)
R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
9: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
9: (b7) r0 = 1
10: R0_w=inv1 R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
10: (18) r2 = 0x600000002
12: R0_w=inv1 R1_w=inv(id=0) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
12: (ad) if r1 < r2 goto pc+1
R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
13: R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
13: (95) exit
14: R0_w=inv1 R1_w=inv(id=0,umax_value=25769803777,var_off=(0x0; 0x7ffffffff)) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
14: (25) if r1 > 0x0 goto pc+1
R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
15: R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
15: (95) exit
16: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=25769803777,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
16: (47) r1 |= 0
17: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=32212254719,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
[...]
Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Simon Scannell <scannell.smn@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
Kernel threads intentionally do CLONE_FS in order to follow any changes
that 'init' does to set up the root directory (or cwd).
It is admittedly a bit odd, but it avoids the situation where 'init'
does some extensive setup to initialize the system environment, and then
we execute a usermode helper program, and it uses the original FS setup
from boot time that may be very limited and incomplete.
[ Both Al Viro and Eric Biederman point out that 'pivot_root()' will
follow the root regardless, since it fixes up other users of root (see
chroot_fs_refs() for details), but overmounting root and doing a
chroot() would not. ]
However, Vegard Nossum noticed that the CLONE_FS not only means that we
follow the root and current working directories, it also means we share
umask with whatever init changed it to. That wasn't intentional.
Just reset umask to the original default (0022) before actually starting
the usermode helper program.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull networking fixes from David Miller:
1) Make sure SKB control block is in the proper state during IPSEC
ESP-in-TCP encapsulation. From Sabrina Dubroca.
2) Various kinds of attributes were not being cloned properly when we
build new xfrm_state objects from existing ones. Fix from Antony
Antony.
3) Make sure to keep BTF sections, from Tony Ambardar.
4) TX DMA channels need proper locking in lantiq driver, from Hauke
Mehrtens.
5) Honour route MTU during forwarding, always. From Maciej
Żenczykowski.
6) Fix races in kTLS which can result in crashes, from Rohit
Maheshwari.
7) Skip TCP DSACKs with rediculous sequence ranges, from Priyaranjan
Jha.
8) Use correct address family in xfrm state lookups, from Herbert Xu.
9) A bridge FDB flush should not clear out user managed fdb entries
with the ext_learn flag set, from Nikolay Aleksandrov.
10) Fix nested locking of netdev address lists, from Taehee Yoo.
11) Fix handling of 32-bit DATA_FIN values in mptcp, from Mat Martineau.
12) Fix r8169 data corruptions on RTL8402 chips, from Heiner Kallweit.
13) Don't free command entries in mlx5 while comp handler could still be
running, from Eran Ben Elisha.
14) Error flow of request_irq() in mlx5 is busted, due to an off by one
we try to free and IRQ never allocated. From Maor Gottlieb.
15) Fix leak when dumping netlink policies, from Johannes Berg.
16) Sendpage cannot be performed when a page is a slab page, or the page
count is < 1. Some subsystems such as nvme were doing so. Create a
"sendpage_ok()" helper and use it as needed, from Coly Li.
17) Don't leak request socket when using syncookes with mptcp, from
Paolo Abeni.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (111 commits)
net/core: check length before updating Ethertype in skb_mpls_{push,pop}
net: mvneta: fix double free of txq->buf
net_sched: check error pointer in tcf_dump_walker()
net: team: fix memory leak in __team_options_register
net: typhoon: Fix a typo Typoon --> Typhoon
net: hinic: fix DEVLINK build errors
net: stmmac: Modify configuration method of EEE timers
tcp: fix syn cookied MPTCP request socket leak
libceph: use sendpage_ok() in ceph_tcp_sendpage()
scsi: libiscsi: use sendpage_ok() in iscsi_tcp_segment_map()
drbd: code cleanup by using sendpage_ok() to check page for kernel_sendpage()
tcp: use sendpage_ok() to detect misused .sendpage
nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage()
net: add WARN_ONCE in kernel_sendpage() for improper zero-copy send
net: introduce helper sendpage_ok() in include/linux/net.h
net: usb: pegasus: Proper error handing when setting pegasus' MAC address
net: core: document two new elements of struct net_device
netlink: fix policy dump leak
net/mlx5e: Fix race condition on nhe->n pointer in neigh update
net/mlx5e: Fix VLAN create flow
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two tracing fixes:
- Fix temp buffer accounting that caused a WARNING for
ftrace_dump_on_opps()
- Move the recursion check in one of the function callback helpers to
the beginning of the function, as if the rcu_is_watching() gets
traced, it will cause a recursive loop that will crash the kernel"
* tag 'trace-v5.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Move RCU is watching check after recursion check
tracing: Fix trace_find_next_entry() accounting of temp buffer size
|
|
Alexei Starovoitov says:
====================
pull-request: bpf 2020-09-29
The following pull-request contains BPF updates for your *net* tree.
We've added 7 non-merge commits during the last 14 day(s) which contain
a total of 7 files changed, 28 insertions(+), 8 deletions(-).
The main changes are:
1) fix xdp loading regression in libbpf for old kernels, from Andrii.
2) Do not discard packet when NETDEV_TX_BUSY, from Magnus.
3) Fix corner cases in libbpf related to endianness and kconfig, from Tony.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|