| Age | Commit message (Collapse) | Author |
|
[ Upstream commit f0dce1d9b7c81fc3dc9d0cc0bc7ef9b3eae22584 ]
Use kvmalloc/kvfree for temporary value when manipulating a map via
syscall. kmalloc might not be sufficient for percpu maps where the value
is big (and further multiplied by hundreds of CPUs).
Can be reproduced with netcnt test on qemu with "-smp 255".
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210818235216.1159202-1-sdf@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit fda7a38714f40b635f5502ec4855602c6b33dad2 upstream.
1. The ufd in generic_map_update_batch() should be read from batch.map_fd;
2. A call to fdget() should be followed by a symmetric call to fdput().
Fixes: aa2e93b8e58e ("bpf: Add generic support for update and delete batch ops")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211019032934.1210517-1-xukuohai@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 54713c85f536048e685258f880bf298a74c3620d upstream.
Lorenzo noticed that the code testing for program type compatibility of
tail call maps is potentially racy in that two threads could encounter a
map with an unset type simultaneously and both return true even though they
are inserting incompatible programs.
The race window is quite small, but artificially enlarging it by adding a
usleep_range() inside the check in bpf_prog_array_compatible() makes it
trivial to trigger from userspace with a program that does, essentially:
map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0);
pid = fork();
if (pid) {
key = 0;
value = xdp_fd;
} else {
key = 1;
value = tc_fd;
}
err = bpf_map_update_elem(map_fd, &key, &value, 0);
While the race window is small, it has potentially serious ramifications in
that triggering it would allow a BPF program to tail call to a program of a
different type. So let's get rid of it by protecting the update with a
spinlock. The commit in the Fixes tag is the last commit that touches the
code in question.
v2:
- Use a spinlock instead of an atomic variable and cmpxchg() (Alexei)
v3:
- Put lock and the members it protects into an embedded 'owner' struct (Daniel)
Fixes: 3324b584b6f6 ("ebpf: misc core cleanup")
Reported-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211026110019.363464-1-toke@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 04f8ef5643bcd8bcde25dfdebef998aea480b2ba upstream.
When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running
the command as below:
$mount -t cgroup -o none,name=foo cgroup cgroup/
$umount cgroup/
unreferenced object 0xc3585c40 (size 64):
comm "mount", pid 425, jiffies 4294959825 (age 31.990s)
hex dump (first 32 bytes):
01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00 ......(.........
00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00 ........lC......
backtrace:
[<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c
[<1f03679c>] cgroup_setup_root+0x174/0x37c
[<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0
[<f85b12fd>] vfs_get_tree+0x24/0x108
[<f55aec5c>] path_mount+0x384/0x988
[<e2d5e9cd>] do_mount+0x64/0x9c
[<208c9cfe>] sys_mount+0xfc/0x1f4
[<06dd06e0>] ret_fast_syscall+0x0/0x48
[<a8308cb3>] 0xbeb4daa8
This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce
memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data
is allocated by the function percpu_ref_init in cgroup_bpf_inherit which
is called by cgroup_setup_root when mounting, but not freed along with
root_cgrp when umounting. Adding cgroup_bpf_offline which calls
percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in
umount path.
This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime
of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a
cleanup that frees the resources which are allocated by cgroup_bpf_inherit
in cgroup_setup_root.
And inside cgroup_bpf_offline, cgroup_get() is at the beginning and
cgroup_put is at the end of cgroup_bpf_release which is called by
cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of
cgroup's refcount.
Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20211018075623.26884-1-quanyang.wang@windriver.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 63acd42c0d4942f74710b11c38602fb14dea7320 ]
Commit f1a0a376ca0c ("sched/core: Initialize the idle task with
preemption disabled") removed the init_idle() call from
idle_thread_get(). This was the sole call-path on hotplug that resets
the Shadow Call Stack (scs) Stack Pointer (sp).
Not resetting the scs-sp leads to scs overflow after enough hotplug
cycles. Therefore add an explicit scs_task_reset() to the hotplug code
to make sure the scs-sp does get reset on hotplug.
Fixes: f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled")
Signed-off-by: Woody Lin <woodylin@google.com>
[peterz: Changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lore.kernel.org/r/20211012083521.973587-1-woodylin@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 15bc01effefe97757ef02ca09e9d1b927ab22725 upstream.
In commit fda31c50292a ("signal: avoid double atomic counter
increments for user accounting") Linus made a clever optimization to
how rlimits and the struct user_struct. Unfortunately that
optimization does not work in the obvious way when moved to nested
rlimits. The problem is that the last decrement of the per user
namespace per user sigpending counter might also be the last decrement
of the sigpending counter in the parent user namespace as well. Which
means that simply freeing the leaf ucount in __free_sigqueue is not
enough.
Maintain the optimization and handle the tricky cases by introducing
inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
By moving the entire optimization into functions that perform all of
the work it becomes possible to ensure that every level is handled
properly.
The new function inc_rlimit_get_ucounts returns 0 on failure to
increment the ucount. This is different than inc_rlimit_ucounts which
increments the ucounts and returns LONG_MAX if the ucount counter has
exceeded it's maximum or it wrapped (to indicate the counter needs to
decremented).
I wish we had a single user to account all pending signals to across
all of the threads of a process so this complexity was not necessary
Cc: stable@vger.kernel.org
Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Tested-by: Rune Kleveland <rune.kleveland@infomedia.dk>
Tested-by: Yu Zhao <yuzhao@google.com>
Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 34dc2fd6e6908499b669c7b45320cddf38b332e1 upstream.
Instead of leaking the ucounts in new if alloc_ucounts fails, store
the result of alloc_ucounts into a temporary variable, which is later
assigned to new->ucounts.
Cc: stable@vger.kernel.org
Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred")
Link: https://lkml.kernel.org/r/87pms2s0v8.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 629715adc62b0ad27ab04d0aa73a71927f886910 upstream.
The purpose of inc_rlimit_ucounts and dec_rlimit_ucounts in commit_creds
is to change which rlimit counter is used to track a process when the
credentials changes.
Use the same test for both to guarantee the tracking is correct.
Cc: stable@vger.kernel.org
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Link: https://lkml.kernel.org/r/87v91us0w4.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5ebcbe342b1c12fae44b4f83cbeae1520e09857e upstream.
Setting cred->ucounts in cred_alloc_blank does not make sense. The
uid and user_ns are deliberately not set in cred_alloc_blank but
instead the setting is delayed until key_change_session_keyring.
So move dealing with ucounts into key_change_session_keyring as well.
Unfortunately that movement of get_ucounts adds a new failure mode to
key_change_session_keyring. I do not see anything stopping the parent
process from calling setuid and changing the relevant part of it's
cred while keyctl_session_to_parent is running making it fundamentally
necessary to call get_ucounts in key_change_session_keyring. Which
means that the new failure mode cannot be avoided.
A failure of key_change_session_keyring results in a single threaded
parent keeping it's existing credentials. Which results in the parent
process not being able to access the session keyring and whichever
keys are in the new keyring.
Further get_ucounts is only expected to fail if the number of bits in
the refernece count for the structure is too few.
Since the code has no other way to report the failure of get_ucounts
and because such failures are not expected to be common add a WARN_ONCE
to report this problem to userspace.
Between the WARN_ONCE and the parent process not having access to
the keys in the new session keyring I expect any failure of get_ucounts
will be noticed and reported and we can find another way to handle this
condition. (Possibly by just making ucounts->count an atomic_long_t).
Cc: stable@vger.kernel.org
Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred")
Link: https://lkml.kernel.org/r/7k0ias0uf.fsf_-_@disp2133
Tested-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6e3ee990c90494561921c756481d0e2125d8b895 upstream.
Fix possible null-pointer dereference in audit_filter_rules.
audit_filter_rules() error: we previously assumed 'ctx' could be null
Cc: stable@vger.kernel.org
Fixes: bf361231c295 ("audit: add saddr_fam filter field")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ed65df63a39a3f6ed04f7258de8b6789e5021c18 upstream.
While writing an email explaining the "bit = 0" logic for a discussion on
making ftrace_test_recursion_trylock() disable preemption, I discovered a
path that makes the "not do the logic if bit is zero" unsafe.
The recursion logic is done in hot paths like the function tracer. Thus,
any code executed causes noticeable overhead. Thus, tricks are done to try
to limit the amount of code executed. This included the recursion testing
logic.
Having recursion testing is important, as there are many paths that can
end up in an infinite recursion cycle when tracing every function in the
kernel. Thus protection is needed to prevent that from happening.
Because it is OK to recurse due to different running context levels (e.g.
an interrupt preempts a trace, and then a trace occurs in the interrupt
handler), a set of bits are used to know which context one is in (normal,
softirq, irq and NMI). If a recursion occurs in the same level, it is
prevented*.
Then there are infrastructure levels of recursion as well. When more than
one callback is attached to the same function to trace, it calls a loop
function to iterate over all the callbacks. Both the callbacks and the
loop function have recursion protection. The callbacks use the
"ftrace_test_recursion_trylock()" which has a "function" set of context
bits to test, and the loop function calls the internal
trace_test_and_set_recursion() directly, with an "internal" set of bits.
If an architecture does not implement all the features supported by ftrace
then the callbacks are never called directly, and the loop function is
called instead, which will implement the features of ftrace.
Since both the loop function and the callbacks do recursion protection, it
was seemed unnecessary to do it in both locations. Thus, a trick was made
to have the internal set of recursion bits at a more significant bit
location than the function bits. Then, if any of the higher bits were set,
the logic of the function bits could be skipped, as any new recursion
would first have to go through the loop function.
This is true for architectures that do not support all the ftrace
features, because all functions being traced must first go through the
loop function before going to the callbacks. But this is not true for
architectures that support all the ftrace features. That's because the
loop function could be called due to two callbacks attached to the same
function, but then a recursion function inside the callback could be
called that does not share any other callback, and it will be called
directly.
i.e.
traced_function_1: [ more than one callback tracing it ]
call loop_func
loop_func:
trace_recursion set internal bit
call callback
callback:
trace_recursion [ skipped because internal bit is set, return 0 ]
call traced_function_2
traced_function_2: [ only traced by above callback ]
call callback
callback:
trace_recursion [ skipped because internal bit is set, return 0 ]
call traced_function_2
[ wash, rinse, repeat, BOOM! out of shampoo! ]
Thus, the "bit == 0 skip" trick is not safe, unless the loop function is
call for all functions.
Since we want to encourage architectures to implement all ftrace features,
having them slow down due to this extra logic may encourage the
maintainers to update to the latest ftrace features. And because this
logic is only safe for them, remove it completely.
[*] There is on layer of recursion that is allowed, and that is to allow
for the transition between interrupt context (normal -> softirq ->
irq -> NMI), because a trace may occur before the context update is
visible to the trace recursion logic.
Link: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/
Link: https://lkml.kernel.org/r/20211018154412.09fcad3c@gandalf.local.home
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jisheng Zhang <jszhang@kernel.org>
Cc: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: stable@vger.kernel.org
Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 293d92cbbd2418ca2ba43fed07f1b92e884d1c77 ]
The following warning occurred sporadically on s390:
DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or rodata [addr=0000000048cc5e2f] [len=131072]
WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 check_for_illegal_area+0xa8/0x138
It is a false-positive warning, due to broken logic in debug_dma_map_sg().
check_for_illegal_area() checks for overlay of sg elements with kernel text
or rodata. It is called with sg_dma_len(s) instead of s->length as
parameter. After the call to ->map_sg(), sg_dma_len() will contain the
length of possibly combined sg elements in the DMA address space, and not
the individual sg element length, which would be s->length.
The check will then use the physical start address of an sg element, and
add the DMA length for the overlap check, which could result in the false
warning, because the DMA length can be larger than the actual single sg
element length.
In addition, the call to check_for_illegal_area() happens in the iteration
over mapped_ents, which will not include all individual sg elements if
any of them were combined in ->map_sg().
Fix this by using s->length instead of sg_dma_len(s). Also put the call to
check_for_illegal_area() in a separate loop, iterating over all the
individual sg elements ("nents" instead of "mapped_ents").
While at it, as suggested by Robin Murphy, also move check_for_stack()
inside the new loop, as it is similarly concerned with validating the
individual sg elements.
Link: https://lore.kernel.org/lkml/20210705185252.4074653-1-gerald.schaefer@linux.ibm.com
Fixes: 884d05970bfb ("dma-debug: use sg_dma_len accessor")
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 424b650f35c77defbb3cbd6e5221d3697af42250 upstream.
The compiler warns when the data are actually unused:
kernel/trace/trace.c:1712:13: error: ‘trace_create_maxlat_file’ defined but not used [-Werror=unused-function]
1712 | static void trace_create_maxlat_file(struct trace_array *tr,
| ^~~~~~~~~~~~~~~~~~~~~~~~
[Why]
CONFIG_HWLAT_TRACER=n, CONFIG_TRACER_MAX_TRACE=n, CONFIG_OSNOISE_TRACER=y
gcc report warns.
[How]
Now trace_create_maxlat_file will only take effect when
CONFIG_HWLAT_TRACER=y or CONFIG_TRACER_MAX_TRACE=y. In fact, after
adding osnoise trace, it also needs to take effect.
Link: https://lore.kernel.org/all/c1d9e328-ad7c-920b-6c24-9e1598a6421c@infradead.org/
Link: https://lkml.kernel.org/r/20210922025122.3268022-1-liu.yun@linux.dev
Fixes: bce29ac9ce0b ("trace: Add osnoise tracer")
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0d67e332e6df72f43eaa21228daa3a79e23093f3 upstream.
When CONFIG_MODULE_UNLOAD is disabled, the module->exit member
is not defined, causing a build failure:
kernel/module.c:4493:8: error: no member named 'exit' in 'struct module'
mod->exit = *exit;
add an #ifdef block around this.
Fixes: cf68fffb66d6 ("add support for Clang CFI")
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f792565326825ed806626da50c6f9a928f1079c1 ]
Users of rdpmc rely on the mmapped user page to calculate accurate
time_enabled. Currently, userpage->time_enabled is only updated when the
event is added to the pmu. As a result, inactive event (due to counter
multiplexing) does not have accurate userpage->time_enabled. This can
be reproduced with something like:
/* open 20 task perf_event "cycles", to create multiplexing */
fd = perf_event_open(); /* open task perf_event "cycles" */
userpage = mmap(fd); /* use mmap and rdmpc */
while (true) {
time_enabled_mmap = xxx; /* use logic in perf_event_mmap_page */
time_enabled_read = read(fd).time_enabled;
if (time_enabled_mmap > time_enabled_read)
BUG();
}
Fix this by updating userpage for inactive events in merge_sched_in.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-and-tested-by: Lucian Grijincu <lucian@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210929194313.2398474-1-songliubraving@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 30e29a9a2bc6a4888335a6ede968b75cd329657a ]
In prealloc_elems_and_freelist(), the multiplication to calculate the
size passed to bpf_map_area_alloc() could lead to an integer overflow.
As a result, out-of-bounds write could occur in pcpu_freelist_populate()
as reported by KASAN:
[...]
[ 16.968613] BUG: KASAN: slab-out-of-bounds in pcpu_freelist_populate+0xd9/0x100
[ 16.969408] Write of size 8 at addr ffff888104fc6ea0 by task crash/78
[ 16.970038]
[ 16.970195] CPU: 0 PID: 78 Comm: crash Not tainted 5.15.0-rc2+ #1
[ 16.970878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 16.972026] Call Trace:
[ 16.972306] dump_stack_lvl+0x34/0x44
[ 16.972687] print_address_description.constprop.0+0x21/0x140
[ 16.973297] ? pcpu_freelist_populate+0xd9/0x100
[ 16.973777] ? pcpu_freelist_populate+0xd9/0x100
[ 16.974257] kasan_report.cold+0x7f/0x11b
[ 16.974681] ? pcpu_freelist_populate+0xd9/0x100
[ 16.975190] pcpu_freelist_populate+0xd9/0x100
[ 16.975669] stack_map_alloc+0x209/0x2a0
[ 16.976106] __sys_bpf+0xd83/0x2ce0
[...]
The possibility of this overflow was originally discussed in [0], but
was overlooked.
Fix the integer overflow by changing elem_size to u64 from u32.
[0] https://lore.kernel.org/bpf/728b238e-a481-eb50-98e9-b0f430ab01e7@gmail.com/
Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
Signed-off-by: Tatsuhiko Yasumatsu <th.yasumatsu@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210930135545.173698-1-th.yasumatsu@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 703066188f63d66cc6b9d678e5b5ef1213c5938e ]
This patch null-terminates the temporary buffer in sched_scaling_write()
so kstrtouint() does not return failure and checks the value is valid.
Before:
$ cat /sys/kernel/debug/sched/tunable_scaling
1
$ echo 0 > /sys/kernel/debug/sched/tunable_scaling
-bash: echo: write error: Invalid argument
$ cat /sys/kernel/debug/sched/tunable_scaling
1
After:
$ cat /sys/kernel/debug/sched/tunable_scaling
1
$ echo 0 > /sys/kernel/debug/sched/tunable_scaling
$ cat /sys/kernel/debug/sched/tunable_scaling
0
$ echo 3 > /sys/kernel/debug/sched/tunable_scaling
-bash: echo: write error: Invalid argument
Fixes: 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210927114635.GH3959@techsingularity.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2630cde26711dab0d0b56a8be1616475be646d13 ]
Since commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to
list on unthrottle") we add cfs_rqs with no runnable tasks but not fully
decayed into the load (leaf) list. We may ignore adding some ancestors
and therefore breaking tmp_alone_branch invariant. This broke LTP test
cfs_bandwidth01 and it was partially fixed in commit fdaba61ef8a2
("sched/fair: Ensure that the CFS parent is added after unthrottling").
I noticed the named test still fails even with the fix (but with low
probability, 1 in ~1000 executions of the test). The reason is when
bailing out of unthrottle_cfs_rq early, we may miss adding ancestors of
the unthrottled cfs_rq, thus, not joining tmp_alone_branch properly.
Fix this by adding ancestors if we notice the unthrottled cfs_rq was
added to the load list.
Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Odin Ugedal <odin@uged.al>
Link: https://lore.kernel.org/r/20210917153037.11176-1-mkoutny@suse.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8a98ae12fbefdb583a7696de719a1d57e5e940a2 ]
When introducing CAP_BPF, bpf_jit_charge_modmem() was not changed to treat
programs with CAP_BPF as privileged for the purpose of JIT memory allocation.
This means that a program without CAP_BPF can block a program with CAP_BPF
from loading a program.
Fix this by checking bpf_capable() in bpf_jit_charge_modmem().
Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210922111153.19843-1-lmb@cloudflare.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 356ed64991c6847a0c4f2e8fa3b1133f7a14f1fc ]
Currently if a function ptr in struct_ops has a return value, its
caller will get a random return value from it, because the return
value of related BPF_PROG_TYPE_STRUCT_OPS prog is just dropped.
So adding a new flag BPF_TRAMP_F_RET_FENTRY_RET to tell bpf trampoline
to save and return the return value of struct_ops prog if ret_size of
the function ptr is greater than 0. Also restricting the flag to be
used alone.
Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210914023351.3664499-1-houtao1@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e5c6b312ce3cc97e90ea159446e6bfa06645364d ]
The struct sugov_tunables is protected by the kobject, so we can't free
it directly. Otherwise we would get a call trace like this:
ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x30
WARNING: CPU: 3 PID: 720 at lib/debugobjects.c:505 debug_print_object+0xb8/0x100
Modules linked in:
CPU: 3 PID: 720 Comm: a.sh Tainted: G W 5.14.0-rc1-next-20210715-yocto-standard+ #507
Hardware name: Marvell OcteonTX CN96XX board (DT)
pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
pc : debug_print_object+0xb8/0x100
lr : debug_print_object+0xb8/0x100
sp : ffff80001ecaf910
x29: ffff80001ecaf910 x28: ffff00011b10b8d0 x27: ffff800011043d80
x26: ffff00011a8f0000 x25: ffff800013cb3ff0 x24: 0000000000000000
x23: ffff80001142aa68 x22: ffff800011043d80 x21: ffff00010de46f20
x20: ffff800013c0c520 x19: ffff800011d8f5b0 x18: 0000000000000010
x17: 6e6968207473696c x16: 5f72656d6974203a x15: 6570797420746365
x14: 6a626f2029302065 x13: 303378302f307830 x12: 2b6e665f72656d69
x11: ffff8000124b1560 x10: ffff800012331520 x9 : ffff8000100ca6b0
x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 0000000000000001
x5 : ffff800011d8c000 x4 : ffff800011d8c740 x3 : 0000000000000000
x2 : ffff0001108301c0 x1 : ab3c90eedf9c0f00 x0 : 0000000000000000
Call trace:
debug_print_object+0xb8/0x100
__debug_check_no_obj_freed+0x1c0/0x230
debug_check_no_obj_freed+0x20/0x88
slab_free_freelist_hook+0x154/0x1c8
kfree+0x114/0x5d0
sugov_exit+0xbc/0xc0
cpufreq_exit_governor+0x44/0x90
cpufreq_set_policy+0x268/0x4a8
store_scaling_governor+0xe0/0x128
store+0xc0/0xf0
sysfs_kf_write+0x54/0x80
kernfs_fop_write_iter+0x128/0x1c0
new_sync_write+0xf0/0x190
vfs_write+0x2d4/0x478
ksys_write+0x74/0x100
__arm64_sys_write+0x24/0x30
invoke_syscall.constprop.0+0x54/0xe0
do_el0_svc+0x64/0x158
el0_svc+0x2c/0xb0
el0t_64_sync_handler+0xb0/0xb8
el0t_64_sync+0x198/0x19c
irq event stamp: 5518
hardirqs last enabled at (5517): [<ffff8000100cbd7c>] console_unlock+0x554/0x6c8
hardirqs last disabled at (5518): [<ffff800010fc0638>] el1_dbg+0x28/0xa0
softirqs last enabled at (5504): [<ffff8000100106e0>] __do_softirq+0x4d0/0x6c0
softirqs last disabled at (5483): [<ffff800010049548>] irq_exit+0x1b0/0x1b8
So split the original sugov_tunables_free() into two functions,
sugov_clear_global_tunables() is just used to clear the global_tunables
and the new sugov_tunables_free() is used as kobj_type::release to
release the sugov_tunables safely.
Fixes: 9bdcb44e391d ("cpufreq: schedutil: New governor based on scheduler utilization data")
Cc: 4.7+ <stable@vger.kernel.org> # 4.7+
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0e6491b559704da720f6da09dd0a52c4df44c514 ]
Commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls") add the
oversize check. When the allocation is larger than what kmalloc() supports,
the following warning triggered:
WARNING: CPU: 0 PID: 8408 at mm/util.c:597 kvmalloc_node+0x108/0x110 mm/util.c:597
Modules linked in:
CPU: 0 PID: 8408 Comm: syz-executor221 Not tainted 5.14.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:kvmalloc_node+0x108/0x110 mm/util.c:597
Call Trace:
kvmalloc include/linux/mm.h:806 [inline]
kvmalloc_array include/linux/mm.h:824 [inline]
kvcalloc include/linux/mm.h:829 [inline]
check_btf_line kernel/bpf/verifier.c:9925 [inline]
check_btf_info kernel/bpf/verifier.c:10049 [inline]
bpf_check+0xd634/0x150d0 kernel/bpf/verifier.c:13759
bpf_prog_load kernel/bpf/syscall.c:2301 [inline]
__sys_bpf+0x11181/0x126e0 kernel/bpf/syscall.c:4587
__do_sys_bpf kernel/bpf/syscall.c:4691 [inline]
__se_sys_bpf kernel/bpf/syscall.c:4689 [inline]
__x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:4689
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Reported-by: syzbot+f3e749d4c662818ae439@syzkaller.appspotmail.com
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210911005557.45518-1-cuibixuan@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 510e1a724ab1bf38150be2c1acabb303f98d0047 ]
For some drivers, that use the DMA API. This error message can be reached
several millions of times per second, causing spam to the kernel's printk
buffer and bringing the CPU usage up to 100% (so, it should be rate
limited). However, since there is at least one driver that is in the
mainline and suffers from the error condition, it is more useful to
err_printk() here instead of just rate limiting the error message (in hopes
that it will make it easier for other drivers that suffer from this issue
to be spotted).
Link: https://lkml.kernel.org/r/fd67fbac-64bf-f0ea-01e1-5938ccfab9d0@arm.com
Reported-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5afedf670caf30a2b5a52da96eb7eac7dee6a9c9 ]
There is an use-after-free problem triggered by following process:
P1(sda) P2(sdb)
echo 0 > /sys/block/sdb/trace/enable
blk_trace_remove_queue
synchronize_rcu
blk_trace_free
relay_close
rcu_read_lock
__blk_add_trace
trace_note_tsk
(Iterate running_trace_list)
relay_close_buf
relay_destroy_buf
kfree(buf)
trace_note(sdb's bt)
relay_reserve
buf->offset <- nullptr deference (use-after-free) !!!
rcu_read_unlock
[ 502.714379] BUG: kernel NULL pointer dereference, address:
0000000000000010
[ 502.715260] #PF: supervisor read access in kernel mode
[ 502.715903] #PF: error_code(0x0000) - not-present page
[ 502.716546] PGD 103984067 P4D 103984067 PUD 17592b067 PMD 0
[ 502.717252] Oops: 0000 [#1] SMP
[ 502.720308] RIP: 0010:trace_note.isra.0+0x86/0x360
[ 502.732872] Call Trace:
[ 502.733193] __blk_add_trace.cold+0x137/0x1a3
[ 502.733734] blk_add_trace_rq+0x7b/0xd0
[ 502.734207] blk_add_trace_rq_issue+0x54/0xa0
[ 502.734755] blk_mq_start_request+0xde/0x1b0
[ 502.735287] scsi_queue_rq+0x528/0x1140
...
[ 502.742704] sg_new_write.isra.0+0x16e/0x3e0
[ 502.747501] sg_ioctl+0x466/0x1100
Reproduce method:
ioctl(/dev/sda, BLKTRACESETUP, blk_user_trace_setup[buf_size=127])
ioctl(/dev/sda, BLKTRACESTART)
ioctl(/dev/sdb, BLKTRACESETUP, blk_user_trace_setup[buf_size=127])
ioctl(/dev/sdb, BLKTRACESTART)
echo 0 > /sys/block/sdb/trace/enable &
// Add delay(mdelay/msleep) before kernel enters blk_trace_free()
ioctl$SG_IO(/dev/sda, SG_IO, ...)
// Enters trace_note_tsk() after blk_trace_free() returned
// Use mdelay in rcu region rather than msleep(which may schedule out)
Remove blk_trace from running_list before calling blk_trace_free() by
sysfs if blk_trace is at Blktrace_running state.
Fixes: c71a896154119f ("blktrace: add ftrace plugin")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20210923134921.109194-1-chengzhihao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 8646e53633f314e4d746a988240d3b951a92f94a upstream.
Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
transferring to a KVM guest, which is roughly equivalent to an exit to
userspace and processes many of the same pending actions. While the task
cannot be in an rseq critical section as the KVM path is reachable only
by via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
critical section still apply, e.g. the current CPU needs to be updated if
the task is migrated.
Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
and other badness in userspace VMMs that use rseq in combination with KVM,
e.g. due to the CPU ID being stale after task migration.
Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function")
Reported-by: Peter Foley <pefoley@google.com>
Bisected-by: Doug Evans <dje@google.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210901203030.1292304-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 9848417926353daa59d2b05eb26e185063dbac6e ]
The intel powerclamp driver will setup a per-CPU worker with RT
priority. The worker will then invoke play_idle() in which it remains in
the idle poll loop until it is stopped by the timer it started earlier.
That timer needs to expire in hard interrupt context on PREEMPT_RT.
Otherwise the timer will expire in ksoftirqd as a SOFT timer but that task
won't be scheduled on the CPU because its priority is lower than the
priority of the worker which is in the idle loop.
Always expire the idle timer in hard interrupt context.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210906113034.jgfxrjdvxnjqgtmc@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit cfd799837dbc48499abb05d1891b3d9992354d3a ]
Since the commit e5efaeb8a8f5 ("bootconfig: Support mixing
a value and subkeys under a key") allows to co-exist a value
node and key nodes under a node, xbc_node_for_each_child()
is not only returning key node but also a value node.
In the boot-time tracing using xbc_node_for_each_child() to
iterate the events, groups and instances, but those must be
key nodes. Thus it must use xbc_node_for_each_subkey().
Link: https://lkml.kernel.org/r/163112988361.74896.2267026262061819145.stgit@devnote2
Fixes: e5efaeb8a8f5 ("bootconfig: Support mixing a value and subkeys under a key")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 2d186afd04d669fe9c48b994c41a7405a3c9f16d upstream.
Syzbot reported shift-out-of-bounds bug in profile_init().
The problem was in incorrect prof_shift. Since prof_shift value comes from
userspace we need to clamp this value into [0, BITS_PER_LONG -1]
boundaries.
Second possible shiht-out-of-bounds was found by Tetsuo:
sample_step local variable in read_profile() had "unsigned int" type,
but prof_shift allows to make a BITS_PER_LONG shift. So, to prevent
possible shiht-out-of-bounds sample_step type was changed to
"unsigned long".
Also, "unsigned short int" will be sufficient for storing
[0, BITS_PER_LONG] value, that's why there is no need for
"unsigned long" prof_shift.
Link: https://lkml.kernel.org/r/20210813140022.5011-1-paskripkin@gmail.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-and-tested-by: syzbot+e68c89a9510c159d9684@syzkaller.appspotmail.com
Suggested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e1fbbd073137a9d63279f6bf363151a938347640 upstream.
Keno Fischer reported that when a binray loaded via ld-linux-x the
prctl(PR_SET_MM_MAP) doesn't allow to setup brk value because it lays
before mm:end_data.
For example a test program shows
| # ~/t
|
| start_code 401000
| end_code 401a15
| start_stack 7ffce4577dd0
| start_data 403e10
| end_data 40408c
| start_brk b5b000
| sbrk(0) b5b000
and when executed via ld-linux
| # /lib64/ld-linux-x86-64.so.2 ~/t
|
| start_code 7fc25b0a4000
| end_code 7fc25b0c4524
| start_stack 7fffcc6b2400
| start_data 7fc25b0ce4c0
| end_data 7fc25b0cff98
| start_brk 55555710c000
| sbrk(0) 55555710c000
This of course prevent criu from restoring such programs. Looking into
how kernel operates with brk/start_brk inside brk() syscall I don't see
any problem if we allow to setup brk/start_brk without checking for
end_data. Even if someone pass some weird address here on a purpose then
the worst possible result will be an unexpected unmapping of existing vma
(own vma, since prctl works with the callers memory) but test for
RLIMIT_DATA is still valid and a user won't be able to gain more memory in
case of expanding VMAs via new values shipped with prctl call.
Link: https://lkml.kernel.org/r/20210121221207.GB2174@grain
Fixes: bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec")
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Keno Fischer <keno@juliacomputing.com>
Acked-by: Andrey Vagin <avagin@gmail.com>
Tested-by: Andrey Vagin <avagin@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6fe7c745f2acb73e4cc961d7f91125eef5a8861f ]
Fixes a build error when CONFIG_HIST_TRIGGERS=n with boot-time
tracing. Since the trigger_process_regex() is defined only
when CONFIG_HIST_TRIGGERS=y, if it is disabled, the 'actions'
event option also must be disabled.
Link: https://lkml.kernel.org/r/162856123376.203126.582144262622247352.stgit@devnote2
Fixes: 81a59555ff15 ("tracing/boot: Add per-event settings")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8e242060c6a4947e8ae7d29794af6a581db08841 ]
Since kprobe_events and uprobe_events only check whether the
other same-type probe event has the same name or not, if the
user gives the same name of the existing tracepoint event (or
the other type of probe events), it silently fails to create
the tracefs entry (but registered.) as below.
/sys/kernel/tracing # ls events/task/task_rename
enable filter format hist id trigger
/sys/kernel/tracing # echo p:task/task_rename vfs_read >> kprobe_events
[ 113.048508] Could not create tracefs 'task_rename' directory
/sys/kernel/tracing # cat kprobe_events
p:task/task_rename vfs_read
To fix this issue, check whether the existing events have the
same name or not in trace_probe_register_event_call(). If exists,
it rejects to register the new event.
Link: https://lkml.kernel.org/r/162936876189.187130.17558311387542061930.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit b89a05b21f46150ac10a962aa50109250b56b03b upstream.
In perf_event_addr_filters_apply, the task associated with
the event (event->ctx->task) is read using READ_ONCE at the beginning
of the function, checked, and then re-read from event->ctx->task,
voiding all guarantees of the checks. Reuse the value that was read by
READ_ONCE to ensure the consistency of the task struct throughout the
function.
Fixes: 375637bc52495 ("perf/core: Introduce address range filtering")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210906015310.12802-1-baptiste.lepers@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4b6b08f2e45edda4c067ac40833e3c1f84383c0b upstream.
When start_kthread() return error, the cpus_read_unlock() need
to be called.
Link: https://lkml.kernel.org/r/20210831022919.27630-1-qiang.zhang@windriver.com
Cc: <stable@vger.kernel.org>
Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations")
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Qiang.Zhang <qiang.zhang@windriver.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fab827dbee8c2e06ca4ba000fa6c48bcf9054aba upstream.
Commit 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg")
enabled memcg accounting for pids allocated from init_pid_ns.pid_cachep,
but forgot to adjust the setting for nested pid namespaces. As a result,
pid memory is not accounted exactly where it is really needed, inside
memcg-limited containers with their own pid namespaces.
Pid was one the first kernel objects enabled for memcg accounting.
init_pid_ns.pid_cachep marked by SLAB_ACCOUNT and we can expect that any
new pids in the system are memcg-accounted.
Though recently I've noticed that it is wrong. nested pid namespaces
creates own slab caches for pid objects, nested pids have increased size
because contain id both for all parent and for own pid namespaces. The
problem is that these slab caches are _NOT_ marked by SLAB_ACCOUNT, as a
result any pids allocated in nested pid namespaces are not
memcg-accounted.
Pid struct in nested pid namespace consumes up to 500 bytes memory, 100000
such objects gives us up to ~50Mb unaccounted memory, this allow container
to exceed assigned memcg limits.
Link: https://lkml.kernel.org/r/8b6de616-fd1a-02c6-cbdb-976ecdcfa604@virtuozzo.com
Fixes: 5d097056c9a0 ("kmemcg: account certain kmem allocations to memcg")
Cc: stable@vger.kernel.org
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 13db8c50477d83ad3e3b9b0ae247e5cd833a7ae4 upstream.
After fork, the child process will get incorrect (2x) hugetlb_usage. If
a process uses 5 2MB hugetlb pages in an anonymous mapping,
HugetlbPages: 10240 kB
and then forks, the child will show,
HugetlbPages: 20480 kB
The reason for double the amount is because hugetlb_usage will be copied
from the parent and then increased when we copy page tables from parent
to child. Child will have 2x actual usage.
Fix this by adding hugetlb_count_init in mm_init.
Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com
Fixes: 5d317b2b6536 ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status")
Signed-off-by: Liu Zixian <liuzixian4@huawei.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
console_unlock()
commit 11e4b63abbe23872b45f325a7c6c8b7f9ff42cad upstream.
The standard printk() tries to flush the message to the console
immediately. It tries to take the console lock. If the lock is
already taken then the current owner is responsible for flushing
even the new message.
There is a small race window between checking whether a new message is
available and releasing the console lock. It is solved by re-checking
the state after releasing the console lock. If the check is positive
then console_unlock() tries to take the lock again and process the new
message as well.
The commit 996e966640ddea7b535c ("printk: remove logbuf_lock") causes that
console_seq is not longer read atomically. As a result, the re-check might
be done with an inconsistent 64-bit index.
Solve it by using the last sequence number that has been checked under
the console lock. In the worst case, it will take the lock again only
to realized that the new message has already been proceed. But it
was possible even before.
The variable next_seq is marked as __maybe_unused to call down compiler
warning when CONFIG_PRINTK is not defined.
Fixes: commit 996e966640ddea7b535c ("printk: remove logbuf_lock")
Reported-by: kernel test robot <lkp@intel.com> # unused next_seq warning
Cc: stable@vger.kernel.org # 5.13
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20210702150657.26760-1-pmladek@suse.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b41cda03765580caf7723b8c1b672d191c71013f ]
RT mutexes belong to the LD_WAIT_SLEEP class. Make them so.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210815211302.031014562@linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f728c4a9e8405caae69d4bc1232c54ff57b5d20f ]
In error handling branch "if (WARN_ON(node == NUMA_NO_NODE))", the
previously allocated memories are not released. Doing this before
allocating memory eliminates memory leaks.
tj: Note that the condition only occurs when the arch code is pretty broken
and the WARN_ON might as well be BUG_ON().
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fed31a4dd3adb5455df7c704de2abb639a1dc1c0 ]
This commit fixes several typos where CONFIG_TASKS_RCU_TRACE should
instead be CONFIG_TASKS_TRACE_RCU. Among other things, these typos
could cause CONFIG_TASKS_TRACE_RCU_READ_MB=y kernels to suffer from
memory-ordering bugs that could result in false-positive quiescent
states and too-short grace periods.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 173735c346c412d9f084825ecb04f24ada0e2986 ]
Due to link order, dma_debug_init is called before debugfs has a chance
to initialize (via debugfs_init which also happens in the core initcall
stage), so the directories for dma-debug are never created.
Decouple dma_debug_fs_init from dma_debug_init and defer its init until
core_initcall_sync (after debugfs has been initialized) while letting
dma-debug initialization occur as soon as possible to catch any early
mappings, as suggested in [1].
[1] https://lore.kernel.org/linux-iommu/YIgGa6yF%2Fadg8OSN@kroah.com/
Fixes: 15b28bbcd567 ("dma-debug: move initialization to common code")
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 868ad33bfa3bf39960982682ad3a0f8ebda1656e upstream.
sched_setscheduler() and rt_mutex_setprio() invoke the run-queue balance
callback after changing priorities or the scheduling class of a task. The
run-queue for which the callback is invoked can be local or remote.
That's not a problem for the regular rq::push_work which is serialized with
a busy flag in the run-queue struct, but for the balance_push() work which
is only valid to be invoked on the outgoing CPU that's wrong. It not only
triggers the debug warning, but also leaves the per CPU variable push_work
unprotected, which can result in double enqueues on the stop machine list.
Remove the warning and validate that the function is invoked on the
outgoing CPU.
Fixes: ae7927023243 ("sched: Optimize finish_lock_switch()")
Reported-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/87zgt1hdw7.ffs@tglx
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit 564005805aadec9cb7e5dc4e14071b8f87cd6b58 which is
commit 406dd42bd1ba0c01babf9cde169bb319e52f6147 upstream.
It is reported to cause regressions. A proposed fix has been posted,
but it is not in a released kernel yet. So just revert this from the
stable release so that the bug is fixed. If it's really needed we can
add it back in in a future release.
Link: https://lore.kernel.org/r/87ilz1pwaq.fsf@wylie.me.uk
Reported-by: "Alan J. Wylie" <alan@wylie.me.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d7af7e497f0308bc97809cc48b58e8e0f13887e1 ]
Fix a verifier bug found by smatch static checker in [0].
This problem has never been seen in prod to my best knowledge. Fixing it
still seems to be a good idea since it's hard to say for sure whether
it's possible or not to have a scenario where a combination of
convert_ctx_access() and a narrow load would lead to an out of bound
write.
When narrow load is handled, one or two new instructions are added to
insn_buf array, but before it was only checked that
cnt >= ARRAY_SIZE(insn_buf)
And it's safe to add a new instruction to insn_buf[cnt++] only once. The
second try will lead to out of bound write. And this is what can happen
if `shift` is set.
Fix it by making sure that if the BPF_RSH instruction has to be added in
addition to BPF_AND then there is enough space for two more instructions
in insn_buf.
The full report [0] is below:
kernel/bpf/verifier.c:12304 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array
kernel/bpf/verifier.c:12311 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array
kernel/bpf/verifier.c
12282
12283 insn->off = off & ~(size_default - 1);
12284 insn->code = BPF_LDX | BPF_MEM | size_code;
12285 }
12286
12287 target_size = 0;
12288 cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
12289 &target_size);
12290 if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Bounds check.
12291 (ctx_field_size && !target_size)) {
12292 verbose(env, "bpf verifier is misconfigured\n");
12293 return -EINVAL;
12294 }
12295
12296 if (is_narrower_load && size < target_size) {
12297 u8 shift = bpf_ctx_narrow_access_offset(
12298 off, size, size_default) * 8;
12299 if (ctx_field_size <= 4) {
12300 if (shift)
12301 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH,
^^^^^
increment beyond end of array
12302 insn->dst_reg,
12303 shift);
--> 12304 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
^^^^^
out of bounds write
12305 (1 << size * 8) - 1);
12306 } else {
12307 if (shift)
12308 insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH,
12309 insn->dst_reg,
12310 shift);
12311 insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg,
^^^^^^^^^^^^^^^
Same.
12312 (1ULL << size * 8) - 1);
12313 }
12314 }
12315
12316 new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
12317 if (!new_prog)
12318 return -ENOMEM;
12319
12320 delta += cnt - 1;
12321
12322 /* keep walking new program and skip insns we just inserted */
12323 env->prog = new_prog;
12324 insn = new_prog->insnsi + i + delta;
12325 }
12326
12327 return 0;
12328 }
[0] https://lore.kernel.org/bpf/20210817050843.GA21456@kili/
v1->v2:
- clarify that problem was only seen by static checker but not in prod;
Fixes: 46f53a65d2de ("bpf: Allow narrow loads with offset > 0")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210820163935.1902398-1-rdna@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b2f6662ac08d0e7c25574ce53623c71bdae9dd78 ]
Invoking atomic_notifier_chain_notify() requires acquiring a spinlock_t,
which can block under CONFIG_PREEMPT_RT. Notifications for members of the
cpu_pm notification chain will be issued by the idle task, which can never
block.
Making *all* atomic_notifiers use a raw_spinlock is too big of a hammer, as
only notifications issued by the idle task are problematic.
Special-case cpu_pm_notifier_chain by kludging a raw_notifier and
raw_spinlock_t together, matching the atomic_notifier behavior with a
raw_spinlock_t.
Fixes: 70d932985757 ("notifier: Fix broken error handling pattern")
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6ba34d3c73674e46d9e126e4f0cee79e5ef2481c ]
The cpuset fields that manage partition root state do not strictly
follow the cpuset locking rule that update to cpuset has to be done
with both the callback_lock and cpuset_mutex held. This is now fixed
by making sure that the locking rule is upheld.
Fixes: 3881b86128d0 ("cpuset: Add an error state to cpuset.sched.partition")
Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0f3adb8a1e5f36e792598c1d77a2cfac9c90a4f9 ]
Use more descriptive variable names for update_prstate(), remove
unnecessary code and fix some typos. There is no functional change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7fcc17d0cb12938d2b3507973a6f93fc9ed2c7a1 ]
The Energy Model (EM) provides useful information about device power in
each performance state to other subsystems like: Energy Aware Scheduler
(EAS). The energy calculation in EAS does arithmetic operation based on
the EM em_cpu_energy(). Current implementation of that function uses
em_perf_state::cost as a pre-computed cost coefficient equal to:
cost = power * max_frequency / frequency.
The 'power' is expressed in milli-Watts (or in abstract scale).
There are corner cases when the EAS energy calculation for two Performance
Domains (PDs) return the same value. The EAS compares these values to
choose smaller one. It might happen that this values are equal due to
rounding error. In such scenario, we need better resolution, e.g. 1000
times better. To provide this possibility increase the resolution in the
em_perf_state::cost for 64-bit architectures. The cost of increasing
resolution on 32-bit is pretty high (64-bit division) and is not justified
since there are no new 32bit big.LITTLE EAS systems expected which would
benefit from this higher resolution.
This patch allows to avoid the rounding to milli-Watt errors, which might
occur in EAS energy estimation for each PD. The rounding error is common
for small tasks which have small utilization value.
There are two places in the code where it makes a difference:
1. In the find_energy_efficient_cpu() where we are searching for
best_delta. We might suffer there when two PDs return the same result,
like in the example below.
Scenario:
Low utilized system e.g. ~200 sum_util for PD0 and ~220 for PD1. There
are quite a few small tasks ~10-15 util. These tasks would suffer for
the rounding error. These utilization values are typical when running games
on Android. One of our partners has reported 5..10mA less battery drain
when running with increased resolution.
Some details:
We have two PDs: PD0 (big) and PD1 (little)
Let's compare w/o patch set ('old') and w/ patch set ('new')
We are comparing energy w/ task and w/o task placed in the PDs
a) 'old' w/o patch set, PD0
task_util = 13
cost = 480
sum_util_w/o_task = 215
sum_util_w_task = 228
scale_cpu = 1024
energy_w/o_task = 480 * 215 / 1024 = 100.78 => 100
energy_w_task = 480 * 228 / 1024 = 106.87 => 106
energy_diff = 106 - 100 = 6
(this is equal to 'old' PD1's energy_diff in 'c)')
b) 'new' w/ patch set, PD0
task_util = 13
cost = 480 * 1000 = 480000
sum_util_w/o_task = 215
sum_util_w_task = 228
energy_w/o_task = 480000 * 215 / 1024 = 100781
energy_w_task = 480000 * 228 / 1024 = 106875
energy_diff = 106875 - 100781 = 6094
(this is not equal to 'new' PD1's energy_diff in 'd)')
c) 'old' w/o patch set, PD1
task_util = 13
cost = 160
sum_util_w/o_task = 283
sum_util_w_task = 293
scale_cpu = 355
energy_w/o_task = 160 * 283 / 355 = 127.55 => 127
energy_w_task = 160 * 296 / 355 = 133.41 => 133
energy_diff = 133 - 127 = 6
(this is equal to 'old' PD0's energy_diff in 'a)')
d) 'new' w/ patch set, PD1
task_util = 13
cost = 160 * 1000 = 160000
sum_util_w/o_task = 283
sum_util_w_task = 293
scale_cpu = 355
energy_w/o_task = 160000 * 283 / 355 = 127549
energy_w_task = 160000 * 296 / 355 = 133408
energy_diff = 133408 - 127549 = 5859
(this is not equal to 'new' PD0's energy_diff in 'b)')
2. Difference in the 6% energy margin filter at the end of
find_energy_efficient_cpu(). With this patch the margin comparison also
has better resolution, so it's possible to have better task placement
thanks to that.
Fixes: 27871f7a8a341ef ("PM: Introduce an Energy Model management framework")
Reported-by: CCJ Yeh <CCj.Yeh@mediatek.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 15d428e6fe77fffc3f4fff923336036f5496ef17 ]
In cpuset_hotplug_workfn(), the detection of whether the cpu list
has been changed is done by comparing the effective cpus of the top
cpuset with the cpu_active_mask. However, in the rare case that just
all the CPUs in the subparts_cpus are offlined, the detection fails
and the partition states are not updated correctly. Fix it by forcing
the cpus_updated flag to true in this particular case.
Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 75f0fc7b48ad45a2e5736bcf8de26c8872fe8695 ]
In bpf_patch_insn_data(), we first use the bpf_patch_insn_single() to
insert new instructions, then use adjust_insn_aux_data() to adjust
insn_aux_data. If the old env->prog have no enough room for new inserted
instructions, we use bpf_prog_realloc to construct new_prog and free the
old env->prog.
There have two errors here. First, if adjust_insn_aux_data() return
ENOMEM, we should free the new_prog. Second, if adjust_insn_aux_data()
return ENOMEM, bpf_patch_insn_data() will return NULL, and env->prog has
been freed in bpf_prog_realloc, but we will use it in bpf_check().
So in this patch, we make the adjust_insn_aux_data() never fails. In
bpf_patch_insn_data(), we first pre-malloc memory for the new
insn_aux_data, then call bpf_patch_insn_single() to insert new
instructions, at last call adjust_insn_aux_data() to adjust
insn_aux_data.
Fixes: 8041902dae52 ("bpf: adjust insn_aux_data when patching insns")
Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210714101815.164322-1-hefengqing@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 290fdc4b7ef14e33d0e30058042b0e9bfd02b89b ]
Return a negative error code from the error handling case instead of 0, as
done elsewhere in this function.
Fixes: f52da98d900e ("genirq/timings: Add selftest for irqs circular buffer")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210811093333.2376-1-thunder.leizhen@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|