| Age | Commit message (Collapse) | Author |
|
commit d1e7fd6462ca9fc76650fbe6ca800e35b24267da upstream.
Replace the 32bit exec_id with a 64bit exec_id to make it impossible
to wrap the exec_id counter. With care an attacker can cause exec_id
wrap and send arbitrary signals to a newly exec'd parent. This
bypasses the signal sending checks if the parent changes their
credentials during exec.
The severity of this problem can been seen that in my limited testing
of a 32bit exec_id it can take as little as 19s to exec 65536 times.
Which means that it can take as little as 14 days to wrap a 32bit
exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
days. Even my slower timing is in the uptime of a typical server.
Which means self_exec_id is simply a speed bump today, and if exec
gets noticably faster self_exec_id won't even be a speed bump.
Extending self_exec_id to 64bits introduces a problem on 32bit
architectures where reading self_exec_id is no longer atomic and can
take two read instructions. Which means that is is possible to hit
a window where the read value of exec_id does not match the written
value. So with very lucky timing after this change this still
remains expoiltable.
I have updated the update of exec_id on exec to use WRITE_ONCE
and the read of exec_id in do_notify_parent to use READ_ONCE
to make it clear that there is no locking between these two
locations.
Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
Fixes: 2.3.23pre2
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
[bwh: Backported to 3.16:
- Use ACCESS_ONCE()
- Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit febac332a819f0e764aa4da62757ba21d18c182b upstream.
Kernel crashes inside QEMU/KVM are observed:
kernel BUG at kernel/time/timer.c:1154!
BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().
At the same time another cpu got:
general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:
__hlist_del at include/linux/list.h:681
(inlined by) detach_timer at kernel/time/timer.c:818
(inlined by) expire_timers at kernel/time/timer.c:1355
(inlined by) __run_timers at kernel/time/timer.c:1686
(inlined by) run_timer_softirq at kernel/time/timer.c:1699
Unfortunately kernel logs are badly scrambled, stacktraces are lost.
Printing the timer->function before the BUG_ON() pointed to
clocksource_watchdog().
The execution of clocksource_watchdog() can race with a sequence of
clocksource_stop_watchdog() .. clocksource_start_watchdog():
expire_timers()
detach_timer(timer, true);
timer->entry.pprev = NULL;
raw_spin_unlock_irq(&base->lock);
call_timer_fn
clocksource_watchdog()
clocksource_watchdog_kthread() or
clocksource_unbind()
spin_lock_irqsave(&watchdog_lock, flags);
clocksource_stop_watchdog();
del_timer(&watchdog_timer);
watchdog_running = 0;
spin_unlock_irqrestore(&watchdog_lock, flags);
spin_lock_irqsave(&watchdog_lock, flags);
clocksource_start_watchdog();
add_timer_on(&watchdog_timer, ...);
watchdog_running = 1;
spin_unlock_irqrestore(&watchdog_lock, flags);
spin_lock(&watchdog_lock);
add_timer_on(&watchdog_timer, ...);
BUG_ON(timer_pending(timer) || !timer->function);
timer_pending() -> true
BUG()
I.e. inside clocksource_watchdog() watchdog_timer could be already armed.
Check timer_pending() before calling add_timer_on(). This is sufficient as
all operations are synchronized by watchdog_lock.
Fixes: 75c5158f70c0 ("timekeeping: Update clocksource with stop_machine")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzz
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit afccc00f75bbbee4e4ae833a96c2d29a7259c693 upstream.
tracing_stat_init() was always returning '0', even on the error paths. It
now returns -ENODEV if tracing_init_dentry() fails or -ENOMEM if it fails
to created the 'trace_stat' debugfs directory.
Link: http://lkml.kernel.org/r/1410299381-20108-1-git-send-email-luis.henriques@canonical.com
Fixes: ed6f1c996bfe4 ("tracing: Check return value of tracing_init_dentry()")
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit dfb6cd1e654315168e36d947471bd2a0ccd834ae upstream.
Looking through old emails in my INBOX, I came across a patch from Luis
Henriques that attempted to fix a race of two stat tracers registering the
same stat trace (extremely unlikely, as this is done in the kernel, and
probably doesn't even exist). The submitted patch wasn't quite right as it
needed to deal with clean up a bit better (if two stat tracers were the
same, it would have the same files).
But to make the code cleaner, all we needed to do is to keep the
all_stat_sessions_mutex held for most of the registering function.
Link: http://lkml.kernel.org/r/1410299375-20068-1-git-send-email-luis.henriques@canonical.com
Fixes: 002bb86d8d42f ("tracing/ftrace: separate events tracing and stats tracing engine")
Reported-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 38228e8848cd7dd86ccb90406af32de0cad24be3 upstream.
lockdep complains when padata's paths to update cpumasks via CPU hotplug
and sysfs are both taken:
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo ff > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
======================================================
WARNING: possible circular locking dependency detected
5.4.0-rc8-padata-cpuhp-v3+ #1 Not tainted
------------------------------------------------------
bash/205 is trying to acquire lock:
ffffffff8286bcd0 (cpu_hotplug_lock.rw_sem){++++}, at: padata_set_cpumask+0x2b/0x120
but task is already holding lock:
ffff8880001abfa0 (&pinst->lock){+.+.}, at: padata_set_cpumask+0x26/0x120
which lock already depends on the new lock.
padata doesn't take cpu_hotplug_lock and pinst->lock in a consistent
order. Which should be first? CPU hotplug calls into padata with
cpu_hotplug_lock already held, so it should have priority.
Fixes: 6751fb3c0e0c ("padata: Use get_online_cpus/put_online_cpus")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 065cf577135a4977931c7a1e1edf442bfd9773dd upstream.
With the removal of the padata timer, padata_do_serial no longer
needs special CPU handling, so remove it.
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 07928d9bfc81640bab36f5190e8725894d93b659 upstream.
The function padata_flush_queues is fundamentally broken because
it cannot force padata users to complete the request that is
underway. IOW padata has to passively wait for the completion
of any outstanding work.
As it stands flushing is used in two places. Its use in padata_stop
is simply unnecessary because nothing depends on the queues to
be flushed afterwards.
The other use in padata_replace is more substantial as we depend
on it to free the old pd structure. This patch instead uses the
pd->refcnt to dynamically free the pd structure once all requests
are complete.
Fixes: 2b73b07ab8a4 ("padata: Flush the padata queues actively")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit ec9c7d19336ee98ecba8de80128aa405c45feebb upstream.
Exercising CPU hotplug on a 5.2 kernel with recent padata fixes from
cryptodev-2.6.git in an 8-CPU kvm guest...
# modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo c > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
# modprobe tcrypt mode=215
...caused the following crash:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 2 PID: 134 Comm: kworker/2:2 Not tainted 5.2.0-padata-base+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-<snip>
Workqueue: pencrypt padata_parallel_worker
RIP: 0010:padata_reorder+0xcb/0x180
...
Call Trace:
padata_do_serial+0x57/0x60
pcrypt_aead_enc+0x3a/0x50 [pcrypt]
padata_parallel_worker+0x9b/0xe0
process_one_work+0x1b5/0x3f0
worker_thread+0x4a/0x3c0
...
In padata_alloc_pd, pd->cpu is set using the user-supplied cpumask
instead of the effective cpumask, and in this case cpumask_first picked
an offline CPU.
The offline CPU's reorder->list.next is NULL in padata_reorder because
the list wasn't initialized in padata_init_pqueues, which only operates
on CPUs in the effective mask.
Fix by using the effective mask in padata_alloc_pd.
Fixes: 6fc4dbcf0276 ("padata: Replace delayed timer with immediate workqueue in padata_reorder")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 6fc4dbcf0276279d488c5fbbfabe94734134f4fa upstream.
The function padata_reorder will use a timer when it cannot progress
while completed jobs are outstanding (pd->reorder_objects > 0). This
is suboptimal as if we do end up using the timer then it would have
introduced a gratuitous delay of one second.
In fact we can easily distinguish between whether completed jobs
are outstanding and whether we can make progress. All we have to
do is look at the next pqueue list.
This patch does that by replacing pd->processed with pd->cpu so
that the next pqueue is more accessible.
A work queue is used instead of the original try_again to avoid
hogging the CPU.
Note that we don't bother removing the work queue in
padata_flush_queues because the whole premise is broken. You
cannot flush async crypto requests so it makes no sense to even
try. A subsequent patch will fix it by replacing it with a ref
counting scheme.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
[bwh: Backported to 3.16: Deleted code used the old timer API here]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 350ef88e7e922354f82a931897ad4a4ce6c686ff upstream.
If the algorithm we're parallelizing is asynchronous we might change
CPUs between padata_do_parallel() and padata_do_serial(). However, we
don't expect this to happen as we need to enqueue the padata object into
the per-cpu reorder queue we took it from, i.e. the same-cpu's parallel
queue.
Ensure we're not switching CPUs for a given padata object by tracking
the CPU within the padata object. If the serial callback gets called on
the wrong CPU, defer invoking padata_reorder() via a kernel worker on
the CPU we're expected to run on.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit cf5868c8a22dc2854b96e9569064bb92365549ca upstream.
The reorder timer function runs on the CPU where the timer interrupt was
handled which is not necessarily one of the CPUs of the 'pcpu' CPU mask
set.
Ensure the padata_reorder() callback runs on the correct CPU, which is
one in the 'pcpu' CPU mask set and, preferrably, the next expected one.
Do so by comparing the current CPU with the expected target CPU. If they
match, call padata_reorder() right away. If they differ, schedule a work
item on the target CPU that does the padata_reorder() call for us.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 1bd845bcb41d5b7f83745e0cb99273eb376f2ec5 upstream.
The parallel queue per-cpu data structure gets initialized only for CPUs
in the 'pcpu' CPU mask set. This is not sufficient as the reorder timer
may run on a different CPU and might wrongly decide it's the target CPU
for the next reorder item as per-cpu memory gets memset(0) and we might
be waiting for the first CPU in cpumask.pcpu, i.e. cpu_index 0.
Make the '__this_cpu_read(pd->pqueue->cpu_index) == next_queue->cpu_index'
compare in padata_get_next() fail in this case by initializing the
cpu_index member of all per-cpu parallel queues. Use -1 for unused ones.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 69b348449bda0f9588737539cfe135774c9939a7 upstream.
Per Dan's static checker warning, the code that returns NULL was removed
in 2010, so this patch updates the comments and fixes the code
assumptions.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit de5540d088fe97ad583cc7d396586437b32149a5 upstream.
Under extremely heavy uses of padata, crashes occur, and with list
debugging turned on, this happens instead:
[87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
__list_add+0xae/0x130
[87487.301868] list_add corruption. prev->next should be next
(ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00).
[87487.339011] [<ffffffff9a53d075>] dump_stack+0x68/0xa3
[87487.342198] [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0
[87487.345364] [<ffffffff99d6b91f>] __warn+0xff/0x140
[87487.348513] [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50
[87487.351659] [<ffffffff9a58b5de>] __list_add+0xae/0x130
[87487.354772] [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70
[87487.357915] [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420
[87487.361084] [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120
padata_reorder calls list_add_tail with the list to which its adding
locked, which seems correct:
spin_lock(&squeue->serial.lock);
list_add_tail(&padata->list, &squeue->serial.list);
spin_unlock(&squeue->serial.lock);
This therefore leaves only place where such inconsistency could occur:
if padata->list is added at the same time on two different threads.
This pdata pointer comes from the function call to
padata_get_next(pd), which has in it the following block:
next_queue = per_cpu_ptr(pd->pqueue, cpu);
padata = NULL;
reorder = &next_queue->reorder;
if (!list_empty(&reorder->list)) {
padata = list_entry(reorder->list.next,
struct padata_priv, list);
spin_lock(&reorder->lock);
list_del_init(&padata->list);
atomic_dec(&pd->reorder_objects);
spin_unlock(&reorder->lock);
pd->processed++;
goto out;
}
out:
return padata;
I strongly suspect that the problem here is that two threads can race
on reorder list. Even though the deletion is locked, call to
list_entry is not locked, which means it's feasible that two threads
pick up the same padata object and subsequently call list_add_tail on
them at the same time. The fix is thus be hoist that lock outside of
that block.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 119a0798dc42ed4c4f96d39b8b676efcea73aec6 upstream.
Remove the unused but set variable pinst in padata_parallel_worker to
fix the following warning when building with 'W=1':
kernel/padata.c: In function ‘padata_parallel_worker’:
kernel/padata.c:68:26: warning: variable ‘pinst’ set but not used [-Wunused-but-set-variable]
Also remove the now unused variable pd which is only used to set pinst.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 8d67743653dce5a0e7aa500fcccb237cde7ad88e upstream.
The recent futex inode life time fix changed the ordering of the futex key
union struct members, but forgot to adjust the hash function accordingly,
As a result the hashing omits the leading 64bit and even hashes beyond the
futex key causing a bad hash distribution which led to a ~100% performance
regression.
Hand in the futex key pointer instead of a random struct member and make
the size calculation based of the struct offset.
Fixes: 8019ad13ef7f ("futex: Fix inode life-time issue")
Reported-by: Rong Chen <rong.a.chen@intel.com>
Decoded-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Rong Chen <rong.a.chen@intel.com>
Link: https://lkml.kernel.org/r/87h7yy90ve.fsf@nanos.tec.linutronix.de
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Jann Horn <jannh@google.com>
|
|
commit 8019ad13ef7f64be44d4f892af9c840179009254 upstream.
As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.
This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.16: Use atomic64_cmpxchg() instead of the
_relaxed() variant which we don't have]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit a33121e5487b424339636b25c35d3a180eaa5f5e upstream.
In a case when a ptp chardev (like /dev/ptp0) is open but an underlying
device is removed, closing this file leads to a race. This reproduces
easily in a kvm virtual machine:
ts# cat openptp0.c
int main() { ... fp = fopen("/dev/ptp0", "r"); ... sleep(10); }
ts# uname -r
5.5.0-rc3-46cf053e
ts# cat /proc/cmdline
... slub_debug=FZP
ts# modprobe ptp_kvm
ts# ./openptp0 &
[1] 670
opened /dev/ptp0, sleeping 10s...
ts# rmmod ptp_kvm
ts# ls /dev/ptp*
ls: cannot access '/dev/ptp*': No such file or directory
ts# ...woken up
[ 48.010809] general protection fault: 0000 [#1] SMP
[ 48.012502] CPU: 6 PID: 658 Comm: openptp0 Not tainted 5.5.0-rc3-46cf053e #25
[ 48.014624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
[ 48.016270] RIP: 0010:module_put.part.0+0x7/0x80
[ 48.017939] RSP: 0018:ffffb3850073be00 EFLAGS: 00010202
[ 48.018339] RAX: 000000006b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffff89a476c00ad0
[ 48.018936] RDX: fffff65a08d3ea08 RSI: 0000000000000247 RDI: 6b6b6b6b6b6b6b6b
[ 48.019470] ... ^^^ a slub poison
[ 48.023854] Call Trace:
[ 48.024050] __fput+0x21f/0x240
[ 48.024288] task_work_run+0x79/0x90
[ 48.024555] do_exit+0x2af/0xab0
[ 48.024799] ? vfs_write+0x16a/0x190
[ 48.025082] do_group_exit+0x35/0x90
[ 48.025387] __x64_sys_exit_group+0xf/0x10
[ 48.025737] do_syscall_64+0x3d/0x130
[ 48.026056] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 48.026479] RIP: 0033:0x7f53b12082f6
[ 48.026792] ...
[ 48.030945] Modules linked in: ptp i6300esb watchdog [last unloaded: ptp_kvm]
[ 48.045001] Fixing recursive fault but reboot is needed!
This happens in:
static void __fput(struct file *file)
{ ...
if (file->f_op->release)
file->f_op->release(inode, file); <<< cdev is kfree'd here
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
!(mode & FMODE_PATH))) {
cdev_put(inode->i_cdev); <<< cdev fields are accessed here
Namely:
__fput()
posix_clock_release()
kref_put(&clk->kref, delete_clock) <<< the last reference
delete_clock()
delete_ptp_clock()
kfree(ptp) <<< cdev is embedded in ptp
cdev_put
module_put(p->owner) <<< *p is kfree'd, bang!
Here cdev is embedded in posix_clock which is embedded in ptp_clock.
The race happens because ptp_clock's lifetime is controlled by two
refcounts: kref and cdev.kobj in posix_clock. This is wrong.
Make ptp_clock's sysfs device a parent of cdev with cdev_device_add()
created especially for such cases. This way the parent device with its
ptp_clock is not released until all references to the cdev are released.
This adds a requirement that an initialized but not exposed struct
device should be provided to posix_clock_register() by a caller instead
of a simple dev_t.
This approach was adopted from the commit 72139dfa2464 ("watchdog: Fix
the race between the release of watchdog_core_data and cdev"). See
details of the implementation in the commit 233ed09d7fda ("chardev: add
helper function to register char devs with a struct device").
Link: https://lore.kernel.org/linux-fsdevel/20191125125342.6189-1-vdronov@redhat.com/T/#u
Analyzed-by: Stephen Johnston <sjohnsto@redhat.com>
Analyzed-by: Vern Lovejoy <vlovejoy@redhat.com>
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 153031a301bb07194e9c37466cfce8eacb977621 upstream.
There was a recent change in blktrace.c that added a RCU protection to
`q->blk_trace` in order to fix a use-after-free issue during access.
However the change missed an edge case that can lead to dereferencing of
`bt` pointer even when it's NULL:
Coverity static analyzer marked this as a FORWARD_NULL issue with CID
1460458.
```
/kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store()
1898 ret = 0;
1899 if (bt == NULL)
1900 ret = blk_trace_setup_queue(q, bdev);
1901
1902 if (ret == 0) {
1903 if (attr == &dev_attr_act_mask)
>>> CID 1460458: Null pointer dereferences (FORWARD_NULL)
>>> Dereferencing null pointer "bt".
1904 bt->act_mask = value;
1905 else if (attr == &dev_attr_pid)
1906 bt->pid = value;
1907 else if (attr == &dev_attr_start_lba)
1908 bt->start_lba = value;
1909 else if (attr == &dev_attr_end_lba)
```
Added a reassignment with RCU annotation to fix the issue.
Fixes: c780e86dd48 ("blktrace: Protect q->blk_trace with RCU")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Cengiz Can <cengiz@kernel.wtf>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit c780e86dd48ef6467a1146cf7d0fe1e05a635039 upstream.
KASAN is reporting that __blk_add_trace() has a use-after-free issue
when accessing q->blk_trace. Indeed the switching of block tracing (and
thus eventual freeing of q->blk_trace) is completely unsynchronized with
the currently running tracing and thus it can happen that the blk_trace
structure is being freed just while __blk_add_trace() works on it.
Protect accesses to q->blk_trace by RCU during tracing and make sure we
wait for the end of RCU grace period when shutting down tracing. Luckily
that is rare enough event that we can afford that. Note that postponing
the freeing of blk_trace to an RCU callback should better be avoided as
it could have unexpected user visible side-effects as debugfs files
would be still existing for a short while block tracing has been shut
down.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Tristan Madani <tristmd@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[bwh: Backported to 3.16:
- Drop changes in blk_trace_note_message_enabled(), blk_trace_bio_get_cgid()
- Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit cdea01b2bf98affb7e9c44530108a4a28535eee8 upstream.
This is really about simplifying the double xchg patterns into
a single cmpxchg, with the same logic. Other than the immediate
cleanup, there are some subtleties this change deals with:
(i) While the load of the old bt is fully ordered wrt everything,
ie:
old_bt = xchg(&q->blk_trace, bt); [barrier]
if (old_bt)
(void) xchg(&q->blk_trace, old_bt); [barrier]
blk_trace could still be changed between the xchg and the old_bt
load. Note that this description is merely theoretical and afaict
very small, but doing everything in a single context with cmpxchg
closes this potential race.
(ii) Ordering guarantees are obviously kept with cmpxchg.
(iii) Gets rid of the hacky-by-nature (void)xchg pattern.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
eviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 30aba6656f61ed44cba445a3c0d38b296fa9e8f5 upstream.
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag. The purpose
is to make data spoofing attacks harder. This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection. This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.
This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:
CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489
This list is not meant to be complete. It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector. In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.
[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
sched_migrate_task fail
commit 50f9ad607ea891a9308e67b81f774c71736d1098 upstream.
In the function, if register_trace_sched_migrate_task() returns error,
sched_switch/sched_wakeup_new/sched_wakeup won't unregister. That is
why fail_deprobe_sched_switch was added.
Link: http://lkml.kernel.org/r/20191231133530.2794-1-pilgrimtao@gmail.com
Fixes: 478142c39c8c2 ("tracing: do not grab lock in wakeup latency function tracing")
Signed-off-by: Kaitao Cheng <pilgrimtao@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit e31f7939c1c27faa5d0e3f14519eaf7c89e8a69d upstream.
The ftrace_profile->counter is unsigned long and
do_div truncates it to 32 bits, which means it can test
non-zero and be truncated to zero for division.
Fix this issue by using div64_ul() instead.
Link: http://lkml.kernel.org/r/20200103030248.14516-1-wenyang@linux.alibaba.com
Fixes: e330b3bcd8319 ("tracing: Show sample std dev in function profiling")
Fixes: 34886c8bc590f ("tracing: add average time in function to function profiler")
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit b8299d362d0837ae39e87e9019ebe6b736e0f035 upstream.
On some archs with some configurations, MCOUNT_INSN_SIZE is not defined, and
this makes the stack tracer fail to compile. Just define it to zero in this
case.
Link: https://lore.kernel.org/r/202001020219.zvE3vsty%lkp@intel.com
Fixes: 4df297129f622 ("tracing: Remove most or all of stack tracer stack size from stack_max_size")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 0b8d616fb5a8ffa307b1d3af37f55c15dae14f28 upstream.
When assiging and testing taskstats in taskstats_exit() there's a race
when setting up and reading sig->stats when a thread-group with more
than one thread exits:
write to 0xffff8881157bbe10 of 8 bytes by task 7951 on cpu 0:
taskstats_tgid_alloc kernel/taskstats.c:567 [inline]
taskstats_exit+0x6b7/0x717 kernel/taskstats.c:596
do_exit+0x2c2/0x18e0 kernel/exit.c:864
do_group_exit+0xb4/0x1c0 kernel/exit.c:983
get_signal+0x2a2/0x1320 kernel/signal.c:2734
do_signal+0x3b/0xc00 arch/x86/kernel/signal.c:815
exit_to_usermode_loop+0x250/0x2c0 arch/x86/entry/common.c:159
prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
do_syscall_64+0x2d7/0x2f0 arch/x86/entry/common.c:299
entry_SYSCALL_64_after_hwframe+0x44/0xa9
read to 0xffff8881157bbe10 of 8 bytes by task 7949 on cpu 1:
taskstats_tgid_alloc kernel/taskstats.c:559 [inline]
taskstats_exit+0xb2/0x717 kernel/taskstats.c:596
do_exit+0x2c2/0x18e0 kernel/exit.c:864
do_group_exit+0xb4/0x1c0 kernel/exit.c:983
__do_sys_exit_group kernel/exit.c:994 [inline]
__se_sys_exit_group kernel/exit.c:992 [inline]
__x64_sys_exit_group+0x2e/0x30 kernel/exit.c:992
do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix this by using smp_load_acquire() and smp_store_release().
Reported-by: syzbot+c5d03165a1bd1dead0c1@syzkaller.appspotmail.com
Fixes: 34ec12349c8a ("taskstats: cleanup ->signal->stats allocation")
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Marco Elver <elver@google.com>
Reviewed-by: Will Deacon <will@kernel.org>
Reviewed-by: Andrea Parri <parri.andrea@gmail.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/r/20191009114809.8643-1-christian.brauner@ubuntu.com
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 398ca17fb54b212cdc9da7ff4a17a35c48dd2103 upstream.
The field has no value because all clock bases have the same
resolution. The resolution only changes when we switch to high
resolution timer mode. We can evaluate that from a single static
variable as well. In the !HIGHRES case its simply a constant.
Export the variable, so we can simplify the usage sites.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203500.645454122@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bwh: Backported to 3.16 as dependency of commit 552263456215
"powerpc: Fix vDSO clock_getres()":
- Adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit ca16d5bee59807bf04deaab0a8eccecd5061528c upstream.
Robust futexes utilize the robust_list mechanism to allow the kernel to
release futexes which are held when a task exits. The exit can be voluntary
or caused by a signal or fault. This prevents that waiters block forever.
The futex operations in user space store a pointer to the futex they are
either locking or unlocking in the op_pending member of the per task robust
list.
After a lock operation has succeeded the futex is queued in the robust list
linked list and the op_pending pointer is cleared.
After an unlock operation has succeeded the futex is removed from the
robust list linked list and the op_pending pointer is cleared.
The robust list exit code checks for the pending operation and any futex
which is queued in the linked list. It carefully checks whether the futex
value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and
tries to wake up a potential waiter.
This is race free for the lock operation but unlock has two race scenarios
where waiters might not be woken up. These issues can be observed with
regular robust pthread mutexes. PI aware pthread mutexes are not affected.
(1) Unlocking task is killed after unlocking the futex value in user space
before being able to wake a waiter.
pthread_mutex_unlock()
|
V
atomic_exchange_rel (&mutex->__data.__lock, 0)
<------------------------killed
lll_futex_wake () |
|
|(__lock = 0)
|(enter kernel)
|
V
do_exit()
exit_mm()
mm_release()
exit_robust_list()
handle_futex_death()
|
|(__lock = 0)
|(uval = 0)
|
V
if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
return 0;
The sanity check which ensures that the user space futex is owned by
the exiting task prevents the wakeup of waiters which in consequence
block infinitely.
(2) Waiting task is killed after a wakeup and before it can acquire the
futex in user space.
OWNER WAITER
futex_wait()
pthread_mutex_unlock() |
| |
|(__lock = 0) |
| |
V |
futex_wake() ------------> wakeup()
|
|(return to userspace)
|(__lock = 0)
|
V
oldval = mutex->__data.__lock
<-----------------killed
atomic_compare_and_exchange_val_acq (&mutex->__data.__lock, |
id | assume_other_futex_waiters, 0) |
|
|
(enter kernel)|
|
V
do_exit()
|
|
V
handle_futex_death()
|
|(__lock = 0)
|(uval = 0)
|
V
if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
return 0;
The sanity check which ensures that the user space futex is owned
by the exiting task prevents the wakeup of waiters, which seems to
be correct as the exiting task does not own the futex value, but
the consequence is that other waiters wont be woken up and block
infinitely.
In both scenarios the following conditions are true:
- task->robust_list->list_op_pending != NULL
- user space futex value == 0
- Regular futex (not PI)
If these conditions are met then it is reasonably safe to wake up a
potential waiter in order to prevent the above problems.
As this might be a false positive it can cause spurious wakeups, but the
waiter side has to handle other types of unrelated wakeups, e.g. signals
gracefully anyway. So such a spurious wakeup will not affect the
correctness of these operations.
This workaround must not touch the user space futex value and cannot set
the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting
OWNER_DIED in this case would result in inconsistent state and subsequently
in malfunction of the owner died handling in user space.
The rest of the user space state is still consistent as no other task can
observe the list_op_pending entry in the exiting tasks robust list.
The eventually woken up waiter will observe the uncontended lock value and
take it over.
[ tglx: Massaged changelog and comment. Made the return explicit and not
depend on the subsequent check and added constants to hand into
handle_futex_death() instead of plain numbers. Fixed a few coding
style issues. ]
Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
Signed-off-by: Yang Tao <yang.tao172@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn
Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de
[bwh: Backported to 3.16: Implementation is split between futex.c and
futex_compat.c, with common definitions in <linux/futex.h>]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit def98c84b6cdf2eeea19ec5736e90e316df5206b upstream.
Before actually destrying a workqueue, destroy_workqueue() checks
whether it's actually idle. If it isn't, it prints out a bunch of
warning messages and leaves the workqueue dangling. It unfortunately
has a couple issues.
* Mayday list queueing increments pwq's refcnts which gets detected as
busy and fails the sanity checks. However, because mayday list
queueing is asynchronous, this condition can happen without any
actual work items left in the workqueue.
* Sanity check failure leaves the sysfs interface behind too which can
lead to init failure of newer instances of the workqueue.
This patch fixes the above two by
* If a workqueue has a rescuer, disable and kill the rescuer before
sanity checks. Disabling and killing is guaranteed to flush the
existing mayday list.
* Remove sysfs interface before sanity checks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marcin Pawlowski <mpawlowski@fb.com>
Reported-by: "Williams, Gerald S" <gerald.s.williams@intel.com>
[bwh: Backported to 3.16: destroy_workqueue() also freed wq->rescuer itself]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
This reverts commit eb29ee5a3873134917a760bf9c416da0a089a0be, which
was commit 512ac999d2755d2b7109e996a76b6fb8b888631d upstream. This
introduced a regression and doesn't seem to have been suitable for
older stable branches. (It has been fixed differently upstream.)
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit f6cf0545ec697ddc278b7457b7d0c0d86a2ea88e upstream.
Some architectures require code written to memory as if it were data to be
'cleaned' from any data caches before the processor can fetch them as new
instructions.
During resume from hibernate, the snapshot code copies some pages directly,
meaning these architectures do not get a chance to perform their cache
maintenance. Modify the read and decompress code to call
flush_icache_range() on all pages that are restored, so that the restored
in-place pages are guaranteed to be executable on these architectures.
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
[will: make clean_pages_on_* static and remove initialisers]
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 343df3c79c62b644ce6ff5dff96c9e0be1ecb242 upstream.
Stop abusing struct page functionality and the swap end_io handler, and
instead add a modified version of the blk-lib.c bio_batch helpers.
Also move the block I/O code into swap.c as they are directly tied into
each other.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Tested-by: Ming Lin <mlin@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
[bwh: Backported to 3.16 as dependency of commit f6cf0545ec69
"PM / Hibernate: Call flush_icache_range() on pages restored in-place":
- Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 0722069a5374b904ec1a67f91249f90e1cfae259 upstream.
When printing multiple uprobe arguments as strings the output for the
earlier arguments would also include all later string arguments.
This is best explained in an example:
Consider adding a uprobe to a function receiving two strings as
parameters which is at offset 0xa0 in strlib.so and we want to print
both parameters when the uprobe is hit (on x86_64):
$ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \
/sys/kernel/debug/tracing/uprobe_events
When the function is called as func("foo", "bar") and we hit the probe,
the trace file shows a line like the following:
[...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar"
Note the extra "bar" printed as part of arg1. This behaviour stacks up
for additional string arguments.
The strings are stored in a dynamically growing part of the uprobe
buffer by fetch_store_string() after copying them from userspace via
strncpy_from_user(). The return value of strncpy_from_user() is then
directly used as the required size for the string. However, this does
not take the terminating null byte into account as the documentation
for strncpy_from_user() cleary states that it "[...] returns the
length of the string (not including the trailing NUL)" even though the
null byte will be copied to the destination.
Therefore, subsequent calls to fetch_store_string() will overwrite
the terminating null byte of the most recently fetched string with
the first character of the current string, leading to the
"accumulation" of strings in earlier arguments in the output.
Fix this by incrementing the return value of strncpy_from_user() by
one if we did not hit the maximum buffer size.
Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de
Cc: Ingo Molnar <mingo@redhat.com>
Fixes: 5baaa59ef09e ("tracing/probes: Implement 'memory' fetch method for uprobes")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 194c2c74f5532e62c218adeb8e2b683119503907 upstream.
As instances may have different tracers available, we need to look at the
trace_array descriptor that shows the list of the available tracers for the
instance. But there's a race between opening the file and an admin
deleting the instance. The trace_array_get() needs to be called before
accessing the trace_array.
Fixes: 607e2ea167e56 ("tracing: Set up infrastructure to allow tracers for instances")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
ratio precision
commit 4929a4e6faa0f13289a67cae98139e727f0d4a97 upstream.
The quota/period ratio is used to ensure a child task group won't get
more bandwidth than the parent task group, and is calculated as:
normalized_cfs_quota() = [(quota_us << 20) / period_us]
If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups.
See below example:
A userspace container manager (kubelet) does three operations:
1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
2) Create a few children cgroups.
3) Set quota to 1,000us and period to 10,000us on a child cgroup.
These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3, quota and period of the parent cgroup
will be changed:
new_quota: 1148437ns, 1148us
new_period: 11484375ns, 11484us
And when step 3 comes in, the ratio of the child cgroup will be
104857, which will be larger than the parent cgroup ratio (104821),
and will fail.
Scaling them by a factor of 2 will fix the problem.
Tested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Xuewei Zhang <xueweiz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Phil Auld <pauld@redhat.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 20bb759a66be52cf4a9ddd17fddaf509e11490cd upstream.
Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the
calling CPU in an infinite loop, but with interrupts and preemption
enabled. From this state, userspace can continue to be scheduled,
despite the system being "dead" as far as the kernel is concerned.
This is easily reproducible on arm64 when booting with "nosmp" on the
command line; a couple of shell scripts print out a periodic "Ping"
message whilst another triggers a crash by writing to
/proc/sysrq-trigger:
| sysrq: Trigger a crash
| Kernel panic - not syncing: sysrq triggered crash
| CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1
| Hardware name: linux,dummy-virt (DT)
| Call trace:
| dump_backtrace+0x0/0x148
| show_stack+0x14/0x20
| dump_stack+0xa0/0xc4
| panic+0x140/0x32c
| sysrq_handle_reboot+0x0/0x20
| __handle_sysrq+0x124/0x190
| write_sysrq_trigger+0x64/0x88
| proc_reg_write+0x60/0xa8
| __vfs_write+0x18/0x40
| vfs_write+0xa4/0x1b8
| ksys_write+0x64/0xf0
| __arm64_sys_write+0x14/0x20
| el0_svc_common.constprop.0+0xb0/0x168
| el0_svc_handler+0x28/0x78
| el0_svc+0x8/0xc
| Kernel Offset: disabled
| CPU features: 0x0002,24002004
| Memory Limit: none
| ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
| Ping 2!
| Ping 1!
| Ping 1!
| Ping 2!
The issue can also be triggered on x86 kernels if CONFIG_SMP=n,
otherwise local interrupts are disabled in 'smp_send_stop()'.
Disable preemption in 'panic()' before re-enabling interrupts.
Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org
Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstone
Signed-off-by: Will Deacon <will@kernel.org>
Reported-by: Xogium <contact@xogium.me>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit b9023b91dd020ad7e093baa5122b6968c48cc9e0 upstream.
When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is active
using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does not provide
the required synchronization when the callback is active on other core.
The callback could have already executed tick_handle_oneshot_broadcast()
and could have also returned. But still there is a small time window where
the hrtimer_try_to_cancel() returns -1. In that case bc_set_next() returns
without doing anything, but the next_event of the tick broadcast clock
device is already set to a timeout value.
In the race condition diagram below, CPU #1 is running the timer callback
and CPU #2 is entering idle state and so calls bc_set_next().
In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback returns
HRTIMER_NORESTART. The hrtimer might never recover if all further requests
from the CPUs to subscribe to tick broadcast have timeout greater than the
next_event of tick broadcast clock device. This leads to cascading of
failures and finally noticed as rcu stall warnings
Here is a depiction of the race condition
CPU #1 (Running timer callback) CPU #2 (Enter idle
and subscribe to
tick broadcast)
--------------------- ---------------------
__run_hrtimer() tick_broadcast_enter()
bc_handler() __tick_broadcast_oneshot_control()
tick_handle_oneshot_broadcast()
raw_spin_lock(&tick_broadcast_lock);
dev->next_event = KTIME_MAX; //wait for tick_broadcast_lock
//next_event for tick broadcast clock
set to KTIME_MAX since no other cores
subscribed to tick broadcasting
raw_spin_unlock(&tick_broadcast_lock);
if (dev->next_event == KTIME_MAX)
return HRTIMER_NORESTART
// callback function exits without
restarting the hrtimer //tick_broadcast_lock acquired
raw_spin_lock(&tick_broadcast_lock);
tick_broadcast_set_event()
clockevents_program_event()
dev->next_event = expires;
bc_set_next()
hrtimer_try_to_cancel()
//returns -1 since the timer
callback is active. Exits without
restarting the timer
cpu_base->running = NULL;
The comment that hrtimer cannot be armed from within the callback is
wrong. It is fine to start the hrtimer from within the callback. Also it is
safe to start the hrtimer from the enter/exit idle code while the broadcast
handler is active. The enter/exit idle code and the broadcast handler are
synchronized using tick_broadcast_lock. So there is no need for the
existing try to cancel logic. All this can be removed which will eliminate
the race condition as well.
Fixes: 5d1638acb9f6 ("tick: Introduce hrtimer based broadcast")
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190926135101.12102-2-balasubramani_vivekanandan@mentor.com
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
unused
commit 38d23a6cc16c02f7b0c920266053f340b5601735 upstream.
The hrtimer callback in the hrtimer's tick broadcast code sometimes
incorrectly ends up scheduling events at the current tick causing the
kernel to hang servicing the same hrtimer forever. This typically
happens when a device is swapped out by
tick_install_broadcast_device(), which replaces the event handler with
clock_events_handle_noop() and sets the device mode to
CLOCK_EVT_MODE_UNUSED. If the timer is scheduled when this happens,
the next_event field will not be updated and the hrtimer ends up being
restarted at the current tick. To prevent this from happening, only
try to restart the hrtimer if the broadcast clock event device is in
one of the active modes and try to cancel the timer when entering the
CLOCK_EVT_MODE_UNUSED mode.
Signed-off-by: Andreas Sandberg <andreas.sandberg@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1429880765-5558-1-git-send-email-andreas.sandberg@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bwh: Backported to 3.16 as dependency of commit b9023b91dd02
"tick: broadcast-hrtimer: Fix a race in bc_set_next"]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit b8a62f1ff0ccb18fdc25c6150d1cd394610f4753 upstream.
The assignment of bc_moved in the conditional construct relies on the
fact that in the case of hrtimer_start() invocation the return value
is always 0. It took me a while to understand it.
We want to get rid of the hrtimer_start() return value. Open code the
logic which makes it readable as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20150414203503.404751457@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bwh: Backported to 3.16 to ease backporting commit b9023b91dd02
"tick: broadcast-hrtimer: Fix a race in bc_set_next"]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit cddd02489f52ccf635ed65931214729a23b93cd6 upstream.
In lowres mode, hrtimers are serviced by the tick instead of a clock
event. Now it works well as long as the tick stays periodic but we
must also make sure that the hrtimers are serviced in dynticks mode.
Part of that job consist in kicking a dynticks hrtimer target in order
to make it reconsider the next tick to schedule to correctly handle the
hrtimer's expiring time. And that part isn't handled by the hrtimers
subsystem.
To prepare for fixing this, we need __hrtimer_start_range_ns() to be
able to resolve the CPU target associated to a hrtimer's object
'cpu_base' so that the kick can be centralized there.
So lets store it in the 'struct hrtimer_cpu_base' to resolve the CPU
without overhead. It is set once at CPU's online notification.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1403393357-2070-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bwh: Backported to 3.16 as dependency of commit b9023b91dd02
"tick: broadcast-hrtimer: Fix a race in bc_set_next":
- Adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit f18ddc13af981ce3c7b7f26925f099e7c6929aba upstream.
ENOTSUPP is not supposed to be returned to userspace. This was found on an
OpenPower machine, where the RTC does not support set_alarm.
On that system, a clock_nanosleep(CLOCK_REALTIME_ALARM, ...) results in
"524 Unknown error 524"
Replace it with EOPNOTSUPP which results in the expected "95 Operation not
supported" error.
Fixes: 1c6b39ad3f01 (alarmtimers: Return -ENOTSUPP if no RTC device is present)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190903171802.28314-1-cascardo@canonical.com
[ pvorel: backport for v3.16, changes also in alarm_timer_{del,set}(), which
were removed in f2c45807d3992fe0f173f34af9c347d907c31686 in v4.13-rc1 ]
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit eddf3e9c7c7e4d0707c68d1bb22cc6ec8aef7d4a upstream.
The following crash was observed:
Unable to handle kernel NULL pointer dereference at 0000000000000158
Internal error: Oops: 96000004 [#1] SMP
pc : resend_irqs+0x68/0xb0
lr : resend_irqs+0x64/0xb0
...
Call trace:
resend_irqs+0x68/0xb0
tasklet_action_common.isra.6+0x84/0x138
tasklet_action+0x2c/0x38
__do_softirq+0x120/0x324
run_ksoftirqd+0x44/0x60
smpboot_thread_fn+0x1ac/0x1e8
kthread+0x134/0x138
ret_from_fork+0x10/0x18
The reason for this is that the interrupt resend mechanism happens in soft
interrupt context, which is a asynchronous mechanism versus other
operations on interrupts. free_irq() does not take resend handling into
account. Thus, the irq descriptor might be already freed before the resend
tasklet is executed. resend_irqs() does not check the return value of the
interrupt descriptor lookup and derefences the return value
unconditionally.
1):
__setup_irq
irq_startup
check_irq_resend // activate softirq to handle resend irq
2):
irq_domain_free_irqs
irq_free_descs
free_desc
call_rcu(&desc->rcu, delayed_free_desc)
3):
__do_softirq
tasklet_action
resend_irqs
desc = irq_to_desc(irq)
desc->handle_irq(desc) // desc is NULL --> Ooops
Fix this by adding a NULL pointer check in resend_irqs() before derefencing
the irq descriptor.
Fixes: a4633adcdbc1 ("[PATCH] genirq: add genirq sw IRQ-retrigger")
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Link: https://lkml.kernel.org/r/1630ae13-5c8e-901e-de09-e740b6a426a7@huawei.com
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 5e2d2cc2588bd3307ce3937acbc2ed03c830a861 upstream.
do_sched_cfs_period_timer() will refill cfs_b runtime and call
distribute_cfs_runtime to unthrottle cfs_rq, sometimes cfs_b->runtime
will allocate all quota to one cfs_rq incorrectly, then other cfs_rqs
attached to this cfs_b can't get runtime and will be throttled.
We find that one throttled cfs_rq has non-negative
cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64
in snippet:
distribute_cfs_runtime() {
runtime = -cfs_rq->runtime_remaining + 1;
}
The runtime here will change to a large number and consume all
cfs_b->runtime in this cfs_b period.
According to Ben Segall, the throttled cfs_rq can have
account_cfs_rq_runtime called on it because it is throttled before
idle_balance, and the idle_balance calls update_rq_clock to add time
that is accounted to the task.
This commit prevents cfs_rq to be assgined new runtime if it has been
throttled until that distribute_cfs_runtime is called.
Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: shanpeic@linux.alibaba.com
Cc: xlpang@linux.alibaba.com
Fixes: d3d9dc330236 ("sched: Throttle entities exceeding their allowed bandwidth")
Link: https://lkml.kernel.org/r/20190826121633.6538-1-liangyan.peng@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.16: Open-code SCHED_WARN_ON().]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 16d51a590a8ce3befb1308e0e7ab77f3b661af33 upstream.
When going through execve(), zero out the NUMA fault statistics instead of
freeing them.
During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.
Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.16: adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit cf144f81a99d1a3928f90b0936accfd3f45c9a0a upstream.
Testing padata with the tcrypt module on a 5.2 kernel...
# modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
# modprobe tcrypt mode=211 sec=1
...produces this splat:
INFO: task modprobe:10075 blocked for more than 120 seconds.
Not tainted 5.2.0-base+ #16
modprobe D 0 10075 10064 0x80004080
Call Trace:
? __schedule+0x4dd/0x610
? ring_buffer_unlock_commit+0x23/0x100
schedule+0x6c/0x90
schedule_timeout+0x3b/0x320
? trace_buffer_unlock_commit_regs+0x4f/0x1f0
wait_for_common+0x160/0x1a0
? wake_up_q+0x80/0x80
{ crypto_wait_req } # entries in braces added by hand
{ do_one_aead_op }
{ test_aead_jiffies }
test_aead_speed.constprop.17+0x681/0xf30 [tcrypt]
do_test+0x4053/0x6a2b [tcrypt]
? 0xffffffffa00f4000
tcrypt_mod_init+0x50/0x1000 [tcrypt]
...
The second modprobe command never finishes because in padata_reorder,
CPU0's load of reorder_objects is executed before the unlocking store in
spin_unlock_bh(pd->lock), causing CPU0 to miss CPU1's increment:
CPU0 CPU1
padata_reorder padata_do_serial
LOAD reorder_objects // 0
INC reorder_objects // 1
padata_reorder
TRYLOCK pd->lock // failed
UNLOCK pd->lock
CPU0 deletes the timer before returning from padata_reorder and since no
other job is submitted to padata, modprobe waits indefinitely.
Add a pair of full barriers to guarantee proper ordering:
CPU0 CPU1
padata_reorder padata_do_serial
UNLOCK pd->lock
smp_mb()
LOAD reorder_objects
INC reorder_objects
smp_mb__after_atomic()
padata_reorder
TRYLOCK pd->lock
smp_mb__after_atomic is needed so the read part of the trylock operation
comes after the INC, as Andrea points out. Thanks also to Andrea for
help with writing a litmus test.
Fixes: 16295bec6398 ("padata: Generic parallelization/serialization interface")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit f9070dc94542093fd516ae4ccea17ef46a4362c5 upstream.
The locking in force_sig_info is not prepared to deal with a task that
exits or execs (as sighand may change). The is not a locking problem
in force_sig as force_sig is only built to handle synchronous
exceptions.
Further the function force_sig_info changes the signal state if the
signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the
delivery of the signal. The signal SIGKILL can not be ignored and can
not be blocked and SIGNAL_UNKILLABLE won't prevent it from being
delivered.
So using force_sig rather than send_sig for SIGKILL is confusing
and pointless.
Because it won't impact the sending of the signal and and because
using force_sig is wrong, replace force_sig with send_sig.
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Fixes: cf3f89214ef6 ("pidns: add reboot_pid_ns() to handle the reboot syscall")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 46cc0b44428d0f0e81f11ea98217fc0edfbeab07 upstream.
Current snapshot implementation swaps two ring_buffers even though their
sizes are different from each other, that can cause an inconsistency
between the contents of buffer_size_kb file and the current buffer size.
For example:
# cat buffer_size_kb
7 (expanded: 1408)
# echo 1 > events/enable
# grep bytes per_cpu/cpu0/stats
bytes: 1441020
# echo 1 > snapshot // current:1408, spare:1408
# echo 123 > buffer_size_kb // current:123, spare:1408
# echo 1 > snapshot // current:1408, spare:123
# grep bytes per_cpu/cpu0/stats
bytes: 1443700
# cat buffer_size_kb
123 // != current:1408
And also, a similar per-cpu case hits the following WARNING:
Reproducer:
# echo 1 > per_cpu/cpu0/snapshot
# echo 123 > buffer_size_kb
# echo 1 > per_cpu/cpu0/snapshot
WARNING:
WARNING: CPU: 0 PID: 1946 at kernel/trace/trace.c:1607 update_max_tr_single.part.0+0x2b8/0x380
Modules linked in:
CPU: 0 PID: 1946 Comm: bash Not tainted 5.2.0-rc6 #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
RIP: 0010:update_max_tr_single.part.0+0x2b8/0x380
Code: ff e8 dc da f9 ff 0f 0b e9 88 fe ff ff e8 d0 da f9 ff 44 89 ee bf f5 ff ff ff e8 33 dc f9 ff 41 83 fd f5 74 96 e8 b8 da f9 ff <0f> 0b eb 8d e8 af da f9 ff 0f 0b e9 bf fd ff ff e8 a3 da f9 ff 48
RSP: 0018:ffff888063e4fca0 EFLAGS: 00010093
RAX: ffff888066214380 RBX: ffffffff99850fe0 RCX: ffffffff964298a8
RDX: 0000000000000000 RSI: 00000000fffffff5 RDI: 0000000000000005
RBP: 1ffff1100c7c9f96 R08: ffff888066214380 R09: ffffed100c7c9f9b
R10: ffffed100c7c9f9a R11: 0000000000000003 R12: 0000000000000000
R13: 00000000ffffffea R14: ffff888066214380 R15: ffffffff99851060
FS: 00007f9f8173c700(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000714dc0 CR3: 0000000066fa6000 CR4: 00000000000006f0
Call Trace:
? trace_array_printk_buf+0x140/0x140
? __mutex_lock_slowpath+0x10/0x10
tracing_snapshot_write+0x4c8/0x7f0
? trace_printk_init_buffers+0x60/0x60
? selinux_file_permission+0x3b/0x540
? tracer_preempt_off+0x38/0x506
? trace_printk_init_buffers+0x60/0x60
__vfs_write+0x81/0x100
vfs_write+0x1e1/0x560
ksys_write+0x126/0x250
? __ia32_sys_read+0xb0/0xb0
? do_syscall_64+0x1f/0x390
do_syscall_64+0xc1/0x390
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This patch adds resize_buffer_duplicate_size() to check if there is a
difference between current/spare buffer sizes and resize a spare buffer
if necessary.
Link: http://lkml.kernel.org/r/20190625012910.13109-1-devel@etsukata.com
Fixes: ad909e21bbe69 ("tracing: Add internal tracing_snapshot() functions")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 1bf72720281770162c87990697eae1ba2f1d917a upstream.
Currently, if the user specifies an unsupported mitigation strategy on the
kernel command line, it will be ignored silently. The code will fall back
to the default strategy, possibly leaving the system more vulnerable than
expected.
This may happen due to e.g. a simple typo, or, for a stable kernel release,
because not all mitigation strategies have been backported.
Inform the user by printing a message.
Fixes: 98af8452945c5565 ("cpu/speculation: Add 'mitigations=' cmdline option")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Link: https://lkml.kernel.org/r/20190516070935.22546-1-geert@linux-m68k.org
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 913a90bc5a3a06b1f04c337320e9aeee2328dd77 upstream.
perf_event_open() limits the sample_period to 63 bits. See:
0819b2e30ccb ("perf: Limit perf_event_attr::sample_period to 63 bits")
Make ioctl() consistent with it.
Also on PowerPC, negative sample_period could cause a recursive
PMIs leading to a hang (reported when running perf-fuzzer).
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: maddy@linux.vnet.ibm.com
Cc: mpe@ellerman.id.au
Fixes: 0819b2e30ccb ("perf: Limit perf_event_attr::sample_period to 63 bits")
Link: https://lkml.kernel.org/r/20190604042953.914-1-ravi.bangoria@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|
|
commit 085ebfe937d7a7a5df1729f35a12d6d655fea68c upstream.
perf_sample_regs_user() uses 'current->mm' to test for the presence of
userspace, but this is insufficient, consider use_mm().
A better test is: '!(current->flags & PF_KTHREAD)', exec() clears
PF_KTHREAD after it sets the new ->mm but before it drops to userspace
for the first time.
Possibly obsoletes: bf05fc25f268 ("powerpc/perf: Fix oops when kthread execs user process")
Reported-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Reported-by: Young Xiao <92siuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4018994f3d87 ("perf: Add ability to attach user level registers dump to sample")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
|