| Age | Commit message (Collapse) | Author |
|
The CPU hotplug locking was quite messy, with a recursive lock to
handle the fact that both the actual up/down sequence wanted to
protect itself from being re-entered, but the callbacks that it
called also tended to want to protect themselves from CPU events.
This splits the lock into two (one to serialize the whole hotplug
sequence, the other to protect against the CPU present bitmaps
changing). The latter still allows recursive usage because some
subsystems (ondemand policy for cpufreq at least) had already gotten
too used to the lax locking, but the locking mistakes are hopefully
now less fundamental, and we now warn about recursive lock usage
when we see it, in the hope that it can be fixed.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
CPUs come online only at init time (unless CONFIG_HOTPLUG_CPU is defined).
So, cpu_notifier functionality need to be available only at init time.
This patch makes register_cpu_notifier() available only at init time, unless
CONFIG_HOTPLUG_CPU is defined.
This patch exports register_cpu_notifier() and unregister_cpu_notifier() only
if CONFIG_HOTPLUG_CPU is defined.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Convert kernel/cpu.c from semaphore to mutex.
I've reviewed all lock_cpu_hotplug() critical sections, and they all seem to
fit mutex semantics.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
The kernel's implementation of notifier chains is unsafe. There is no
protection against entries being added to or removed from a chain while the
chain is in use. The issues were discussed in this thread:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2
We noticed that notifier chains in the kernel fall into two basic usage
classes:
"Blocking" chains are always called from a process context
and the callout routines are allowed to sleep;
"Atomic" chains can be called from an atomic context and
the callout routines are not allowed to sleep.
We decided to codify this distinction and make it part of the API. Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name). New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain. The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.
With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed. For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections. (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)
There are some limitations, which should not be too hard to live with. For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem. Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain. (This did happen in a couple of places and the code
had to be changed to avoid it.)
Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization. Instead we use RCU. The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.
Here is the list of chains that we adjusted and their classifications. None
of them use the raw API, so for the moment it is only a placeholder.
ATOMIC CHAINS
-------------
arch/i386/kernel/traps.c: i386die_chain
arch/ia64/kernel/traps.c: ia64die_chain
arch/powerpc/kernel/traps.c: powerpc_die_chain
arch/sparc64/kernel/traps.c: sparc64die_chain
arch/x86_64/kernel/traps.c: die_chain
drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list
kernel/panic.c: panic_notifier_list
kernel/profile.c: task_free_notifier
net/bluetooth/hci_core.c: hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain
net/ipv6/addrconf.c: inet6addr_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain
net/netlink/af_netlink.c: netlink_chain
BLOCKING CHAINS
---------------
arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain
arch/s390/kernel/process.c: idle_chain
arch/x86_64/kernel/process.c idle_notifier
drivers/base/memory.c: memory_chain
drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list
drivers/macintosh/adb.c: adb_client_list
drivers/macintosh/via-pmu.c sleep_notifier_list
drivers/macintosh/via-pmu68k.c sleep_notifier_list
drivers/macintosh/windfarm_core.c wf_client_list
drivers/usb/core/notify.c usb_notifier_list
drivers/video/fbmem.c fb_notifier_list
kernel/cpu.c cpu_chain
kernel/module.c module_notify_list
kernel/profile.c munmap_notifier
kernel/profile.c task_exit_notifier
kernel/sys.c reboot_notifier_list
net/core/dev.c netdev_chain
net/decnet/dn_dev.c: dnaddr_chain
net/ipv4/devinet.c: inetaddr_chain
It's possible that some of these classifications are wrong. If they are,
please let us know or submit a patch to fix them. Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)
The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.
[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
|
There are some callers in cpufreq hotplug notify path that the lowest
function calls lock_cpu_hotplug(). The lock is already held during
cpu_up() and cpu_down() calls when the notify calls are broadcast to
registered clients.
Ideally if possible, we could disable_preempt() at the highest caller and
make sure we dont sleep in the path down in cpufreq->driver_target() calls
but the calls are so intertwined and cumbersome to cleanup.
Hence we consistently use lock_cpu_hotplug() and unlock_cpu_hotplug() in
all places.
- Removed export of cpucontrol semaphore and made it static.
- removed explicit uses of up/down with lock_cpu_hotplug()
so we can keep track of the the callers in same thread context and
just keep refcounts without calling a down() that causes a deadlock.
- Removed current_in_hotplug() uses
- Removed PF_HOTPLUG_CPU in sched.h introduced for the current_in_hotplug()
temporary workaround.
Tested with insmod of cpufreq_stat.ko, and logical online/offline
to make sure we dont have any hang situations.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Cc: Zwane Mwaikambo <zwane@linuxpower.ca>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
When calling target drivers to set frequency, we take cpucontrol lock.
When we modified the code to accomodate CPU hotplug, there was an attempt
to take a double lock of cpucontrol leading to a deadlock. Since the
current thread context is already holding the cpucontrol lock, we dont need
to make another attempt to acquire it.
Now we leave a trace in current->flags indicating current thread already is
under cpucontrol lock held, so we dont attempt to do this another time.
Thanks to Andrew Morton for the beating:-)
From: Brice Goglin <Brice.Goglin@ens-lyon.org>
Build fix
(akpm: this patch is still unpleasant. Ashok continues to look for a cleaner
solution, doesn't he? ;))
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Brice Goglin <Brice.Goglin@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
cpufreq entries in sysfs should only be populated when CPU is online state.
When we either boot with maxcpus=x and then boot the other cpus by echoing
to sysfs online file, these entries should be created and destroyed when
CPU_DEAD is notified. Same treatement as cache entries under sysfs.
We place the processor in the lowest frequency, so hw managed P-State
transitions can still work on the other threads to save power.
Primary goal was to just make these directories appear/disapper dynamically.
There is one in this patch i had to do, which i really dont like myself but
probably best if someone handling the cpufreq infrastructure could give
this code right treatment if this is not acceptable. I guess its probably
good for the first cut.
- Converting lock_cpu_hotplug()/unlock_cpu_hotplug() to disable/enable preempt.
The locking was smack in the middle of the notification path, when the
hotplug is already holding the lock. I tried another solution to avoid this
so avoid taking locks if we know we are from notification path. The solution
was getting very ugly and i decided this was probably good for this iteration
until someone who understands cpufreq could do a better job than me.
(akpm: export cpucontrol to GPL modules: drivers/cpufreq/cpufreq_stats.c now
does lock_cpu_hotplug())
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Zwane Mwaikambo <zwane@holomorphy.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
(The i386 CPU hotplug patch provides infrastructure for some work which Pavel
is doing as well as for ACPI S3 (suspend-to-RAM) work which Li Shaohua
<shaohua.li@intel.com> is doing)
The following provides i386 architecture support for safely unregistering and
registering processors during runtime, updated for the current -mm tree. In
order to avoid dumping cpu hotplug code into kernel/irq/* i dropped the
cpu_online check in do_IRQ() by modifying fixup_irqs(). The difference being
that on cpu offline, fixup_irqs() is called before we clear the cpu from
cpu_online_map and a long delay in order to ensure that we never have any
queued external interrupts on the APICs. There are additional changes to s390
and ppc64 to account for this change.
1) Add CONFIG_HOTPLUG_CPU
2) disable local APIC timer on dead cpus.
3) Disable preempt around irq balancing to prevent CPUs going down.
4) Print irq stats for all possible cpus.
5) Debugging check for interrupts on offline cpus.
6) Hacky fixup_irqs() to redirect irqs when cpus go off/online.
7) play_dead() for offline cpus to spin inside.
8) Handle offline cpus set in flush_tlb_others().
9) Grab lock earlier in smp_call_function() to prevent CPUs going down.
10) Implement __cpu_disable() and __cpu_die().
11) Enable local interrupts in cpu_enable() after fixup_irqs()
12) Don't fiddle with NMI on dead cpu, but leave intact on other cpus.
13) Program IRQ affinity whilst cpu is still in cpu_online_map on offline.
Signed-off-by: Zwane Mwaikambo <zwane@linuxpower.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
This patch introduces the concept of (virtual) cputime. Each architecture
can define its method to measure cputime. The main idea is to define a
cputime_t type and a set of operations on it (see asm-generic/cputime.h).
Then use the type for utime, stime, cutime, cstime, it_virt_value,
it_virt_incr, it_prof_value and it_prof_incr and use the cputime operations
for each access to these variables. The default implementation is jiffies
based and the effect of this patch for architectures which use the default
implementation should be neglectible.
There is a second type cputime64_t which is necessary for the kernel_stat
cpu statistics. The default cputime_t is 32 bit and based on HZ, this will
overflow after 49.7 days. This is not enough for kernel_stat (ihmo not
enough for a processes too), so it is necessary to have a 64 bit type.
The third thing that gets introduced by this patch is an additional field
for the /proc/stat interface: cpu steal time. An architecture can account
cpu steal time by calls to the account_stealtime function. The cpu which
backs a virtual processor doesn't spent all of its time for the virtual
cpu. To get meaningful cpu usage numbers this involuntary wait time needs
to be accounted and exported to user space.
From: Hugh Dickins <hugh@veritas.com>
The p->signal check in account_system_time is insufficient. If the timer
interrupt hits near the end of exit_notify, after EXIT_ZOMBIE has been set,
another cpu may release_task (NULLifying p->signal) in between
account_system_time's check and check_rlimit's dereference. Nor should
account_it_prof risk send_sig. But surely account_user_time is safe?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Fix (harmless?) smp_processor_id() usage in preemptible section of
cpu_down.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Move hotplug_path[] out of kmod.[ch] to kobject_uevent.[ch] where
it belongs now. At some time in the future we should fix the remaining bad
hotplug calls (no SEQNUM, no netlink uevent):
./drivers/input/input.c (no DEVPATH on some hotplug events!)
./drivers/pnp/pnpbios/core.c
./drivers/s390/crypto/z90main.c
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
|
|
From: Keshavamurthy Anil S <anil.s.keshavamurthy@intel.com>
Remove cpu_run_sbin_hotplug() - use kobject_hotplug() instead.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <greg@kroah.com>
|
|
Introduce CPU_DOWN_FAILED notifier, so we can cope with a failure after a
CPU_DOWN_PREPARE notice.
This fixes 3/8 "add CPU_DOWN_PREPARE notifier" to be useful
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add a CPU_DOWN_PREPARE hotplug CPU notifier. This is needed so we can dettach
all sched-domains before a CPU goes down, thus we can build domains from
online cpumasks, and not have to check for the possibility of a CPU coming up
or going down.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
Add $DEVPATH to the environmental variables during /sbin/hotplug call.
Signed-off-by: Josef 'Jeff' Sipek <jeffpc@optonline.net>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Paul Jackson <pj@sgi.com>
This patch makes cpu_present_map a real map for all configurations, instead of
a constant for non-SMP. It also moves the definition of cpu_present_map out
of kernel/cpu.c into kernel/sched.c, because cpu.c isn't compiled into non-SMP
kernels.
The pattern is that each of the possible, present and online cpu maps are
actual kernel global cpumask_t variables, for all configurations. They are
documented in include/linux/cpumask.h. Some of the UP (NR_CPUS=1) code
cheats, and hardcodes the assumption that the single bit position of these
maps is always set, as an optimization.
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
|
From: Paul Jackson <pj@sgi.com>
With a hotplug capable kernel, there is a requirement to distinguish a
possible CPU from one actually present. The set of possible CPU numbers
doesn't change during a single system boot, but the set of present CPUs
changes as CPUs are physically inserted into or removed from a system. The
cpu_possible_map does not change once initialized at boot, but the
cpu_present_map changes dynamically as CPUs are inserted or removed.
Paul Jackson <pj@sgi.com> provided an expanded explanation:
Ashok's cpu hot plug patch adds a cpu_present_map, resulting in the following
cpu maps being available. All the following maps are fixed size bitmaps of
size NR_CPUS.
#ifdef CONFIG_HOTPLUG_CPU
cpu_possible_map - map with all NR_CPUS bits set
cpu_present_map - map with bit 'cpu' set iff cpu is populated
cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler
#else
cpu_possible_map - map with bit 'cpu' set iff cpu is populated
cpu_present_map - copy of cpu_possible_map
cpu_online_map - map with bit 'cpu' set iff cpu available to scheduler
#endif
In either case, NR_CPUS is fixed at compile time, as the static size of these
bitmaps. The cpu_possible_map is fixed at boot time, as the set of CPU id's
that it is possible might ever be plugged in at anytime during the life of
that system boot. The cpu_present_map is dynamic(*), representing which CPUs
are currently plugged in. And cpu_online_map is the dynamic subset of
cpu_present_map, indicating those CPUs available for scheduling.
If HOTPLUG is enabled, then cpu_possible_map is forced to have all NR_CPUS
bits set, otherwise it is just the set of CPUs that ACPI reports present at
boot.
If HOTPLUG is enabled, then cpu_present_map varies dynamically, depending on
what ACPI reports as currently plugged in, otherwise cpu_present_map is just a
copy of cpu_possible_map.
(*) Well, cpu_present_map is dynamic in the hotplug case. If not hotplug,
it's the same as cpu_possible_map, hence fixed at boot.
|
|
From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
migrate_all_tasks is currently run with rest of the machine stopped.
It iterates thr' the complete task table, turning off cpu affinity of any task
that it finds affine to the dying cpu. Depending on the task table
size this can take considerable time. All this time machine is stopped, doing
nothing.
Stopping the machine for such extended periods can be avoided if we do
task migration in CPU_DEAD notification and that's precisely what this patch
does.
The patch puts idle task to the _front_ of the dying CPU's runqueue at the
highest priority possible. This cause idle thread to run _immediately_ after
kstopmachine thread yields. Idle thread notices that its cpu is offline and
dies quickly. Task migration can then be done at leisure in CPU_DEAD
notification, when rest of the CPUs are running.
Some advantages with this approach are:
- More scalable. Predicatable amout of time that machine is stopped.
- No changes to hot path/core code. We are just exploiting scheduler
rules which runs the next high-priority task on the runqueue. Also
since I put idle task to the _front_ of the runqueue, there
are no races when a equally high priority task is woken up
and added to the runqueue. It gets in at the back of the runqueue,
_after_ idle task!
- cpu_is_offline check that is presenty required in try_to_wake_up,
idle_balance and rebalance_tick can be removed, thus speeding them
up a bit
From: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Rusty mentioned that the unlikely hints against cpu_is_offline is
redundant since the macro already has that hint. Patch below removes those
redundant hints I added.
|
|
Implement cpu_down(): uses stop_machine to freeze the machine, then
uses (arch-specific) __cpu_disable() and migrate_all_tasks().
Whole thing under CONFIG_HOTPLUG_CPU, so doesn't break archs which
don't define that.
|
|
|
|
The registration and unregistration of CPU notifiers should be done
under the cpucontrol sem. They should also be exported.
|
|
From: Jes Sorensen <jes@trained-monkey.org>
I'd like to propose the following for 2.6.1-mm/2.6.2. On systems with a
large number of CPUs the number of printk's flowing by for each CPU
booting starts becoming a real console hog.
The following patch eliminates a couple of them (already sent a patch to
David for the ia64 specific ones) as well as changes the
"Building zonelist : X" in "Built Y zonelists". IMHO it doesn't make any
sense to print for each zonelist since it's run in a for loop running
from 0 to Y-1 anyway.
The patch nukes a few new printk's that were introduced with the
scheduler changes to the NUMA code in -mm3, if these are still needed
then I won't fight for that part of the patch.
|
|
From: Jes Sorensen <jes@trained-monkey.org>
The following patch removes a couple of null-ilizers of global variables.
Not a big deal, but every byte helps in the .data segment ;-)
|
|
Trivial patch: when these were introduced cpu.h didn't exist.
|
|
Patch from Dipankar Sarma <dipankar@in.ibm.com>
This is Manfred's patch which provides a CPU_UP_PREPARE cpu notifier to
allow initialization of per_cpu data just before the cpu becomes fully
functional.
It also provides a facility for the CPU_UP_PREPARE handler to return
NOTIFY_BAD to signify that the CPU is not permitted to come up. If
that happens, a CPU_UP_CANCELLED message is passed to all the handlers.
The patch also fixes a bogus NOFITY_BAD return from the softirq setup
code.
Patch has been acked by Rusty.
We need this mechanism in slab for starting per-cpu timers and for
allocating the per-cpu slab hgead arrays *before* the CPU has come up
and started using slab.
|
|
|
|
This patch alters the boot sequence to "plug in" each CPU, one at a
time. You need the patch for each architecture, as well. The
interface used to be "smp_boot_cpus()", "smp_commence()", and each
arch implemented the "maxcpus" boot arg itself. With this patch,
it is:
smp_prepare_cpus(maxcpus): probe for cpus and set up cpu_possible(cpu).
__cpu_up(cpu): called *after* initcalls, for each cpu where
cpu_possible(cpu) is true.
smp_cpus_done(maxcpus): called after every cpu has been brought up
|