From f9fa0bc1fabe1d861e46d80ecbe7e85da359195c Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 8 Apr 2011 10:53:46 -0700 Subject: signal.c: fix erroneous syscall kernel-doc Fix erroneous syscall kernel-doc comments in kernel/signal.c. Reported-by: Matt Fleming Signed-off-by: Randy Dunlap Signed-off-by: Linus Torvalds --- kernel/signal.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 29e233fd7a0f..7165af5f1b11 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2711,8 +2711,8 @@ out: /** * sys_rt_sigaction - alter an action taken by a process * @sig: signal to be sent - * @act: the thread group ID of the thread - * @oact: the PID of the thread + * @act: new sigaction + * @oact: used to save the previous sigaction * @sigsetsize: size of sigset_t type */ SYSCALL_DEFINE4(rt_sigaction, int, sig, -- cgit v1.2.3 From b0432d8f162c7d5d9537b4cb749d44076b76a783 Mon Sep 17 00:00:00 2001 From: Ken Chen Date: Thu, 7 Apr 2011 17:23:22 -0700 Subject: sched: Fix sched-domain avg_load calculation In function find_busiest_group(), the sched-domain avg_load isn't calculated at all if there is a group imbalance within the domain. This will cause erroneous imbalance calculation. The reason is that calculate_imbalance() sees sds->avg_load = 0 and it will dump entire sds->max_load into imbalance variable, which is used later on to migrate entire load from busiest CPU to the puller CPU. This has two really bad effect: 1. stampede of task migration, and they won't be able to break out of the bad state because of positive feedback loop: large load delta -> heavier load migration -> larger imbalance and the cycle goes on. 2. severe imbalance in CPU queue depth. This causes really long scheduling latency blip which affects badly on application that has tight latency requirement. The fix is to have kernel calculate domain avg_load in both cases. This will ensure that imbalance calculation is always sensible and the target is usually half way between busiest and puller CPU. Signed-off-by: Ken Chen Signed-off-by: Peter Zijlstra Cc: Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com Signed-off-by: Ingo Molnar --- kernel/sched_fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 7f00772e57c9..60f9d407c5ec 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -3127,6 +3127,8 @@ find_busiest_group(struct sched_domain *sd, int this_cpu, if (!sds.busiest || sds.busiest_nr_running == 0) goto out_balanced; + sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr; + /* * If the busiest group is imbalanced the below checks don't * work because they assumes all things are equal, which typically @@ -3151,7 +3153,6 @@ find_busiest_group(struct sched_domain *sd, int this_cpu, * Don't pull any tasks if this group is already above the domain * average load. */ - sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr; if (sds.this_load >= sds.avg_load) goto out_balanced; -- cgit v1.2.3 From b30aef17f71cf9e24b10c11cbb5e5f0ebe8a85ab Mon Sep 17 00:00:00 2001 From: Ken Chen Date: Fri, 8 Apr 2011 12:20:16 -0700 Subject: sched: Fix erroneous all_pinned logic The scheduler load balancer has specific code to deal with cases of unbalanced system due to lots of unmovable tasks (for example because of hard CPU affinity). In those situation, it excludes the busiest CPU that has pinned tasks for load balance consideration such that it can perform second 2nd load balance pass on the rest of the system. This all works as designed if there is only one cgroup in the system. However, when we have multiple cgroups, this logic has false positives and triggers multiple load balance passes despite there are actually no pinned tasks at all. The reason it has false positives is that the all pinned logic is deep in the lowest function of can_migrate_task() and is too low level: load_balance_fair() iterates each task group and calls balance_tasks() to migrate target load. Along the way, balance_tasks() will also set a all_pinned variable. Given that task-groups are iterated, this all_pinned variable is essentially the status of last group in the scanning process. Task group can have number of reasons that no load being migrated, none due to cpu affinity. However, this status bit is being propagated back up to the higher level load_balance(), which incorrectly think that no tasks were moved. It kick off the all pinned logic and start multiple passes attempt to move load onto puller CPU. To fix this, move the all_pinned aggregation up at the iterator level. This ensures that the status is aggregated over all task-groups, not just last one in the list. Signed-off-by: Ken Chen Cc: stable@kernel.org Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com Signed-off-by: Ingo Molnar --- kernel/sched_fair.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 60f9d407c5ec..6fa833ab2cb8 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -2104,21 +2104,20 @@ balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, enum cpu_idle_type idle, int *all_pinned, int *this_best_prio, struct cfs_rq *busiest_cfs_rq) { - int loops = 0, pulled = 0, pinned = 0; + int loops = 0, pulled = 0; long rem_load_move = max_load_move; struct task_struct *p, *n; if (max_load_move == 0) goto out; - pinned = 1; - list_for_each_entry_safe(p, n, &busiest_cfs_rq->tasks, se.group_node) { if (loops++ > sysctl_sched_nr_migrate) break; if ((p->se.load.weight >> 1) > rem_load_move || - !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) + !can_migrate_task(p, busiest, this_cpu, sd, idle, + all_pinned)) continue; pull_task(busiest, p, this_rq, this_cpu); @@ -2153,9 +2152,6 @@ out: */ schedstat_add(sd, lb_gained[idle], pulled); - if (all_pinned) - *all_pinned = pinned; - return max_load_move - rem_load_move; } @@ -3341,6 +3337,7 @@ redo: * still unbalanced. ld_moved simply stays zero, so it is * correctly treated as an imbalance. */ + all_pinned = 1; local_irq_save(flags); double_rq_lock(this_rq, busiest); ld_moved = move_tasks(this_rq, this_cpu, busiest, -- cgit v1.2.3 From 1f112cee07b314e244ee9e71d9c1e6950dc13327 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Mon, 11 Apr 2011 22:54:42 +0200 Subject: PM / Hibernate: Introduce CONFIG_HIBERNATE_CALLBACKS Xen save/restore is going to use hibernate device callbacks for quiescing devices and putting them back to normal operations and it would need to select CONFIG_HIBERNATION for this purpose. However, that also would cause the hibernate interfaces for user space to be enabled, which might confuse user space, because the Xen kernels don't support hibernation. Moreover, it would be wasteful, as it would make the Xen kernels include a substantial amount of code that they would never use. To address this issue introduce new power management Kconfig option CONFIG_HIBERNATE_CALLBACKS, such that it will only select the code that is necessary for the hibernate device callbacks to work and make CONFIG_HIBERNATION select it. Then, Xen save/restore will be able to select CONFIG_HIBERNATE_CALLBACKS without dragging the entire hibernate code along with it. Signed-off-by: Rafael J. Wysocki Tested-by: Shriram Rajagopalan --- arch/powerpc/kernel/ibmebus.c | 6 +++--- drivers/amba/bus.c | 6 +++--- drivers/base/platform.c | 6 +++--- drivers/base/power/main.c | 8 ++++---- drivers/pci/pci-driver.c | 6 +++--- drivers/xen/manage.c | 6 +++--- include/linux/suspend.h | 11 +++-------- kernel/power/Kconfig | 6 +++++- 8 files changed, 27 insertions(+), 28 deletions(-) (limited to 'kernel') diff --git a/arch/powerpc/kernel/ibmebus.c b/arch/powerpc/kernel/ibmebus.c index c00d4ca1ee15..28581f1ad2c0 100644 --- a/arch/powerpc/kernel/ibmebus.c +++ b/arch/powerpc/kernel/ibmebus.c @@ -527,7 +527,7 @@ static int ibmebus_bus_pm_resume_noirq(struct device *dev) #endif /* !CONFIG_SUSPEND */ -#ifdef CONFIG_HIBERNATION +#ifdef CONFIG_HIBERNATE_CALLBACKS static int ibmebus_bus_pm_freeze(struct device *dev) { @@ -665,7 +665,7 @@ static int ibmebus_bus_pm_restore_noirq(struct device *dev) return ret; } -#else /* !CONFIG_HIBERNATION */ +#else /* !CONFIG_HIBERNATE_CALLBACKS */ #define ibmebus_bus_pm_freeze NULL #define ibmebus_bus_pm_thaw NULL @@ -676,7 +676,7 @@ static int ibmebus_bus_pm_restore_noirq(struct device *dev) #define ibmebus_bus_pm_poweroff_noirq NULL #define ibmebus_bus_pm_restore_noirq NULL -#endif /* !CONFIG_HIBERNATION */ +#endif /* !CONFIG_HIBERNATE_CALLBACKS */ static struct dev_pm_ops ibmebus_bus_dev_pm_ops = { .prepare = ibmebus_bus_pm_prepare, diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c index 821040503154..7025593a58c8 100644 --- a/drivers/amba/bus.c +++ b/drivers/amba/bus.c @@ -214,7 +214,7 @@ static int amba_pm_resume_noirq(struct device *dev) #endif /* !CONFIG_SUSPEND */ -#ifdef CONFIG_HIBERNATION +#ifdef CONFIG_HIBERNATE_CALLBACKS static int amba_pm_freeze(struct device *dev) { @@ -352,7 +352,7 @@ static int amba_pm_restore_noirq(struct device *dev) return ret; } -#else /* !CONFIG_HIBERNATION */ +#else /* !CONFIG_HIBERNATE_CALLBACKS */ #define amba_pm_freeze NULL #define amba_pm_thaw NULL @@ -363,7 +363,7 @@ static int amba_pm_restore_noirq(struct device *dev) #define amba_pm_poweroff_noirq NULL #define amba_pm_restore_noirq NULL -#endif /* !CONFIG_HIBERNATION */ +#endif /* !CONFIG_HIBERNATE_CALLBACKS */ #ifdef CONFIG_PM diff --git a/drivers/base/platform.c b/drivers/base/platform.c index f051cfff18af..92792313d24a 100644 --- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -771,7 +771,7 @@ int __weak platform_pm_resume_noirq(struct device *dev) #endif /* !CONFIG_SUSPEND */ -#ifdef CONFIG_HIBERNATION +#ifdef CONFIG_HIBERNATE_CALLBACKS static int platform_pm_freeze(struct device *dev) { @@ -909,7 +909,7 @@ static int platform_pm_restore_noirq(struct device *dev) return ret; } -#else /* !CONFIG_HIBERNATION */ +#else /* !CONFIG_HIBERNATE_CALLBACKS */ #define platform_pm_freeze NULL #define platform_pm_thaw NULL @@ -920,7 +920,7 @@ static int platform_pm_restore_noirq(struct device *dev) #define platform_pm_poweroff_noirq NULL #define platform_pm_restore_noirq NULL -#endif /* !CONFIG_HIBERNATION */ +#endif /* !CONFIG_HIBERNATE_CALLBACKS */ #ifdef CONFIG_PM_RUNTIME diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 052dc53eef38..fbc5b6e7c591 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -233,7 +233,7 @@ static int pm_op(struct device *dev, } break; #endif /* CONFIG_SUSPEND */ -#ifdef CONFIG_HIBERNATION +#ifdef CONFIG_HIBERNATE_CALLBACKS case PM_EVENT_FREEZE: case PM_EVENT_QUIESCE: if (ops->freeze) { @@ -260,7 +260,7 @@ static int pm_op(struct device *dev, suspend_report_result(ops->restore, error); } break; -#endif /* CONFIG_HIBERNATION */ +#endif /* CONFIG_HIBERNATE_CALLBACKS */ default: error = -EINVAL; } @@ -308,7 +308,7 @@ static int pm_noirq_op(struct device *dev, } break; #endif /* CONFIG_SUSPEND */ -#ifdef CONFIG_HIBERNATION +#ifdef CONFIG_HIBERNATE_CALLBACKS case PM_EVENT_FREEZE: case PM_EVENT_QUIESCE: if (ops->freeze_noirq) { @@ -335,7 +335,7 @@ static int pm_noirq_op(struct device *dev, suspend_report_result(ops->restore_noirq, error); } break; -#endif /* CONFIG_HIBERNATION */ +#endif /* CONFIG_HIBERNATE_CALLBACKS */ default: error = -EINVAL; } diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index d86ea8b01137..135df164a4c1 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -781,7 +781,7 @@ static int pci_pm_resume(struct device *dev) #endif /* !CONFIG_SUSPEND */ -#ifdef CONFIG_HIBERNATION +#ifdef CONFIG_HIBERNATE_CALLBACKS static int pci_pm_freeze(struct device *dev) { @@ -970,7 +970,7 @@ static int pci_pm_restore(struct device *dev) return error; } -#else /* !CONFIG_HIBERNATION */ +#else /* !CONFIG_HIBERNATE_CALLBACKS */ #define pci_pm_freeze NULL #define pci_pm_freeze_noirq NULL @@ -981,7 +981,7 @@ static int pci_pm_restore(struct device *dev) #define pci_pm_restore NULL #define pci_pm_restore_noirq NULL -#endif /* !CONFIG_HIBERNATION */ +#endif /* !CONFIG_HIBERNATE_CALLBACKS */ #ifdef CONFIG_PM_RUNTIME diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c index 95143dd6904d..1ac94125bf93 100644 --- a/drivers/xen/manage.c +++ b/drivers/xen/manage.c @@ -61,7 +61,7 @@ static void xen_post_suspend(int cancelled) xen_mm_unpin_all(); } -#ifdef CONFIG_HIBERNATION +#ifdef CONFIG_HIBERNATE_CALLBACKS static int xen_suspend(void *data) { struct suspend_info *si = data; @@ -173,7 +173,7 @@ out: #endif shutting_down = SHUTDOWN_INVALID; } -#endif /* CONFIG_HIBERNATION */ +#endif /* CONFIG_HIBERNATE_CALLBACKS */ struct shutdown_handler { const char *command; @@ -202,7 +202,7 @@ static void shutdown_handler(struct xenbus_watch *watch, { "poweroff", do_poweroff }, { "halt", do_poweroff }, { "reboot", do_reboot }, -#ifdef CONFIG_HIBERNATION +#ifdef CONFIG_HIBERNATE_CALLBACKS { "suspend", do_suspend }, #endif {NULL, NULL}, diff --git a/include/linux/suspend.h b/include/linux/suspend.h index 5a89e3612875..083ffea7ba18 100644 --- a/include/linux/suspend.h +++ b/include/linux/suspend.h @@ -249,6 +249,8 @@ extern void hibernation_set_ops(const struct platform_hibernation_ops *ops); extern int hibernate(void); extern bool system_entering_hibernation(void); #else /* CONFIG_HIBERNATION */ +static inline void register_nosave_region(unsigned long b, unsigned long e) {} +static inline void register_nosave_region_late(unsigned long b, unsigned long e) {} static inline int swsusp_page_is_forbidden(struct page *p) { return 0; } static inline void swsusp_set_page_free(struct page *p) {} static inline void swsusp_unset_page_free(struct page *p) {} @@ -297,14 +299,7 @@ static inline bool pm_wakeup_pending(void) { return false; } extern struct mutex pm_mutex; -#ifndef CONFIG_HIBERNATION -static inline void register_nosave_region(unsigned long b, unsigned long e) -{ -} -static inline void register_nosave_region_late(unsigned long b, unsigned long e) -{ -} - +#ifndef CONFIG_HIBERNATE_CALLBACKS static inline void lock_system_sleep(void) {} static inline void unlock_system_sleep(void) {} diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index 4603f08dc47b..049791468d37 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -18,9 +18,13 @@ config SUSPEND_FREEZER Turning OFF this setting is NOT recommended! If in doubt, say Y. +config HIBERNATE_CALLBACKS + bool + config HIBERNATION bool "Hibernation (aka 'suspend to disk')" depends on SWAP && ARCH_HIBERNATION_POSSIBLE + select HIBERNATE_CALLBACKS select LZO_COMPRESS select LZO_DECOMPRESS ---help--- @@ -85,7 +89,7 @@ config PM_STD_PARTITION config PM_SLEEP def_bool y - depends on SUSPEND || HIBERNATION || XEN_SAVE_RESTORE + depends on SUSPEND || HIBERNATE_CALLBACKS || XEN_SAVE_RESTORE config PM_SLEEP_SMP def_bool y -- cgit v1.2.3 From d419e4c0f7584ffc5c72d9aeeaac485cc756ebcf Mon Sep 17 00:00:00 2001 From: Shriram Rajagopalan Date: Mon, 11 Apr 2011 22:54:48 +0200 Subject: fix XEN_SAVE_RESTORE Kconfig dependencies Make XEN_SAVE_RESTORE select HIBERNATE_CALLBACKS. Remove XEN_SAVE_RESTORE dependency from PM_SLEEP. Signed-off-by: Shriram Rajagopalan Acked-by: Ian Campbell Signed-off-by: Rafael J. Wysocki --- arch/x86/xen/Kconfig | 1 + kernel/power/Kconfig | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig index 1c7121ba18ff..5cc821cb2e09 100644 --- a/arch/x86/xen/Kconfig +++ b/arch/x86/xen/Kconfig @@ -39,6 +39,7 @@ config XEN_MAX_DOMAIN_MEMORY config XEN_SAVE_RESTORE bool depends on XEN + select HIBERNATE_CALLBACKS default y config XEN_DEBUG_FS diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index 049791468d37..6de9a8fc3417 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -89,7 +89,7 @@ config PM_STD_PARTITION config PM_SLEEP def_bool y - depends on SUSPEND || HIBERNATE_CALLBACKS || XEN_SAVE_RESTORE + depends on SUSPEND || HIBERNATE_CALLBACKS config PM_SLEEP_SMP def_bool y -- cgit v1.2.3 From d9c97833179036408e53ef5f3f5c7eaf781769bc Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Tue, 12 Apr 2011 10:06:33 +0200 Subject: block: remove block_unplug_timer() trace point We no longer have an unplug timer running, so no point in keeping the trace point. Signed-off-by: Jens Axboe --- include/trace/events/block.h | 14 -------------- kernel/trace/blktrace.c | 17 ----------------- 2 files changed, 31 deletions(-) (limited to 'kernel') diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 78f18adb49c8..43a985390bb6 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -418,20 +418,6 @@ DECLARE_EVENT_CLASS(block_unplug, TP_printk("[%s] %d", __entry->comm, __entry->nr_rq) ); -/** - * block_unplug_timer - timed release of operations requests in queue to device driver - * @q: request queue to unplug - * - * Unplug the request queue @q because a timer expired and allow block - * operation requests to be sent to the device driver. - */ -DEFINE_EVENT(block_unplug, block_unplug_timer, - - TP_PROTO(struct request_queue *q), - - TP_ARGS(q) -); - /** * block_unplug_io - release of operations requests in request queue * @q: request queue to unplug diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 7aa40f8e182d..824708cbfb7b 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -863,19 +863,6 @@ static void blk_add_trace_unplug_io(void *ignore, struct request_queue *q) } } -static void blk_add_trace_unplug_timer(void *ignore, struct request_queue *q) -{ - struct blk_trace *bt = q->blk_trace; - - if (bt) { - unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE]; - __be64 rpdu = cpu_to_be64(pdu); - - __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_TIMER, 0, - sizeof(rpdu), &rpdu); - } -} - static void blk_add_trace_split(void *ignore, struct request_queue *q, struct bio *bio, unsigned int pdu) @@ -1015,8 +1002,6 @@ static void blk_register_tracepoints(void) WARN_ON(ret); ret = register_trace_block_plug(blk_add_trace_plug, NULL); WARN_ON(ret); - ret = register_trace_block_unplug_timer(blk_add_trace_unplug_timer, NULL); - WARN_ON(ret); ret = register_trace_block_unplug_io(blk_add_trace_unplug_io, NULL); WARN_ON(ret); ret = register_trace_block_split(blk_add_trace_split, NULL); @@ -1033,7 +1018,6 @@ static void blk_unregister_tracepoints(void) unregister_trace_block_bio_remap(blk_add_trace_bio_remap, NULL); unregister_trace_block_split(blk_add_trace_split, NULL); unregister_trace_block_unplug_io(blk_add_trace_unplug_io, NULL); - unregister_trace_block_unplug_timer(blk_add_trace_unplug_timer, NULL); unregister_trace_block_plug(blk_add_trace_plug, NULL); unregister_trace_block_sleeprq(blk_add_trace_sleeprq, NULL); unregister_trace_block_getrq(blk_add_trace_getrq, NULL); @@ -1348,7 +1332,6 @@ static const struct { [__BLK_TA_COMPLETE] = {{ "C", "complete" }, blk_log_with_error }, [__BLK_TA_PLUG] = {{ "P", "plug" }, blk_log_plug }, [__BLK_TA_UNPLUG_IO] = {{ "U", "unplug_io" }, blk_log_unplug }, - [__BLK_TA_UNPLUG_TIMER] = {{ "UT", "unplug_timer" }, blk_log_unplug }, [__BLK_TA_INSERT] = {{ "I", "insert" }, blk_log_generic }, [__BLK_TA_SPLIT] = {{ "X", "split" }, blk_log_split }, [__BLK_TA_BOUNCE] = {{ "B", "bounce" }, blk_log_generic }, -- cgit v1.2.3 From 94b5eb28b41cc79d9713696e0005ae167b5afd1b Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Tue, 12 Apr 2011 10:12:19 +0200 Subject: block: fixup block IO unplug trace call It was removed with the on-stack plugging, readd it and track the depth of requests added when flushing the plug. Signed-off-by: Jens Axboe --- block/blk-core.c | 15 +++++++++++++-- include/trace/events/block.h | 11 ++++++----- kernel/trace/blktrace.c | 6 +++--- 3 files changed, 22 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/block/blk-core.c b/block/blk-core.c index eeaca0998df5..d20ce1e849c8 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2668,12 +2668,19 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b) return !(rqa->q <= rqb->q); } +static void queue_unplugged(struct request_queue *q, unsigned int depth) +{ + trace_block_unplug_io(q, depth); + __blk_run_queue(q, false); +} + static void flush_plug_list(struct blk_plug *plug) { struct request_queue *q; unsigned long flags; struct request *rq; LIST_HEAD(list); + unsigned int depth; BUG_ON(plug->magic != PLUG_MAGIC); @@ -2688,6 +2695,7 @@ static void flush_plug_list(struct blk_plug *plug) } q = NULL; + depth = 0; local_irq_save(flags); while (!list_empty(&list)) { rq = list_entry_rq(list.next); @@ -2696,10 +2704,11 @@ static void flush_plug_list(struct blk_plug *plug) BUG_ON(!rq->q); if (rq->q != q) { if (q) { - __blk_run_queue(q, false); + queue_unplugged(q, depth); spin_unlock(q->queue_lock); } q = rq->q; + depth = 0; spin_lock(q->queue_lock); } rq->cmd_flags &= ~REQ_ON_PLUG; @@ -2711,10 +2720,12 @@ static void flush_plug_list(struct blk_plug *plug) __elv_add_request(q, rq, ELEVATOR_INSERT_FLUSH); else __elv_add_request(q, rq, ELEVATOR_INSERT_SORT_MERGE); + + depth++; } if (q) { - __blk_run_queue(q, false); + queue_unplugged(q, depth); spin_unlock(q->queue_lock); } diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 43a985390bb6..006e60b58306 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -401,9 +401,9 @@ TRACE_EVENT(block_plug, DECLARE_EVENT_CLASS(block_unplug, - TP_PROTO(struct request_queue *q), + TP_PROTO(struct request_queue *q, unsigned int depth), - TP_ARGS(q), + TP_ARGS(q, depth), TP_STRUCT__entry( __field( int, nr_rq ) @@ -411,7 +411,7 @@ DECLARE_EVENT_CLASS(block_unplug, ), TP_fast_assign( - __entry->nr_rq = q->rq.count[READ] + q->rq.count[WRITE]; + __entry->nr_rq = depth; memcpy(__entry->comm, current->comm, TASK_COMM_LEN); ), @@ -421,15 +421,16 @@ DECLARE_EVENT_CLASS(block_unplug, /** * block_unplug_io - release of operations requests in request queue * @q: request queue to unplug + * @depth: number of requests just added to the queue * * Unplug request queue @q because device driver is scheduled to work * on elements in the request queue. */ DEFINE_EVENT(block_unplug, block_unplug_io, - TP_PROTO(struct request_queue *q), + TP_PROTO(struct request_queue *q, unsigned int depth), - TP_ARGS(q) + TP_ARGS(q, depth) ); /** diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 824708cbfb7b..3e3970d53d14 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -850,13 +850,13 @@ static void blk_add_trace_plug(void *ignore, struct request_queue *q) __blk_add_trace(bt, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL); } -static void blk_add_trace_unplug_io(void *ignore, struct request_queue *q) +static void blk_add_trace_unplug_io(void *ignore, struct request_queue *q, + unsigned int depth) { struct blk_trace *bt = q->blk_trace; if (bt) { - unsigned int pdu = q->rq.count[READ] + q->rq.count[WRITE]; - __be64 rpdu = cpu_to_be64(pdu); + __be64 rpdu = cpu_to_be64(depth); __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_IO, 0, sizeof(rpdu), &rpdu); -- cgit v1.2.3 From 6631e635c65dc33cb798cc2f51d0ddd69ada6319 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Wed, 13 Apr 2011 08:08:20 -0700 Subject: block: don't flush plugged IO on forced preemtion scheduling We really only want to unplug the pending IO when the process actually goes to sleep. So move the test for flushing the plug up to the place where we actually deactivate the task - where we have properly checked for preemption and for the process really sleeping. Acked-by: Jens Axboe Acked-by: Peter Zijlstra Signed-off-by: Linus Torvalds --- kernel/sched.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/sched.c b/kernel/sched.c index 48013633d792..a187c3fe027b 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4111,20 +4111,20 @@ need_resched: try_to_wake_up_local(to_wakeup); } deactivate_task(rq, prev, DEQUEUE_SLEEP); + + /* + * If we are going to sleep and we have plugged IO queued, make + * sure to submit it to avoid deadlocks. + */ + if (blk_needs_flush_plug(prev)) { + raw_spin_unlock(&rq->lock); + blk_flush_plug(prev); + raw_spin_lock(&rq->lock); + } } switch_count = &prev->nvcsw; } - /* - * If we are going to sleep and we have plugged IO queued, make - * sure to submit it to avoid deadlocks. - */ - if (prev->state != TASK_RUNNING && blk_needs_flush_plug(prev)) { - raw_spin_unlock(&rq->lock); - blk_flush_plug(prev); - raw_spin_lock(&rq->lock); - } - pre_schedule(rq, prev); if (unlikely(!rq->nr_running)) -- cgit v1.2.3 From 0cd9c6494ee5c19aef085152bc37f3a4e774a9e1 Mon Sep 17 00:00:00 2001 From: Darren Hart Date: Thu, 14 Apr 2011 15:41:57 -0700 Subject: futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup The FLAGS_HAS_TIMEOUT flag was not getting set, causing the restart_block to restart futex_wait() without a timeout after a signal. Commit b41277dc7a18ee332d in 2.6.38 introduced the regression by accidentally removing the the FLAGS_HAS_TIMEOUT assignment from futex_wait() during the setup of the restart block. Restore the originaly behavior. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=32922 Reported-by: Tim Smith Reported-by: Torsten Hilbrich Signed-off-by: Darren Hart Signed-off-by: Eric Dumazet Cc: Peter Zijlstra Cc: John Kacur Cc: stable@kernel.org Link: http://lkml.kernel.org/r/%3Cdaac0eb3af607f72b9a4d3126b2ba8fb5ed3b883.1302820917.git.dvhart%40linux.intel.com%3E Signed-off-by: Thomas Gleixner --- kernel/futex.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/futex.c b/kernel/futex.c index dfb924ffe65b..fe28dc282eae 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1886,7 +1886,7 @@ retry: restart->futex.val = val; restart->futex.time = abs_time->tv64; restart->futex.bitset = bitset; - restart->futex.flags = flags; + restart->futex.flags = flags | FLAGS_HAS_TIMEOUT; ret = -ERESTART_RESTARTBLOCK; -- cgit v1.2.3 From a237c1c5bc5dc5c76a21be922dca4826f3eca8ca Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Sat, 16 Apr 2011 13:27:55 +0200 Subject: block: let io_schedule() flush the plug inline Linus correctly observes that the most important dispatch cases are now done from kblockd, this isn't ideal for latency reasons. The original reason for switching dispatches out-of-line was to avoid too deep a stack, so by _only_ letting the "accidental" flush directly in schedule() be guarded by offload to kblockd, we should be able to get the best of both worlds. So add a blk_schedule_flush_plug() that offloads to kblockd, and only use that from the schedule() path. Signed-off-by: Jens Axboe --- include/linux/blkdev.h | 13 +++++++++++++ kernel/sched.c | 2 +- 2 files changed, 14 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 1c76506fcf11..ec0357d8c4a5 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -871,6 +871,14 @@ static inline void blk_flush_plug(struct task_struct *tsk) { struct blk_plug *plug = tsk->plug; + if (plug) + blk_flush_plug_list(plug, false); +} + +static inline void blk_schedule_flush_plug(struct task_struct *tsk) +{ + struct blk_plug *plug = tsk->plug; + if (plug) blk_flush_plug_list(plug, true); } @@ -1317,6 +1325,11 @@ static inline void blk_flush_plug(struct task_struct *task) { } +static inline void blk_schedule_flush_plug(struct task_struct *task) +{ +} + + static inline bool blk_needs_flush_plug(struct task_struct *tsk) { return false; diff --git a/kernel/sched.c b/kernel/sched.c index a187c3fe027b..312f8b95c2d4 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -4118,7 +4118,7 @@ need_resched: */ if (blk_needs_flush_plug(prev)) { raw_spin_unlock(&rq->lock); - blk_flush_plug(prev); + blk_schedule_flush_plug(prev); raw_spin_lock(&rq->lock); } } -- cgit v1.2.3 From 49cac01e1fa74174d72adb0e872504a7fefd7c01 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Sat, 16 Apr 2011 13:51:05 +0200 Subject: block: make unplug timer trace event correspond to the schedule() unplug It's a pretty close match to what we had before - the timer triggering would mean that nobody unplugged the plug in due time, in the new scheme this matches very closely what the schedule() unplug now is. It's essentially the difference between an explicit unplug (IO unplug) or an implicit unplug (timer unplug, we scheduled with pending IO queued). Signed-off-by: Jens Axboe --- block/blk-core.c | 18 ++++++++++++------ include/trace/events/block.h | 13 +++++++------ kernel/trace/blktrace.c | 18 ++++++++++++------ 3 files changed, 31 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/block/blk-core.c b/block/blk-core.c index 3c8121072507..78b7b0cb7216 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2662,17 +2662,23 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b) return !(rqa->q <= rqb->q); } +/* + * If 'from_schedule' is true, then postpone the dispatch of requests + * until a safe kblockd context. We due this to avoid accidental big + * additional stack usage in driver dispatch, in places where the originally + * plugger did not intend it. + */ static void queue_unplugged(struct request_queue *q, unsigned int depth, - bool force_kblockd) + bool from_schedule) { - trace_block_unplug_io(q, depth); - __blk_run_queue(q, force_kblockd); + trace_block_unplug(q, depth, !from_schedule); + __blk_run_queue(q, from_schedule); if (q->unplugged_fn) q->unplugged_fn(q); } -void blk_flush_plug_list(struct blk_plug *plug, bool force_kblockd) +void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule) { struct request_queue *q; unsigned long flags; @@ -2707,7 +2713,7 @@ void blk_flush_plug_list(struct blk_plug *plug, bool force_kblockd) BUG_ON(!rq->q); if (rq->q != q) { if (q) { - queue_unplugged(q, depth, force_kblockd); + queue_unplugged(q, depth, from_schedule); spin_unlock(q->queue_lock); } q = rq->q; @@ -2728,7 +2734,7 @@ void blk_flush_plug_list(struct blk_plug *plug, bool force_kblockd) } if (q) { - queue_unplugged(q, depth, force_kblockd); + queue_unplugged(q, depth, from_schedule); spin_unlock(q->queue_lock); } diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 006e60b58306..bf366547da25 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -401,9 +401,9 @@ TRACE_EVENT(block_plug, DECLARE_EVENT_CLASS(block_unplug, - TP_PROTO(struct request_queue *q, unsigned int depth), + TP_PROTO(struct request_queue *q, unsigned int depth, bool explicit), - TP_ARGS(q, depth), + TP_ARGS(q, depth, explicit), TP_STRUCT__entry( __field( int, nr_rq ) @@ -419,18 +419,19 @@ DECLARE_EVENT_CLASS(block_unplug, ); /** - * block_unplug_io - release of operations requests in request queue + * block_unplug - release of operations requests in request queue * @q: request queue to unplug * @depth: number of requests just added to the queue + * @explicit: whether this was an explicit unplug, or one from schedule() * * Unplug request queue @q because device driver is scheduled to work * on elements in the request queue. */ -DEFINE_EVENT(block_unplug, block_unplug_io, +DEFINE_EVENT(block_unplug, block_unplug, - TP_PROTO(struct request_queue *q, unsigned int depth), + TP_PROTO(struct request_queue *q, unsigned int depth, bool explicit), - TP_ARGS(q, depth) + TP_ARGS(q, depth, explicit) ); /** diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 3e3970d53d14..6957aa298dfa 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -850,16 +850,21 @@ static void blk_add_trace_plug(void *ignore, struct request_queue *q) __blk_add_trace(bt, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL); } -static void blk_add_trace_unplug_io(void *ignore, struct request_queue *q, - unsigned int depth) +static void blk_add_trace_unplug(void *ignore, struct request_queue *q, + unsigned int depth, bool explicit) { struct blk_trace *bt = q->blk_trace; if (bt) { __be64 rpdu = cpu_to_be64(depth); + u32 what; - __blk_add_trace(bt, 0, 0, 0, BLK_TA_UNPLUG_IO, 0, - sizeof(rpdu), &rpdu); + if (explicit) + what = BLK_TA_UNPLUG_IO; + else + what = BLK_TA_UNPLUG_TIMER; + + __blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu); } } @@ -1002,7 +1007,7 @@ static void blk_register_tracepoints(void) WARN_ON(ret); ret = register_trace_block_plug(blk_add_trace_plug, NULL); WARN_ON(ret); - ret = register_trace_block_unplug_io(blk_add_trace_unplug_io, NULL); + ret = register_trace_block_unplug(blk_add_trace_unplug, NULL); WARN_ON(ret); ret = register_trace_block_split(blk_add_trace_split, NULL); WARN_ON(ret); @@ -1017,7 +1022,7 @@ static void blk_unregister_tracepoints(void) unregister_trace_block_rq_remap(blk_add_trace_rq_remap, NULL); unregister_trace_block_bio_remap(blk_add_trace_bio_remap, NULL); unregister_trace_block_split(blk_add_trace_split, NULL); - unregister_trace_block_unplug_io(blk_add_trace_unplug_io, NULL); + unregister_trace_block_unplug(blk_add_trace_unplug, NULL); unregister_trace_block_plug(blk_add_trace_plug, NULL); unregister_trace_block_sleeprq(blk_add_trace_sleeprq, NULL); unregister_trace_block_getrq(blk_add_trace_getrq, NULL); @@ -1332,6 +1337,7 @@ static const struct { [__BLK_TA_COMPLETE] = {{ "C", "complete" }, blk_log_with_error }, [__BLK_TA_PLUG] = {{ "P", "plug" }, blk_log_plug }, [__BLK_TA_UNPLUG_IO] = {{ "U", "unplug_io" }, blk_log_unplug }, + [__BLK_TA_UNPLUG_TIMER] = {{ "UT", "unplug_timer" }, blk_log_unplug }, [__BLK_TA_INSERT] = {{ "I", "insert" }, blk_log_generic }, [__BLK_TA_SPLIT] = {{ "X", "split" }, blk_log_split }, [__BLK_TA_BOUNCE] = {{ "B", "bounce" }, blk_log_generic }, -- cgit v1.2.3 From 1791f881435fab951939ad700e947b66c062e083 Mon Sep 17 00:00:00 2001 From: Richard Cochran Date: Wed, 30 Mar 2011 15:24:21 +0200 Subject: posix clocks: Replace mutex with reader/writer semaphore A dynamic posix clock is protected from asynchronous removal by a mutex. However, using a mutex has the unwanted effect that a long running clock operation in one process will unnecessarily block other processes. For example, one process might call read() to get an external time stamp coming in at one pulse per second. A second process calling clock_gettime would have to wait for almost a whole second. This patch fixes the issue by using a reader/writer semaphore instead of a mutex. Signed-off-by: Richard Cochran Cc: John Stultz Link: http://lkml.kernel.org/r/%3C20110330132421.GA31771%40riccoc20.at.omicron.at%3E Signed-off-by: Thomas Gleixner --- include/linux/posix-clock.h | 5 +++-- kernel/time/posix-clock.c | 24 +++++++++--------------- 2 files changed, 12 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/include/linux/posix-clock.h b/include/linux/posix-clock.h index 369e19d3750b..7f1183dcd119 100644 --- a/include/linux/posix-clock.h +++ b/include/linux/posix-clock.h @@ -24,6 +24,7 @@ #include #include #include +#include struct posix_clock; @@ -104,7 +105,7 @@ struct posix_clock_operations { * @ops: Functional interface to the clock * @cdev: Character device instance for this clock * @kref: Reference count. - * @mutex: Protects the 'zombie' field from concurrent access. + * @rwsem: Protects the 'zombie' field from concurrent access. * @zombie: If 'zombie' is true, then the hardware has disappeared. * @release: A function to free the structure when the reference count reaches * zero. May be NULL if structure is statically allocated. @@ -117,7 +118,7 @@ struct posix_clock { struct posix_clock_operations ops; struct cdev cdev; struct kref kref; - struct mutex mutex; + struct rw_semaphore rwsem; bool zombie; void (*release)(struct posix_clock *clk); }; diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c index 25028dd4fa18..c340ca658f37 100644 --- a/kernel/time/posix-clock.c +++ b/kernel/time/posix-clock.c @@ -19,7 +19,6 @@ */ #include #include -#include #include #include #include @@ -34,19 +33,19 @@ static struct posix_clock *get_posix_clock(struct file *fp) { struct posix_clock *clk = fp->private_data; - mutex_lock(&clk->mutex); + down_read(&clk->rwsem); if (!clk->zombie) return clk; - mutex_unlock(&clk->mutex); + up_read(&clk->rwsem); return NULL; } static void put_posix_clock(struct posix_clock *clk) { - mutex_unlock(&clk->mutex); + up_read(&clk->rwsem); } static ssize_t posix_clock_read(struct file *fp, char __user *buf, @@ -156,7 +155,7 @@ static int posix_clock_open(struct inode *inode, struct file *fp) struct posix_clock *clk = container_of(inode->i_cdev, struct posix_clock, cdev); - mutex_lock(&clk->mutex); + down_read(&clk->rwsem); if (clk->zombie) { err = -ENODEV; @@ -172,7 +171,7 @@ static int posix_clock_open(struct inode *inode, struct file *fp) fp->private_data = clk; } out: - mutex_unlock(&clk->mutex); + up_read(&clk->rwsem); return err; } @@ -211,25 +210,20 @@ int posix_clock_register(struct posix_clock *clk, dev_t devid) int err; kref_init(&clk->kref); - mutex_init(&clk->mutex); + init_rwsem(&clk->rwsem); cdev_init(&clk->cdev, &posix_clock_file_operations); clk->cdev.owner = clk->ops.owner; err = cdev_add(&clk->cdev, devid, 1); - if (err) - goto no_cdev; return err; -no_cdev: - mutex_destroy(&clk->mutex); - return err; } EXPORT_SYMBOL_GPL(posix_clock_register); static void delete_clock(struct kref *kref) { struct posix_clock *clk = container_of(kref, struct posix_clock, kref); - mutex_destroy(&clk->mutex); + if (clk->release) clk->release(clk); } @@ -238,9 +232,9 @@ void posix_clock_unregister(struct posix_clock *clk) { cdev_del(&clk->cdev); - mutex_lock(&clk->mutex); + down_write(&clk->rwsem); clk->zombie = true; - mutex_unlock(&clk->mutex); + up_write(&clk->rwsem); kref_put(&clk->kref, delete_clock); } -- cgit v1.2.3 From c78193e9c7bcbf25b8237ad0dec82f805c4ea69b Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Mon, 18 Apr 2011 10:35:30 -0700 Subject: next_pidmap: fix overflow condition MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit next_pidmap() just quietly accepted whatever 'last' pid that was passed in, which is not all that safe when one of the users is /proc. Admittedly the proc code should do some sanity checking on the range (and that will be the next commit), but that doesn't mean that the helper functions should just do that pidmap pointer arithmetic without checking the range of its arguments. So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1" doesn't really matter, the for-loop does check against the end of the pidmap array properly (it's only the actual pointer arithmetic overflow case we need to worry about, and going one bit beyond isn't going to overflow). [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ] Reported-by: Tavis Ormandy Analyzed-by: Robert Święcki Cc: Eric W. Biederman Cc: Pavel Emelyanov Signed-off-by: Linus Torvalds --- include/linux/pid.h | 2 +- kernel/pid.c | 5 ++++- 2 files changed, 5 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/pid.h b/include/linux/pid.h index 31afb7ecbe1f..cdced84261d7 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -117,7 +117,7 @@ extern struct pid *find_vpid(int nr); */ extern struct pid *find_get_pid(int nr); extern struct pid *find_ge_pid(int nr, struct pid_namespace *); -int next_pidmap(struct pid_namespace *pid_ns, int last); +int next_pidmap(struct pid_namespace *pid_ns, unsigned int last); extern struct pid *alloc_pid(struct pid_namespace *ns); extern void free_pid(struct pid *pid); diff --git a/kernel/pid.c b/kernel/pid.c index 02f221274265..57a8346a270e 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -217,11 +217,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) return -1; } -int next_pidmap(struct pid_namespace *pid_ns, int last) +int next_pidmap(struct pid_namespace *pid_ns, unsigned int last) { int offset; struct pidmap *map, *end; + if (last >= PID_MAX_LIMIT) + return -1; + offset = (last + 1) & BITS_PER_PAGE_MASK; map = &pid_ns->pidmap[(last + 1)/BITS_PER_PAGE]; end = &pid_ns->pidmap[PIDMAP_ENTRIES]; -- cgit v1.2.3