From 10fc05d0e551146ad6feb0ab8902d28a2d3c5624 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 7 Oct 2013 11:28:40 +0100 Subject: mm: numa: Document automatic NUMA balancing sysctls Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Srikar Dronamraju Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1381141781-10992-3-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar --- Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 9d4c1d18ad44..1428c6659254 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -355,6 +355,72 @@ utilize. ============================================================== +numa_balancing + +Enables/disables automatic page fault based NUMA memory +balancing. Memory is moved automatically to nodes +that access it often. + +Enables/disables automatic NUMA memory balancing. On NUMA machines, there +is a performance penalty if remote memory is accessed by a CPU. When this +feature is enabled the kernel samples what task thread is accessing memory +by periodically unmapping pages and later trapping a page fault. At the +time of the page fault, it is determined if the data being accessed should +be migrated to a local memory node. + +The unmapping of pages and trapping faults incur additional overhead that +ideally is offset by improved memory locality but there is no universal +guarantee. If the target workload is already bound to NUMA nodes then this +feature should be disabled. Otherwise, if the system overhead from the +feature is too high then the rate the kernel samples for NUMA hinting +faults may be controlled by the numa_balancing_scan_period_min_ms, +numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset, +numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls. + +============================================================== + +numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, +numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset, +numa_balancing_scan_size_mb + +Automatic NUMA balancing scans tasks address space and unmaps pages to +detect if pages are properly placed or if the data should be migrated to a +memory node local to where the task is running. Every "scan delay" the task +scans the next "scan size" number of pages in its address space. When the +end of the address space is reached the scanner restarts from the beginning. + +In combination, the "scan delay" and "scan size" determine the scan rate. +When "scan delay" decreases, the scan rate increases. The scan delay and +hence the scan rate of every task is adaptive and depends on historical +behaviour. If pages are properly placed then the scan delay increases, +otherwise the scan delay decreases. The "scan size" is not adaptive but +the higher the "scan size", the higher the scan rate. + +Higher scan rates incur higher system overhead as page faults must be +trapped and potentially data must be migrated. However, the higher the scan +rate, the more quickly a tasks memory is migrated to a local node if the +workload pattern changes and minimises performance impact due to remote +memory accesses. These sysctls control the thresholds for scan delays and +the number of pages scanned. + +numa_balancing_scan_period_min_ms is the minimum delay in milliseconds +between scans. It effectively controls the maximum scanning rate for +each task. + +numa_balancing_scan_delay_ms is the starting "scan delay" used for a task +when it initially forks. + +numa_balancing_scan_period_max_ms is the maximum delay between scans. It +effectively controls the minimum scanning rate for each task. + +numa_balancing_scan_size_mb is how many megabytes worth of pages are +scanned for a given scan. + +numa_balancing_scan_period_reset is a blunt instrument that controls how +often a tasks scan delay is reset to detect sudden changes in task behaviour. + +============================================================== + osrelease, ostype & version: # cat osrelease -- cgit v1.3 From 598f0ec0bc996e90a806ee9564af919ea5aad401 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 7 Oct 2013 11:28:55 +0100 Subject: sched/numa: Set the scan rate proportional to the memory usage of the task being scanned The NUMA PTE scan rate is controlled with a combination of the numa_balancing_scan_period_min, numa_balancing_scan_period_max and numa_balancing_scan_size. This scan rate is independent of the size of the task and as an aside it is further complicated by the fact that numa_balancing_scan_size controls how many pages are marked pte_numa and not how much virtual memory is scanned. In combination, it is almost impossible to meaningfully tune the min and max scan periods and reasoning about performance is complex when the time to complete a full scan is is partially a function of the tasks memory size. This patch alters the semantic of the min and max tunables to be about tuning the length time it takes to complete a scan of a tasks occupied virtual address space. Conceptually this is a lot easier to understand. There is a "sanity" check to ensure the scan rate is never extremely fast based on the amount of virtual memory that should be scanned in a second. The default of 2.5G seems arbitrary but it is to have the maximum scan rate after the patch roughly match the maximum scan rate before the patch was applied. On a similar note, numa_scan_period is in milliseconds and not jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies to numa_scan_period means that the rate scanning slows depends on HZ which is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Srikar Dronamraju Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar --- Documentation/sysctl/kernel.txt | 11 +++--- include/linux/sched.h | 1 + kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++------ 3 files changed, 83 insertions(+), 17 deletions(-) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 1428c6659254..8cd7e5fc79da 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -403,15 +403,16 @@ workload pattern changes and minimises performance impact due to remote memory accesses. These sysctls control the thresholds for scan delays and the number of pages scanned. -numa_balancing_scan_period_min_ms is the minimum delay in milliseconds -between scans. It effectively controls the maximum scanning rate for -each task. +numa_balancing_scan_period_min_ms is the minimum time in milliseconds to +scan a tasks virtual memory. It effectively controls the maximum scanning +rate for each task. numa_balancing_scan_delay_ms is the starting "scan delay" used for a task when it initially forks. -numa_balancing_scan_period_max_ms is the maximum delay between scans. It -effectively controls the minimum scanning rate for each task. +numa_balancing_scan_period_max_ms is the maximum time in milliseconds to +scan a tasks virtual memory. It effectively controls the minimum scanning +rate for each task. numa_balancing_scan_size_mb is how many megabytes worth of pages are scanned for a given scan. diff --git a/include/linux/sched.h b/include/linux/sched.h index 2ac5285db434..fdcb4c855072 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1339,6 +1339,7 @@ struct task_struct { int numa_scan_seq; int numa_migrate_seq; unsigned int numa_scan_period; + unsigned int numa_scan_period_max; u64 node_stamp; /* migration stamp */ struct callback_head numa_work; #endif /* CONFIG_NUMA_BALANCING */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0966f0c16f1b..e08d757720de 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -818,11 +818,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se) #ifdef CONFIG_NUMA_BALANCING /* - * numa task sample period in ms + * Approximate time to scan a full NUMA task in ms. The task scan period is + * calculated based on the tasks virtual memory size and + * numa_balancing_scan_size. */ -unsigned int sysctl_numa_balancing_scan_period_min = 100; -unsigned int sysctl_numa_balancing_scan_period_max = 100*50; -unsigned int sysctl_numa_balancing_scan_period_reset = 100*600; +unsigned int sysctl_numa_balancing_scan_period_min = 1000; +unsigned int sysctl_numa_balancing_scan_period_max = 60000; +unsigned int sysctl_numa_balancing_scan_period_reset = 60000; /* Portion of address space to scan in MB */ unsigned int sysctl_numa_balancing_scan_size = 256; @@ -830,6 +832,51 @@ unsigned int sysctl_numa_balancing_scan_size = 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ unsigned int sysctl_numa_balancing_scan_delay = 1000; +static unsigned int task_nr_scan_windows(struct task_struct *p) +{ + unsigned long rss = 0; + unsigned long nr_scan_pages; + + /* + * Calculations based on RSS as non-present and empty pages are skipped + * by the PTE scanner and NUMA hinting faults should be trapped based + * on resident pages + */ + nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT); + rss = get_mm_rss(p->mm); + if (!rss) + rss = nr_scan_pages; + + rss = round_up(rss, nr_scan_pages); + return rss / nr_scan_pages; +} + +/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */ +#define MAX_SCAN_WINDOW 2560 + +static unsigned int task_scan_min(struct task_struct *p) +{ + unsigned int scan, floor; + unsigned int windows = 1; + + if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW) + windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size; + floor = 1000 / windows; + + scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p); + return max_t(unsigned int, floor, scan); +} + +static unsigned int task_scan_max(struct task_struct *p) +{ + unsigned int smin = task_scan_min(p); + unsigned int smax; + + /* Watch for min being lower than max due to floor calculations */ + smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p); + return max(smin, smax); +} + static void task_numa_placement(struct task_struct *p) { int seq; @@ -840,6 +887,7 @@ static void task_numa_placement(struct task_struct *p) if (p->numa_scan_seq == seq) return; p->numa_scan_seq = seq; + p->numa_scan_period_max = task_scan_max(p); /* FIXME: Scheduling placement policy hints go here */ } @@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated) * If pages are properly placed (did not migrate) then scan slower. * This is reset periodically in case of phase changes */ - if (!migrated) - p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max, - p->numa_scan_period + jiffies_to_msecs(10)); + if (!migrated) { + /* Initialise if necessary */ + if (!p->numa_scan_period_max) + p->numa_scan_period_max = task_scan_max(p); + + p->numa_scan_period = min(p->numa_scan_period_max, + p->numa_scan_period + 10); + } task_numa_placement(p); } @@ -884,6 +937,7 @@ void task_numa_work(struct callback_head *work) struct mm_struct *mm = p->mm; struct vm_area_struct *vma; unsigned long start, end; + unsigned long nr_pte_updates = 0; long pages; WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work)); @@ -915,7 +969,7 @@ void task_numa_work(struct callback_head *work) */ migrate = mm->numa_next_reset; if (time_after(now, migrate)) { - p->numa_scan_period = sysctl_numa_balancing_scan_period_min; + p->numa_scan_period = task_scan_min(p); next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset); xchg(&mm->numa_next_reset, next_scan); } @@ -927,8 +981,10 @@ void task_numa_work(struct callback_head *work) if (time_before(now, migrate)) return; - if (p->numa_scan_period == 0) - p->numa_scan_period = sysctl_numa_balancing_scan_period_min; + if (p->numa_scan_period == 0) { + p->numa_scan_period_max = task_scan_max(p); + p->numa_scan_period = task_scan_min(p); + } next_scan = now + msecs_to_jiffies(p->numa_scan_period); if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate) @@ -965,7 +1021,15 @@ void task_numa_work(struct callback_head *work) start = max(start, vma->vm_start); end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); end = min(end, vma->vm_end); - pages -= change_prot_numa(vma, start, end); + nr_pte_updates += change_prot_numa(vma, start, end); + + /* + * Scan sysctl_numa_balancing_scan_size but ensure that + * at least one PTE is updated so that unused virtual + * address space is quickly skipped. + */ + if (nr_pte_updates) + pages -= (end - start) >> PAGE_SHIFT; start = end; if (pages <= 0) @@ -1012,7 +1076,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr) if (now - curr->node_stamp > period) { if (!curr->node_stamp) - curr->numa_scan_period = sysctl_numa_balancing_scan_period_min; + curr->numa_scan_period = task_scan_min(curr); curr->node_stamp += period; if (!time_before(jiffies, curr->mm->numa_next_scan)) { -- cgit v1.3 From 3a7053b3224f4a8b0e8184166190076593621617 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 7 Oct 2013 11:29:00 +0100 Subject: sched/numa: Favour moving tasks towards the preferred node This patch favours moving tasks towards NUMA node that recorded a higher number of NUMA faults during active load balancing. Ideally this is self-reinforcing as the longer the task runs on that node, the more faults it should incur causing task_numa_placement to keep the task running on that node. In reality a big weakness is that the nodes CPUs can be overloaded and it would be more efficient to queue tasks on an idle node and migrate to the new node. This would require additional smarts in the balancer so for now the balancer will simply prefer to place the task on the preferred node for a PTE scans which is controlled by the numa_balancing_settle_count sysctl. Once the settle_count number of scans has complete the schedule is free to place the task on an alternative node if the load is imbalanced. [srikar@linux.vnet.ibm.com: Fixed statistics] Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Srikar Dronamraju [ Tunable and use higher faults instead of preferred. ] Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar --- Documentation/sysctl/kernel.txt | 8 +++++- include/linux/sched.h | 1 + kernel/sched/core.c | 3 +- kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++++++++--- kernel/sched/features.h | 7 +++++ kernel/sysctl.c | 7 +++++ 6 files changed, 83 insertions(+), 6 deletions(-) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 8cd7e5fc79da..d48bca45b6f2 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -375,7 +375,8 @@ feature should be disabled. Otherwise, if the system overhead from the feature is too high then the rate the kernel samples for NUMA hinting faults may be controlled by the numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset, -numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls. +numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and +numa_balancing_settle_count sysctls. ============================================================== @@ -420,6 +421,11 @@ scanned for a given scan. numa_balancing_scan_period_reset is a blunt instrument that controls how often a tasks scan delay is reset to detect sudden changes in task behaviour. +numa_balancing_settle_count is how many scan periods must complete before +the schedule balancer stops pushing the task towards a preferred node. This +gives the scheduler a chance to place the task on an alternative node if the +preferred node is overloaded. + ============================================================== osrelease, ostype & version: diff --git a/include/linux/sched.h b/include/linux/sched.h index a463bc3ad437..aecdc5a18773 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -777,6 +777,7 @@ enum cpu_idle_type { #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ #define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */ +#define SD_NUMA 0x4000 /* cross-node balancing */ extern int __weak arch_sd_sibiling_asym_packing(void); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 064a0af44540..b7e6b6f9c5f6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1631,7 +1631,7 @@ static void __sched_fork(struct task_struct *p) p->node_stamp = 0ULL; p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0; - p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0; + p->numa_migrate_seq = 0; p->numa_scan_period = sysctl_numa_balancing_scan_delay; p->numa_preferred_nid = -1; p->numa_work.next = &p->numa_work; @@ -5656,6 +5656,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu) | 0*SD_SHARE_PKG_RESOURCES | 1*SD_SERIALIZE | 0*SD_PREFER_SIBLING + | 1*SD_NUMA | sd_local_flags(level) , .last_balance = jiffies, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3abc651bc38a..6ffddca687fe 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -877,6 +877,15 @@ static unsigned int task_scan_max(struct task_struct *p) return max(smin, smax); } +/* + * Once a preferred node is selected the scheduler balancer will prefer moving + * a task to that node for sysctl_numa_balancing_settle_count number of PTE + * scans. This will give the process the chance to accumulate more faults on + * the preferred node but still allow the scheduler to move the task again if + * the nodes CPUs are overloaded. + */ +unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3; + static void task_numa_placement(struct task_struct *p) { int seq, nid, max_nid = -1; @@ -888,6 +897,7 @@ static void task_numa_placement(struct task_struct *p) if (p->numa_scan_seq == seq) return; p->numa_scan_seq = seq; + p->numa_migrate_seq++; p->numa_scan_period_max = task_scan_max(p); /* Find the node with the highest number of faults */ @@ -907,8 +917,10 @@ static void task_numa_placement(struct task_struct *p) } /* Update the tasks preferred node if necessary */ - if (max_faults && max_nid != p->numa_preferred_nid) + if (max_faults && max_nid != p->numa_preferred_nid) { p->numa_preferred_nid = max_nid; + p->numa_migrate_seq = 0; + } } /* @@ -4071,6 +4083,38 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd) return delta < (s64)sysctl_sched_migration_cost; } +#ifdef CONFIG_NUMA_BALANCING +/* Returns true if the destination node has incurred more faults */ +static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env) +{ + int src_nid, dst_nid; + + if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults || + !(env->sd->flags & SD_NUMA)) { + return false; + } + + src_nid = cpu_to_node(env->src_cpu); + dst_nid = cpu_to_node(env->dst_cpu); + + if (src_nid == dst_nid || + p->numa_migrate_seq >= sysctl_numa_balancing_settle_count) + return false; + + if (dst_nid == p->numa_preferred_nid || + p->numa_faults[dst_nid] > p->numa_faults[src_nid]) + return true; + + return false; +} +#else +static inline bool migrate_improves_locality(struct task_struct *p, + struct lb_env *env) +{ + return false; +} +#endif + /* * can_migrate_task - may task p from runqueue rq be migrated to this_cpu? */ @@ -4128,11 +4172,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) /* * Aggressive migration if: - * 1) task is cache cold, or - * 2) too many balance attempts have failed. + * 1) destination numa is preferred + * 2) task is cache cold, or + * 3) too many balance attempts have failed. */ - tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd); + + if (migrate_improves_locality(p, env)) { +#ifdef CONFIG_SCHEDSTATS + if (tsk_cache_hot) { + schedstat_inc(env->sd, lb_hot_gained[env->idle]); + schedstat_inc(p, se.statistics.nr_forced_migrations); + } +#endif + return 1; + } + if (!tsk_cache_hot || env->sd->nr_balance_failed > env->sd->cache_nice_tries) { diff --git a/kernel/sched/features.h b/kernel/sched/features.h index cba5c616a157..d9278ce2c4b4 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -67,4 +67,11 @@ SCHED_FEAT(LB_MIN, false) */ #ifdef CONFIG_NUMA_BALANCING SCHED_FEAT(NUMA, false) + +/* + * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a + * higher number of hinting faults are recorded during active load + * balancing. + */ +SCHED_FEAT(NUMA_FAVOUR_HIGHER, true) #endif diff --git a/kernel/sysctl.c b/kernel/sysctl.c index b2f06f3c6a3f..42f616a74f40 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "numa_balancing_settle_count", + .data = &sysctl_numa_balancing_settle_count, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_SCHED_DEBUG */ { -- cgit v1.3 From 930aa174fcc8b0efaad102fd80f677b92f35eaa2 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 7 Oct 2013 11:29:37 +0100 Subject: sched/numa: Remove the numa_balancing_scan_period_reset sysctl With scan rate adaptions based on whether the workload has properly converged or not there should be no need for the scan period reset hammer. Get rid of it. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Srikar Dronamraju Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar --- Documentation/sysctl/kernel.txt | 11 +++-------- include/linux/mm_types.h | 3 --- include/linux/sched/sysctl.h | 1 - kernel/sched/core.c | 1 - kernel/sched/fair.c | 18 +----------------- kernel/sysctl.c | 7 ------- 6 files changed, 4 insertions(+), 37 deletions(-) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index d48bca45b6f2..84f17800f8b5 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -374,15 +374,13 @@ guarantee. If the target workload is already bound to NUMA nodes then this feature should be disabled. Otherwise, if the system overhead from the feature is too high then the rate the kernel samples for NUMA hinting faults may be controlled by the numa_balancing_scan_period_min_ms, -numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset, -numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and -numa_balancing_settle_count sysctls. +numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, +numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls. ============================================================== numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, -numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset, -numa_balancing_scan_size_mb +numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb Automatic NUMA balancing scans tasks address space and unmaps pages to detect if pages are properly placed or if the data should be migrated to a @@ -418,9 +416,6 @@ rate for each task. numa_balancing_scan_size_mb is how many megabytes worth of pages are scanned for a given scan. -numa_balancing_scan_period_reset is a blunt instrument that controls how -often a tasks scan delay is reset to detect sudden changes in task behaviour. - numa_balancing_settle_count is how many scan periods must complete before the schedule balancer stops pushing the task towards a preferred node. This gives the scheduler a chance to place the task on an alternative node if the diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index a30f9ca66557..a3198e5aaf4e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -420,9 +420,6 @@ struct mm_struct { */ unsigned long numa_next_scan; - /* numa_next_reset is when the PTE scanner period will be reset */ - unsigned long numa_next_reset; - /* Restart point for scanning and setting pte_numa */ unsigned long numa_scan_offset; diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index bf8086b2506e..10d16c4fbe89 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -47,7 +47,6 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling; extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; -extern unsigned int sysctl_numa_balancing_scan_period_reset; extern unsigned int sysctl_numa_balancing_scan_size; extern unsigned int sysctl_numa_balancing_settle_count; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8cfd51f62241..89c5ae836f66 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1721,7 +1721,6 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) #ifdef CONFIG_NUMA_BALANCING if (p->mm && atomic_read(&p->mm->mm_users) == 1) { p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); - p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset); p->mm->numa_scan_seq = 0; } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 66237ff8b01e..da6fa22be000 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -826,7 +826,6 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se) */ unsigned int sysctl_numa_balancing_scan_period_min = 1000; unsigned int sysctl_numa_balancing_scan_period_max = 60000; -unsigned int sysctl_numa_balancing_scan_period_reset = 60000; /* Portion of address space to scan in MB */ unsigned int sysctl_numa_balancing_scan_size = 256; @@ -1685,24 +1684,9 @@ void task_numa_work(struct callback_head *work) if (p->flags & PF_EXITING) return; - if (!mm->numa_next_reset || !mm->numa_next_scan) { + if (!mm->numa_next_scan) { mm->numa_next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); - mm->numa_next_reset = now + - msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset); - } - - /* - * Reset the scan period if enough time has gone by. Objective is that - * scanning will be reduced if pages are properly placed. As tasks - * can enter different phases this needs to be re-examined. Lacking - * proper tracking of reference behaviour, this blunt hammer is used. - */ - migrate = mm->numa_next_reset; - if (time_after(now, migrate)) { - p->numa_scan_period = task_scan_min(p); - next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset); - xchg(&mm->numa_next_reset, next_scan); } /* diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 42f616a74f40..e509b90a8002 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -370,13 +370,6 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, - { - .procname = "numa_balancing_scan_period_reset", - .data = &sysctl_numa_balancing_scan_period_reset, - .maxlen = sizeof(unsigned int), - .mode = 0644, - .proc_handler = proc_dointvec, - }, { .procname = "numa_balancing_scan_period_max_ms", .data = &sysctl_numa_balancing_scan_period_max, -- cgit v1.3 From de1c9ce6f07fec0381a39a9d0b379ea35aa1167f Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Mon, 7 Oct 2013 11:29:39 +0100 Subject: sched/numa: Skip some page migrations after a shared fault Shared faults can lead to lots of unnecessary page migrations, slowing down the system, and causing private faults to hit the per-pgdat migration ratelimit. This patch adds sysctl numa_balancing_migrate_deferred, which specifies how many shared page migrations to skip unconditionally, after each page migration that is skipped because it is a shared fault. This reduces the number of page migrations back and forth in shared fault situations. It also gives a strong preference to the tasks that are already running where most of the memory is, and to moving the other tasks to near the memory. Testing this with a much higher scan rate than the default still seems to result in fewer page migrations than before. Memory seems to be somewhat better consolidated than previously, with multi-instance specjbb runs on a 4 node system. Signed-off-by: Rik van Riel Signed-off-by: Mel Gorman Cc: Andrea Arcangeli Cc: Johannes Weiner Cc: Srikar Dronamraju Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar --- Documentation/sysctl/kernel.txt | 10 ++++++++- include/linux/sched.h | 5 ++++- kernel/sched/fair.c | 8 +++++++ kernel/sysctl.c | 7 ++++++ mm/mempolicy.c | 48 ++++++++++++++++++++++++++++++++++++++++- 5 files changed, 75 insertions(+), 3 deletions(-) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 84f17800f8b5..4273b2d71a27 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -375,7 +375,8 @@ feature should be disabled. Otherwise, if the system overhead from the feature is too high then the rate the kernel samples for NUMA hinting faults may be controlled by the numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, -numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls. +numa_balancing_scan_size_mb, numa_balancing_settle_count sysctls and +numa_balancing_migrate_deferred. ============================================================== @@ -421,6 +422,13 @@ the schedule balancer stops pushing the task towards a preferred node. This gives the scheduler a chance to place the task on an alternative node if the preferred node is overloaded. +numa_balancing_migrate_deferred is how many page migrations get skipped +unconditionally, after a page migration is skipped because a page is shared +with other tasks. This reduces page migration overhead, and determines +how much stronger the "move task near its memory" policy scheduler becomes, +versus the "move memory near its task" memory management policy, for workloads +with shared memory. + ============================================================== osrelease, ostype & version: diff --git a/include/linux/sched.h b/include/linux/sched.h index d24f70ffddee..833eed55cf43 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1342,6 +1342,8 @@ struct task_struct { int numa_scan_seq; unsigned int numa_scan_period; unsigned int numa_scan_period_max; + int numa_preferred_nid; + int numa_migrate_deferred; unsigned long numa_migrate_retry; u64 node_stamp; /* migration stamp */ struct callback_head numa_work; @@ -1372,7 +1374,6 @@ struct task_struct { */ unsigned long numa_faults_locality[2]; - int numa_preferred_nid; unsigned long numa_pages_migrated; #endif /* CONFIG_NUMA_BALANCING */ @@ -1469,6 +1470,8 @@ extern void task_numa_fault(int last_node, int node, int pages, int flags); extern pid_t task_numa_group_id(struct task_struct *p); extern void set_numabalancing_state(bool enabled); extern void task_numa_free(struct task_struct *p); + +extern unsigned int sysctl_numa_balancing_migrate_deferred; #else static inline void task_numa_fault(int last_node, int node, int pages, int flags) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8454c38b1b12..e7884dc3416d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -833,6 +833,14 @@ unsigned int sysctl_numa_balancing_scan_size = 256; /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */ unsigned int sysctl_numa_balancing_scan_delay = 1000; +/* + * After skipping a page migration on a shared page, skip N more numa page + * migrations unconditionally. This reduces the number of NUMA migrations + * in shared memory workloads, and has the effect of pulling tasks towards + * where their memory lives, over pulling the memory towards the task. + */ +unsigned int sysctl_numa_balancing_migrate_deferred = 16; + static unsigned int task_nr_scan_windows(struct task_struct *p) { unsigned long rss = 0; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e509b90a8002..a159e1fd2013 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "numa_balancing_migrate_deferred", + .data = &sysctl_numa_balancing_migrate_deferred, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, #endif /* CONFIG_NUMA_BALANCING */ #endif /* CONFIG_SCHED_DEBUG */ { diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 2929c24c22b7..71cb253368cb 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2301,6 +2301,35 @@ static void sp_free(struct sp_node *n) kmem_cache_free(sn_cache, n); } +#ifdef CONFIG_NUMA_BALANCING +static bool numa_migrate_deferred(struct task_struct *p, int last_cpupid) +{ + /* Never defer a private fault */ + if (cpupid_match_pid(p, last_cpupid)) + return false; + + if (p->numa_migrate_deferred) { + p->numa_migrate_deferred--; + return true; + } + return false; +} + +static inline void defer_numa_migrate(struct task_struct *p) +{ + p->numa_migrate_deferred = sysctl_numa_balancing_migrate_deferred; +} +#else +static inline bool numa_migrate_deferred(struct task_struct *p, int last_cpupid) +{ + return false; +} + +static inline void defer_numa_migrate(struct task_struct *p) +{ +} +#endif /* CONFIG_NUMA_BALANCING */ + /** * mpol_misplaced - check whether current page node is valid in policy * @@ -2402,7 +2431,24 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long * relation. */ last_cpupid = page_cpupid_xchg_last(page, this_cpupid); - if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) + if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) { + + /* See sysctl_numa_balancing_migrate_deferred comment */ + if (!cpupid_match_pid(current, last_cpupid)) + defer_numa_migrate(current); + + goto out; + } + + /* + * The quadratic filter above reduces extraneous migration + * of shared pages somewhat. This code reduces it even more, + * reducing the overhead of page migrations of shared pages. + * This makes workloads with shared pages rely more on + * "move task near its memory", and less on "move memory + * towards its task", which is exactly what we want. + */ + if (numa_migrate_deferred(current, last_cpupid)) goto out; } -- cgit v1.3 From 715ea41e60277f28f84d6c937737350e00955d56 Mon Sep 17 00:00:00 2001 From: Zheng Liu Date: Tue, 12 Nov 2013 15:08:30 -0800 Subject: mm: improve the description for dirty_background_ratio/dirty_ratio sysctl Now dirty_background_ratio/dirty_ratio contains a percentage of total avaiable memory, which contains free pages and reclaimable pages. The number of these pages is not equal to the number of total system memory. But they are described as a percentage of total system memory in Documentation/sysctl/vm.txt. So we need to fix them to avoid misunderstanding. Signed-off-by: Zheng Liu Cc: Rob Landley Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 79a797eb3e87..1fbd4eb7b64a 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -119,8 +119,11 @@ other appears as 0 when read. dirty_background_ratio -Contains, as a percentage of total system memory, the number of pages at which -the background kernel flusher threads will start writing out dirty data. +Contains, as a percentage of total available memory that contains free pages +and reclaimable pages, the number of pages at which the background kernel +flusher threads will start writing out dirty data. + +The total avaiable memory is not equal to total system memory. ============================================================== @@ -151,9 +154,11 @@ interval will be written out next time a flusher thread wakes up. dirty_ratio -Contains, as a percentage of total system memory, the number of pages at which -a process which is generating disk writes will itself start writing out dirty -data. +Contains, as a percentage of total available memory that contains free pages +and reclaimable pages, the number of pages at which a process which is +generating disk writes will itself start writing out dirty data. + +The total avaiable memory is not equal to total system memory. ============================================================== -- cgit v1.3 From 312b4e226951f707e120b95b118cbc14f3d162b2 Mon Sep 17 00:00:00 2001 From: Ryan Mallon Date: Tue, 12 Nov 2013 15:08:51 -0800 Subject: vsprintf: check real user/group id for %pK Some setuid binaries will allow reading of files which have read permission by the real user id. This is problematic with files which use %pK because the file access permission is checked at open() time, but the kptr_restrict setting is checked at read() time. If a setuid binary opens a %pK file as an unprivileged user, and then elevates permissions before reading the file, then kernel pointer values may be leaked. This happens for example with the setuid pppd application on Ubuntu 12.04: $ head -1 /proc/kallsyms 00000000 T startup_32 $ pppd file /proc/kallsyms pppd: In file /proc/kallsyms: unrecognized option 'c1000000' This will only leak the pointer value from the first line, but other setuid binaries may leak more information. Fix this by adding a check that in addition to the current process having CAP_SYSLOG, that effective user and group ids are equal to the real ids. If a setuid binary reads the contents of a file which uses %pK then the pointer values will be printed as NULL if the real user is unprivileged. Update the sysctl documentation to reflect the changes, and also correct the documentation to state the kptr_restrict=0 is the default. This is a only temporary solution to the issue. The correct solution is to do the permission check at open() time on files, and to replace %pK with a function which checks the open() time permission. %pK uses in printk should be removed since no sane permission check can be done, and instead protected by using dmesg_restrict. Signed-off-by: Ryan Mallon Cc: Kees Cook Cc: Alexander Viro Cc: Joe Perches Cc: "Eric W. Biederman" Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/kernel.txt | 25 ++++++++++++++++++------- lib/vsprintf.c | 33 ++++++++++++++++++++++++++++++--- 2 files changed, 48 insertions(+), 10 deletions(-) (limited to 'Documentation/sysctl') diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 4273b2d71a27..26b7ee491df8 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -290,13 +290,24 @@ Default value is "/sbin/hotplug". kptr_restrict: This toggle indicates whether restrictions are placed on -exposing kernel addresses via /proc and other interfaces. When -kptr_restrict is set to (0), there are no restrictions. When -kptr_restrict is set to (1), the default, kernel pointers -printed using the %pK format specifier will be replaced with 0's -unless the user has CAP_SYSLOG. When kptr_restrict is set to -(2), kernel pointers printed using %pK will be replaced with 0's -regardless of privileges. +exposing kernel addresses via /proc and other interfaces. + +When kptr_restrict is set to (0), the default, there are no restrictions. + +When kptr_restrict is set to (1), kernel pointers printed using the %pK +format specifier will be replaced with 0's unless the user has CAP_SYSLOG +and effective user and group ids are equal to the real ids. This is +because %pK checks are done at read() time rather than open() time, so +if permissions are elevated between the open() and the read() (e.g via +a setuid binary) then %pK will not leak kernel pointers to unprivileged +users. Note, this is a temporary solution only. The correct long-term +solution is to do the permission checks at open() time. Consider removing +world read permissions from files that use %pK, and using dmesg_restrict +to protect against uses of %pK in dmesg(8) if leaking kernel pointer +values to unprivileged users is a concern. + +When kptr_restrict is set to (2), kernel pointers printed using +%pK will be replaced with 0's regardless of privileges. ============================================================== diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 26559bdb4c49..d76555c45ff4 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include /* for PAGE_SIZE */ @@ -1312,11 +1313,37 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr, spec.field_width = default_width; return string(buf, end, "pK-error", spec); } - if (!((kptr_restrict == 0) || - (kptr_restrict == 1 && - has_capability_noaudit(current, CAP_SYSLOG)))) + + switch (kptr_restrict) { + case 0: + /* Always print %pK values */ + break; + case 1: { + /* + * Only print the real pointer value if the current + * process has CAP_SYSLOG and is running with the + * same credentials it started with. This is because + * access to files is checked at open() time, but %pK + * checks permission at read() time. We don't want to + * leak pointer values if a binary opens a file using + * %pK and then elevates privileges before reading it. + */ + const struct cred *cred = current_cred(); + + if (!has_capability_noaudit(current, CAP_SYSLOG) || + !uid_eq(cred->euid, cred->uid) || + !gid_eq(cred->egid, cred->gid)) + ptr = NULL; + break; + } + case 2: + default: + /* Always print 0's for %pK */ ptr = NULL; + break; + } break; + case 'N': switch (fmt[1]) { case 'F': -- cgit v1.3