[PATCH] normalise node load for NUMA

From: Andrew Theurer <habanero@us.ibm.com> This patch ensures that when node loads are compared, the load value is normalised. Without this, load balance across nodes of dissimilar cpu counts can cause unfairness and sometimes lower overall performance. For example, a 2 node system with 4 cpus in the first node and 2 cpus in the second. A workload with 6 running tasks would have 3 tasks running on one node and 3 on the other, leaving one cpu idle in the first node and two tasks sharing a cpu in the second node. The patch would ensure that 4 tasks run in the first node and 2 in the second. I ran some kernel compiles comparing this patch on a 2 node 4 cpu/2 cpu system to show the benefits. Without the patch I got 140 second elapsed time. With the patch I get 132 seconds (6% better). Although it is not very common to have nodes with dissimilar cpu counts, it is already happening. PPC64 systems with partitioning have this happen, and I expect it to be more common on ia32 as partitioning becomes more common.
author: Andrew Morton <akpm@digeo.com> 2003-06-25 04:21:09 -0700
committer: Greg Kroah-Hartman <greg@kroah.com> 2003-06-25 04:21:09 -0700
commit: 325a282480c24b685b780221e3e463c2894f9efa (patch)
tree: f6660c8b467a69883eb5225b92760ab210625bad /kernel
parent: 4f758d8f7a892b456a9d6de6b08cba940ad7c12d (diff)
1 files changed, 17 insertions, 4 deletions
diff --git a/kernel/sched.c b/kernel/sched.c
index 6994f9f537ee..1cf66c7f1732 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -784,7 +784,13 @@ static int sched_best_cpu(struct task_struct *p)
 
 	minload = 10000000;
 	for_each_node_with_cpus(i) {
-		load = atomic_read(&node_nr_running[i]);
+		/*
+		 * Node load is always divided by nr_cpus_node to normalise 
+		 * load values in case cpu count differs from node to node.
+		 * We first multiply node_nr_running by 10 to get a little
+		 * better resolution.   
+		 */
+		load = 10 * atomic_read(&node_nr_running[i]) / nr_cpus_node(i);
 		if (load < minload) {
 			minload = load;
 			node = i;
@@ -820,19 +826,26 @@ void sched_balance_exec(void)
  * geometrically deccaying weight to the load measure:
  *      load_{t} = load_{t-1}/2 + nr_node_running_{t}
  * This way sudden load peaks are flattened out a bit.
+ * Node load is divided by nr_cpus_node() in order to compare nodes
+ * of different cpu count but also [first] multiplied by 10 to 
+ * provide better resolution.
  */
 static int find_busiest_node(int this_node)
 {
 	int i, node = -1, load, this_load, maxload;
 
+	if (!nr_cpus_node(this_node))
+		return node;
 	this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
-		+ atomic_read(&node_nr_running[this_node]);
+		+ (10 * atomic_read(&node_nr_running[this_node])
+		/ nr_cpus_node(this_node));
 	this_rq()->prev_node_load[this_node] = this_load;
-	for (i = 0; i < numnodes; i++) {
+	for_each_node_with_cpus(i) {
 		if (i == this_node)
 			continue;
 		load = (this_rq()->prev_node_load[i] >> 1)
-			+ atomic_read(&node_nr_running[i]);
+			+ (10 * atomic_read(&node_nr_running[i])
+			/ nr_cpus_node(i));
 		this_rq()->prev_node_load[i] = load;
 		if (load > maxload && (100*load > NODE_THRESHOLD*this_load)) {
 			maxload = load;
author	Andrew Morton <akpm@digeo.com>	2003-06-25 04:21:09 -0700
committer	Greg Kroah-Hartman <greg@kroah.com>	2003-06-25 04:21:09 -0700
commit	325a282480c24b685b780221e3e463c2894f9efa (patch)
tree	f6660c8b467a69883eb5225b92760ab210625bad /kernel
parent	4f758d8f7a892b456a9d6de6b08cba940ad7c12d (diff)