user/sven/linux.git/include/linux/topology.h, branch v3.2.45

sched: Allow SD_NODES_PER_DOMAIN to be overridden

2011-09-20T05:53:21Z

We want to override the default value of SD_NODES_PER_DOMAIN on ppc64, so move it into linux/topology.h. Signed-off-by: Anton Blanchard Acked-by: Peter Zijlstra Signed-off-by: Benjamin Herrenschmidt

mm: increase RECLAIM_DISTANCE to 30

2011-06-16T03:03:59Z

Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236) that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual Xeon E5520 + Intel S5520UR MB). He is using Cyrus IMAPd and it's built on a very traditional single-process model. * a master process which reads config files and manages the other process * multiple imapd processes, one per connection * multiple pop3d processes, one per connection * multiple lmtpd processes, one per connection * periodical "cleanup" processes. There are thousands of independent processes. The problem is, recent Intel motherboard turn on zone_reclaim_mode by default and traditional prefork model software don't work well on it. Unfortunatelly, such models are still typical even in the 21st century. We can't ignore them. This patch raises the zone_reclaim_mode threshold to 30. 30 doesn't have any specific meaning. but 20 means that one-hop QPI/Hypertransport and such relatively cheap 2-4 socket machine are often used for traditional servers as above. The intention is that these machines don't use zone_reclaim_mode. Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions. This patch doesn't change such high-end NUMA machine behavior. Dave Hansen said: : I know specifically of pieces of x86 hardware that set the information : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode : behavior which that implies. : : They've done performance testing and run very large and scary benchmarks : to make sure that they _want_ this turned on. What this means for them : is that they'll probably be de-optimized, at least on newer versions of : the kernel. : : If you want to do this for particular systems, maybe _that_'s what we : should do. Have a list of specific configurations that need the : defaults overridden either because they're buggy, or they have an : unusual hardware configuration not really reflected in the distance : table. And later said: : The original change in the hardware tables was for the benefit of a : benchmark. Said benchmark isn't going to get run on mainline until the : next batch of enterprise distros drops, at which point the hardware where : this was done will be irrelevant for the benchmark. I'm sure any new : hardware will just set this distance to another yet arbitrary value to : make the kernel do what it wants. :) : : Also, when the hardware got _set_ to this initially, I complained. So, I : guess I'm getting my way now, with this patch. I'm cool with it. Reported-by: Robert Mueller Signed-off-by: KOSAKI Motohiro Acked-by: Christoph Lameter Acked-by: David Rientjes Reviewed-by: KAMEZAWA Hiroyuki Cc: Benjamin Herrenschmidt Cc: "Luck, Tony" Acked-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

sched: Add book scheduling domain

2010-09-09T18:41:20Z

On top of the SMT and MC scheduling domains this adds the BOOK scheduling domain. This is useful for NUMA like machines which do not have an interface which tells which piece of memory is attached to which node or where the hardware performs striping. Signed-off-by: Heiko Carstens Signed-off-by: Peter Zijlstra LKML-Reference: <20100831082844.253053798@de.ibm.com> Signed-off-by: Ingo Molnar

topology: alternate fix for ia64 tiger_defconfig build breakage

2010-08-10T03:44:57Z

Define stubs for the numa_*_id() generic percpu related functions for non-NUMA configurations in where the other non-numa stubs live. Fixes ia64 !NUMA build breakage -- e.g., tiger_defconfig Back out now unneeded '#ifndef CONFIG_NUMA' guards from ia64 smpboot.c Signed-off-by: Lee Schermerhorn Tested-by: Tony Luck Acked-by: Tony Luck Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

sched: Fix spelling of sibling

2010-06-29T08:44:29Z

No logic changes, only spelling. Signed-off-by: Michael Neuling Cc: linuxppc-dev@ozlabs.org Cc: David Howells Cc: Peter Zijlstra LKML-Reference: <15249.1277776921@neuling.org> Signed-off-by: Ingo Molnar

sched: Add asymmetric group packing option for sibling domain

2010-06-09T08:34:55Z

Check to see if the group is packed in a sched doman. This is primarily intended to used at the sibling level. Some cores like POWER7 prefer to use lower numbered SMT threads. In the case of POWER7, it can move to lower SMT modes only when higher threads are idle. When in lower SMT modes, the threads will perform better since they share less core resources. Hence when we have idle threads, we want them to be the higher ones. This adds a hook into f_b_g() called check_asym_packing() to check the packing. This packing function is run on idle threads. It checks to see if the busiest CPU in this domain (core in the P7 case) has a higher CPU number than what where the packing function is being run on. If it is, calculate the imbalance and return the higher busier thread as the busiest group to f_b_g(). Here we are assuming a lower CPU number will be equivalent to a lower SMT thread number. It also creates a new SD_ASYM_PACKING flag to enable this feature at any scheduler domain level. It also creates an arch hook to enable this feature at the sibling level. The default function doesn't enable this feature. Based heavily on patch from Peter Zijlstra. Fixes from Srivatsa Vaddagiri. Signed-off-by: Michael Neuling Signed-off-by: Srivatsa Vaddagiri Signed-off-by: Peter Zijlstra Cc: Arjan van de Ven Cc: "H. Peter Anvin" Cc: Thomas Gleixner LKML-Reference: <20100608045702.2936CCC897@localhost.localdomain> Signed-off-by: Ingo Molnar

numa: introduce numa_mem_id()- effective local memory node id

2010-05-27T16:12:57Z

Introduce numa_mem_id(), based on generic percpu variable infrastructure to track "nearest node with memory" for archs that support memoryless nodes. Define API in when CONFIG_HAVE_MEMORYLESS_NODES defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES if/when they support them. Archs can override definitions of: numa_mem_id() - returns node number of "local memory" node set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem' cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue Generic initialization of 'numa_mem' occurs in __build_all_zonelists(). This will initialize the boot cpu at boot time, and all cpus on change of numa_zonelist_order, or when node or memory hot-plug requires zonelist rebuild. Archs that support memoryless nodes will need to initialize 'numa_mem' for secondary cpus as they're brought on-line. [akpm@linux-foundation.org: fix build] Signed-off-by: Lee Schermerhorn Signed-off-by: Christoph Lameter Cc: Tejun Heo Cc: Mel Gorman Cc: Christoph Lameter Cc: Nick Piggin Cc: David Rientjes Cc: Eric Whitney Cc: KAMEZAWA Hiroyuki Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: "Luck, Tony" Cc: Pekka Enberg Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

numa: add generic percpu var numa_node_id() implementation

2010-05-27T16:12:57Z

Rework the generic version of the numa_node_id() function to use the new generic percpu variable infrastructure. Guard the new implementation with a new config option: CONFIG_USE_PERCPU_NUMA_NODE_ID. Archs which support this new implemention will default this option to 'y' when NUMA is configured. This config option could be removed if/when all archs switch over to the generic percpu implementation of numa_node_id(). Arch support involves: 1) converting any existing per cpu variable implementations to use this implementation. x86_64 is an instance of such an arch. 2) archs that don't use a per cpu variable for numa_node_id() will need to initialize the new per cpu variable "numa_node" as cpus are brought on-line. ia64 is an example. 3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g., when NUMA is configured. This is required because I have retained the old implementation by default to allow archs to be modified incrementally, as desired. Subsequent patches will convert x86_64 and ia64 to use this implemenation. Signed-off-by: Lee Schermerhorn Cc: Tejun Heo Cc: Mel Gorman Reviewed-by: Christoph Lameter Cc: Nick Piggin Cc: David Rientjes Cc: Eric Whitney Cc: KAMEZAWA Hiroyuki Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: "Luck, Tony" Cc: Pekka Enberg Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

sched: Fix vmark regression on big machines

2010-01-21T12:39:03Z

SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't enabled, leading to many cache misses on large machines as we traverse looking for an idle shared cache to wake to. Change the enabler of select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the sibling domain level. Reported-by: Lin Ming Signed-off-by: Mike Galbraith Signed-off-by: Peter Zijlstra LKML-Reference: <1262612696.15495.15.camel@marge.simson.net> Signed-off-by: Ingo Molnar

sched: Disable SD_PREFER_LOCAL for MC/CPU domains

2009-10-14T13:02:34Z

Yanmin reported that both tbench and hackbench were significantly hurt by trying to keep tasks local on these domains, esp on small cache machines. So disable it in order to promote spreading outside of the cache domains. Reported-by: "Zhang, Yanmin" Signed-off-by: Peter Zijlstra CC: Mike Galbraith LKML-Reference: <1255083400.8802.15.camel@laptop> Signed-off-by: Ingo Molnar