<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/topology.h, branch leds/HEAD</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=leds%2FHEAD</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=leds%2FHEAD'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2015-05-27T13:22:15Z</updated>
<entry>
<title>sched/topology: Rename topology_thread_cpumask() to topology_sibling_cpumask()</title>
<updated>2015-05-27T13:22:15Z</updated>
<author>
<name>Bartosz Golaszewski</name>
<email>bgolaszewski@baylibre.com</email>
</author>
<published>2015-05-26T13:11:28Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=06931e62246844c73fba24d7aeb4a5dc897a2739'/>
<id>urn:sha1:06931e62246844c73fba24d7aeb4a5dc897a2739</id>
<content type='text'>
Rename topology_thread_cpumask() to topology_sibling_cpumask()
for more consistency with scheduler code.

Signed-off-by: Bartosz Golaszewski &lt;bgolaszewski@baylibre.com&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: Russell King &lt;rmk+kernel@arm.linux.org.uk&gt;
Acked-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Benoit Cousson &lt;bcousson@baylibre.com&gt;
Cc: Fenghua Yu &lt;fenghua.yu@intel.com&gt;
Cc: Guenter Roeck &lt;linux@roeck-us.net&gt;
Cc: Jean Delvare &lt;jdelvare@suse.de&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Oleg Drokin &lt;oleg.drokin@intel.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Rafael J. Wysocki &lt;rjw@rjwysocki.net&gt;
Cc: Russell King &lt;linux@arm.linux.org.uk&gt;
Cc: Viresh Kumar &lt;viresh.kumar@linaro.org&gt;
Link: http://lkml.kernel.org/r/1432645896-12588-2-git-send-email-bgolaszewski@baylibre.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>topology: add support for node_to_mem_node() to determine the fallback node</title>
<updated>2014-10-10T02:25:51Z</updated>
<author>
<name>Joonsoo Kim</name>
<email>iamjoonsoo.kim@lge.com</email>
</author>
<published>2014-10-09T22:26:13Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ad2c8144418c6a81cefe65379fd47bbe8344cef2'/>
<id>urn:sha1:ad2c8144418c6a81cefe65379fd47bbe8344cef2</id>
<content type='text'>
Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that
on ppc LPARs with memoryless nodes, a large amount of memory was consumed
by slabs and was marked unreclaimable.  He tracked it down to slab
deactivations in the SLUB core when we allocate remotely, leading to poor
efficiency always when memoryless nodes are present.

After much discussion, Joonsoo provided a few patches that help
significantly.  They don't resolve the problem altogether:

 - memory hotplug still needs testing, that is when a memoryless node
   becomes memory-ful, we want to dtrt
 - there are other reasons for going off-node than memoryless nodes,
   e.g., fully exhausted local nodes

Neither case is resolved with this series, but I don't think that should
block their acceptance, as they can be explored/resolved with follow-on
patches.

The series consists of:

[1/3] topology: add support for node_to_mem_node() to determine the
      fallback node

[2/3] slub: fallback to node_to_mem_node() node if allocating on
      memoryless node

      - Joonsoo's patches to cache the nearest node with memory for each
        NUMA node

[3/3] Partial revert of 81c98869faa5 (""kthread: ensure locality of
      task_struct allocations")

 - At Tejun's request, keep the knowledge of memoryless node fallback
   to the allocator core.

This patch (of 3):

We need to determine the fallback node in slub allocator if the allocation
target node is memoryless node.  Without it, the SLUB wrongly select the
node which has no memory and can't use a partial slab, because of node
mismatch.  Introduced function, node_to_mem_node(X), will return a node Y
with memory that has the nearest distance.  If X is memoryless node, it
will return nearest distance node, but, if X is normal node, it will
return itself.

We will use this function in following patch to determine the fallback
node.

Signed-off-by: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Signed-off-by: Nishanth Aravamudan &lt;nacc@linux.vnet.ibm.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Han Pingtian &lt;hanpt@linux.vnet.ibm.com&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: Paul Mackerras &lt;paulus@samba.org&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Anton Blanchard &lt;anton@samba.org&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Wanpeng Li &lt;liwanp@linux.vnet.ibm.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: disable zone_reclaim_mode by default</title>
<updated>2014-06-04T23:53:59Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2014-06-04T23:07:14Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4f9b16a64753d0bb607454347036dc997fd03b82'/>
<id>urn:sha1:4f9b16a64753d0bb607454347036dc997fd03b82</id>
<content type='text'>
When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node.  NUMA machines are now common but few of the workloads are
NUMA-aware and it's routine to see major performance degradation due to
zone_reclaim_mode being enabled but relatively few can identify the
problem.

Those that require zone_reclaim_mode are likely to be able to detect
when it needs to be enabled and tune appropriately so lets have a
sensible default for the bulk of users.

This patch (of 2):

zone_reclaim_mode causes processes to prefer reclaiming memory from
local node instead of spilling over to other nodes.  This made sense
initially when NUMA machines were almost exclusively HPC and the
workload was partitioned into nodes.  The NUMA penalties were
sufficiently high to justify reclaiming the memory.  On current machines
and workloads it is often the case that zone_reclaim_mode destroys
performance but not all users know how to detect this.  Favour the
common case and disable it by default.  Users that are sophisticated
enough to know they need zone_reclaim_mode will detect it.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Zhang Yanfei &lt;zhangyanfei@cn.fujitsu.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Reviewed-by: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched: Rework sched_domain topology definition</title>
<updated>2014-05-07T11:33:49Z</updated>
<author>
<name>Vincent Guittot</name>
<email>vincent.guittot@linaro.org</email>
</author>
<published>2014-04-11T09:44:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=143e1e28cb40bed836b0a06567208bd7347c9672'/>
<id>urn:sha1:143e1e28cb40bed836b0a06567208bd7347c9672</id>
<content type='text'>
We replace the old way to configure the scheduler topology with a new method
which enables a platform to declare additionnal level (if needed).

We still have a default topology table definition that can be used by platform
that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
be overwritten by an arch which either wants to add new level where a load
balance make sense like BOOK or powergating level or wants to change the flags
configuration of some levels.

For each level, we need a function pointer that returns cpumask for each cpu,
a function pointer that returns the flags for the level and a name. Only flags
that describe topology, can be set by an architecture. The current topology
flags are:

 SD_SHARE_CPUPOWER
 SD_SHARE_PKG_RESOURCES
 SD_NUMA
 SD_ASYM_PACKING

Then, each level must be a subset on the next one. The build sequence of the
sched_domain will take care of removing useless levels like those with 1 CPU
and those with the same CPU span and no more relevant information for
load balancing than its children.

Signed-off-by: Vincent Guittot &lt;vincent.guittot@linaro.org&gt;
Tested-by: Dietmar Eggemann &lt;dietmar.eggemann@arm.com&gt;
Reviewed-by: Preeti U Murthy &lt;preeti@linux.vnet.ibm.com&gt;
Reviewed-by: Dietmar Eggemann &lt;dietmar.eggemann@arm.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Cc: Chris Metcalf &lt;cmetcalf@tilera.com&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Cc: Fenghua Yu &lt;fenghua.yu@intel.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: Hanjun Guo &lt;hanjun.guo@linaro.org&gt;
Cc: Heiko Carstens &lt;heiko.carstens@de.ibm.com&gt;
Cc: Jason Low &lt;jason.low2@hp.com&gt;
Cc: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Cc: Martin Schwidefsky &lt;schwidefsky@de.ibm.com&gt;
Cc: Tony Luck &lt;tony.luck@intel.com&gt;
Cc: linux390@de.ibm.com
Cc: linux-ia64@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/1397209481-28542-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>mm: use raw_cpu ops for determining current NUMA node</title>
<updated>2014-04-07T23:36:13Z</updated>
<author>
<name>Christoph Lameter</name>
<email>cl@linux.com</email>
</author>
<published>2014-04-07T22:39:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=dc322a99d31fff5d3f8acfa061ad033953efdebe'/>
<id>urn:sha1:dc322a99d31fff5d3f8acfa061ad033953efdebe</id>
<content type='text'>
With the preempt checking logic for __this_cpu_ops we will get false
positives from locations in the code that use numa_node_id.

Before the __this_cpu ops where introduced there were no checks for
preemption present either.  smp_raw_processor_id() was used.  See

  http://www.spinics.net/lists/linux-numa/msg00641.html

Therefore we need to use raw_cpu_read here to avoid false postives.

Note that this issue has been discussed in prior years.  If the process
changes nodes after retrieving the current numa node then that is
acceptable since most uses of numa_node etc are for optimization and not
for correctness.

There were suggestions to implement a raw_numa_node_id in order to do
preempt checks for numa_node_id as well.  But I think we better defer
that to another patch since that would mean investigating how
numa_node_id() is used throughout the kernel which would increase the
scope of this patchset significantly.  After all preemption was never
checked before when numa_node_id() was used.

Some sample traces:

__this_cpu_read operation in preemptible [00000000] code: login/1456
caller is __this_cpu_preempt_check+0x2b/0x2d
CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3-dirty #185
Call Trace:
  dump_stack+0x4e/0x82
  check_preemption_disabled+0xc5/0xe0
  __this_cpu_preempt_check+0x2b/0x2d
  get_task_policy+0x1d/0x49
  get_vma_policy+0x14/0x76
  alloc_pages_vma+0x35/0xff
  handle_mm_fault+0x290/0x73b
  __do_page_fault+0x3fe/0x44d
  do_page_fault+0x9/0xc
  page_fault+0x22/0x30
  generic_file_aio_read+0x38e/0x624
  do_sync_read+0x54/0x73
  vfs_read+0x9d/0x12a
  SyS_read+0x47/0x7e
  cstar_dispatch+0x7/0x23

caller is __this_cpu_preempt_check+0x2b/0x2d
CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3-dirty #185
Call Trace:
  dump_stack+0x4e/0x82
  check_preemption_disabled+0xc5/0xe0
  __this_cpu_preempt_check+0x2b/0x2d
  alloc_pages_current+0x8f/0xbc
  __page_cache_alloc+0xb/0xd
  __do_page_cache_readahead+0xf4/0x219
  ra_submit+0x1c/0x20
  ondemand_readahead+0x28c/0x2b4
  page_cache_sync_readahead+0x38/0x3a
  generic_file_aio_read+0x261/0x624
  do_sync_read+0x54/0x73
  vfs_read+0x9d/0x12a
  SyS_read+0x47/0x7e
  cstar_dispatch+0x7/0x23

Signed-off-by: Christoph Lameter &lt;cl@linux.com&gt;
Acked-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Alex Shi &lt;alex.shi@intel.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>sched/balancing: Periodically decay max cost of idle balance</title>
<updated>2013-09-20T10:03:46Z</updated>
<author>
<name>Jason Low</name>
<email>jason.low2@hp.com</email>
</author>
<published>2013-09-13T18:26:53Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f48627e686a69f5215cb0761e731edb3d9859dd9'/>
<id>urn:sha1:f48627e686a69f5215cb0761e731edb3d9859dd9</id>
<content type='text'>
This patch builds on patch 2 and periodically decays that max value to
do idle balancing per sched domain by approximately 1% per second. Also
decay the rq's max_idle_balance_cost value.

Signed-off-by: Jason Low &lt;jason.low2@hp.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Link: http://lkml.kernel.org/r/1379096813-3032-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched/balancing: Consider max cost of idle balance per sched domain</title>
<updated>2013-09-20T10:03:44Z</updated>
<author>
<name>Jason Low</name>
<email>jason.low2@hp.com</email>
</author>
<published>2013-09-13T18:26:52Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=9bd721c55c8a886b938a45198aab0ccb52f1f7fa'/>
<id>urn:sha1:9bd721c55c8a886b938a45198aab0ccb52f1f7fa</id>
<content type='text'>
In this patch, we keep track of the max cost we spend doing idle load balancing
for each sched domain. If the avg time the CPU remains idle is less then the
time we have already spent on idle balancing + the max cost of idle balancing
in the sched domain, then we don't continue to attempt the balance. We also
keep a per rq variable, max_idle_balance_cost, which keeps track of the max
time spent on newidle load balances throughout all its domains so that we can
determine the avg_idle's max value.

By using the max, we avoid overrunning the average. This further reduces the
chance we attempt balancing when the CPU is not idle for longer than the cost
to balance.

Signed-off-by: Jason Low &lt;jason.low2@hp.com&gt;
Signed-off-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Link: http://lkml.kernel.org/r/1379096813-3032-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>sched: recover SD_WAKE_AFFINE in select_task_rq_fair and code clean up</title>
<updated>2012-08-13T17:02:05Z</updated>
<author>
<name>Alex Shi</name>
<email>alex.shi@intel.com</email>
</author>
<published>2012-07-26T00:55:34Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f03542a7019c600163ac4441d8a826c92c1bd510'/>
<id>urn:sha1:f03542a7019c600163ac4441d8a826c92c1bd510</id>
<content type='text'>
Since power saving code was removed from sched now, the implement
code is out of service in this function, and even pollute other logical.
like, 'want_sd' never has chance to be set '0', that remove the effect
of SD_WAKE_AFFINE here.

So, clean up the obsolete code, includes SD_PREFER_LOCAL.

Signed-off-by: Alex Shi &lt;alex.shi@intel.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Link: http://lkml.kernel.org/r/5028F431.6000306@intel.com
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
</entry>
<entry>
<title>sched/numa: Add SD_PERFER_SIBLING to CPU domain</title>
<updated>2012-07-26T09:46:58Z</updated>
<author>
<name>Alex Shi</name>
<email>alex.shi@intel.com</email>
</author>
<published>2012-07-20T06:19:50Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6956dc568f34107f1d02b24f87efe7250803fc87'/>
<id>urn:sha1:6956dc568f34107f1d02b24f87efe7250803fc87</id>
<content type='text'>
Commit 8e7fbcbc22c ("sched: Remove stale power aware scheduling remnants
and dysfunctional knobs") removed SD_PERFER_SIBLING from the CPU domain.

On NUMA machines this causes that load_balance() doesn't perfer LCPU in
 same physical CPU package.

It causes some actual performance regressions on our NUMA machines from
Core2 to NHM and SNB.

Adding this domain flag again recovers the performance drop.

This change doesn't have any bad impact on any of my benchmarks:
 specjbb, kbuild, fio, hackbench .. etc, on all my machines.

Signed-off-by: Alex Shi &lt;alex.shi@intel.com&gt;
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Link: http://lkml.kernel.org/r/1342765190-21540-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
</entry>
<entry>
<title>Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2012-05-23T01:27:32Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2012-05-23T01:27:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d79ee93de909dfb252279b9a95978bbda9a814a9'/>
<id>urn:sha1:d79ee93de909dfb252279b9a95978bbda9a814a9</id>
<content type='text'>
Pull scheduler changes from Ingo Molnar:
 "The biggest change is the cleanup/simplification of the load-balancer:
  instead of the current practice of architectures twiddling scheduler
  internal data structures and providing the scheduler domains in
  colorfully inconsistent ways, we now have generic scheduler code in
  kernel/sched/core.c:sched_init_numa() that looks at the architecture's
  node_distance() parameters and (while not fully trusting it) deducts a
  NUMA topology from it.

  This inevitably changes balancing behavior - hopefully for the better.

  There are various smaller optimizations, cleanups and fixlets as well"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug
  sched: Remove stale power aware scheduling remnants and dysfunctional knobs
  sched/debug: Fix printing large integers on 32-bit platforms
  sched/fair: Improve the -&gt;group_imb logic
  sched/nohz: Fix rq-&gt;cpu_load[] calculations
  sched/numa: Don't scale the imbalance
  sched/fair: Revert sched-domain iteration breakage
  sched/x86: Rewrite set_cpu_sibling_map()
  sched/numa: Fix the new NUMA topology bits
  sched/numa: Rewrite the CONFIG_NUMA sched domain support
  sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
  sched/fair: Add some serialization to the sched_domain load-balance walk
  sched/fair: Let minimally loaded cpu balance the group
  sched: Change rq-&gt;nr_running to unsigned int
  x86/numa: Check for nonsensical topologies on real hw as well
  x86/numa: Hard partition cpu topology masks on node boundaries
  x86/numa: Allow specifying node_distance() for numa=fake
  x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly
  sched: Update documentation and comments
  sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt()
</content>
</entry>
</feed>
