<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/memcontrol.h, branch v3.12.43</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.12.43</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.12.43'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2013-10-17T04:35:53Z</updated>
<entry>
<title>mm: memcg: handle non-error OOM situations more gracefully</title>
<updated>2013-10-17T04:35:53Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2013-10-16T20:46:59Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4942642080ea82d99ab5b653abb9a12b7ba31f4a'/>
<id>urn:sha1:4942642080ea82d99ab5b653abb9a12b7ba31f4a</id>
<content type='text'>
Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
callstack on OOM") assumed that only a few places that can trigger a
memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
readahead.  But there are many more and it's impractical to annotate
them all.

First of all, we don't want to invoke the OOM killer when the failed
allocation is gracefully handled, so defer the actual kill to the end of
the fault handling as well.  This simplifies the code quite a bit for
added bonus.

Second, since a failed allocation might not be the abrupt end of the
fault, the memcg OOM handler needs to be re-entrant until the fault
finishes for subsequent allocation attempts.  If an allocation is
attempted after the task already OOMed, allow it to bypass the limit so
that it can quickly finish the fault and invoke the OOM killer.

Reported-by: azurIt &lt;azurit@pobox.sk&gt;
Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: &lt;stable@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>revert "memcg, vmscan: integrate soft reclaim tighter with zone shrinking code"</title>
<updated>2013-09-25T00:00:26Z</updated>
<author>
<name>Andrew Morton</name>
<email>akpm@linux-foundation.org</email>
</author>
<published>2013-09-24T22:27:41Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0608f43da64a1f1c42507304b5f25bc8b1227aa4'/>
<id>urn:sha1:0608f43da64a1f1c42507304b5f25bc8b1227aa4</id>
<content type='text'>
Revert commit 3b38722efd9f ("memcg, vmscan: integrate soft reclaim
tighter with zone shrinking code")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>revert "vmscan, memcg: do softlimit reclaim also for targeted reclaim"</title>
<updated>2013-09-25T00:00:26Z</updated>
<author>
<name>Andrew Morton</name>
<email>akpm@linux-foundation.org</email>
</author>
<published>2013-09-24T22:27:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b1aff7fcf86c88472b0a70f15d89d7a4adba07bb'/>
<id>urn:sha1:b1aff7fcf86c88472b0a70f15d89d7a4adba07bb</id>
<content type='text'>
Revert commit a5b7c87f9207 ("vmscan, memcg: do softlimit reclaim also
for targeted reclaim")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>revert "memcg: enhance memcg iterator to support predicates"</title>
<updated>2013-09-25T00:00:26Z</updated>
<author>
<name>Andrew Morton</name>
<email>akpm@linux-foundation.org</email>
</author>
<published>2013-09-24T22:27:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=694fbc0fe78518d06efa63910bf4ecee660e7852'/>
<id>urn:sha1:694fbc0fe78518d06efa63910bf4ecee660e7852</id>
<content type='text'>
Revert commit de57780dc659 ("memcg: enhance memcg iterator to support
predicates")

I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.

Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memcg: add per cgroup writeback pages accounting</title>
<updated>2013-09-12T22:38:02Z</updated>
<author>
<name>Sha Zhengju</name>
<email>handai.szj@taobao.com</email>
</author>
<published>2013-09-12T22:13:53Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3ea67d06e4679a16f69f66f43a8d6ee4778985fc'/>
<id>urn:sha1:3ea67d06e4679a16f69f66f43a8d6ee4778985fc</id>
<content type='text'>
Add memcg routines to count writeback pages, later dirty pages will also
be accounted.

After Kame's commit 89c06bd52fb9 ("memcg: use new logic for page stat
accounting"), we can use 'struct page' flag to test page state instead
of per page_cgroup flag.  But memcg has a feature to move a page from a
cgroup to another one and may have race between "move" and "page stat
accounting".  So in order to avoid the race we have designed a new lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        --&gt;(a)
         mem_cgroup_update_page_stat()  --&gt;(b)
         mem_cgroup_end_update_page_stat()

It requires both (a) and (b)(writeback pages accounting) to be pretected
in mem_cgroup_{begin/end}_update_page_stat().  It's full no-op for
!CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
read lock in the most cases (no task is moving), and spin_lock_irqsave
on top in the slow path.

There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
And the lock order is:
	--&gt; memcg-&gt;move_lock
	  --&gt; mapping-&gt;tree_lock

Signed-off-by: Sha Zhengju &lt;handai.szj@taobao.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Reviewed-by: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memcg: remove MEMCG_NR_FILE_MAPPED</title>
<updated>2013-09-12T22:38:02Z</updated>
<author>
<name>Sha Zhengju</name>
<email>handai.szj@taobao.com</email>
</author>
<published>2013-09-12T22:13:50Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=68b4876d996e8749142b2895bc2e251448996363'/>
<id>urn:sha1:68b4876d996e8749142b2895bc2e251448996363</id>
<content type='text'>
While accounting memcg page stat, it's not worth to use
MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
complexity and presumed performance overhead.  We can use
MEM_CGROUP_STAT_FILE_MAPPED directly.

Signed-off-by: Sha Zhengju &lt;handai.szj@taobao.com&gt;
Acked-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Acked-by: Fengguang Wu &lt;fengguang.wu@intel.com&gt;
Reviewed-by: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: memcg: do not trap chargers with full callstack on OOM</title>
<updated>2013-09-12T22:38:02Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2013-09-12T22:13:44Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3812c8c8f3953921ef18544110dafc3505c1ac62'/>
<id>urn:sha1:3812c8c8f3953921ef18544110dafc3505c1ac62</id>
<content type='text'>
The memcg OOM handling is incredibly fragile and can deadlock.  When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds.  Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved.  The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex.  The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:

OOM invoking task:
  mem_cgroup_handle_oom+0x241/0x3b0
  mem_cgroup_cache_charge+0xbe/0xe0
  add_to_page_cache_locked+0x4c/0x140
  add_to_page_cache_lru+0x22/0x50
  grab_cache_page_write_begin+0x8b/0xe0
  ext3_write_begin+0x88/0x270
  generic_file_buffered_write+0x116/0x290
  __generic_file_aio_write+0x27c/0x480
  generic_file_aio_write+0x76/0xf0           # takes -&gt;i_mutex
  do_sync_write+0xea/0x130
  vfs_write+0xf3/0x1f0
  sys_write+0x51/0x90
  system_call_fastpath+0x18/0x1d

OOM kill victim:
  do_truncate+0x58/0xa0              # takes i_mutex
  do_last+0x250/0xa30
  path_openat+0xd7/0x440
  do_filp_open+0x49/0xa0
  do_sys_open+0x106/0x240
  sys_open+0x20/0x30
  system_call_fastpath+0x18/0x1d

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase
the group's limit.  But a userspace OOM handler is prone to deadlock
itself on the locks held by the waiting tasks.  For example one of the
sleeping tasks may be stuck in a brk() call with the mmap_sem held for
writing but the userspace handler, in order to pick an optimal victim,
may need to read files from /proc/&lt;pid&gt;, which tries to acquire the same
mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:

1. When OOMing in a user fault, invoke the OOM killer and restart the
   fault instead of looping on the charge attempt.  This way, the OOM
   victim can not get stuck on locks the looping task may hold.

2. When OOMing in a user fault but somebody else is handling it
   (either the kernel OOM killer or a userspace handler), don't go to
   sleep in the charge context.  Instead, remember the OOMing memcg in
   the task struct and then fully unwind the page fault stack with
   -ENOMEM.  pagefault_out_of_memory() will then call back into the
   memcg code to check if the -ENOMEM came from the memcg, and then
   either put the task to sleep on the memcg's OOM waitqueue or just
   restart the fault.  The OOM victim can no longer get stuck on any
   lock a sleeping task may hold.

Debugged by Michal Hocko.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reported-by: azurIt &lt;azurit@pobox.sk&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: memcg: enable memcg OOM killer only for user faults</title>
<updated>2013-09-12T22:38:01Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2013-09-12T22:13:42Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=519e52473ebe9db5cdef44670d5a97f1fd53d721'/>
<id>urn:sha1:519e52473ebe9db5cdef44670d5a97f1fd53d721</id>
<content type='text'>
System calls and kernel faults (uaccess, gup) can handle an out of memory
situation gracefully and just return -ENOMEM.

Enable the memcg OOM killer only for user faults, where it's really the
only option available.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: azurIt &lt;azurit@pobox.sk&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memcg: enhance memcg iterator to support predicates</title>
<updated>2013-09-12T22:38:00Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.cz</email>
</author>
<published>2013-09-12T22:13:26Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=de57780dc659f95b17ccb649f003278dde0b5b86'/>
<id>urn:sha1:de57780dc659f95b17ccb649f003278dde0b5b86</id>
<content type='text'>
The caller of the iterator might know that some nodes or even subtrees
should be skipped but there is no way to tell iterators about that so the
only choice left is to let iterators to visit each node and do the
selection outside of the iterating code.  This, however, doesn't scale
well with hierarchies with many groups where only few groups are
interesting.

This patch adds mem_cgroup_iter_cond variant of the iterator with a
callback which gets called for every visited node.  There are three
possible ways how the callback can influence the walk.  Either the node is
visited, it is skipped but the tree walk continues down the tree or the
whole subtree of the current group is skipped.

[hughd@google.com: fix memcg-less page reclaim]
Signed-off-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Balbir Singh &lt;bsingharora@gmail.com&gt;
Cc: Glauber Costa &lt;glommer@openvz.org&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Ying Han &lt;yinghan@google.com&gt;
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>vmscan, memcg: do softlimit reclaim also for targeted reclaim</title>
<updated>2013-09-12T22:38:00Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.cz</email>
</author>
<published>2013-09-12T22:13:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a5b7c87f92076352dbff2fe0423ec255e1c9a71b'/>
<id>urn:sha1:a5b7c87f92076352dbff2fe0423ec255e1c9a71b</id>
<content type='text'>
Soft reclaim has been done only for the global reclaim (both background
and direct).  Since "memcg: integrate soft reclaim tighter with zone
shrinking code" there is no reason for this limitation anymore as the soft
limit reclaim doesn't use any special code paths and it is a part of the
zone shrinking code which is used by both global and targeted reclaims.

From the semantic point of view it is natural to consider soft limit
before touching all groups in the hierarchy tree which is touching the
hard limit because soft limit tells us where to push back when there is a
memory pressure.  It is not important whether the pressure comes from the
limit or imbalanced zones.

This patch simply enables soft reclaim unconditionally in
mem_cgroup_should_soft_reclaim so it is enabled for both global and
targeted reclaim paths.  mem_cgroup_soft_reclaim_eligible needs to learn
about the root of the reclaim to know where to stop checking soft limit
state of parents up the hierarchy.  Say we have

A (over soft limit)
 \
  B (below s.l., hit the hard limit)
 / \
C   D (below s.l.)

B is the source of the outside memory pressure now for D but we shouldn't
soft reclaim it because it is behaving well under B subtree and we can
still reclaim from C (pressumably it is over the limit).
mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
hierarchy at B (root of the memory pressure).

Signed-off-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Reviewed-by: Glauber Costa &lt;glommer@openvz.org&gt;
Reviewed-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Balbir Singh &lt;bsingharora@gmail.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Ying Han &lt;yinghan@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
