user/sven/linux.git/include/linux/memcontrol.h, branch v3.0.85

mm: change isolate mode from #define to bitwise type

2012-08-01T19:27:16Z

commit 4356f21d09283dc6d39a6f7287a65ddab61e2808 upstream. Stable note: Not tracked in Bugzilla. This patch makes later patches easier to apply but has no other impact. Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally, macro isn't recommended as it's type-unsafe and making debugging harder as symbol cannot be passed throught to the debugger. Quote from Johannes " Hmm, it would probably be cleaner to fully convert the isolation mode into independent flags. INACTIVE, ACTIVE, BOTH is currently a tri-state among flags, which is a bit ugly." This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h Signed-off-by: Minchan Kim Cc: Johannes Weiner Cc: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro Cc: Mel Gorman Cc: Rik van Riel Cc: Michal Hocko Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Mel Gorman Signed-off-by: Greg Kroah-Hartman

memcg: add mem_cgroup_replace_page_cache() to fix LRU issue

2012-01-26T01:24:43Z

commit ab936cbcd02072a34b60d268f94440fd5cf1970b upstream. Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a function replace_page_cache_page(). This function replaces a page in the radix-tree with a new page. WHen doing this, memory cgroup needs to fix up the accounting information. memcg need to check PCG_USED bit etc. In some(many?) cases, 'newpage' is on LRU before calling replace_page_cache(). So, memcg's LRU accounting information should be fixed, too. This patch adds mem_cgroup_replace_page_cache() and removes the old hooks. In that function, old pages will be unaccounted without touching res_counter and new page will be accounted to the memcg (of old page). WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid races with LRU handling. Background: replace_page_cache_page() is called by FUSE code in its splice() handling. Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated page and may be on LRU. LRU mis-accounting will be critical for memory cgroup because rmdir() checks the whole LRU is empty and there is no account leak. If a page is on the other LRU than it should be, rmdir() will fail. This bug was added in March 2011, but no bug report yet. I guess there are not many people who use memcg and FUSE at the same time with upstream kernels. The result of this bug is that admin cannot destroy a memcg because of account leak. So, no panic, no deadlock. And, even if an active cgroup exist, umount can succseed. So no problem at shutdown. Signed-off-by: KAMEZAWA Hiroyuki Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Miklos Szeredi Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

vmscan,memcg: memcg aware swap token

2011-06-16T03:03:59Z

Currently, memcg reclaim can disable swap token even if the swap token mm doesn't belong in its memory cgroup. It's slightly risky. If an admin creates very small mem-cgroup and silly guy runs contentious heavy memory pressure workload, every tasks are going to lose swap token and then system may become unresponsive. That's bad. This patch adds 'memcg' parameter into disable_swap_token(). and if the parameter doesn't match swap token, VM doesn't disable it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro Reviewed-by: KAMEZAWA Hiroyuki Reviewed-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

memcg: add the pagefault count into memcg stats

2011-05-27T00:12:36Z

Two new stats in per-memcg memory.stat which tracks the number of page faults and number of major page faults. "pgfault" "pgmajfault" They are different from "pgpgin"/"pgpgout" stat which count number of pages charged/discharged to the cgroup and have no meaning of reading/ writing page to disk. It is valuable to track the two stats for both measuring application's performance as well as the efficiency of the kernel page reclaim path. Counting pagefaults per process is useful, but we also need the aggregated value since processes are monitored and controlled in cgroup basis in memcg. Functional test: check the total number of pgfault/pgmajfault of all memcgs and compare with global vmstat value: $ cat /proc/vmstat | grep fault pgfault 1070751 pgmajfault 553 $ cat /dev/cgroup/memory.stat | grep fault pgfault 1071138 pgmajfault 553 total_pgfault 1071142 total_pgmajfault 553 $ cat /dev/cgroup/A/memory.stat | grep fault pgfault 199 pgmajfault 0 total_pgfault 199 total_pgmajfault 0 Performance test: run page fault test(pft) wit 16 thread on faulting in 15G anon pages in 16G container. There is no regression noticed on the "flt/cpu/s" Sample output from pft: TAG pft:anon-sys-default: Gb Thr CLine User System Wall flt/cpu/s fault/wsec 15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260 +-------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 16682.962 17344.027 16913.524 16928.812 166.5362 + 10 16695.568 16923.896 16820.604 16824.652 84.816568 No difference proven at 95.0% confidence [akpm@linux-foundation.org: fix build] [hughd@google.com: shmem fix] Signed-off-by: Ying Han Acked-by: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro Reviewed-by: Minchan Kim Cc: Daisuke Nishimura Acked-by: Balbir Singh Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

memcg: rename mem_cgroup_zone_nr_pages() to mem_cgroup_zone_nr_lru_pages()

2011-05-27T00:12:35Z

The caller of the function has been renamed to zone_nr_lru_pages(), and this is just fixing up in the memcg code. The current name is easily to be mis-read as zone's total number of pages. Signed-off-by: Ying Han Acked-by: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki Reviewed-by: Minchan Kim Cc: Balbir Singh Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

memcg: reclaim memory from nodes in round-robin order

2011-05-27T00:12:35Z

Presently, memory cgroup's direct reclaim frees memory from the current node. But this has some troubles. Usually when a set of threads works in a cooperative way, they tend to operate on the same node. So if they hit limits under memcg they will reclaim memory from themselves, damaging the active working set. For example, assume 2 node system which has Node 0 and Node 1 and a memcg which has 1G limit. After some work, file cache remains and the usages are Node 0: 1M Node 1: 998M. and run an application on Node 0, it will eat its foot before freeing unnecessary file caches. This patch adds round-robin for NUMA and adds equal pressure to each node. When using cpuset's spread memory feature, this will work very well. But yes, a better algorithm is needed. [akpm@linux-foundation.org: comment editing] [kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons] Signed-off-by: Ying Han Signed-off-by: KAMEZAWA Hiroyuki Cc: Balbir Singh Cc: KOSAKI Motohiro Cc: Daisuke Nishimura Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

memcg: count the soft_limit reclaim in global background reclaim

2011-05-27T00:12:35Z

The global kswapd scans per-zone LRU and reclaims pages regardless of the cgroup. It breaks memory isolation since one cgroup can end up reclaiming pages from another cgroup. Instead we should rely on memcg-aware target reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under memory pressure. In the global background reclaim, we do soft reclaim before scanning the per-zone LRU. However, the return value is ignored. This patch is the first step to skip shrink_zone() if soft_limit reclaim does enough work. This is part of the effort which tries to reduce reclaiming pages in global LRU in memcg. The per-memcg background reclaim patchset further enhances the per-cgroup targetting reclaim, which I should have V4 posted shortly. Try running multiple memory intensive workloads within seperate memcgs. Watch the counters of soft_steal in memory.stat. $ cat /dev/cgroup/A/memory.stat | grep 'soft' soft_steal 240000 soft_scan 240000 total_soft_steal 240000 total_soft_scan 240000 This patch: In the global background reclaim, we do soft reclaim before scanning the per-zone LRU. However, the return value is ignored. We would like to skip shrink_zone() if soft_limit reclaim does enough work. Also, we need to make the memory pressure balanced across per-memcg zones, like the logic vm-core. This patch is the first step where we start with counting the nr_scanned and nr_reclaimed from soft_limit reclaim into the global scan_control. Signed-off-by: Ying Han Cc: KOSAKI Motohiro Cc: Minchan Kim Cc: Rik van Riel Cc: Mel Gorman Cc: KAMEZAWA Hiroyuki Cc: Balbir Singh Acked-by: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

memcg: fix mem_cgroup_rotate_reclaimable_page()

2011-04-14T23:06:54Z

commit 3f58a8294333 ("move memcg reclaimable page into tail of inactive list") added inline keyword twice in its prototype. CC arch/x86/kernel/asm-offsets.s In file included from include/linux/swap.h:8, from include/linux/suspend.h:4, from arch/x86/kernel/asm-offsets.c:12: include/linux/memcontrol.h:220: error: duplicate `inline' Signed-off-by: Eric Dumazet Reviewed-by: Minchan Kim Cc: Balbir Singh Cc: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

memcg: add memcg sanity checks at allocating and freeing pages

2011-03-24T02:46:25Z

Add checks at allocating or freeing a page whether the page is used (iow, charged) from the view point of memcg. This check may be useful in debugging a problem and we did similar checks before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot). This patch adds some overheads at allocating or freeing memory, so it's enabled only when CONFIG_DEBUG_VM is enabled. Signed-off-by: Daisuke Nishimura Signed-off-by: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki Cc: Balbir Singh Cc: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

memcg: move memcg reclaimable page into tail of inactive list

2011-03-23T00:44:03Z

The rotate_reclaimable_page function moves just written out pages, which the VM wanted to reclaim, to the end of the inactive list. That way the VM will find those pages first next time it needs to free memory. This patch applies the rule in memcg. It can help to prevent unnecessary working page eviction of memcg. Signed-off-by: Minchan Kim Acked-by: Balbir Singh Acked-by: KAMEZAWA Hiroyuki Reviewed-by: Rik van Riel Cc: KOSAKI Motohiro Acked-by: Johannes Weiner Cc: Andrea Arcangeli Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds