user/sven/linux.git/include/linux/mmzone.h, branch v3.12.13

mm: vmscan: fix do_try_to_free_pages() livelock

2013-09-11T22:58:01Z

This patch is based on KOSAKI's work and I add a little more description, please refer https://lkml.org/lkml/2012/6/14/74. Currently, I found system can enter a state that there are lots of free pages in a zone but only order-0 and order-1 pages which means the zone is heavily fragmented, then high order allocation could make direct reclaim path's long stall(ex, 60 seconds) especially in no swap and no compaciton enviroment. This problem happened on v3.4, but it seems issue still lives in current tree, the reason is do_try_to_free_pages enter live lock: kswapd will go to sleep if the zones have been fully scanned and are still not balanced. As kswapd thinks there's little point trying all over again to avoid infinite loop. Instead it changes order from high-order to 0-order because kswapd think order-0 is the most important. Look at 73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep and may leave zone->all_unreclaimable =3D 0. It assume high-order users can still perform direct reclaim if they wish. Direct reclaim continue to reclaim for a high order which is not a COSTLY_ORDER without oom-killer until kswapd turn on zone->all_unreclaimble= . This is because to avoid too early oom-kill. So it means direct_reclaim depends on kswapd to break this loop. In worst case, direct-reclaim may continue to page reclaim forever when kswapd sleeps forever until someone like watchdog detect and finally kill the process. As described in: http://thread.gmane.org/gmane.linux.kernel.mm/103737 We can't turn on zone->all_unreclaimable from direct reclaim path because direct reclaim path don't take any lock and this way is racy. Thus this patch removes zone->all_unreclaimable field completely and recalculates zone reclaimable state every time. Note: we can't take the idea that direct-reclaim see zone->pages_scanned directly and kswapd continue to use zone->all_unreclaimable. Because, it is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use zone->all_unreclaimable as a name) describes the detail. [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()] Cc: Aaditya Kumar Cc: Ying Han Cc: Nick Piggin Acked-by: Rik van Riel Cc: Mel Gorman Cc: KAMEZAWA Hiroyuki Cc: Christoph Lameter Cc: Bob Liu Cc: Neil Zhang Cc: Russell King - ARM Linux Reviewed-by: Michal Hocko Acked-by: Minchan Kim Acked-by: Johannes Weiner Signed-off-by: KOSAKI Motohiro Signed-off-by: Lisa Du Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: page_alloc: fair zone allocator policy

2013-09-11T22:57:23Z

Each zone that holds userspace pages of one workload must be aged at a speed proportional to the zone size. Otherwise, the time an individual page gets to stay in memory depends on the zone it happened to be allocated in. Asymmetry in the zone aging creates rather unpredictable aging behavior and results in the wrong pages being reclaimed, activated etc. But exactly this happens right now because of the way the page allocator and kswapd interact. The page allocator uses per-node lists of all zones in the system, ordered by preference, when allocating a new page. When the first iteration does not yield any results, kswapd is woken up and the allocator retries. Due to the way kswapd reclaims zones below the high watermark while a zone can be allocated from when it is above the low watermark, the allocator may keep kswapd running while kswapd reclaim ensures that the page allocator can keep allocating from the first zone in the zonelist for extended periods of time. Meanwhile the other zones rarely see new allocations and thus get aged much slower in comparison. The result is that the occasional page placed in lower zones gets relatively more time in memory, even gets promoted to the active list after its peers have long been evicted. Meanwhile, the bulk of the working set may be thrashing on the preferred zone even though there may be significant amounts of memory available in the lower zones. Even the most basic test -- repeatedly reading a file slightly bigger than memory -- shows how broken the zone aging is. In this scenario, no single page should be able stay in memory long enough to get referenced twice and activated, but activation happens in spades: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 0 nr_active_file 8 nr_inactive_file 1582 nr_active_file 11994 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 70 nr_inactive_file 258753 nr_active_file 443214 nr_inactive_file 149793 nr_active_file 12021 Fix this with a very simple round robin allocator. Each zone is allowed a batch of allocations that is proportional to the zone's size, after which it is treated as full. The batch counters are reset when all zones have been tried and the allocator enters the slowpath and kicks off kswapd reclaim. Allocation and reclaim is now fairly spread out to all available/allowable zones: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 174 nr_active_file 4865 nr_inactive_file 53 nr_active_file 860 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 666622 nr_active_file 4988 nr_inactive_file 190969 nr_active_file 937 When zone_reclaim_mode is enabled, allocations will now spread out to all zones on the local node, not just the first preferred zone (which on a 4G node might be a tiny Normal zone). Signed-off-by: Johannes Weiner Acked-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Paul Bolle Cc: Zlatko Calusic Tested-by: Kevin Hilman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: remove unused functions is_{normal_idx, normal, dma32, dma}

2013-07-09T17:33:22Z

These functions are nowhere used, so remove them. Signed-off-by: Zhang Yanfei Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

sparsemem: add BUILD_BUG_ON when sizeof mem_section is non-power-of-2

2013-07-03T23:07:40Z

Instead of leaving a hidden trap for the next person who comes along and wants to add something to mem_section, add a big fat warning about it needing to be a power-of-2, and insert a BUILD_BUG_ON() in sparse_init() to catch mistakes. Right now non-power-of-2 mem_sections cause a number of WARNs at boot (which don't clearly point to the size of mem_section as an issue), but the system limps on (temporarily, at least). This is based upon Dave Hansen's earlier RFC where he ran into the same issue: "sparsemem: fix boot when SECTIONS_PER_ROOT is not power-of-2" http://lkml.indiana.edu/hypermail/linux/kernel/1205.2/03077.html Signed-off-by: Cody P Schafer Acked-by: Dave Hansen Cc: Jiang Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: use a dedicated lock to protect totalram_pages and zone->managed_pages

2013-07-03T23:07:33Z

Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to protect totalram_pages and zone->managed_pages. Other than the memory hotplug driver, totalram_pages and zone->managed_pages may also be modified at runtime by other drivers, such as Xen balloon, virtio_balloon etc. For those cases, memory hotplug lock is a little too heavy, so introduce a dedicated lock to protect totalram_pages and zone->managed_pages. Now we have a simplified locking rules totalram_pages and zone->managed_pages as: 1) no locking for read accesses because they are unsigned long. 2) no locking for write accesses at boot time in single-threaded context. 3) serialize write accesses at runtime by acquiring the dedicated managed_page_count_lock. Also adjust zone->managed_pages when freeing reserved pages into the buddy system, to keep totalram_pages and zone->managed_pages in consistence. [akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)] Signed-off-by: Jiang Liu Cc: Mel Gorman Cc: Michel Lespinasse Cc: Rik van Riel Cc: Minchan Kim Cc: "H. Peter Anvin" Cc: "Michael S. Tsirkin" Cc: Cc: Arnd Bergmann Cc: Catalin Marinas Cc: Chris Metcalf Cc: David Howells Cc: Geert Uytterhoeven Cc: Ingo Molnar Cc: Jeremy Fitzhardinge Cc: Jianguo Wu Cc: Joonsoo Kim Cc: Kamezawa Hiroyuki Cc: Konrad Rzeszutek Wilk Cc: Marek Szyprowski Cc: Rusty Russell Cc: Tang Chen Cc: Tejun Heo Cc: Thomas Gleixner Cc: Wen Congyang Cc: Will Deacon Cc: Yasuaki Ishimatsu Cc: Yinghai Lu Cc: Russell King Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mmzone: note that node_size_lock should be manipulated via pgdat_resize_lock()

2013-07-03T23:07:29Z

Signed-off-by: Cody P Schafer Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: fix comment referring to non-existent size_seqlock, change to span_seqlock

2013-07-03T23:07:29Z

Signed-off-by: Cody P Schafer Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: vmscan: block kswapd if it is encountering pages under writeback

2013-07-03T23:07:28Z

Historically, kswapd used to congestion_wait() at higher priorities if it was not making forward progress. This made no sense as the failure to make progress could be completely independent of IO. It was later replaced by wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't wait on congested zones in balance_pgdat()) as it was duplicating logic in shrink_inactive_list(). This is problematic. If kswapd encounters many pages under writeback and it continues to scan until it reaches the high watermark then it will quickly skip over the pages under writeback and reclaim clean young pages or push applications out to swap. The use of wait_iff_congested() is not suited to kswapd as it will only stall if the underlying BDI is really congested or a direct reclaimer was unable to write to the underlying BDI. kswapd bypasses the BDI congestion as it sets PF_SWAPWRITE but even if this was taken into account then it would cause direct reclaimers to stall on writeback which is not desirable. This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is encountering too many pages under writeback. If this flag is set and kswapd encounters a PageReclaim page under writeback then it'll assume that the LRU lists are being recycled too quickly before IO can complete and block waiting for some IO to complete. Signed-off-by: Mel Gorman Reviewed-by: Michal Hocko Acked-by: Rik van Riel Cc: Johannes Weiner Cc: KAMEZAWA Hiroyuki Cc: Jiri Slaby Cc: Valdis Kletnieks Tested-by: Zlatko Calusic Cc: dormando Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: vmscan: have kswapd writeback pages based on dirty pages encountered, not priority

2013-07-03T23:07:28Z

Currently kswapd queues dirty pages for writeback if scanning at an elevated priority but the priority kswapd scans at is not related to the number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten kswapd priority loop", the priority is related to the size of the LRU and the zone watermark which is no indication as to whether kswapd should write pages or not. This patch tracks if an excessive number of unqueued dirty pages are being encountered at the end of the LRU. If so, it indicates that dirty pages are being recycled before flusher threads can clean them and flags the zone so that kswapd will start writing pages until the zone is balanced. Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Reviewed-by: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: Rik van Riel Cc: Jiri Slaby Cc: Valdis Kletnieks Tested-by: Zlatko Calusic Cc: dormando Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial

2013-04-30T16:36:50Z

Pull trivial tree updates from Jiri Kosina: "Usual stuff, mostly comment fixes, typo fixes, printk fixes and small code cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits) mm: Convert print_symbol to %pSR gfs2: Convert print_symbol to %pSR m32r: Convert print_symbol to %pSR iostats.txt: add easy-to-find description for field 6 x86 cmpxchg.h: fix wrong comment treewide: Fix typo in printk and comments doc: devicetree: Fix various typos docbook: fix 8250 naming in device-drivers pata_pdc2027x: Fix compiler warning treewide: Fix typo in printks mei: Fix comments in drivers/misc/mei treewide: Fix typos in kernel messages pm44xx: Fix comment for "CONFIG_CPU_IDLE" doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP" mmzone: correct "pags" to "pages" in comment. kernel-parameters: remove outdated 'noresidual' parameter Remove spurious _H suffixes from ifdef comments sound: Remove stray pluses from Kconfig file radio-shark: Fix printk "CONFIG_LED_CLASS" doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE ...