<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/mm/page_alloc.c, branch v3.6.9</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.6.9</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.6.9'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2012-11-26T20:14:18Z</updated>
<entry>
<title>memcg: fix hotplugged memory zone oops</title>
<updated>2012-11-26T20:14:18Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2012-11-16T22:14:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=dd51fb8944830c19fe9f946e32c95c569985ac85'/>
<id>urn:sha1:dd51fb8944830c19fe9f946e32c95c569985ac85</id>
<content type='text'>
commit bea8c150a7efbc0f204e709b7274fe273f55e0d3 upstream.

When MEMCG is configured on (even when it's disabled by boot option),
when adding or removing a page to/from its lru list, the zone pointer
used for stats updates is nowadays taken from the struct lruvec.  (On
many configurations, calculating zone from page is slower.)

But we have no code to update all the lruvecs (per zone, per memcg) when
a memory node is hotadded.  Here's an extract from the oops which
results when running numactl to bind a program to a newly onlined node:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
  IP:  __mod_zone_page_state+0x9/0x60
  Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
  Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
  Call Trace:
    __pagevec_lru_add_fn+0xdf/0x140
    pagevec_lru_move_fn+0xb1/0x100
    __pagevec_lru_add+0x1c/0x30
    lru_add_drain_cpu+0xa3/0x130
    lru_add_drain+0x2f/0x40
   ...

The natural solution might be to use a memcg callback whenever memory is
hotadded; but that solution has not been scoped out, and it happens that
we do have an easy location at which to update lruvec-&gt;zone.  The lruvec
pointer is discovered either by mem_cgroup_zone_lruvec() or by
mem_cgroup_page_lruvec(), and both of those do know the right zone.

So check and set lruvec-&gt;zone in those; and remove the inadequate
attempt to set lruvec-&gt;zone from lruvec_init(), which is called before
NODE_DATA(node) has been allocated in such cases.

Ah, there was one exceptionr.  For no particularly good reason,
mem_cgroup_force_empty_list() has its own code for deciding lruvec.
Change it to use the standard mem_cgroup_zone_lruvec() and
mem_cgroup_get_lru_size() too.  In fact it was already safe against such
an oops (the lru lists in danger could only be empty), but we're better
proofed against future changes this way.

I've marked this for stable (3.6) since we introduced the problem in 3.5
(now closed to stable); but I have no idea if this is the only fix
needed to get memory hotadd working with memcg in 3.6, and received no
answer when I enquired twice before.

Reported-by: Tang Chen &lt;tangchen@cn.fujitsu.com&gt;
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Konstantin Khlebnikov &lt;khlebnikov@openvz.org&gt;
Cc: Wen Congyang &lt;wency@cn.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm/page_alloc: fix the page address of higher page's buddy calculation</title>
<updated>2012-09-17T22:00:38Z</updated>
<author>
<name>Li Haifeng</name>
<email>omycle@gmail.com</email>
</author>
<published>2012-09-17T21:09:21Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0ba8f2d59304dfe69b59c034de723ad80f7ab9ac'/>
<id>urn:sha1:0ba8f2d59304dfe69b59c034de723ad80f7ab9ac</id>
<content type='text'>
The heuristic method for buddy has been introduced since commit
43506fad21ca ("mm/page_alloc.c: simplify calculation of combined index
of adjacent buddy lists").  But the page address of higher page's buddy
was wrongly calculated, which will lead page_is_buddy to fail for ever.
IOW, the heuristic method would be disabled with the wrong page address
of higher page's buddy.

Calculating the page address of higher page's buddy should be based
higher_page with the offset between index of higher page and index of
higher page's buddy.

Signed-off-by: Haifeng Li &lt;omycle@gmail.com&gt;
Signed-off-by: Gavin Shan &lt;shangw@linux.vnet.ibm.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: KyongHo Cho &lt;pullip.cho@samsung.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: Johannes Weiner &lt;jweiner@redhat.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[2.6.38+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: compaction: Abort async compaction if locks are contended or taking too long</title>
<updated>2012-08-21T23:45:03Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-08-21T23:16:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c67fe3752abe6ab47639e2f9b836900c3dc3da84'/>
<id>urn:sha1:c67fe3752abe6ab47639e2f9b836900c3dc3da84</id>
<content type='text'>
Jim Schutt reported a problem that pointed at compaction contending
heavily on locks.  The workload is straight-forward and in his own words;

	The systems in question have 24 SAS drives spread across 3 HBAs,
	running 24 Ceph OSD instances, one per drive.  FWIW these servers
	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
	Ceph Linux clients doing dd simultaneously to a Ceph file system
	backed by 12 of these servers.

Early in the test everything looks fine

  procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
  31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
  27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
  28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
   6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
  22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0

and then it goes to pot

  procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
  163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
  207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
  123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
  123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
  622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
  223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

  perf record -g -a sleep 10
  perf report --sort symbol --call-graph fractal,5
    34.63%  [k] _raw_spin_lock_irqsave
            |
            |--97.30%-- isolate_freepages
            |          compaction_alloc
            |          unmap_and_move
            |          migrate_pages
            |          compact_zone
            |          compact_zone_order
            |          try_to_compact_pages
            |          __alloc_pages_direct_compact
            |          __alloc_pages_slowpath
            |          __alloc_pages_nodemask
            |          alloc_pages_vma
            |          do_huge_pmd_anonymous_page
            |          handle_mm_fault
            |          do_page_fault
            |          page_fault
            |          |
            |          |--87.39%-- skb_copy_datagram_iovec
            |          |          tcp_recvmsg
            |          |          inet_recvmsg
            |          |          sock_recvmsg
            |          |          sys_recvfrom
            |          |          system_call
            |          |          __recv
            |          |          |
            |          |           --100.00%-- (nil)
            |          |
            |           --12.61%-- memcpy
             --2.70%-- [...]

There was other data but primarily it is all showing that compaction is
contended heavily on the zone-&gt;lock and zone-&gt;lru_lock.

commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration] noted that it was possible for
migration to hold the lru_lock for an excessive amount of time. Very
broadly speaking this patch expands the concept.

This patch introduces compact_checklock_irqsave() to check if a lock
is contended or the process needs to be scheduled. If either condition
is true then async compaction is aborted and the caller is informed.
The page allocator will fail a THP allocation if compaction failed due
to contention. This patch also introduces compact_trylock_irqsave()
which will acquire the lock only if it is not contended and the process
does not need to schedule.

Reported-by: Jim Schutt &lt;jaschut@sandia.gov&gt;
Tested-by: Jim Schutt &lt;jaschut@sandia.gov&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: correct page-&gt;pfmemalloc to fix deactivate_slab regression</title>
<updated>2012-08-21T23:45:03Z</updated>
<author>
<name>Alex Shi</name>
<email>alex.shi@intel.com</email>
</author>
<published>2012-08-21T23:16:08Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b121186ab1b12e2a96a945d88eae0735b4542158'/>
<id>urn:sha1:b121186ab1b12e2a96a945d88eae0735b4542158</id>
<content type='text'>
Commit cfd19c5a9ecf ("mm: only set page-&gt;pfmemalloc when
ALLOC_NO_WATERMARKS was used") tried to narrow down page-&gt;pfmemalloc
setting, but it missed some places the pfmemalloc should be set.

So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
cause incorrect deactivate_slab() on our core2 server:

    64.73%           fio  [kernel.kallsyms]     [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |---0.34%-- deactivate_slab
                        |          __slab_alloc
                        |          kmem_cache_alloc
                        |          |

That causes our fio sync write performance to have a 40% regression.

Move the checking in get_page_from_freelist() which resolves this issue.

Signed-off-by: Alex Shi &lt;alex.shi@intel.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Tested-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Tested-by: Sage Weil &lt;sage@inktank.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: remove node_start_pfn checking in new WARN_ON for now</title>
<updated>2012-08-02T17:37:03Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2012-08-02T17:37:03Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8783b6e2b2cb726f2734cf208d101f73ac1ba616'/>
<id>urn:sha1:8783b6e2b2cb726f2734cf208d101f73ac1ba616</id>
<content type='text'>
Borislav Petkov reports that the new warning added in commit
88fdf75d1bb5 ("mm: warn if pg_data_t isn't initialized with zero")
triggers for him, and it is the node_start_pfn field that has already
been initialized once.

The call trace looks like this:

  x86_64_start_kernel -&gt;
    x86_64_start_reservations -&gt;
    start_kernel -&gt;
    setup_arch -&gt;
    paging_init -&gt;
    zone_sizes_init -&gt;
    free_area_init_nodes -&gt;
    free_area_init_node

and (with the warning replaced by debug output), Borislav sees

  On node 0 totalpages: 4193848
    DMA zone: 64 pages used for memmap
    DMA zone: 6 pages reserved
    DMA zone: 3890 pages, LIFO batch:0
    DMA32 zone: 16320 pages used for memmap
    DMA32 zone: 798464 pages, LIFO batch:31
    Normal zone: 52736 pages used for memmap
    Normal zone: 3322368 pages, LIFO batch:31
  free_area_init_node: pgdat-&gt;node_start_pfn: 4423680      &lt;----
  On node 1 totalpages: 4194304
    Normal zone: 65536 pages used for memmap
    Normal zone: 4128768 pages, LIFO batch:31
  free_area_init_node: pgdat-&gt;node_start_pfn: 8617984      &lt;----
  On node 2 totalpages: 4194304
    Normal zone: 65536 pages used for memmap
    Normal zone: 4128768 pages, LIFO batch:31
  free_area_init_node: pgdat-&gt;node_start_pfn: 12812288     &lt;----
  On node 3 totalpages: 4194304
    Normal zone: 65536 pages used for memmap
    Normal zone: 4128768 pages, LIFO batch:31

so remove the bogus warning for now to avoid annoying people.  Minchan
Kim is looking at it.

Reported-by: Borislav Petkov &lt;bp@amd64.org&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: remove redundant initialization</title>
<updated>2012-08-01T01:42:50Z</updated>
<author>
<name>Minchan Kim</name>
<email>minchan@kernel.org</email>
</author>
<published>2012-07-31T23:46:16Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6527af5d1bea219d64095a5e30c1b1e0868aae16'/>
<id>urn:sha1:6527af5d1bea219d64095a5e30c1b1e0868aae16</id>
<content type='text'>
pg_data_t is zeroed before reaching free_area_init_core(), so remove the
now unnecessary initializations.

Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: warn if pg_data_t isn't initialized with zero</title>
<updated>2012-08-01T01:42:50Z</updated>
<author>
<name>Minchan Kim</name>
<email>minchan@kernel.org</email>
</author>
<published>2012-07-31T23:46:14Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=88fdf75d1bb51d85ba00c466391770056d44bc03'/>
<id>urn:sha1:88fdf75d1bb51d85ba00c466391770056d44bc03</id>
<content type='text'>
Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
when it is allocated.  Arch code and memory hotplug already initiailize
pg_data_t.  So this warning should never happen.  I select fields randomly
near the beginning, middle and end of pg_data_t for checking.

This patch isn't for performance but for removing initialization code
which is necessary to add whenever we adds new field to pg_data_t or zone.

Firstly, Andrew suggested clearing out of pg_data_t in MM core part but
Tejun doesn't like it because in the future, some archs can initialize
some fields in arch code and pass them into general MM part so blindly
clearing it out in mm core part would be very annoying.

Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage</title>
<updated>2012-08-01T01:42:46Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-07-31T23:44:35Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5515061d22f0f9976ae7815864bfd22042d36848'/>
<id>urn:sha1:5515061d22f0f9976ae7815864bfd22042d36848</id>
<content type='text'>
If swap is backed by network storage such as NBD, there is a risk that a
large number of reclaimers can hang the system by consuming all
PF_MEMALLOC reserves.  To avoid these hangs, the administrator must tune
min_free_kbytes in advance which is a bit fragile.

This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use.  If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother but
the system will keep running.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Mike Christie &lt;michaelc@cs.wisc.edu&gt;
Cc: Eric B Munson &lt;emunson@mgebm.net&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Sebastian Andrzej Siewior &lt;sebastian@breakpoint.cc&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: ignore mempolicies when using ALLOC_NO_WATERMARK</title>
<updated>2012-08-01T01:42:45Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-07-31T23:44:12Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=183f6371aac2a5496a8ef2b0b0a68562652c3cdb'/>
<id>urn:sha1:183f6371aac2a5496a8ef2b0b0a68562652c3cdb</id>
<content type='text'>
The reserve is proportionally distributed over all !highmem zones in the
system.  So we need to allow an emergency allocation access to all zones.
In order to do that we need to break out of any mempolicy boundaries we
might have.

In my opinion that does not break mempolicies as those are user oriented
and not system oriented.  That is, system allocations are not guaranteed
to be within mempolicy boundaries.  For instance IRQs do not even have a
mempolicy.

So breaking out of mempolicy boundaries for 'rare' emergency allocations,
which are always system allocations (as opposed to user) is ok.

Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Mike Christie &lt;michaelc@cs.wisc.edu&gt;
Cc: Eric B Munson &lt;emunson@mgebm.net&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Sebastian Andrzej Siewior &lt;sebastian@breakpoint.cc&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: only set page-&gt;pfmemalloc when ALLOC_NO_WATERMARKS was used</title>
<updated>2012-08-01T01:42:45Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-07-31T23:44:10Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=cfd19c5a9ecf8e5e38de2603077c4330af21316e'/>
<id>urn:sha1:cfd19c5a9ecf8e5e38de2603077c4330af21316e</id>
<content type='text'>
__alloc_pages_slowpath() is called when the number of free pages is below
the low watermark.  If the caller is entitled to use ALLOC_NO_WATERMARKS
then the page will be marked page-&gt;pfmemalloc.  This protects more pages
than are strictly necessary as we only need to protect pages allocated
below the min watermark (the pfmemalloc reserves).

This patch only sets page-&gt;pfmemalloc when ALLOC_NO_WATERMARKS was
required to allocate the page.

[rientjes@google.com: David noticed the problem during review]
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Mike Christie &lt;michaelc@cs.wisc.edu&gt;
Cc: Eric B Munson &lt;emunson@mgebm.net&gt;
Cc: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Sebastian Andrzej Siewior &lt;sebastian@breakpoint.cc&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
