<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/mm/memory.c, branch v3.0.9</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.0.9</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.0.9'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2011-11-11T17:36:29Z</updated>
<entry>
<title>mm: thp: tail page refcounting fix</title>
<updated>2011-11-11T17:36:29Z</updated>
<author>
<name>Andrea Arcangeli</name>
<email>aarcange@redhat.com</email>
</author>
<published>2011-11-02T20:36:59Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=68fe9d9c796303de600dbc622086768ca4d8408b'/>
<id>urn:sha1:68fe9d9c796303de600dbc622086768ca4d8408b</id>
<content type='text'>
commit 70b50f94f1644e2aa7cb374819cfd93f3c28d725 upstream.

Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().

He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail-&gt;_count zero at
all times.  This will guarantee that get_page_unless_zero() can never
succeed on any tail page.  page_tail-&gt;_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page-&gt;_count in __split_huge_page_refcount() (in addition to the
head_page-&gt;_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages.  That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic.  As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns.  It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).

The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages.  The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it.  A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead.  So it's worth it.  Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages().  The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.

This new refcounting on page_tail-&gt;_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.

Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Reported-by: Michel Lespinasse &lt;walken@google.com&gt;
Reviewed-by: Michel Lespinasse &lt;walken@google.com&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;jweiner@redhat.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>mm/futex: fix futex writes on archs with SW tracking of dirty &amp; young</title>
<updated>2011-08-05T04:58:38Z</updated>
<author>
<name>Benjamin Herrenschmidt</name>
<email>benh@kernel.crashing.org</email>
</author>
<published>2011-07-26T00:12:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b045b9a265fb46d8197b7d78aff1a8f6ab8e23df'/>
<id>urn:sha1:b045b9a265fb46d8197b7d78aff1a8f6ab8e23df</id>
<content type='text'>
commit 2efaca927f5cd7ecd0f1554b8f9b6a9a2c329c03 upstream.

I haven't reproduced it myself but the fail scenario is that on such
machines (notably ARM and some embedded powerpc), if you manage to hit
that futex path on a writable page whose dirty bit has gone from the PTE,
you'll livelock inside the kernel from what I can tell.

It will go in a loop of trying the atomic access, failing, trying gup to
"fix it up", getting succcess from gup, go back to the atomic access,
failing again because dirty wasn't fixed etc...

So I think you essentially hang in the kernel.

The scenario is probably rare'ish because affected architecture are
embedded and tend to not swap much (if at all) so we probably rarely hit
the case where dirty is missing or young is missing, but I think Shan has
a piece of SW that can reliably reproduce it using a shared writable
mapping &amp; fork or something like that.

On archs who use SW tracking of dirty &amp; young, a page without dirty is
effectively mapped read-only and a page without young unaccessible in the
PTE.

Additionally, some architectures might lazily flush the TLB when relaxing
write protection (by doing only a local flush), and expect a fault to
invalidate the stale entry if it's still present on another processor.

The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
"fix it up" by causing get_user_pages() which would then be equivalent to
taking the fault.

However that isn't the case.  get_user_pages() will not call
handle_mm_fault() in the case where the PTE seems to have the right
permissions, regardless of the dirty and young state.  It will eventually
update those bits ...  in the struct page, but not in the PTE.

Additionally, it will not handle the lazy TLB flushing that can be
required by some architectures in the fault case.

Basically, gup is the wrong interface for the job.  The patch provides a
more appropriate one which boils down to just calling handle_mm_fault()
since what we are trying to do is simulate a real page fault.

The futex code currently attempts to write to user memory within a
pagefault disabled section, and if that fails, tries to fix it up using
get_user_pages().

This doesn't work on archs where the dirty and young bits are maintained
by software, since they will gate access permission in the TLB, and will
not be updated by gup().

In addition, there's an expectation on some archs that a spurious write
fault triggers a local TLB flush, and that is missing from the picture as
well.

I decided that adding those "features" to gup() would be too much for this
already too complex function, and instead added a new simpler
fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
which the futex code can call.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
Signed-off-by: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Reported-by: Shan Hai &lt;haishan.bai@gmail.com&gt;
Tested-by: Shan Hai &lt;haishan.bai@gmail.com&gt;
Cc: David Laight &lt;David.Laight@ACULAB.COM&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Darren Hart &lt;darren.hart@intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>mm: __tlb_remove_page() check the correct batch</title>
<updated>2011-07-09T04:14:43Z</updated>
<author>
<name>Shaohua Li</name>
<email>shaohua.li@intel.com</email>
</author>
<published>2011-07-08T22:39:41Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0b43c3aab0137595335b08b340a3f3e5af9818a6'/>
<id>urn:sha1:0b43c3aab0137595335b08b340a3f3e5af9818a6</id>
<content type='text'>
__tlb_remove_page() switches to a new batch page, but still checks space
in the old batch.  This check always fails, and causes a forced tlb flush.

Signed-off-by: Shaohua Li &lt;shaohua.li@intel.com&gt;
Acked-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: move vmtruncate_range to truncate.c</title>
<updated>2011-06-28T01:00:12Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2011-06-27T23:18:01Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5b8ba10198a109f8a02380648c5d29000caa9c55'/>
<id>urn:sha1:5b8ba10198a109f8a02380648c5d29000caa9c55</id>
<content type='text'>
You would expect to find vmtruncate_range() next to vmtruncate() in
mm/truncate.c: move it there.

Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Acked-by: Christoph Hellwig &lt;hch@infradead.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: fix wrong kunmap_atomic() pointer</title>
<updated>2011-06-16T03:04:00Z</updated>
<author>
<name>Steven Rostedt</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2011-06-15T22:08:23Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=5f1a19070b16c20cdc71ed0e981bfa19f8f6a4ee'/>
<id>urn:sha1:5f1a19070b16c20cdc71ed0e981bfa19f8f6a4ee</id>
<content type='text'>
Running a ktest.pl test, I hit the following bug on x86_32:

  ------------[ cut here ]------------
  WARNING: at arch/x86/mm/highmem_32.c:81 __kunmap_atomic+0x64/0xc1()
   Hardware name:
  Modules linked in:
  Pid: 93, comm: sh Not tainted 2.6.39-test+ #1
  Call Trace:
   [&lt;c04450da&gt;] warn_slowpath_common+0x7c/0x91
   [&lt;c042f5df&gt;] ? __kunmap_atomic+0x64/0xc1
   [&lt;c042f5df&gt;] ? __kunmap_atomic+0x64/0xc1^M
   [&lt;c0445111&gt;] warn_slowpath_null+0x22/0x24
   [&lt;c042f5df&gt;] __kunmap_atomic+0x64/0xc1
   [&lt;c04d4a22&gt;] unmap_vmas+0x43a/0x4e0
   [&lt;c04d9065&gt;] exit_mmap+0x91/0xd2
   [&lt;c0443057&gt;] mmput+0x43/0xad
   [&lt;c0448358&gt;] exit_mm+0x111/0x119
   [&lt;c044855f&gt;] do_exit+0x1ff/0x5fa
   [&lt;c0454ea2&gt;] ? set_current_blocked+0x3c/0x40
   [&lt;c0454f24&gt;] ? sigprocmask+0x7e/0x8e
   [&lt;c0448b55&gt;] do_group_exit+0x65/0x88
   [&lt;c0448b90&gt;] sys_exit_group+0x18/0x1c
   [&lt;c0c3915f&gt;] sysenter_do_call+0x12/0x38
  ---[ end trace 8055f74ea3c0eb62 ]---

Running a ktest.pl git bisect, found the culprit: commit e303297e6c3a
("mm: extended batches for generic mmu_gather")

But although this was the commit triggering the bug, it was not the one
originally responsible for the bug.  That was commit d16dfc550f53 ("mm:
mmu_gather rework").

The code in zap_pte_range() has something that looks like the following:

	pte =  pte_offset_map_lock(mm, pmd, addr, &amp;ptl);
	do {
		[...]
	} while (pte++, addr += PAGE_SIZE, addr != end);
	pte_unmap_unlock(pte - 1, ptl);

The pte starts off pointing at the first element in the page table
directory that was returned by the pte_offset_map_lock().  When it's done
with the page, pte will be pointing to anything between the next entry and
the first entry of the next page inclusive.  By doing a pte - 1, this puts
the pte back onto the original page, which is all that pte_unmap_unlock()
needs.

In most archs (64 bit), this is not an issue as the pte is ignored in the
pte_unmap_unlock().  But on 32 bit archs, where things may be kmapped, it
is essential that the pte passed to pte_unmap_unlock() resides on the same
page that was given by pte_offest_map_lock().

The problem came in d16dfc55 ("mm: mmu_gather rework") where it introduced
a "break;" from the while loop.  This alone did not seem to easily trigger
the bug.  But the modifications made by e303297e6 caused that "break;" to
be hit on the first iteration, before the pte++.

The pte not being incremented will now cause pte_unmap_unlock(pte - 1) to
be pointing to the previous page.  This will cause the wrong page to be
unmapped, and also trigger the warning above.

The simple solution is to just save the pointer given by
pte_offset_map_lock() and use it in the unlock.

Signed-off-by: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/memory.c: fix kernel-doc notation</title>
<updated>2011-06-16T03:03:59Z</updated>
<author>
<name>Randy Dunlap</name>
<email>randy.dunlap@oracle.com</email>
</author>
<published>2011-06-15T22:08:09Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0164f69d0cf1a6abbc936851f5b72ece92187cda'/>
<id>urn:sha1:0164f69d0cf1a6abbc936851f5b72ece92187cda</id>
<content type='text'>
Fix new kernel-doc warnings in mm/memory.c:

  Warning(mm/memory.c:1327): No description found for parameter 'tlb'
  Warning(mm/memory.c:1327): Excess function parameter 'tlbp' description in 'unmap_vmas'

Signed-off-by: Randy Dunlap &lt;randy.dunlap@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memcg: add the pagefault count into memcg stats</title>
<updated>2011-05-27T00:12:36Z</updated>
<author>
<name>Ying Han</name>
<email>yinghan@google.com</email>
</author>
<published>2011-05-26T23:25:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=456f998ec817ebfa254464be4f089542fa390645'/>
<id>urn:sha1:456f998ec817ebfa254464be4f089542fa390645</id>
<content type='text'>
Two new stats in per-memcg memory.stat which tracks the number of page
faults and number of major page faults.

  "pgfault"
  "pgmajfault"

They are different from "pgpgin"/"pgpgout" stat which count number of
pages charged/discharged to the cgroup and have no meaning of reading/
writing page to disk.

It is valuable to track the two stats for both measuring application's
performance as well as the efficiency of the kernel page reclaim path.
Counting pagefaults per process is useful, but we also need the aggregated
value since processes are monitored and controlled in cgroup basis in
memcg.

Functional test: check the total number of pgfault/pgmajfault of all
memcgs and compare with global vmstat value:

  $ cat /proc/vmstat | grep fault
  pgfault 1070751
  pgmajfault 553

  $ cat /dev/cgroup/memory.stat | grep fault
  pgfault 1071138
  pgmajfault 553
  total_pgfault 1071142
  total_pgmajfault 553

  $ cat /dev/cgroup/A/memory.stat | grep fault
  pgfault 199
  pgmajfault 0
  total_pgfault 199
  total_pgmajfault 0

Performance test: run page fault test(pft) wit 16 thread on faulting in
15G anon pages in 16G container.  There is no regression noticed on the
"flt/cpu/s"

Sample output from pft:

  TAG pft:anon-sys-default:
    Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
    15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260

  +-------------------------------------------------------------------------+
      N           Min           Max        Median           Avg        Stddev
  x  10     16682.962     17344.027     16913.524     16928.812      166.5362
  +  10     16695.568     16923.896     16820.604     16824.652     84.816568
  No difference proven at 95.0% confidence

[akpm@linux-foundation.org: fix build]
[hughd@google.com: shmem fix]
Signed-off-by: Ying Han &lt;yinghan@google.com&gt;
Acked-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: Daisuke Nishimura &lt;nishimura@mxp.nes.nec.co.jp&gt;
Acked-by: Balbir Singh &lt;balbir@linux.vnet.ibm.com&gt;
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: don't access vm_flags as 'int'</title>
<updated>2011-05-26T16:20:31Z</updated>
<author>
<name>KOSAKI Motohiro</name>
<email>kosaki.motohiro@jp.fujitsu.com</email>
</author>
<published>2011-05-26T10:16:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ca16d140af91febe25daeb9e032bf8bd46b8c31f'/>
<id>urn:sha1:ca16d140af91febe25daeb9e032bf8bd46b8c31f</id>
<content type='text'>
The type of vma-&gt;vm_flags is 'unsigned long'. Neither 'int' nor
'unsigned int'. This patch fixes such misuse.

Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
[ Changed to use a typedef - we'll extend it to cover more cases
  later, since there has been discussion about making it a 64-bit
  type..                      - Linus ]
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: uninline large generic tlb.h functions</title>
<updated>2011-05-25T15:39:20Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>a.p.zijlstra@chello.nl</email>
</author>
<published>2011-05-25T00:12:14Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=9547d01bfb9c351dc19067f8a4cea9d3955f4125'/>
<id>urn:sha1:9547d01bfb9c351dc19067f8a4cea9d3955f4125</id>
<content type='text'>
Some of these functions have grown beyond inline sanity, move them
out-of-line.

Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Requested-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Requested-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: Convert i_mmap_lock to a mutex</title>
<updated>2011-05-25T15:39:18Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>a.p.zijlstra@chello.nl</email>
</author>
<published>2011-05-25T00:12:06Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3d48ae45e72390ddf8cc5256ac32ed6f7a19cbea'/>
<id>urn:sha1:3d48ae45e72390ddf8cc5256ac32ed6f7a19cbea</id>
<content type='text'>
Straightforward conversion of i_mmap_lock to a mutex.

Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Cc: Martin Schwidefsky &lt;schwidefsky@de.ibm.com&gt;
Cc: Russell King &lt;rmk@arm.linux.org.uk&gt;
Cc: Paul Mundt &lt;lethal@linux-sh.org&gt;
Cc: Jeff Dike &lt;jdike@addtoit.com&gt;
Cc: Richard Weinberger &lt;richard@nod.at&gt;
Cc: Tony Luck &lt;tony.luck@intel.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Mel Gorman &lt;mel@csn.ul.ie&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Nick Piggin &lt;npiggin@kernel.dk&gt;
Cc: Namhyung Kim &lt;namhyung@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
