[PATCH] minimal rmap

This is the "minimal rmap" patch, writen by Rik, ported to 2.5 by Craig Kulsea. Basically, before: When the page reclaim code decides that is has scanned too many unreclaimable pages on the LRU it does a scan of process virtual address spaces for pages to add to swapcache. ptes pointing at the page are unmapped as the scan proceeds. When all ptes referring to a page have been unmapped and it has been written to swap the page is reclaimable. after: When an anonymous page is encountered on the tail of the LRU we use the rmap to see if it hasn't been referenced lately. If so then add it to swapcache. When the page is again encountered on the LRU, if it is still unreferenced then try to unmap all ptes which refer to it in one hit, and if it is clean (ie: on swap) then free it. The rest of the VM - list management, the classzone concept, etc remains unchanged. There are a number of things which the per-page pte chain could be used for. Bill Irwin has identified the following. (1) page replacement no longer goes around randomly unmapping things (2) referenced bits are more accurate because there aren't several ms or even seconds between find the multiple pte's mapping a page (3) reduces page replacement from O(total virtually mapped) to O(physical) (4) enables defragmentation of physical memory (5) enables cooperative offlining of memory for friendly guest instance behavior in UML and/or LPAR settings (6) demonstrable benefit in performance of swapping which is common in end-user interactive workstation workloads (I don't like the word "desktop"). c.f. Craig Kulesa's post wrt. swapping performance (7) evidence from 2.4-based rmap trees indicates approximate parity with mainline in kernel compiles with appropriate locking bits (8) partitioning of physical memory can reduce the complexity of page replacement searches by scanning only the "interesting" zones implemented and merged in 2.4-based rmap (9) partitioning of physical memory can increase the parallelism of page replacement searches by independently processing different zones implemented, but not merged in 2.4-based rmap (10) the reverse mappings may be used for efficiently keeping pte cache attributes coherent (11) they may be used for virtual cache invalidation (with changes) (12) the reverse mappings enable proper RSS limit enforcement implemented and merged in 2.4-based rmap The code adds a pointer to struct page, consumes additional storage for the pte chains and adds computational expense to the page reclaim code (I measured it at 3% additional load during streaming I/O). The benefits which we get back for all this are, I must say, theoretical and unproven. If it has real advantages (or, indeed, disadvantages) then why has nobody demonstrated them? There are a number of things remaining to be done: 1: Demonstrate the above advantages. 2: Make it work with pte-highmem (Bill Irwin is signed up for this) 3: Don't add pte_chains to non-shared pages optimisation (Dave McCracken's patch does this) 4: Move the pte_chains into highmem too (Bill, I guess) 5: per-cpu pte_chain freelists (Rik?) 6: maybe GC the pte_chain backing pages. (Seems unavoidable. Rik?) 7: multithread the page reclaim code. (I have patches). 8: clustered add-to-swap. Not sure if I buy this. anon pages are often well-ordered-by-virtual-address on the LRU, so it "just works" for benchmarky loads. But there may be some other loads... 9: Fix bad IO latency in page reclaim (I have lame patches) 10: Develop tuning tools, use them. 11: The nightly updatedb run is still evicting everything.
author: Andrew Morton <akpm@zip.com.au> 2002-07-18 21:08:35 -0700
committer: Linus Torvalds <torvalds@home.transmeta.com> 2002-07-18 21:08:35 -0700
commit: c48c43e6ed41a3bcec0155e8e4b8440a9a769a0a (patch)
tree: 47b1a23b3a84d14e8d712eda4c14700bf3151ffb /include/linux
parent: b15d45bfd7a7fa994adec7a32673a448b5155cb0 (diff)
3 files changed, 46 insertions, 1 deletions
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 163e19fd7b33..0c0b6d41dbb0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -130,6 +130,9 @@ struct vm_operations_struct {
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int unused);
 };
 
+/* forward declaration; pte_chain is meant to be internal to rmap.c */
+struct pte_chain;
+
 /*
  * Each physical page in the system has a struct page associated with
  * it to keep track of whatever it is we are using the page for at the
@@ -154,6 +157,8 @@ struct page {
 					   updated asynchronously */
 	struct list_head lru;		/* Pageout list, eg. active_list;
 					   protected by pagemap_lru_lock !! */
+	struct pte_chain * pte_chain;	/* Reverse pte mapping pointer.
+					 * protected by PG_chainlock */
 	unsigned long private;		/* mapping-private opaque data */
 
 	/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 93a6f27cb454..7cdd56c8cc3e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -47,7 +47,7 @@
  * locked- and dirty-page accounting.  The top eight bits of page->flags are
  * used for page->zone, so putting flag bits there doesn't work.
  */
-#define PG_locked	 0	/* Page is locked. Don't touch. */
+#define PG_locked	 	 0	/* Page is locked. Don't touch. */
 #define PG_error		 1
 #define PG_referenced		 2
 #define PG_uptodate		 3
@@ -65,6 +65,7 @@
 #define PG_private		12	/* Has something at ->private */
 #define PG_writeback		13	/* Page is under writeback */
 #define PG_nosave		15	/* Used for system suspend/resume */
+#define PG_chainlock		16	/* lock bit for ->pte_chain */
 
 /*
  * Global page accounting.  One instance per CPU.
@@ -217,6 +218,31 @@ extern void get_page_state(struct page_state *ret);
 #define TestClearPageNosave(page)	test_and_clear_bit(PG_nosave, &(page)->flags)
 
 /*
+ * inlines for acquisition and release of PG_chainlock
+ */
+static inline void pte_chain_lock(struct page *page)
+{
+	/*
+	 * Assuming the lock is uncontended, this never enters
+	 * the body of the outer loop. If it is contended, then
+	 * within the inner loop a non-atomic test is used to
+	 * busywait with less bus contention for a good time to
+	 * attempt to acquire the lock bit.
+	 */
+	preempt_disable();
+	while (test_and_set_bit(PG_chainlock, &page->flags)) {
+		while (test_bit(PG_chainlock, &page->flags))
+			cpu_relax();
+	}
+}
+
+static inline void pte_chain_unlock(struct page *page)
+{
+	clear_bit(PG_chainlock, &page->flags);
+	preempt_enable();
+}
+
+/*
  * The PageSwapCache predicate doesn't use a PG_flag at this time,
  * but it may again do so one day.
  */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0b448a811a39..8ba0854d69af 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -142,6 +142,19 @@ struct sysinfo;
 struct address_space;
 struct zone_t;
 
+/* linux/mm/rmap.c */
+extern int FASTCALL(page_referenced(struct page *));
+extern void FASTCALL(page_add_rmap(struct page *, pte_t *));
+extern void FASTCALL(page_remove_rmap(struct page *, pte_t *));
+extern int FASTCALL(try_to_unmap(struct page *));
+extern int FASTCALL(page_over_rsslimit(struct page *));
+
+/* return values of try_to_unmap */
+#define	SWAP_SUCCESS	0
+#define	SWAP_AGAIN	1
+#define	SWAP_FAIL	2
+#define	SWAP_ERROR	3
+
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(__lru_cache_del(struct page *));
@@ -168,6 +181,7 @@ int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page);
 extern void show_swap_cache_info(void);
 #endif
 extern int add_to_swap_cache(struct page *, swp_entry_t);
+extern int add_to_swap(struct page *);
 extern void __delete_from_swap_cache(struct page *page);
 extern void delete_from_swap_cache(struct page *page);
 extern int move_to_swap_cache(struct page *page, swp_entry_t entry);
author	Andrew Morton <akpm@zip.com.au>	2002-07-18 21:08:35 -0700
committer	Linus Torvalds <torvalds@home.transmeta.com>	2002-07-18 21:08:35 -0700
commit	c48c43e6ed41a3bcec0155e8e4b8440a9a769a0a (patch)
tree	47b1a23b3a84d14e8d712eda4c14700bf3151ffb /include/linux
parent	b15d45bfd7a7fa994adec7a32673a448b5155cb0 (diff)