Merge master.kernel.org:/home/mingo/bk-sched

into home.transmeta.com:/home/torvalds/v2.5/linux
author: Linus Torvalds <torvalds@home.transmeta.com> 2002-06-17 20:48:29 -0700
committer: Linus Torvalds <torvalds@home.transmeta.com> 2002-06-17 20:48:29 -0700
commit: 1f60ade2a44d22a67c75a165b70d66f9d4e0b76e (patch)
tree: 7a8bda4c45fb3e5d255a023b030137e3b6be87ee
parent: 8509486ae776be099cbedb6c37c37741ddc20ad8 (diff)
parent: 3986594c6167a269053d3d88f17e53e0ca4023f8 (diff)
158 files changed, 2799 insertions, 2645 deletions
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index d636ae84e508..c894fcceb996 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -50,27 +50,27 @@ prototypes:
 	int (*removexattr) (struct dentry *, const char *);
 
 locking rules:
-	all may block
-		BKL	i_sem(inode)
-lookup:		no	yes
-create:		no	yes
-link:		no	yes (both)
-mknod:		no	yes
-symlink:	no	yes
-mkdir:		no	yes
-unlink:		no	yes (both)
-rmdir:		no	yes (both)	(see below)
-rename:		no	yes (all)	(see below)
-readlink:	no	no
-follow_link:	no	no
-truncate:	no	yes		(see below)
-setattr:	no	yes
-permission:	yes	no
-getattr:	no	no
-setxattr:	no	yes
-getxattr:	no	yes
-listxattr:	no	yes
-removexattr:	no	yes
+	all may block, none have BKL
+		i_sem(inode)
+lookup:		yes
+create:		yes
+link:		yes (both)
+mknod:		yes
+symlink:	yes
+mkdir:		yes
+unlink:		yes (both)
+rmdir:		yes (both)	(see below)
+rename:		yes (all)	(see below)
+readlink:	no
+follow_link:	no
+truncate:	yes		(see below)
+setattr:	yes
+permission:	no
+getattr:	no
+setxattr:	yes
+getxattr:	yes
+listxattr:	yes
+removexattr:	yes
 	Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on
 victim.
 	cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index ef49709ee8ad..85281b6f4ff0 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -81,9 +81,9 @@ can relax your locking.
 [mandatory]
 
 ->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(),
-->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename() and ->readdir() 
-are called without BKL now.  Grab it on the entry, drop upon return - that 
-will guarantee the same locking you used to have.  If your method or its
+->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename(), ->permission()
+and ->readdir() are called without BKL now.  Grab it on entry, drop upon return
+- that will guarantee the same locking you used to have.  If your method or its
 parts do not need BKL - better yet, now you can shift lock_kernel() and
 unlock_kernel() so that they would protect exactly what needs to be
 protected.
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index f93b1544c6b2..57597335536d 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -948,120 +948,43 @@ program to load modules on demand.
 -----------------------------------------------
 
 The files  in  this directory can be used to tune the operation of the virtual
-memory (VM)  subsystem  of  the  Linux  kernel.  In addition, one of the files
-(bdflush) has some influence on disk usage.
+memory (VM)  subsystem  of  the  Linux  kernel.
 
-bdflush
--------
-
-This file  controls  the  operation of the bdflush kernel daemon. It currently
-contains nine  integer  values,  six of which are actually used by the kernel.
-They are listed in table 2-2.
-
-
-Table 2-2: Parameters in /proc/sys/vm/bdflush 
-..............................................................................
- Value      Meaning                                                            
- nfract     Percentage of buffer cache dirty to  activate bdflush              
- ndirty     Maximum number of dirty blocks to  write out per wake-cycle        
- nrefill    Number of clean buffers to try to obtain  each time we call refill 
- nref_dirt  buffer threshold for activating bdflush when trying to refill
-            buffers. 
- dummy      Unused                                                             
- age_buffer Time for normal buffer to age before we flush it                   
- age_super  Time for superblock to age before we flush it                      
- dummy      Unused                                                             
- dummy      Unused                                                             
-..............................................................................
-
-nfract
-------
-
-This parameter  governs  the  maximum  number  of  dirty buffers in the buffer
-cache. Dirty means that the contents of the buffer still have to be written to
-disk (as  opposed  to  a  clean  buffer,  which  can just be forgotten about).
-Setting this  to  a  higher value means that Linux can delay disk writes for a
-long time, but it also means that it will have to do a lot of I/O at once when
-memory becomes short. A lower value will spread out disk I/O more evenly.
-
-ndirty
-------
-
-Ndirty gives the maximum number of dirty buffers that bdflush can write to the
-disk at  one  time.  A high value will mean delayed, bursty I/O, while a small
-value can lead to memory shortage when bdflush isn't woken up often enough.
-
-nrefill
--------
-
-This is  the  number  of  buffers  that  bdflush  will add to the list of free
-buffers when  refill_freelist()  is  called.  It is necessary to allocate free
-buffers beforehand,  since  the  buffers  are  often  different sizes than the
-memory pages  and some bookkeeping needs to be done beforehand. The higher the
-number, the  more  memory  will be wasted and the less often refill_freelist()
-will need to run.
-
-nref_dirt
----------
-
-When refill_freelist() comes across more than nref_dirt dirty buffers, it will
-wake up bdflush.
-
-age_buffer and age_super
-------------------------
-
-Finally, the age_buffer and age_super parameters govern the maximum time Linux
-waits before  writing  out  a  dirty buffer to disk. The value is expressed in
-jiffies (clockticks),  the  number of jiffies per second is 100. Age_buffer is
-the maximum age for data blocks, while age_super is for filesystems meta data.
-
-buffermem
----------
-
-The three  values  in  this  file  control  how much memory should be used for
-buffer memory.  The  percentage  is calculated as a percentage of total system
-memory.
-
-The values are:
-
-min_percent
------------
+dirty_background_ratio
+----------------------
 
-This is  the  minimum  percentage  of  memory  that  should be spent on buffer
-memory.
+Contains, as a percentage of total system memory, the number of pages at which
+the pdflush background writeback daemon will start writing out dirty data.
 
-borrow_percent
---------------
+dirty_async_ratio
+-----------------
 
-When Linux is short on memory, and the buffer cache uses more than it has been
-allotted, the  memory  management  (MM)  subsystem will prune the buffer cache
-more heavily than other memory to compensate.
+Contains, as a percentage of total system memory, the number of pages at which
+a process which is generating disk writes will itself start writing out dirty
+data.
 
-max_percent
------------
+dirty_sync_ratio
+----------------
 
-This is the maximum amount of memory that can be used for buffer memory.
+Contains, as a percentage of total system memory, the number of pages at which
+a process which is generating disk writes will itself start writing out dirty
+data and waiting upon completion of that writeout.
 
-freepages
----------
+dirty_writeback_centisecs
+-------------------------
 
-This file contains three values: min, low and high:
+The pdflush writeback daemons will periodically wake up and write `old' data
+out to disk.  This tunable expresses the interval between those wakeups, in
+100'ths of a second.
 
-min
----
-When the  number  of  free  pages  in the system reaches this number, only the
-kernel can allocate more memory.
+dirty_expire_centisecs
+----------------------
 
-low
----
-If the number of free pages falls below this point, the kernel starts swapping
-aggressively.
+This tunable is used to define when dirty data is old enough to be eligible
+for writeout by the pdflush daemons.  It is expressed in 100'ths of a second. 
+Data which has been dirty in-memory for longer than this interval will be
+written out next time a pdflush daemon wakes up.
 
-high
-----
-The kernel  tries  to  keep  up to this amount of memory free; if memory falls
-below this point, the kernel starts gently swapping in the hopes that it never
-has to do really aggressive swapping.
 
 kswapd
 ------
@@ -1113,79 +1036,6 @@ On the  other  hand,  enabling this feature can cause you to run out of memory
 and thrash the system to death, so large and/or important servers will want to
 set this value to 0.
 
-pagecache
----------
-
-This file  does exactly the same job as buffermem, only this file controls the
-amount of memory allowed for memory mapping and generic caching of files.
-
-You don't  want  the  minimum level to be too low, otherwise your system might
-thrash when memory is tight or fragmentation is high.
-
-pagetable_cache
----------------
-
-The kernel  keeps a number of page tables in a per-processor cache (this helps
-a lot  on  SMP systems). The cache size for each processor will be between the
-low and the high value.
-
-On a  low-memory,  single  CPU system, you can safely set these values to 0 so
-you don't  waste  memory.  It  is  used  on SMP systems so that the system can
-perform fast  pagetable allocations without having to acquire the kernel memory
-lock.
-
-For large  systems,  the  settings  are probably fine. For normal systems they
-won't hurt  a  bit.  For  small  systems  (  less  than  16MB ram) it might be
-advantageous to set both values to 0.
-
-swapctl
--------
-
-This file  contains  no less than 8 variables. All of these values are used by
-kswapd.
-
-The first four variables
-* sc_max_page_age,
-* sc_page_advance,
-* sc_page_decline and
-* sc_page_initial_age
-are used  to  keep  track  of  Linux's page aging. Page aging is a bookkeeping
-method to  track  which pages of memory are often used, and which pages can be
-swapped out without consequences.
-
-When a  page  is  swapped in, it starts at sc_page_initial_age (default 3) and
-when the  page  is  scanned  by  kswapd,  its age is adjusted according to the
-following scheme:
-
-* If  the  page  was used since the last time we scanned, its age is increased
-  by sc_page_advance  (default  3).  Where  the  maximum  value  is  given  by
-  sc_max_page_age (default 20).
-* Otherwise  (meaning  it wasn't used) its age is decreased by sc_page_decline
-  (default 1).
-
-When a page reaches age 0, it's ready to be swapped out.
-
-The variables  sc_age_cluster_fract, sc_age_cluster_min, sc_pageout_weight and
-sc_bufferout_weight, can  be  used  to  control  kswapd's  aggressiveness  in
-swapping out pages.
-
-Sc_age_cluster_fract is used to calculate how many pages from a process are to
-be scanned by kswapd. The formula used is
-
-(sc_age_cluster_fract divided by 1024) times resident set size
-
-So if you want kswapd to scan the whole process, sc_age_cluster_fract needs to
-have a  value  of  1024.  The  minimum  number  of  pages  kswapd will scan is
-represented by sc_age_cluster_min, which is done so that kswapd will also scan
-small processes.
-
-The values  of  sc_pageout_weight  and sc_bufferout_weight are used to control
-how many  tries  kswapd  will make in order to swap out one page/buffer. These
-values can  be used to fine-tune the ratio between user pages and buffer/cache
-memory. When  you find that your Linux system is swapping out too many process
-pages in  order  to  satisfy  buffer  memory  demands,  you may want to either
-increase sc_bufferout_weight, or decrease the value of sc_pageout_weight.
-
 2.5 /proc/sys/dev - Device specific parameters
 ----------------------------------------------
 
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index bf9abe829e40..b8221db90cde 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -9,116 +9,28 @@ This file contains the documentation for the sysctl files in
 /proc/sys/vm and is valid for Linux kernel version 2.2.
 
 The files in this directory can be used to tune the operation
-of the virtual memory (VM) subsystem of the Linux kernel, and
-one of the files (bdflush) also has a little influence on disk
-usage.
+of the virtual memory (VM) subsystem of the Linux kernel and
+the writeout of dirty data to disk.
 
 Default values and initialization routines for most of these
 files can be found in mm/swap.c.
 
 Currently, these files are in /proc/sys/vm:
-- bdflush
-- buffermem
-- freepages
 - kswapd
 - overcommit_memory
 - page-cluster
-- pagecache
-- pagetable_cache
+- dirty_async_ratio
+- dirty_background_ratio
+- dirty_expire_centisecs
+- dirty_sync_ratio
+- dirty_writeback_centisecs
 
 ==============================================================
 
-bdflush:
-
-This file controls the operation of the bdflush kernel
-daemon. The source code to this struct can be found in
-linux/fs/buffer.c. It currently contains 9 integer values,
-of which 4 are actually used by the kernel.
-
-From linux/fs/buffer.c:
---------------------------------------------------------------
-union bdflush_param {
-	struct {
-		int nfract;	/* Percentage of buffer cache dirty to 
-				   activate bdflush */
-		int dummy1;	/* old "ndirty" */
-		int dummy2;	/* old "nrefill" */
-		int dummy3;	/* unused */
-		int interval;	/* jiffies delay between kupdate flushes */
-		int age_buffer;	/* Time for normal buffer to age */
-		int nfract_sync;/* Percentage of buffer cache dirty to 
-				   activate bdflush synchronously */
-		int dummy4;	/* unused */
-		int dummy5;	/* unused */
-	} b_un;
-	unsigned int data[N_PARAM];
-} bdf_prm = {{30, 64, 64, 256, 5*HZ, 30*HZ, 60, 0, 0}};
---------------------------------------------------------------
-
-int nfract:
-The first parameter governs the maximum number of dirty
-buffers in the buffer cache. Dirty means that the contents
-of the buffer still have to be written to disk (as opposed
-to a clean buffer, which can just be forgotten about).
-Setting this to a high value means that Linux can delay disk
-writes for a long time, but it also means that it will have
-to do a lot of I/O at once when memory becomes short. A low
-value will spread out disk I/O more evenly, at the cost of
-more frequent I/O operations.  The default value is 30%,
-the minimum is 0%, and the maximum is 100%.
-
-int interval:
-The fifth parameter, interval, is the minimum rate at
-which kupdate will wake and flush.  The value is expressed in
-jiffies (clockticks), the number of jiffies per second is
-normally 100 (Alpha is 1024). Thus, x*HZ is x seconds.  The
-default value is 5 seconds, the minimum is 0 seconds, and the
-maximum is 600 seconds.
-
-int age_buffer:
-The sixth parameter, age_buffer, governs the maximum time
-Linux waits before writing out a dirty buffer to disk.  The
-value is in jiffies.  The default value is 30 seconds,
-the minimum is 1 second, and the maximum 6,000 seconds.
-
-int nfract_sync:
-The seventh parameter, nfract_sync, governs the percentage
-of buffer cache that is dirty before bdflush activates
-synchronously.  This can be viewed as the hard limit before
-bdflush forces buffers to disk.  The default is 60%, the
-minimum is 0%, and the maximum is 100%.
- 
-==============================================================
-buffermem:
-
-The three values in this file correspond to the values in
-the struct buffer_mem. It controls how much memory should
-be used for buffer memory. The percentage is calculated
-as a percentage of total system memory.
-
-The values are:
-min_percent	-- this is the minimum percentage of memory
-		   that should be spent on buffer memory
-borrow_percent  -- UNUSED
-max_percent     -- UNUSED
-
-==============================================================
-freepages:
+dirty_async_ratio, dirty_background_ratio, dirty_expire_centisecs,
+dirty_sync_ratio dirty_writeback_centisecs:
 
-This file contains the values in the struct freepages. That
-struct contains three members: min, low and high.
-
-The meaning of the numbers is:
-
-freepages.min	When the number of free pages in the system
-		reaches this number, only the kernel can
-		allocate more memory.
-freepages.low	If the number of free pages gets below this
-		point, the kernel starts swapping aggressively.
-freepages.high	The kernel tries to keep up to this amount of
-		memory free; if memory comes below this point,
-		the kernel gently starts swapping in the hopes
-		that it never has to do real aggressive swapping.
+See Documentation/filesystems/proc.txt
 
 ==============================================================
 
@@ -180,38 +92,3 @@ The number of pages the kernel reads in at once is equal to
 2 ^ page-cluster. Values above 2 ^ 5 don't make much sense
 for swap because we only cluster swap data in 32-page groups.
 
-==============================================================
-
-pagecache:
-
-This file does exactly the same as buffermem, only this
-file controls the struct page_cache, and thus controls
-the amount of memory used for the page cache.
-
-In 2.2, the page cache is used for 3 main purposes:
-- caching read() data from files
-- caching mmap()ed data and executable files
-- swap cache
-
-When your system is both deep in swap and high on cache,
-it probably means that a lot of the swapped data is being
-cached, making for more efficient swapping than possible
-with the 2.0 kernel.
-
-==============================================================
-
-pagetable_cache:
-
-The kernel keeps a number of page tables in a per-processor
-cache (this helps a lot on SMP systems). The cache size for
-each processor will be between the low and the high value.
-
-On a low-memory, single CPU system you can safely set these
-values to 0 so you don't waste the memory. On SMP systems it
-is used so that the system can do fast pagetable allocations
-without having to acquire the kernel memory lock.
-
-For large systems, the settings are probably OK. For normal
-systems they won't hurt a bit. For small systems (<16MB ram)
-it might be advantageous to set both values to 0.
-
diff --git a/arch/alpha/kernel/time.c b/arch/alpha/kernel/time.c
index 0be250e543e8..93a569828d70 100644
--- a/arch/alpha/kernel/time.c
+++ b/arch/alpha/kernel/time.c
@@ -48,6 +48,8 @@
 #include "proto.h"
 #include "irq_impl.h"
 
+u64 jiffies_64;
+
 extern rwlock_t xtime_lock;
 extern unsigned long wall_jiffies;	/* kernel/timer.c */
 
diff --git a/arch/arm/kernel/time.c b/arch/arm/kernel/time.c
index 7c7e03c5b6e9..cd00aacc74a9 100644
--- a/arch/arm/kernel/time.c
+++ b/arch/arm/kernel/time.c
@@ -32,6 +32,8 @@
 #include <asm/irq.h>
 #include <asm/leds.h>
 
+u64 jiffies_64;
+
 extern rwlock_t xtime_lock;
 extern unsigned long wall_jiffies;
 
diff --git a/arch/cris/kernel/time.c b/arch/cris/kernel/time.c
index 537040f95a6d..1ee0bbfeab7e 100644
--- a/arch/cris/kernel/time.c
+++ b/arch/cris/kernel/time.c
@@ -44,6 +44,8 @@
 
 #include <asm/svinto.h>
 
+u64 jiffies_64;
+
 static int have_rtc;  /* used to remember if we have an RTC or not */
 
 /* define this if you need to use print_timestamp */
diff --git a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c
index 8608a903f86d..4265cb038a5a 100644
--- a/arch/i386/kernel/irq.c
+++ b/arch/i386/kernel/irq.c
@@ -360,8 +360,9 @@ void __global_cli(void)
 
 	__save_flags(flags);
 	if (flags & (1 << EFLAGS_IF_SHIFT)) {
-		int cpu = smp_processor_id();
+		int cpu;
 		__cli();
+		cpu = smp_processor_id();
 		if (!local_irq_count(cpu))
 			get_irqlock(cpu);
 	}
@@ -369,11 +370,12 @@ void __global_cli(void)
 
 void __global_sti(void)
 {
-	int cpu = smp_processor_id();
+	int cpu = get_cpu();
 
 	if (!local_irq_count(cpu))
 		release_irqlock(cpu);
 	__sti();
+	put_cpu();
 }
 
 /*
diff --git a/arch/i386/kernel/time.c b/arch/i386/kernel/time.c
index 1e1eb0d3a5f7..f56251513581 100644
--- a/arch/i386/kernel/time.c
+++ b/arch/i386/kernel/time.c
@@ -65,6 +65,7 @@
  */
 #include <linux/irq.h>
 
+u64 jiffies_64;
 
 unsigned long cpu_khz;	/* Detected as we calibrate the TSC */
 
diff --git a/arch/i386/mm/Makefile b/arch/i386/mm/Makefile
index 73e25bd3022a..67df8b6f6594 100644
--- a/arch/i386/mm/Makefile
+++ b/arch/i386/mm/Makefile
@@ -9,6 +9,7 @@
 
 O_TARGET := mm.o
 
-obj-y	 := init.o fault.o ioremap.o extable.o
+obj-y	 := init.o fault.o ioremap.o extable.o pageattr.o
+export-objs := pageattr.o
 
 include $(TOPDIR)/Rules.make
diff --git a/arch/i386/mm/ioremap.c b/arch/i386/mm/ioremap.c
index f81fae4ff7a9..4ba5641b271f 100644
--- a/arch/i386/mm/ioremap.c
+++ b/arch/i386/mm/ioremap.c
@@ -10,12 +10,13 @@
 
 #include <linux/vmalloc.h>
 #include <linux/init.h>
+#include <linux/slab.h>
 #include <asm/io.h>
 #include <asm/pgalloc.h>
 #include <asm/fixmap.h>
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
-
+#include <asm/pgtable.h>
 
 static inline void remap_area_pte(pte_t * pte, unsigned long address, unsigned long size,
 	unsigned long phys_addr, unsigned long flags)
@@ -155,6 +156,7 @@ void * __ioremap(unsigned long phys_addr, unsigned long size, unsigned long flag
 	area = get_vm_area(size, VM_IOREMAP);
 	if (!area)
 		return NULL;
+	area->phys_addr = phys_addr;
 	addr = area->addr;
 	if (remap_area_pages(VMALLOC_VMADDR(addr), phys_addr, size, flags)) {
 		vfree(addr);
@@ -163,10 +165,71 @@ void * __ioremap(unsigned long phys_addr, unsigned long size, unsigned long flag
 	return (void *) (offset + (char *)addr);
 }
 
+
+/**
+ * ioremap_nocache     -   map bus memory into CPU space
+ * @offset:    bus address of the memory
+ * @size:      size of the resource to map
+ *
+ * ioremap_nocache performs a platform specific sequence of operations to
+ * make bus memory CPU accessible via the readb/readw/readl/writeb/
+ * writew/writel functions and the other mmio helpers. The returned
+ * address is not guaranteed to be usable directly as a virtual
+ * address. 
+ *
+ * This version of ioremap ensures that the memory is marked uncachable
+ * on the CPU as well as honouring existing caching rules from things like
+ * the PCI bus. Note that there are other caches and buffers on many 
+ * busses. In particular driver authors should read up on PCI writes
+ *
+ * It's useful if some control registers are in such an area and
+ * write combining or read caching is not desirable:
+ * 
+ * Must be freed with iounmap.
+ */
+
+void *ioremap_nocache (unsigned long phys_addr, unsigned long size)
+{
+	void *p = __ioremap(phys_addr, size, _PAGE_PCD);
+	if (!p) 
+		return p; 
+
+	if (phys_addr + size < virt_to_phys(high_memory)) { 
+		struct page *ppage = virt_to_page(__va(phys_addr));		
+		unsigned long npages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+		BUG_ON(phys_addr+size > (unsigned long)high_memory);
+		BUG_ON(phys_addr + size < phys_addr);
+
+		if (change_page_attr(ppage, npages, PAGE_KERNEL_NOCACHE) < 0) { 
+			iounmap(p); 
+			p = NULL;
+		}
+	} 
+
+	return p;					
+}
+
 void iounmap(void *addr)
 {
-	if (addr > high_memory)
-		return vfree((void *) (PAGE_MASK & (unsigned long) addr));
+	struct vm_struct *p;
+	if (addr < high_memory) 
+		return; 
+	p = remove_kernel_area(addr); 
+	if (!p) { 
+		printk("__iounmap: bad address %p\n", addr);
+		return;
+	} 
+
+	BUG_ON(p->phys_addr == 0);  /* not allocated with ioremap */	
+
+	vmfree_area_pages(VMALLOC_VMADDR(p->addr), p->size);	
+	if (p->flags && p->phys_addr < virt_to_phys(high_memory)) { 
+		change_page_attr(virt_to_page(__va(p->phys_addr)),
+				 p->size >> PAGE_SHIFT,
+				 PAGE_KERNEL); 				 
+	} 
+	kfree(p); 
 }
 
 void __init *bt_ioremap(unsigned long phys_addr, unsigned long size)
diff --git a/arch/i386/mm/pageattr.c b/arch/i386/mm/pageattr.c
new file mode 100644
index 000000000000..c5e2374b6bc7
--- /dev/null
+++ b/arch/i386/mm/pageattr.c
@@ -0,0 +1,197 @@
+/* 
+ * Copyright 2002 Andi Kleen, SuSE Labs. 
+ * Thanks to Ben LaHaise for precious feedback.
+ */ 
+
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <asm/uaccess.h>
+#include <asm/processor.h>
+
+static inline pte_t *lookup_address(unsigned long address) 
+{ 
+	pgd_t *pgd = pgd_offset_k(address); 
+	pmd_t *pmd = pmd_offset(pgd, address); 	       
+	if (pmd_large(*pmd))
+		return (pte_t *)pmd;
+        return pte_offset_kernel(pmd, address);
+} 
+
+static struct page *split_large_page(unsigned long address, pgprot_t prot)
+{ 
+	int i; 
+	unsigned long addr;
+	struct page *base = alloc_pages(GFP_KERNEL, 0);
+	pte_t *pbase;
+	if (!base) 
+		return NULL;
+	address = __pa(address);
+	addr = address & LARGE_PAGE_MASK; 
+	pbase = (pte_t *)page_address(base);
+	for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
+		pbase[i] = pfn_pte(addr >> PAGE_SHIFT, 
+				   addr == address ? prot : PAGE_KERNEL);
+	}
+	return base;
+} 
+
+static void flush_kernel_map(void *dummy) 
+{ 
+	/* Could use CLFLUSH here if the CPU supports it (Hammer,P4) */
+	if (boot_cpu_data.x86_model >= 4) 
+		asm volatile("wbinvd":::"memory"); 
+	/* Flush all to work around Errata in early athlons regarding 
+	 * large page flushing. 
+	 */
+	__flush_tlb_all(); 	
+}
+
+static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) 
+{ 
+	set_pte_atomic(kpte, pte); 	/* change init_mm */
+#ifndef CONFIG_X86_PAE
+	{
+		struct list_head *l;
+		spin_lock(&mmlist_lock);
+		list_for_each(l, &init_mm.mmlist) { 
+			struct mm_struct *mm = list_entry(l, struct mm_struct, mmlist);
+			pmd_t *pmd = pmd_offset(pgd_offset(mm, address), address);
+			set_pte_atomic((pte_t *)pmd, pte);
+		} 
+		spin_unlock(&mmlist_lock);
+	}
+#endif
+}
+
+/* 
+ * No more special protections in this 2/4MB area - revert to a
+ * large page again. 
+ */
+static inline void revert_page(struct page *kpte_page, unsigned long address)
+{
+	pte_t *linear = (pte_t *) 
+		pmd_offset(pgd_offset(&init_mm, address), address);
+	set_pmd_pte(linear,  address,
+		    pfn_pte((__pa(address) & LARGE_PAGE_MASK) >> PAGE_SHIFT,
+			    PAGE_KERNEL_LARGE));
+}
+
+static int
+__change_page_attr(struct page *page, pgprot_t prot, struct page **oldpage) 
+{ 
+	pte_t *kpte; 
+	unsigned long address;
+	struct page *kpte_page;
+
+#ifdef CONFIG_HIGHMEM
+	if (page >= highmem_start_page) 
+		BUG(); 
+#endif
+	address = (unsigned long)page_address(page);
+
+	kpte = lookup_address(address);
+	kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
+	if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) { 
+		if ((pte_val(*kpte) & _PAGE_PSE) == 0) { 
+			pte_t old = *kpte;
+			pte_t standard = mk_pte(page, PAGE_KERNEL); 
+
+			set_pte_atomic(kpte, mk_pte(page, prot)); 
+			if (pte_same(old,standard))
+				atomic_inc(&kpte_page->count);
+		} else {
+			struct page *split = split_large_page(address, prot); 
+			if (!split)
+				return -ENOMEM;
+			set_pmd_pte(kpte,address,mk_pte(split, PAGE_KERNEL));
+		}	
+	} else if ((pte_val(*kpte) & _PAGE_PSE) == 0) { 
+		set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
+		atomic_dec(&kpte_page->count); 
+	}
+
+	if (cpu_has_pse && (atomic_read(&kpte_page->count) == 1)) { 
+		*oldpage = kpte_page;
+		revert_page(kpte_page, address);
+	} 
+	return 0;
+} 
+
+static inline void flush_map(void)
+{	
+#ifdef CONFIG_SMP 
+	smp_call_function(flush_kernel_map, NULL, 1, 1);
+#endif	
+	flush_kernel_map(NULL);
+}
+
+struct deferred_page { 
+	struct deferred_page *next; 
+	struct page *fpage;
+}; 
+static struct deferred_page *df_list; /* protected by init_mm.mmap_sem */
+
+/*
+ * Change the page attributes of an page in the linear mapping.
+ *
+ * This should be used when a page is mapped with a different caching policy
+ * than write-back somewhere - some CPUs do not like it when mappings with
+ * different caching policies exist. This changes the page attributes of the
+ * in kernel linear mapping too.
+ * 
+ * The caller needs to ensure that there are no conflicting mappings elsewhere.
+ * This function only deals with the kernel linear map.
+ * 
+ * Caller must call global_flush_tlb() after this.
+ */
+int change_page_attr(struct page *page, int numpages, pgprot_t prot)
+{
+	int err = 0; 
+	struct page *fpage; 
+	int i; 
+
+	down_write(&init_mm.mmap_sem);
+	for (i = 0; i < numpages; i++, page++) { 
+		fpage = NULL;
+		err = __change_page_attr(page, prot, &fpage); 
+		if (err) 
+			break; 
+		if (fpage) { 
+			struct deferred_page *df;
+			df = kmalloc(sizeof(struct deferred_page), GFP_KERNEL); 
+			if (!df) {
+				flush_map();
+				__free_page(fpage);
+			} else { 
+				df->next = df_list;
+				df->fpage = fpage;				
+				df_list = df;
+			} 			
+		} 
+	} 	
+	up_write(&init_mm.mmap_sem); 
+	return err;
+}
+
+void global_flush_tlb(void)
+{ 
+	struct deferred_page *df, *next_df;
+
+	down_read(&init_mm.mmap_sem);
+	df = xchg(&df_list, NULL);
+	up_read(&init_mm.mmap_sem);
+	flush_map();
+	for (; df; df = next_df) { 
+		next_df = df->next;
+		if (df->fpage) 
+			__free_page(df->fpage);
+		kfree(df);
+	} 
+} 
+
+EXPORT_SYMBOL(change_page_attr);
+EXPORT_SYMBOL(global_flush_tlb);
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index dc6500b7a167..1c348cce1fdd 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -27,6 +27,8 @@ extern rwlock_t xtime_lock;
 extern unsigned long wall_jiffies;
 extern unsigned long last_time_offset;
 
+u64 jiffies_64;
+
 #ifdef CONFIG_IA64_DEBUG_IRQ
 
 unsigned long last_cli_ip;
diff --git a/arch/m68k/kernel/time.c b/arch/m68k/kernel/time.c
index a845040b339a..54b8f68cf7e0 100644
--- a/arch/m68k/kernel/time.c
+++ b/arch/m68k/kernel/time.c
@@ -24,6 +24,7 @@
 
 #include <linux/timex.h>
 
+u64 jiffies_64;
 
 static inline int set_rtc_mmss(unsigned long nowtime)
 {
diff --git a/arch/mips/kernel/time.c b/arch/mips/kernel/time.c
index e548314773de..6ea186b42155 100644
--- a/arch/mips/kernel/time.c
+++ b/arch/mips/kernel/time.c
@@ -32,6 +32,8 @@
 #define USECS_PER_JIFFY (1000000/HZ)
 #define USECS_PER_JIFFY_FRAC ((1000000ULL << 32) / HZ & 0xffffffff)
 
+u64 jiffies_64;
+
 /*
  * forward reference
  */
diff --git a/arch/mips64/kernel/syscall.c b/arch/mips64/kernel/syscall.c
index 6daab491059b..053051c63a25 100644
--- a/arch/mips64/kernel/syscall.c
+++ b/arch/mips64/kernel/syscall.c
@@ -32,6 +32,8 @@
 #include <asm/sysmips.h>
 #include <asm/uaccess.h>
 
+u64 jiffies_64;
+
 extern asmlinkage void syscall_trace(void);
 
 asmlinkage int sys_pipe(abi64_no_regargs, struct pt_regs regs)
diff --git a/arch/parisc/kernel/time.c b/arch/parisc/kernel/time.c
index 7b3de0e0ada3..e028e6f3dbe2 100644
--- a/arch/parisc/kernel/time.c
+++ b/arch/parisc/kernel/time.c
@@ -30,6 +30,8 @@
 
 #include <linux/timex.h>
 
+u64 jiffies_64;
+
 extern rwlock_t xtime_lock;
 
 static int timer_value;
diff --git a/arch/ppc/kernel/time.c b/arch/ppc/kernel/time.c
index 260345226022..88a4d63ffea0 100644
--- a/arch/ppc/kernel/time.c
+++ b/arch/ppc/kernel/time.c
@@ -70,6 +70,9 @@
 
 #include <asm/time.h>
 
+/* XXX false sharing with below? */
+u64 jiffies_64;
+
 unsigned long disarm_decr[NR_CPUS];
 
 extern int do_sys_settimeofday(struct timeval *tv, struct timezone *tz);
diff --git a/arch/ppc64/kernel/time.c b/arch/ppc64/kernel/time.c
index d00224a05633..9cd390d65342 100644
--- a/arch/ppc64/kernel/time.c
+++ b/arch/ppc64/kernel/time.c
@@ -64,6 +64,8 @@
 
 void smp_local_timer_interrupt(struct pt_regs *);
 
+u64 jiffies_64;
+
 /* keep track of when we need to update the rtc */
 time_t last_rtc_update;
 extern rwlock_t xtime_lock;
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index 2a135d999830..f09059ee63bd 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -39,6 +39,8 @@
 
 #define TICK_SIZE tick
 
+u64 jiffies_64;
+
 static ext_int_info_t ext_int_info_timer;
 static uint64_t init_timer_cc;
 
diff --git a/arch/s390x/kernel/time.c b/arch/s390x/kernel/time.c
index e12e41e2eaef..b81dcb9683d7 100644
--- a/arch/s390x/kernel/time.c
+++ b/arch/s390x/kernel/time.c
@@ -39,6 +39,8 @@
 
 #define TICK_SIZE tick
 
+u64 jiffies_64;
+
 static ext_int_info_t ext_int_info_timer;
 static uint64_t init_timer_cc;
 
diff --git a/arch/sh/kernel/time.c b/arch/sh/kernel/time.c
index 62af96d4fd48..e51e0eb001d6 100644
--- a/arch/sh/kernel/time.c
+++ b/arch/sh/kernel/time.c
@@ -70,6 +70,8 @@
 #endif /* CONFIG_CPU_SUBTYPE_ST40STB1 */
 #endif /* __sh3__ or __SH4__ */
 
+u64 jiffies_64;
+
 extern rwlock_t xtime_lock;
 extern unsigned long wall_jiffies;
 #define TICK_SIZE tick
diff --git a/arch/sparc/kernel/time.c b/arch/sparc/kernel/time.c
index 6e7935ab7c56..90d3e8528358 100644
--- a/arch/sparc/kernel/time.c
+++ b/arch/sparc/kernel/time.c
@@ -43,6 +43,8 @@
 
 extern rwlock_t xtime_lock;
 
+u64 jiffies_64;
+
 enum sparc_clock_type sp_clock_typ;
 spinlock_t mostek_lock = SPIN_LOCK_UNLOCKED;
 unsigned long mstk48t02_regs = 0UL;
diff --git a/arch/sparc64/kernel/time.c b/arch/sparc64/kernel/time.c
index 852c96d62319..47c794e99f4b 100644
--- a/arch/sparc64/kernel/time.c
+++ b/arch/sparc64/kernel/time.c
@@ -44,6 +44,8 @@ unsigned long mstk48t02_regs = 0UL;
 unsigned long ds1287_regs = 0UL;
 #endif
 
+u64 jiffies_64;
+
 static unsigned long mstk48t08_regs = 0UL;
 static unsigned long mstk48t59_regs = 0UL;
 
diff --git a/arch/x86_64/Makefile b/arch/x86_64/Makefile
index 3968f838fe7c..46fe5228c782 100644
--- a/arch/x86_64/Makefile
+++ b/arch/x86_64/Makefile
@@ -43,15 +43,9 @@ CFLAGS += -mcmodel=kernel
 CFLAGS += -pipe
 # this makes reading assembly source easier
 CFLAGS += -fno-reorder-blocks	
-# needed for later gcc 3.1
 CFLAGS += -finline-limit=2000
-# needed for earlier gcc 3.1
-#CFLAGS += -fno-strength-reduce
 #CFLAGS += -g
 
-# prevent gcc from keeping the stack 16 byte aligned (FIXME)
-#CFLAGS += -mpreferred-stack-boundary=2
-
 HEAD := arch/x86_64/kernel/head.o arch/x86_64/kernel/head64.o arch/x86_64/kernel/init_task.o
 
 SUBDIRS := arch/x86_64/tools $(SUBDIRS) arch/x86_64/kernel arch/x86_64/mm arch/x86_64/lib
diff --git a/arch/x86_64/boot/Makefile b/arch/x86_64/boot/Makefile
index a82cabc11223..9549b65aaae7 100644
--- a/arch/x86_64/boot/Makefile
+++ b/arch/x86_64/boot/Makefile
@@ -21,10 +21,6 @@ ROOT_DEV := CURRENT
 
 SVGA_MODE := -DSVGA_MODE=NORMAL_VGA
 
-# If you want the RAM disk device, define this to be the size in blocks.
-
-RAMDISK := -DRAMDISK=512
-
 # ---------------------------------------------------------------------------
 
 BOOT_INCL =	$(TOPDIR)/include/linux/config.h \
diff --git a/arch/x86_64/config.in b/arch/x86_64/config.in
index 8605598747a8..829a74f439ad 100644
--- a/arch/x86_64/config.in
+++ b/arch/x86_64/config.in
@@ -47,8 +47,7 @@ define_bool CONFIG_EISA n
 define_bool CONFIG_X86_IO_APIC y
 define_bool CONFIG_X86_LOCAL_APIC y
 
-#currently broken:
-#bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR
+bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR
 bool 'Symmetric multi-processing support' CONFIG_SMP
 if [ "$CONFIG_SMP" = "n" ]; then
    bool 'Preemptible Kernel' CONFIG_PREEMPT
@@ -226,6 +225,7 @@ if [ "$CONFIG_DEBUG_KERNEL" != "n" ]; then
    bool '  Spinlock debugging' CONFIG_DEBUG_SPINLOCK
    bool '  Additional run-time checks' CONFIG_CHECKING
    bool '  Debug __init statements' CONFIG_INIT_DEBUG
+   bool '  Spinlock debugging' CONFIG_DEBUG_SPINLOCK
 fi
 endmenu
 
diff --git a/arch/x86_64/ia32/Makefile b/arch/x86_64/ia32/Makefile
index 45c356b60cb5..00e69a2d0060 100644
--- a/arch/x86_64/ia32/Makefile
+++ b/arch/x86_64/ia32/Makefile
@@ -9,8 +9,9 @@ export-objs := ia32_ioctl.o sys_ia32.o
 all: ia32.o
 
 O_TARGET := ia32.o
-obj-$(CONFIG_IA32_EMULATION) := ia32entry.o sys_ia32.o ia32_ioctl.o ia32_signal.o \
-	ia32_binfmt.o fpu32.o socket32.o ptrace32.o
+obj-$(CONFIG_IA32_EMULATION) := ia32entry.o sys_ia32.o ia32_ioctl.o \
+	ia32_signal.o \
+	ia32_binfmt.o fpu32.o socket32.o ptrace32.o ipc32.o
 
 clean::
 
diff --git a/arch/x86_64/ia32/ipc32.c b/arch/x86_64/ia32/ipc32.c
new file mode 100644
index 000000000000..2d322dda88ef
--- /dev/null
+++ b/arch/x86_64/ia32/ipc32.c
@@ -0,0 +1,645 @@
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h> 
+#include <linux/file.h> 
+#include <linux/sem.h>
+#include <linux/msg.h>
+#include <linux/mm.h>
+#include <linux/shm.h>
+#include <linux/slab.h>
+#include <linux/ipc.h>
+#include <asm/mman.h>
+#include <asm/types.h>
+#include <asm/uaccess.h>
+#include <asm/semaphore.h>
+#include <asm/ipc.h>
+
+#include <asm/ia32.h>
+
+/*
+ * sys32_ipc() is the de-multiplexer for the SysV IPC calls in 32bit emulation..
+ *
+ * This is really horribly ugly.
+ */
+
+struct msgbuf32 { 
+	s32 mtype; 
+	char mtext[1]; 
+};
+
+struct ipc_perm32 {
+	int key;
+	__kernel_uid_t32 uid;
+	__kernel_gid_t32 gid;
+	__kernel_uid_t32 cuid;
+	__kernel_gid_t32 cgid;
+	unsigned short mode;
+	unsigned short seq;
+};
+
+struct ipc64_perm32 {
+        unsigned key;
+	__kernel_uid32_t32 uid;
+	__kernel_gid32_t32 gid;
+	__kernel_uid32_t32 cuid;
+	__kernel_gid32_t32 cgid;
+	unsigned short mode;
+	unsigned short __pad1;
+	unsigned short seq;
+	unsigned short __pad2;
+	unsigned int unused1;
+	unsigned int unused2;
+};
+
+struct semid_ds32 {
+	struct ipc_perm32 sem_perm;               /* permissions .. see ipc.h */
+	__kernel_time_t32 sem_otime;              /* last semop time */
+	__kernel_time_t32 sem_ctime;              /* last change time */
+	u32 sem_base;              /* ptr to first semaphore in array */
+	u32 sem_pending;          /* pending operations to be processed */
+	u32 sem_pending_last;    /* last pending operation */
+	u32 undo;                  /* undo requests on this array */
+	unsigned short  sem_nsems;              /* no. of semaphores in array */
+};
+
+struct semid64_ds32 {
+	struct ipc64_perm32 sem_perm;
+	__kernel_time_t32 sem_otime;
+	unsigned int __unused1;
+	__kernel_time_t32 sem_ctime;
+	unsigned int __unused2;
+	unsigned int sem_nsems;
+	unsigned int __unused3;
+	unsigned int __unused4;
+};
+
+struct msqid_ds32 {
+	struct ipc_perm32 msg_perm;
+	u32 msg_first;
+	u32 msg_last;
+	__kernel_time_t32 msg_stime;
+	__kernel_time_t32 msg_rtime;
+	__kernel_time_t32 msg_ctime;
+	u32 wwait;
+	u32 rwait;
+	unsigned short msg_cbytes;
+	unsigned short msg_qnum;
+	unsigned short msg_qbytes;
+	__kernel_ipc_pid_t32 msg_lspid;
+	__kernel_ipc_pid_t32 msg_lrpid;
+};
+
+struct msqid64_ds32 {
+	struct ipc64_perm32 msg_perm;
+	__kernel_time_t32 msg_stime;
+	unsigned int __unused1;
+	__kernel_time_t32 msg_rtime;
+	unsigned int __unused2;
+	__kernel_time_t32 msg_ctime;
+	unsigned int __unused3;
+	unsigned int msg_cbytes;
+	unsigned int msg_qnum;
+	unsigned int msg_qbytes;
+	__kernel_pid_t32 msg_lspid;
+	__kernel_pid_t32 msg_lrpid;
+	unsigned int __unused4;
+	unsigned int __unused5;
+};
+
+struct shmid_ds32 {
+	struct ipc_perm32 shm_perm;
+	int shm_segsz;
+	__kernel_time_t32 shm_atime;
+	__kernel_time_t32 shm_dtime;
+	__kernel_time_t32 shm_ctime;
+	__kernel_ipc_pid_t32 shm_cpid;
+	__kernel_ipc_pid_t32 shm_lpid;
+	unsigned short shm_nattch;
+};
+
+struct shmid64_ds32 {
+	struct ipc64_perm32 shm_perm;
+	__kernel_size_t32 shm_segsz;
+	__kernel_time_t32 shm_atime;
+	unsigned int __unused1;
+	__kernel_time_t32 shm_dtime;
+	unsigned int __unused2;
+	__kernel_time_t32 shm_ctime;
+	unsigned int __unused3;
+	__kernel_pid_t32 shm_cpid;
+	__kernel_pid_t32 shm_lpid;
+	unsigned int shm_nattch;
+	unsigned int __unused4;
+	unsigned int __unused5;
+};
+
+struct shminfo64_32 {
+	unsigned int shmmax;
+	unsigned int shmmin;
+	unsigned int shmmni;
+	unsigned int shmseg;
+	unsigned int shmall;
+	unsigned int __unused1;
+	unsigned int __unused2;
+	unsigned int __unused3;
+	unsigned int __unused4;
+};
+
+struct shm_info32 {
+	int used_ids;
+	u32 shm_tot, shm_rss, shm_swp;
+	u32 swap_attempts, swap_successes;
+};
+
+struct ipc_kludge {
+	struct msgbuf *msgp;
+	int msgtyp;
+};
+
+
+#define A(__x)		((unsigned long)(__x))
+#define AA(__x)		((unsigned long)(__x))
+
+#define SEMOP		 1
+#define SEMGET		 2
+#define SEMCTL		 3
+#define MSGSND		11
+#define MSGRCV		12
+#define MSGGET		13
+#define MSGCTL		14
+#define SHMAT		21
+#define SHMDT		22
+#define SHMGET		23
+#define SHMCTL		24
+
+#define IPCOP_MASK(__x)	(1UL << (__x))
+
+static int
+ipc_parse_version32 (int *cmd)
+{
+	if (*cmd & IPC_64) {
+		*cmd ^= IPC_64;
+		return IPC_64;
+	} else {
+		return IPC_OLD;
+	}
+}
+
+static int
+semctl32 (int first, int second, int third, void *uptr)
+{
+	union semun fourth;
+	u32 pad;
+	int err = 0, err2;
+	struct semid64_ds s;
+	mm_segment_t old_fs;
+	int version = ipc_parse_version32(&third);
+
+	if (!uptr)
+		return -EINVAL;
+	if (get_user(pad, (u32 *)uptr))
+		return -EFAULT;
+	if (third == SETVAL)
+		fourth.val = (int)pad;
+	else
+		fourth.__pad = (void *)A(pad);
+	switch (third) {
+	      case IPC_INFO:
+	      case IPC_RMID:
+	      case IPC_SET:
+	      case SEM_INFO:
+	      case GETVAL:
+	      case GETPID:
+	      case GETNCNT:
+	      case GETZCNT:
+	      case GETALL:
+	      case SETVAL:
+	      case SETALL:
+		err = sys_semctl(first, second, third, fourth);
+		break;
+
+	      case IPC_STAT:
+	      case SEM_STAT:
+		fourth.__pad = &s;
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		err = sys_semctl(first, second|IPC_64, third, fourth);
+		set_fs(old_fs);
+
+		if (version == IPC_64) {
+			struct semid64_ds32 *usp64 = (struct semid64_ds32 *) A(pad);
+
+			if (!access_ok(VERIFY_WRITE, usp64, sizeof(*usp64))) {
+				err = -EFAULT;
+				break;
+			}
+			err2 = __put_user(s.sem_perm.key, &usp64->sem_perm.key);
+			err2 |= __put_user(s.sem_perm.uid, &usp64->sem_perm.uid);
+			err2 |= __put_user(s.sem_perm.gid, &usp64->sem_perm.gid);
+			err2 |= __put_user(s.sem_perm.cuid, &usp64->sem_perm.cuid);
+			err2 |= __put_user(s.sem_perm.cgid, &usp64->sem_perm.cgid);
+			err2 |= __put_user(s.sem_perm.mode, &usp64->sem_perm.mode);
+			err2 |= __put_user(s.sem_perm.seq, &usp64->sem_perm.seq);
+			err2 |= __put_user(s.sem_otime, &usp64->sem_otime);
+			err2 |= __put_user(s.sem_ctime, &usp64->sem_ctime);
+			err2 |= __put_user(s.sem_nsems, &usp64->sem_nsems);
+		} else {
+			struct semid_ds32 *usp32 = (struct semid_ds32 *) A(pad);
+
+			if (!access_ok(VERIFY_WRITE, usp32, sizeof(*usp32))) {
+				err = -EFAULT;
+				break;
+			}
+			err2 = __put_user(s.sem_perm.key, &usp32->sem_perm.key);
+			err2 |= __put_user(s.sem_perm.uid, &usp32->sem_perm.uid);
+			err2 |= __put_user(s.sem_perm.gid, &usp32->sem_perm.gid);
+			err2 |= __put_user(s.sem_perm.cuid, &usp32->sem_perm.cuid);
+			err2 |= __put_user(s.sem_perm.cgid, &usp32->sem_perm.cgid);
+			err2 |= __put_user(s.sem_perm.mode, &usp32->sem_perm.mode);
+			err2 |= __put_user(s.sem_perm.seq, &usp32->sem_perm.seq);
+			err2 |= __put_user(s.sem_otime, &usp32->sem_otime);
+			err2 |= __put_user(s.sem_ctime, &usp32->sem_ctime);
+			err2 |= __put_user(s.sem_nsems, &usp32->sem_nsems);
+		}
+		if (err2)
+		    err = -EFAULT;
+		break;
+	}
+	return err;
+}
+
+static int
+do_sys32_msgsnd (int first, int second, int third, void *uptr)
+{
+	struct msgbuf *p = kmalloc(second + sizeof(struct msgbuf) + 4, GFP_USER);
+	struct msgbuf32 *up = (struct msgbuf32 *)uptr;
+	mm_segment_t old_fs;
+	int err;
+
+	if (!p)
+		return -ENOMEM;
+	err = get_user(p->mtype, &up->mtype);
+	err |= copy_from_user(p->mtext, &up->mtext, second);
+	if (err)
+		goto out;
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	err = sys_msgsnd(first, p, second, third);
+	set_fs(old_fs);
+  out:
+	kfree(p);
+	return err;
+}
+
+static int
+do_sys32_msgrcv (int first, int second, int msgtyp, int third, int version, void *uptr)
+{
+	struct msgbuf32 *up;
+	struct msgbuf *p;
+	mm_segment_t old_fs;
+	int err;
+
+	if (!version) {
+		struct ipc_kludge *uipck = (struct ipc_kludge *)uptr;
+		struct ipc_kludge ipck;
+
+		err = -EINVAL;
+		if (!uptr)
+			goto out;
+		err = -EFAULT;
+		if (copy_from_user(&ipck, uipck, sizeof(struct ipc_kludge)))
+			goto out;
+		uptr = (void *)A(ipck.msgp);
+		msgtyp = ipck.msgtyp;
+	}
+	err = -ENOMEM;
+	p = kmalloc(second + sizeof(struct msgbuf) + 4, GFP_USER);
+	if (!p)
+		goto out;
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	err = sys_msgrcv(first, p, second + 4, msgtyp, third);
+	set_fs(old_fs);
+	if (err < 0)
+		goto free_then_out;
+	up = (struct msgbuf32 *)uptr;
+	if (put_user(p->mtype, &up->mtype) || copy_to_user(&up->mtext, p->mtext, err))
+		err = -EFAULT;
+free_then_out:
+	kfree(p);
+out:
+	return err;
+}
+
+static int
+msgctl32 (int first, int second, void *uptr)
+{
+	int err = -EINVAL, err2;
+	struct msqid_ds m;
+	struct msqid64_ds m64;
+	struct msqid_ds32 *up32 = (struct msqid_ds32 *)uptr;
+	struct msqid64_ds32 *up64 = (struct msqid64_ds32 *)uptr;
+	mm_segment_t old_fs;
+	int version = ipc_parse_version32(&second);
+
+	switch (second) {
+	      case IPC_INFO:
+	      case IPC_RMID:
+	      case MSG_INFO:
+		err = sys_msgctl(first, second, (struct msqid_ds *)uptr);
+		break;
+
+	      case IPC_SET:
+		if (version == IPC_64) {
+			err = get_user(m.msg_perm.uid, &up64->msg_perm.uid);
+			err |= get_user(m.msg_perm.gid, &up64->msg_perm.gid);
+			err |= get_user(m.msg_perm.mode, &up64->msg_perm.mode);
+			err |= get_user(m.msg_qbytes, &up64->msg_qbytes);
+		} else {
+			err = get_user(m.msg_perm.uid, &up32->msg_perm.uid);
+			err |= get_user(m.msg_perm.gid, &up32->msg_perm.gid);
+			err |= get_user(m.msg_perm.mode, &up32->msg_perm.mode);
+			err |= get_user(m.msg_qbytes, &up32->msg_qbytes);
+		}
+		if (err)
+			break;
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		err = sys_msgctl(first, second, &m);
+		set_fs(old_fs);
+		break;
+
+	      case IPC_STAT:
+	      case MSG_STAT:
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		err = sys_msgctl(first, second|IPC_64, (void *) &m64);
+		set_fs(old_fs);
+
+		if (version == IPC_64) {
+			if (!access_ok(VERIFY_WRITE, up64, sizeof(*up64))) {
+				err = -EFAULT;
+				break;
+			}
+			err2 = __put_user(m64.msg_perm.key, &up64->msg_perm.key);
+			err2 |= __put_user(m64.msg_perm.uid, &up64->msg_perm.uid);
+			err2 |= __put_user(m64.msg_perm.gid, &up64->msg_perm.gid);
+			err2 |= __put_user(m64.msg_perm.cuid, &up64->msg_perm.cuid);
+			err2 |= __put_user(m64.msg_perm.cgid, &up64->msg_perm.cgid);
+			err2 |= __put_user(m64.msg_perm.mode, &up64->msg_perm.mode);
+			err2 |= __put_user(m64.msg_perm.seq, &up64->msg_perm.seq);
+			err2 |= __put_user(m64.msg_stime, &up64->msg_stime);
+			err2 |= __put_user(m64.msg_rtime, &up64->msg_rtime);
+			err2 |= __put_user(m64.msg_ctime, &up64->msg_ctime);
+			err2 |= __put_user(m64.msg_cbytes, &up64->msg_cbytes);
+			err2 |= __put_user(m64.msg_qnum, &up64->msg_qnum);
+			err2 |= __put_user(m64.msg_qbytes, &up64->msg_qbytes);
+			err2 |= __put_user(m64.msg_lspid, &up64->msg_lspid);
+			err2 |= __put_user(m64.msg_lrpid, &up64->msg_lrpid);
+			if (err2)
+				err = -EFAULT;
+		} else {
+			if (!access_ok(VERIFY_WRITE, up32, sizeof(*up32))) {
+				err = -EFAULT;
+				break;
+			}
+			err2 = __put_user(m64.msg_perm.key, &up32->msg_perm.key);
+			err2 |= __put_user(m64.msg_perm.uid, &up32->msg_perm.uid);
+			err2 |= __put_user(m64.msg_perm.gid, &up32->msg_perm.gid);
+			err2 |= __put_user(m64.msg_perm.cuid, &up32->msg_perm.cuid);
+			err2 |= __put_user(m64.msg_perm.cgid, &up32->msg_perm.cgid);
+			err2 |= __put_user(m64.msg_perm.mode, &up32->msg_perm.mode);
+			err2 |= __put_user(m64.msg_perm.seq, &up32->msg_perm.seq);
+			err2 |= __put_user(m64.msg_stime, &up32->msg_stime);
+			err2 |= __put_user(m64.msg_rtime, &up32->msg_rtime);
+			err2 |= __put_user(m64.msg_ctime, &up32->msg_ctime);
+			err2 |= __put_user(m64.msg_cbytes, &up32->msg_cbytes);
+			err2 |= __put_user(m64.msg_qnum, &up32->msg_qnum);
+			err2 |= __put_user(m64.msg_qbytes, &up32->msg_qbytes);
+			err2 |= __put_user(m64.msg_lspid, &up32->msg_lspid);
+			err2 |= __put_user(m64.msg_lrpid, &up32->msg_lrpid);
+			if (err2)
+				err = -EFAULT;
+		}
+		break;
+	}
+	return err;
+}
+
+static int
+shmat32 (int first, int second, int third, int version, void *uptr)
+{
+	unsigned long raddr;
+	u32 *uaddr = (u32 *)A((u32)third);
+	int err;
+
+	if (version == 1)
+		return -EINVAL;	/* iBCS2 emulator entry point: unsupported */
+	err = sys_shmat(first, uptr, second, &raddr);
+	if (err)
+		return err;
+	return put_user(raddr, uaddr);
+}
+
+static int put_shmid64(struct shmid64_ds *s64p, void *uptr, int version) 
+{ 
+	int err2; 
+#define s64 (*s64p)
+	if (version == IPC_64) {
+		struct shmid64_ds32 *up64 = (struct shmid64_ds32 *)uptr;
+
+		if (!access_ok(VERIFY_WRITE, up64, sizeof(*up64)))
+			return -EFAULT;
+
+		err2 = __put_user(s64.shm_perm.key, &up64->shm_perm.key);
+		err2 |= __put_user(s64.shm_perm.uid, &up64->shm_perm.uid);
+		err2 |= __put_user(s64.shm_perm.gid, &up64->shm_perm.gid);
+		err2 |= __put_user(s64.shm_perm.cuid, &up64->shm_perm.cuid);
+		err2 |= __put_user(s64.shm_perm.cgid, &up64->shm_perm.cgid);
+		err2 |= __put_user(s64.shm_perm.mode, &up64->shm_perm.mode);
+		err2 |= __put_user(s64.shm_perm.seq, &up64->shm_perm.seq);
+		err2 |= __put_user(s64.shm_atime, &up64->shm_atime);
+		err2 |= __put_user(s64.shm_dtime, &up64->shm_dtime);
+		err2 |= __put_user(s64.shm_ctime, &up64->shm_ctime);
+		err2 |= __put_user(s64.shm_segsz, &up64->shm_segsz);
+		err2 |= __put_user(s64.shm_nattch, &up64->shm_nattch);
+		err2 |= __put_user(s64.shm_cpid, &up64->shm_cpid);
+		err2 |= __put_user(s64.shm_lpid, &up64->shm_lpid);
+	} else {
+		struct shmid_ds32 *up32 = (struct shmid_ds32 *)uptr;
+
+		if (!access_ok(VERIFY_WRITE, up32, sizeof(*up32))) 
+			return -EFAULT;
+
+		err2 = __put_user(s64.shm_perm.key, &up32->shm_perm.key);
+		err2 |= __put_user(s64.shm_perm.uid, &up32->shm_perm.uid);
+		err2 |= __put_user(s64.shm_perm.gid, &up32->shm_perm.gid);
+		err2 |= __put_user(s64.shm_perm.cuid, &up32->shm_perm.cuid);
+		err2 |= __put_user(s64.shm_perm.cgid, &up32->shm_perm.cgid);
+		err2 |= __put_user(s64.shm_perm.mode, &up32->shm_perm.mode);
+		err2 |= __put_user(s64.shm_perm.seq, &up32->shm_perm.seq);
+		err2 |= __put_user(s64.shm_atime, &up32->shm_atime);
+		err2 |= __put_user(s64.shm_dtime, &up32->shm_dtime);
+		err2 |= __put_user(s64.shm_ctime, &up32->shm_ctime);
+		err2 |= __put_user(s64.shm_segsz, &up32->shm_segsz);
+		err2 |= __put_user(s64.shm_nattch, &up32->shm_nattch);
+		err2 |= __put_user(s64.shm_cpid, &up32->shm_cpid);
+		err2 |= __put_user(s64.shm_lpid, &up32->shm_lpid);
+	}
+#undef s64
+	return err2 ? -EFAULT : 0;
+}
+static int
+shmctl32 (int first, int second, void *uptr)
+{
+	int err = -EFAULT, err2;
+	struct shmid_ds s;
+	struct shmid64_ds s64;
+	mm_segment_t old_fs;
+	struct shm_info32 *uip = (struct shm_info32 *)uptr;
+	struct shm_info si;
+	int version = ipc_parse_version32(&second);
+	struct shminfo64 smi;
+	struct shminfo *usi32 = (struct shminfo *) uptr;
+	struct shminfo64_32 *usi64 = (struct shminfo64_32 *) uptr;
+
+	switch (second) {
+	      case IPC_INFO:
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		err = sys_shmctl(first, second|IPC_64, (struct shmid_ds *)&smi);
+		set_fs(old_fs);
+
+		if (version == IPC_64) {
+			if (!access_ok(VERIFY_WRITE, usi64, sizeof(*usi64))) {
+				err = -EFAULT;
+				break;
+			}
+			err2 = __put_user(smi.shmmax, &usi64->shmmax);
+			err2 |= __put_user(smi.shmmin, &usi64->shmmin);
+			err2 |= __put_user(smi.shmmni, &usi64->shmmni);
+			err2 |= __put_user(smi.shmseg, &usi64->shmseg);
+			err2 |= __put_user(smi.shmall, &usi64->shmall);
+		} else {
+			if (!access_ok(VERIFY_WRITE, usi32, sizeof(*usi32))) {
+				err = -EFAULT;
+				break;
+			}
+			err2 = __put_user(smi.shmmax, &usi32->shmmax);
+			err2 |= __put_user(smi.shmmin, &usi32->shmmin);
+			err2 |= __put_user(smi.shmmni, &usi32->shmmni);
+			err2 |= __put_user(smi.shmseg, &usi32->shmseg);
+			err2 |= __put_user(smi.shmall, &usi32->shmall);
+		}
+		if (err2)
+			err = -EFAULT;
+		break;
+
+	      case IPC_RMID:
+	      case SHM_LOCK:
+	      case SHM_UNLOCK:
+		err = sys_shmctl(first, second, (struct shmid_ds *)uptr);
+		break;
+
+	      case IPC_SET: 
+		if (version == IPC_64) {
+			struct shmid64_ds32 *up64 = (struct shmid64_ds32 *)uptr;
+			err = get_user(s.shm_perm.uid, &up64->shm_perm.uid);
+			err |= get_user(s.shm_perm.gid, &up64->shm_perm.gid);
+			err |= get_user(s.shm_perm.mode, &up64->shm_perm.mode);
+		} else {
+			struct shmid_ds32 *up32 = (struct shmid_ds32 *)uptr;
+			err = get_user(s.shm_perm.uid, &up32->shm_perm.uid);
+			err |= get_user(s.shm_perm.gid, &up32->shm_perm.gid);
+			err |= get_user(s.shm_perm.mode, &up32->shm_perm.mode);
+		}
+		if (err)
+			break;
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		err = sys_shmctl(first, second, &s);
+		set_fs(old_fs);
+		break;
+
+	      case IPC_STAT:
+	      case SHM_STAT:
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		err = sys_shmctl(first, second|IPC_64, (void *) &s64);
+		set_fs(old_fs);
+		
+		if (err < 0)
+			break;
+	        err2 = put_shmid64(&s64, uptr, version); 		
+		if (err2) 
+			err = err2;
+		break;
+
+	      case SHM_INFO:
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		err = sys_shmctl(first, second, (void *)&si);
+		set_fs(old_fs);
+		if (err < 0)
+			break;
+
+		if (!access_ok(VERIFY_WRITE, uip, sizeof(*uip))) {
+			err = -EFAULT;
+			break;
+		}
+		err2 = __put_user(si.used_ids, &uip->used_ids);
+		err2 |= __put_user(si.shm_tot, &uip->shm_tot);
+		err2 |= __put_user(si.shm_rss, &uip->shm_rss);
+		err2 |= __put_user(si.shm_swp, &uip->shm_swp);
+		err2 |= __put_user(si.swap_attempts, &uip->swap_attempts);
+		err2 |= __put_user(si.swap_successes, &uip->swap_successes);
+		if (err2)
+			err = -EFAULT;
+		break;
+
+	}
+	return err;
+}
+
+asmlinkage long
+sys32_ipc (u32 call, int first, int second, int third, u32 ptr, u32 fifth)
+{
+	int version;
+
+	version = call >> 16; /* hack for backward compatibility */
+	call &= 0xffff;
+
+	switch (call) {
+	      case SEMOP:
+		/* struct sembuf is the same on 32 and 64bit :)) */
+		return sys_semop(first, (struct sembuf *)AA(ptr), second);
+	      case SEMGET:
+		return sys_semget(first, second, third);
+	      case SEMCTL:
+		return semctl32(first, second, third, (void *)AA(ptr));
+
+	      case MSGSND:
+		return do_sys32_msgsnd(first, second, third, (void *)AA(ptr));
+	      case MSGRCV:
+		return do_sys32_msgrcv(first, second, fifth, third, version, (void *)AA(ptr));
+	      case MSGGET:
+		return sys_msgget((key_t) first, second);
+	      case MSGCTL:
+		return msgctl32(first, second, (void *)AA(ptr));
+
+	      case SHMAT:
+		return shmat32(first, second, third, version, (void *)AA(ptr));
+		break;
+	      case SHMDT:
+		return sys_shmdt((char *)AA(ptr));
+	      case SHMGET:
+		return sys_shmget(first, second, third);
+	      case SHMCTL:
+		return shmctl32(first, second, (void *)AA(ptr));
+
+	      default:
+		return -EINVAL;
+	}
+	return -EINVAL;
+}
+
diff --git a/arch/x86_64/ia32/sys_ia32.c b/arch/x86_64/ia32/sys_ia32.c
index 35060b86a54a..85aaed5ec40a 100644
--- a/arch/x86_64/ia32/sys_ia32.c
+++ b/arch/x86_64/ia32/sys_ia32.c
@@ -1119,422 +1119,6 @@ sys32_setrlimit(unsigned int resource, struct rlimit32 *rlim)
 }
 
 /*
- * sys32_ipc() is the de-multiplexer for the SysV IPC calls in 32bit emulation..
- *
- * This is really horribly ugly.
- */
-
-struct msgbuf32 { s32 mtype; char mtext[1]; };
-
-struct ipc_perm32
-{
-	key_t    	  key;
-        __kernel_uid_t32  uid;
-        __kernel_gid_t32  gid;
-        __kernel_uid_t32  cuid;
-        __kernel_gid_t32  cgid;
-        __kernel_mode_t32 mode;
-        unsigned short  seq;
-};
-
-struct semid_ds32 {
-        struct ipc_perm32 sem_perm;               /* permissions .. see ipc.h */
-        __kernel_time_t32 sem_otime;              /* last semop time */
-        __kernel_time_t32 sem_ctime;              /* last change time */
-        u32 sem_base;              /* ptr to first semaphore in array */
-        u32 sem_pending;          /* pending operations to be processed */
-        u32 sem_pending_last;    /* last pending operation */
-        u32 undo;                  /* undo requests on this array */
-        unsigned short  sem_nsems;              /* no. of semaphores in array */
-};
-
-struct msqid_ds32
-{
-        struct ipc_perm32 msg_perm;
-        u32 msg_first;
-        u32 msg_last;
-        __kernel_time_t32 msg_stime;
-        __kernel_time_t32 msg_rtime;
-        __kernel_time_t32 msg_ctime;
-        u32 wwait;
-        u32 rwait;
-        unsigned short msg_cbytes;
-        unsigned short msg_qnum;  
-        unsigned short msg_qbytes;
-        __kernel_ipc_pid_t32 msg_lspid;
-        __kernel_ipc_pid_t32 msg_lrpid;
-};
-
-struct shmid_ds32 {
-        struct ipc_perm32       shm_perm;
-        int                     shm_segsz;
-        __kernel_time_t32       shm_atime;
-        __kernel_time_t32       shm_dtime;
-        __kernel_time_t32       shm_ctime;
-        __kernel_ipc_pid_t32    shm_cpid; 
-        __kernel_ipc_pid_t32    shm_lpid; 
-        unsigned short          shm_nattch;
-};
-
-#define IPCOP_MASK(__x)	(1UL << (__x))
-
-static int
-do_sys32_semctl(int first, int second, int third, void *uptr)
-{
-	union semun fourth;
-	u32 pad;
-	int err;
-	struct semid64_ds s;
-	struct semid_ds32 *usp;
-	mm_segment_t old_fs;
-
-	if (!uptr)
-		return -EINVAL;
-	err = -EFAULT;
-	if (get_user (pad, (u32 *)uptr))
-		return err;
-	if(third == SETVAL)
-		fourth.val = (int)pad;
-	else
-		fourth.__pad = (void *)A(pad);
-
-	switch (third) {
-
-	case IPC_INFO:
-	case IPC_RMID:
-	case IPC_SET:
-	case SEM_INFO:
-	case GETVAL:
-	case GETPID:
-	case GETNCNT:
-	case GETZCNT:
-	case GETALL:
-	case SETVAL:
-	case SETALL:
-		err = sys_semctl (first, second, third, fourth);
-		break;
-
-	case IPC_STAT:
-	case SEM_STAT:
-		usp = (struct semid_ds32 *)A(pad);
-		fourth.__pad = &s;
-		old_fs = get_fs ();
-		set_fs (KERNEL_DS);
-		err = sys_semctl (first, second, third, fourth);
-		set_fs (old_fs);
-		if (verify_area(VERIFY_WRITE, usp, sizeof(struct semid_ds32)) ||
-		    __put_user(s.sem_perm.key, &usp->sem_perm.key) ||
-		    __put_user(s.sem_perm.uid, &usp->sem_perm.uid) ||
-		    __put_user(s.sem_perm.gid, &usp->sem_perm.gid) ||
-		    __put_user(s.sem_perm.cuid, &usp->sem_perm.cuid) ||
-		    __put_user (s.sem_perm.cgid, &usp->sem_perm.cgid) ||
-		    __put_user (s.sem_perm.mode, &usp->sem_perm.mode) ||
-		    __put_user (s.sem_perm.seq, &usp->sem_perm.seq) ||
-		    __put_user (s.sem_otime, &usp->sem_otime) ||
-		    __put_user (s.sem_ctime, &usp->sem_ctime) ||
-		    __put_user (s.sem_nsems, &usp->sem_nsems))
-			return -EFAULT;
-		break;
-
-	}
-
-	return err;
-}
-
-static int
-do_sys32_msgsnd (int first, int second, int third, void *uptr)
-{
-	struct msgbuf *p = kmalloc (second + sizeof (struct msgbuf)
-				    + 4, GFP_USER);
-	struct msgbuf32 *up = (struct msgbuf32 *)uptr;
-	mm_segment_t old_fs;
-	int err;
-
-	if (!p)
-		return -ENOMEM;
-	err = verify_area(VERIFY_READ, up, sizeof(struct msgbuf32));
-	if (err)
-		goto out;
-	err = __get_user (p->mtype, &up->mtype);
-	err |= __copy_from_user (p->mtext, &up->mtext, second);
-	if (err)
-		goto out;
-	old_fs = get_fs ();
-	set_fs (KERNEL_DS);
-	err = sys_msgsnd (first, p, second, third);
-	set_fs (old_fs);
-out:
-	kfree (p);
-	return err;
-}
-
-static int
-do_sys32_msgrcv (int first, int second, int msgtyp, int third,
-		 int version, void *uptr)
-{
-	struct msgbuf32 *up;
-	struct msgbuf *p;
-	mm_segment_t old_fs;
-	int err;
-
-	if (!version) {
-		struct ipc_kludge *uipck = (struct ipc_kludge *)uptr;
-		struct ipc_kludge ipck;
-
-		err = -EINVAL;
-		if (!uptr)
-			goto out;
-		err = -EFAULT;
-		if (copy_from_user (&ipck, uipck, sizeof (struct ipc_kludge)))
-			goto out;
-		uptr = (void *)A(ipck.msgp);
-		msgtyp = ipck.msgtyp;
-	}
-	err = -ENOMEM;
-	p = kmalloc (second + sizeof (struct msgbuf) + 4, GFP_USER);
-	if (!p)
-		goto out;
-	old_fs = get_fs ();
-	set_fs (KERNEL_DS);
-	err = sys_msgrcv (first, p, second + 4, msgtyp, third);
-	set_fs (old_fs);
-	if (err < 0)
-		goto free_then_out;
-	up = (struct msgbuf32 *)uptr;
-	if (verify_area(VERIFY_WRITE, up, sizeof(struct msgbuf32)) ||
-	    __put_user (p->mtype, &up->mtype) ||
-	    __copy_to_user (&up->mtext, p->mtext, err))
-		err = -EFAULT;
-free_then_out:
-	kfree (p);
-out:
-	return err;
-}
-
-static int
-do_sys32_msgctl (int first, int second, void *uptr)
-{
-	int err = -EINVAL;
-	struct msqid_ds m;
-	struct msqid64_ds m64;
-	struct msqid_ds32 *up = (struct msqid_ds32 *)uptr;
-	mm_segment_t old_fs;
-
-	switch (second) {
-
-	case IPC_INFO:
-	case IPC_RMID:
-	case MSG_INFO:
-		err = sys_msgctl (first, second, (struct msqid_ds *)uptr);
-		break;
-
-	case IPC_SET:
-		err = verify_area(VERIFY_READ, up, sizeof(struct msqid_ds32));
-		if (err)
-			break;
-		err = __get_user (m.msg_perm.uid, &up->msg_perm.uid);
-		err |= __get_user (m.msg_perm.gid, &up->msg_perm.gid);
-		err |= __get_user (m.msg_perm.mode, &up->msg_perm.mode);
-		err |= __get_user (m.msg_qbytes, &up->msg_qbytes);
-		if (err)
-			break;
-		old_fs = get_fs ();
-		set_fs (KERNEL_DS);
-		err = sys_msgctl (first, second, &m);
-		set_fs (old_fs);
-		break;
-
-	case IPC_STAT:
-	case MSG_STAT:
-		old_fs = get_fs ();
-		set_fs (KERNEL_DS);
-		err = sys_msgctl (first, second, (void *) &m64);
-		set_fs (old_fs);
-		if (verify_area(VERIFY_WRITE, up, sizeof(struct msqid_ds32)) ||
-		    __put_user (m64.msg_perm.key, &up->msg_perm.key) ||
-		    __put_user(m64.msg_perm.uid, &up->msg_perm.uid) ||
-		    __put_user(m64.msg_perm.gid, &up->msg_perm.gid) ||
-		    __put_user(m64.msg_perm.cuid, &up->msg_perm.cuid) ||
-		    __put_user(m64.msg_perm.cgid, &up->msg_perm.cgid) ||
-		    __put_user(m64.msg_perm.mode, &up->msg_perm.mode) ||
-		    __put_user(m64.msg_perm.seq, &up->msg_perm.seq) ||
-		    __put_user(m64.msg_stime, &up->msg_stime) ||
-		    __put_user(m64.msg_rtime, &up->msg_rtime) ||
-		    __put_user(m64.msg_ctime, &up->msg_ctime) ||
-		    __put_user(m64.msg_cbytes, &up->msg_cbytes) ||
-		    __put_user(m64.msg_qnum, &up->msg_qnum) ||
-		    __put_user(m64.msg_qbytes, &up->msg_qbytes) ||
-		    __put_user(m64.msg_lspid, &up->msg_lspid) ||
-		    __put_user(m64.msg_lrpid, &up->msg_lrpid))
-			return -EFAULT;
-		break;
-
-	}
-
-	return err;
-}
-
-static int
-do_sys32_shmat (int first, int second, int third, int version, void *uptr)
-{
-	unsigned long raddr;
-	u32 *uaddr = (u32 *)A((u32)third);
-	int err = -EINVAL;
-
-	if (version == 1)
-		return err;
-	err = sys_shmat (first, uptr, second, &raddr);
-	if (err)
-		return err;
-	err = put_user (raddr, uaddr);
-	return err;
-}
-
-static int
-do_sys32_shmctl (int first, int second, void *uptr)
-{
-	int err = -EFAULT;
-	struct shmid_ds s;
-	struct shmid64_ds s64;
-	struct shmid_ds32 *up = (struct shmid_ds32 *)uptr;
-	mm_segment_t old_fs;
-	struct shm_info32 {
-		int used_ids;
-		u32 shm_tot, shm_rss, shm_swp;
-		u32 swap_attempts, swap_successes;
-	} *uip = (struct shm_info32 *)uptr;
-	struct shm_info si;
-
-	switch (second) {
-
-	case IPC_INFO:
-	case IPC_RMID:
-	case SHM_LOCK:
-	case SHM_UNLOCK:
-		err = sys_shmctl (first, second, (struct shmid_ds *)uptr);
-		break;
-	case IPC_SET:
-		err = verify_area(VERIFY_READ, up, sizeof(struct shmid_ds32));
-		if (err)
-			break;
-		err = __get_user (s.shm_perm.uid, &up->shm_perm.uid);
-		err |= __get_user (s.shm_perm.gid, &up->shm_perm.gid);
-		err |= __get_user (s.shm_perm.mode, &up->shm_perm.mode);
-		if (err)
-			break;
-		old_fs = get_fs ();
-		set_fs (KERNEL_DS);
-		err = sys_shmctl (first, second, &s);
-		set_fs (old_fs);
-		break;
-
-	case IPC_STAT:
-	case SHM_STAT:
-		old_fs = get_fs ();
-		set_fs (KERNEL_DS);
-		err = sys_shmctl (first, second, (void *) &s64);
-		set_fs (old_fs);
-		if (err < 0)
-			break;
-		if (verify_area(VERIFY_WRITE, up, sizeof(struct shmid_ds32)) ||
-		    __put_user (s64.shm_perm.key, &up->shm_perm.key) ||
-		    __put_user (s64.shm_perm.uid, &up->shm_perm.uid) ||
-		    __put_user (s64.shm_perm.gid, &up->shm_perm.gid) ||
-		    __put_user (s64.shm_perm.cuid, &up->shm_perm.cuid) ||
-		    __put_user (s64.shm_perm.cgid, &up->shm_perm.cgid) ||
-		    __put_user (s64.shm_perm.mode, &up->shm_perm.mode) ||
-		    __put_user (s64.shm_perm.seq, &up->shm_perm.seq) ||
-		    __put_user (s64.shm_atime, &up->shm_atime) ||
-		    __put_user (s64.shm_dtime, &up->shm_dtime) ||
-		    __put_user (s64.shm_ctime, &up->shm_ctime) ||
-		    __put_user (s64.shm_segsz, &up->shm_segsz) ||
-		    __put_user (s64.shm_nattch, &up->shm_nattch) ||
-		    __put_user (s64.shm_cpid, &up->shm_cpid) ||
-		    __put_user (s64.shm_lpid, &up->shm_lpid))
-			return -EFAULT;
-		break;
-
-	case SHM_INFO:
-		old_fs = get_fs ();
-		set_fs (KERNEL_DS);
-		err = sys_shmctl (first, second, (void *)&si);
-		set_fs (old_fs);
-		if (err < 0)
-			break;
-		if (verify_area(VERIFY_WRITE, uip, sizeof(struct shm_info32)) ||
-		    __put_user (si.used_ids, &uip->used_ids) ||
-		    __put_user (si.shm_tot, &uip->shm_tot) ||
-		    __put_user (si.shm_rss, &uip->shm_rss) ||
-		    __put_user (si.shm_swp, &uip->shm_swp) ||
-		    __put_user (si.swap_attempts, &uip->swap_attempts) ||
-		    __put_user (si.swap_successes, &uip->swap_successes))
-			return -EFAULT;
-		break;
-
-	}
-	return err;
-}
-
-asmlinkage long
-sys32_ipc (u32 call, int first, int second, int third, u32 ptr, u32 fifth)
-{
-	int version, err;
-
-	version = call >> 16; /* hack for backward compatibility */
-	call &= 0xffff;
-
-	switch (call) {
-
-	case SEMOP:
-		/* struct sembuf is the same on 32 and 64bit :)) */
-		err = sys_semop (first, (struct sembuf *)AA(ptr),
-				 second);
-		break;
-	case SEMGET:
-		err = sys_semget (first, second, third);
-		break;
-	case SEMCTL:
-		err = do_sys32_semctl (first, second, third,
-				       (void *)AA(ptr));
-		break;
-
-	case MSGSND:
-		err = do_sys32_msgsnd (first, second, third,
-				       (void *)AA(ptr));
-		break;
-	case MSGRCV:
-		err = do_sys32_msgrcv (first, second, fifth, third,
-				       version, (void *)AA(ptr));
-		break;
-	case MSGGET:
-		err = sys_msgget ((key_t) first, second);
-		break;
-	case MSGCTL:
-		err = do_sys32_msgctl (first, second, (void *)AA(ptr));
-		break;
-
-	case SHMAT:
-		err = do_sys32_shmat (first, second, third,
-				      version, (void *)AA(ptr));
-		break;
-	case SHMDT: 
-		err = sys_shmdt ((char *)AA(ptr));
-		break;
-	case SHMGET:
-		err = sys_shmget (first, second, third);
-		break;
-	case SHMCTL:
-		err = do_sys32_shmctl (first, second, (void *)AA(ptr));
-		break;
-	default:
-		err = -EINVAL;
-		break;
-	}
-
-	return err;
-}
-
-/*
  * sys_time() can be implemented in user-level using
  * sys_gettimeofday().  IA64 did this but i386 Linux did not
  * so we have to implement this system call here.
diff --git a/arch/x86_64/kernel/ioport.c b/arch/x86_64/kernel/ioport.c
index a0ab1a1ee68e..b8ad4c6d3709 100644
--- a/arch/x86_64/kernel/ioport.c
+++ b/arch/x86_64/kernel/ioport.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/smp_lock.h>
 #include <linux/stddef.h>
+#include <linux/slab.h>
 
 /* Set EXTENT bits starting at BASE in BITMAP to value TURN_ON. */
 static void set_bitmap(unsigned long *bitmap, short base, short extent, int new_value)
@@ -61,27 +62,19 @@ asmlinkage int sys_ioperm(unsigned long from, unsigned long num, int turn_on)
 		return -EINVAL;
 	if (turn_on && !capable(CAP_SYS_RAWIO))
 		return -EPERM;
-	/*
-	 * If it's the first ioperm() call in this thread's lifetime, set the
-	 * IO bitmap up. ioperm() is much less timing critical than clone(),
-	 * this is why we delay this operation until now:
-	 */
- 	if (!t->ioperm) {
-		/*
-		 * just in case ...
-		 */
-		memset(t->io_bitmap,0xff,(IO_BITMAP_SIZE+1)*4);
-		t->ioperm = 1;
-		/*
-		 * this activates it in the TSS
-		 */
+
+	if (!t->io_bitmap_ptr) { 
+		t->io_bitmap_ptr = kmalloc((IO_BITMAP_SIZE+1)*4, GFP_KERNEL);
+		if (!t->io_bitmap_ptr) 
+			return -ENOMEM; 
+		memset(t->io_bitmap_ptr,0xff,(IO_BITMAP_SIZE+1)*4);
 		tss->io_map_base = IO_BITMAP_OFFSET;
 	}
 
 	/*
 	 * do it in the per-thread copy and in the TSS ...
 	 */
-	set_bitmap((unsigned long *) t->io_bitmap, from, num, !turn_on);
+	set_bitmap((unsigned long *) t->io_bitmap_ptr, from, num, !turn_on);
 	set_bitmap((unsigned long *) tss->io_bitmap, from, num, !turn_on);
 
 	return 0;
diff --git a/arch/x86_64/kernel/mtrr.c b/arch/x86_64/kernel/mtrr.c
index 1f36d262b618..b0c43563a30a 100644
--- a/arch/x86_64/kernel/mtrr.c
+++ b/arch/x86_64/kernel/mtrr.c
@@ -19,10 +19,14 @@
     Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 
 	(For earlier history, see arch/i386/kernel/mtrr.c)
-	September 2001	Dave Jones <davej@suse.de>
+	v2.00	September 2001	Dave Jones <davej@suse.de>
 		Initial rewrite for x86-64.
-
+	  Removal of non-Intel style MTRR code.
+	v2.01  June 2002  Dave Jones <davej@suse.de>
+	  Removal of redundant abstraction layer.
+	  64-bit fixes.
 */
+
 #include <linux/types.h>
 #include <linux/errno.h>
 #include <linux/sched.h>
@@ -60,35 +64,19 @@
 #include <asm/hardirq.h>
 #include <linux/irq.h>
 
-#define MTRR_VERSION            "2.00 (20020207)"
+#define MTRR_VERSION "2.01 (20020605)"
 
 #define TRUE  1
 #define FALSE 0
 
-#define MTRRcap_MSR     0x0fe
-#define MTRRdefType_MSR 0x2ff
-
-#define MTRRphysBase_MSR(reg) (0x200 + 2 * (reg))
-#define MTRRphysMask_MSR(reg) (0x200 + 2 * (reg) + 1)
+#define MSR_MTRRphysBase(reg) (0x200 + 2 * (reg))
+#define MSR_MTRRphysMask(reg) (0x200 + 2 * (reg) + 1)
 
 #define NUM_FIXED_RANGES 88
-#define MTRRfix64K_00000_MSR 0x250
-#define MTRRfix16K_80000_MSR 0x258
-#define MTRRfix16K_A0000_MSR 0x259
-#define MTRRfix4K_C0000_MSR 0x268
-#define MTRRfix4K_C8000_MSR 0x269
-#define MTRRfix4K_D0000_MSR 0x26a
-#define MTRRfix4K_D8000_MSR 0x26b
-#define MTRRfix4K_E0000_MSR 0x26c
-#define MTRRfix4K_E8000_MSR 0x26d
-#define MTRRfix4K_F0000_MSR 0x26e
-#define MTRRfix4K_F8000_MSR 0x26f
 
-#ifdef CONFIG_SMP
 #define MTRR_CHANGE_MASK_FIXED     0x01
 #define MTRR_CHANGE_MASK_VARIABLE  0x02
 #define MTRR_CHANGE_MASK_DEFTYPE   0x04
-#endif
 
 typedef u8 mtrr_type;
 
@@ -97,49 +85,43 @@ typedef u8 mtrr_type;
 #ifdef CONFIG_SMP
 #define set_mtrr(reg,base,size,type) set_mtrr_smp (reg, base, size, type)
 #else
-#define set_mtrr(reg,base,size,type) (*set_mtrr_up) (reg, base, size, type, \
-						       TRUE)
+#define set_mtrr(reg,base,size,type) set_mtrr_up (reg, base, size, type, TRUE)
 #endif
 
 #if defined(CONFIG_PROC_FS) || defined(CONFIG_DEVFS_FS)
 #define USERSPACE_INTERFACE
 #endif
 
-#ifndef USERSPACE_INTERFACE
-#define compute_ascii() while (0)
-#endif
-
 #ifdef USERSPACE_INTERFACE
 static char *ascii_buffer;
 static unsigned int ascii_buf_bytes;
-#endif
-static unsigned int *usage_table;
-static DECLARE_MUTEX (main_lock);
-
-/*  Private functions  */
-#ifdef USERSPACE_INTERFACE
 static void compute_ascii (void);
+#else
+#define compute_ascii() while (0)
 #endif
 
+static unsigned int *usage_table;
+static DECLARE_MUTEX (mtrr_lock);
+
 struct set_mtrr_context {
-    unsigned long flags;
-    unsigned long deftype_lo;
-    unsigned long deftype_hi;
-    unsigned long cr4val;
+	u32 deftype_lo;
+	u32 deftype_hi;
+	u64 flags;
+	u64 cr4val;
 };
 
 
 /*  Put the processor into a state where MTRRs can be safely set  */
 static void set_mtrr_prepare (struct set_mtrr_context *ctxt)
 {
-	unsigned long cr0;
+	u64 cr0;
 
 	/* Disable interrupts locally */
 	__save_flags(ctxt->flags);
 	__cli();
 
     /*  Save value of CR4 and clear Page Global Enable (bit 7)  */
-	if (cpu_has_ge) {
+	if (cpu_has_pge) { 
 	 ctxt->cr4val = read_cr4();
 		write_cr4(ctxt->cr4val & ~(1UL << 7));
     }
@@ -152,8 +134,8 @@ static void set_mtrr_prepare (struct set_mtrr_context *ctxt)
            wbinvd();
 
 	/*  Disable MTRRs, and set the default type to uncached  */
-	rdmsr(MTRRdefType_MSR, ctxt->deftype_lo, ctxt->deftype_hi);
-	wrmsr(MTRRdefType_MSR, ctxt->deftype_lo & 0xf300UL, ctxt->deftype_hi);
+	rdmsr(MSR_MTRRdefType, ctxt->deftype_lo, ctxt->deftype_hi);
+	wrmsr(MSR_MTRRdefType, ctxt->deftype_lo & 0xf300UL, ctxt->deftype_hi);
 }
 
 
@@ -164,7 +146,7 @@ static void set_mtrr_done (struct set_mtrr_context *ctxt)
     wbinvd();
 
     /*  Restore MTRRdefType  */
-	wrmsr(MTRRdefType_MSR, ctxt->deftype_lo, ctxt->deftype_hi);
+	wrmsr(MSR_MTRRdefType, ctxt->deftype_lo, ctxt->deftype_hi);
 
     /*  Enable caches  */
 	write_cr0(read_cr0() & 0xbfffffff);
@@ -181,9 +163,9 @@ static void set_mtrr_done (struct set_mtrr_context *ctxt)
 /*  This function returns the number of variable MTRRs  */
 static unsigned int get_num_var_ranges (void)
 {
-    unsigned long config, dummy;
+	u32 config, dummy;
 
-	rdmsr (MTRRcap_MSR, config, dummy);
+	rdmsr (MSR_MTRRcap, config, dummy);
 	return (config & 0xff);
 }
 
@@ -191,21 +173,21 @@ static unsigned int get_num_var_ranges (void)
 /*  Returns non-zero if we have the write-combining memory type  */
 static int have_wrcomb (void)
 {
-    unsigned long config, dummy;
+	u32 config, dummy;
 
-	rdmsr (MTRRcap_MSR, config, dummy);
+	rdmsr (MSR_MTRRcap, config, dummy);
 	return (config & (1 << 10));
 }
 
 
-static u32 size_or_mask, size_and_mask;
+static u64 size_or_mask, size_and_mask;
 
-static void get_mtrr (unsigned int reg, unsigned long *base,
-		unsigned long *size, mtrr_type * type)
+static void get_mtrr (unsigned int reg, u64 *base, u32 *size, mtrr_type * type)
 {
-    unsigned long mask_lo, mask_hi, base_lo, base_hi;
+	u32 mask_lo, mask_hi, base_lo, base_hi;
+	u64 newsize;
 
-	rdmsr (MTRRphysMask_MSR (reg), mask_lo, mask_hi);
+	rdmsr (MSR_MTRRphysMask(reg), mask_lo, mask_hi);
 	if ((mask_lo & 0x800) == 0) {
 	/*  Invalid (i.e. free) range  */
 	*base = 0;
@@ -214,32 +196,29 @@ static void get_mtrr (unsigned int reg, unsigned long *base,
 	return;
     }
 
-	rdmsr (MTRRphysBase_MSR (reg), base_lo, base_hi);
+	rdmsr (MSR_MTRRphysBase(reg), base_lo, base_hi);
 
     /* Work out the shifted address mask. */
-    mask_lo = size_or_mask | mask_hi << (32 - PAGE_SHIFT)
-		| mask_lo >> PAGE_SHIFT;
-
-    /* This works correctly if size is a power of two, i.e. a
-       contiguous range. */
-     *size = -mask_lo;
+	newsize = (u64) mask_hi << 32 | (mask_lo & ~0x800);
+	newsize = ~newsize+1;
+	*size = (u32) newsize >> PAGE_SHIFT;
      *base = base_hi << (32 - PAGE_SHIFT) | base_lo >> PAGE_SHIFT;
      *type = base_lo & 0xff;
 }
 
 
 
-static void set_mtrr_up (unsigned int reg, unsigned long base,
-			       unsigned long size, mtrr_type type, int do_safe)
-/*  [SUMMARY] Set variable MTRR register on the local CPU.
-    <reg> The register to set.
-    <base> The base address of the region.
-    <size> The size of the region. If this is 0 the region is disabled.
-    <type> The type of the region.
-    <do_safe> If TRUE, do the change safely. If FALSE, safety measures should
-    be done externally.
-    [RETURNS] Nothing.
-*/
+/*
+ * Set variable MTRR register on the local CPU.
+ *  <reg> The register to set.
+ *  <base> The base address of the region.
+ *  <size> The size of the region. If this is 0 the region is disabled.
+ *  <type> The type of the region.
+ *  <do_safe> If TRUE, do the change safely. If FALSE, safety measures should
+ *  be done externally.
+ */
+static void set_mtrr_up (unsigned int reg, u64 base,
+		   u32 size, mtrr_type type, int do_safe)
 {
     struct set_mtrr_context ctxt;
 
@@ -249,12 +228,12 @@ static void set_mtrr_up (unsigned int reg, unsigned long base,
 	if (size == 0) {
 	/* The invalid bit is kept in the mask, so we simply clear the
 	   relevant mask register to disable a range. */
-	wrmsr (MTRRphysMask_MSR (reg), 0, 0);
+		wrmsr (MSR_MTRRphysMask(reg), 0, 0);
 	} else {
-	wrmsr (MTRRphysBase_MSR (reg), base << PAGE_SHIFT | type,
+		wrmsr (MSR_MTRRphysBase(reg), base << PAGE_SHIFT | type,
 		(base & size_and_mask) >> (32 - PAGE_SHIFT));
-	wrmsr (MTRRphysMask_MSR (reg), -size << PAGE_SHIFT | 0x800,
-		(-size & size_and_mask) >> (32 - PAGE_SHIFT));
+		wrmsr (MSR_MTRRphysMask(reg), (-size-1) << PAGE_SHIFT | 0x800,
+		       ((-size-1) & size_and_mask) >> (32 - PAGE_SHIFT));
     }
 	if (do_safe)
 		set_mtrr_done (&ctxt);
@@ -264,41 +243,40 @@ static void set_mtrr_up (unsigned int reg, unsigned long base,
 #ifdef CONFIG_SMP
 
 struct mtrr_var_range {
-    unsigned long base_lo;
-    unsigned long base_hi;
-    unsigned long mask_lo;
-    unsigned long mask_hi;
+	u32 base_lo;
+	u32 base_hi;
+	u32 mask_lo;
+	u32 mask_hi;
 };
 
 /*  Get the MSR pair relating to a var range  */
 static void __init get_mtrr_var_range (unsigned int index,
 					   struct mtrr_var_range *vr)
 {
-    rdmsr (MTRRphysBase_MSR (index), vr->base_lo, vr->base_hi);
-    rdmsr (MTRRphysMask_MSR (index), vr->mask_lo, vr->mask_hi);
+	rdmsr (MSR_MTRRphysBase(index), vr->base_lo, vr->base_hi);
+	rdmsr (MSR_MTRRphysMask(index), vr->mask_lo, vr->mask_hi);
 }
 
 
 /*  Set the MSR pair relating to a var range. Returns TRUE if
     changes are made  */
-static int __init
-set_mtrr_var_range_testing (unsigned int index, struct mtrr_var_range *vr)
+static int __init set_mtrr_var_range_testing (unsigned int index,
+		struct mtrr_var_range *vr)
 {
-    unsigned int lo, hi;
+	u32 lo, hi;
     int changed = FALSE;
 
-	rdmsr (MTRRphysBase_MSR (index), lo, hi);
-	if ((vr->base_lo & 0xfffff0ffUL) != (lo & 0xfffff0ffUL)
-	    || (vr->base_hi & 0xfUL) != (hi & 0xfUL)) {
-		wrmsr (MTRRphysBase_MSR (index), vr->base_lo, vr->base_hi);
+	rdmsr (MSR_MTRRphysBase(index), lo, hi);
+	if ((vr->base_lo & 0xfffff0ff) != (lo & 0xfffff0ff)
+	    || (vr->base_hi & 0x000fffff) != (hi & 0x000fffff)) {
+		wrmsr (MSR_MTRRphysBase(index), vr->base_lo, vr->base_hi);
 	changed = TRUE;
     }
 
-	rdmsr (MTRRphysMask_MSR (index), lo, hi);
-
-	if ((vr->mask_lo & 0xfffff800UL) != (lo & 0xfffff800UL)
-	    || (vr->mask_hi & 0xfUL) != (hi & 0xfUL)) {
-		wrmsr (MTRRphysMask_MSR (index), vr->mask_lo, vr->mask_hi);
+	rdmsr (MSR_MTRRphysMask(index), lo, hi);
+	if ((vr->mask_lo & 0xfffff800) != (lo & 0xfffff800)
+	    || (vr->mask_hi & 0x000fffff) != (hi & 0x000fffff)) {
+		wrmsr (MSR_MTRRphysMask(index), vr->mask_lo, vr->mask_hi);
 	changed = TRUE;
     }
     return changed;
@@ -307,45 +285,50 @@ set_mtrr_var_range_testing (unsigned int index, struct mtrr_var_range *vr)
 
 static void __init get_fixed_ranges (mtrr_type * frs)
 {
-	unsigned long *p = (unsigned long *) frs;
+	u32 *p = (u32 *) frs;
     int i;
 
-	rdmsr (MTRRfix64K_00000_MSR, p[0], p[1]);
+	rdmsr (MSR_MTRRfix64K_00000, p[0], p[1]);
 
     for (i = 0; i < 2; i++)
-		rdmsr (MTRRfix16K_80000_MSR + i, p[2 + i * 2], p[3 + i * 2]);
+		rdmsr (MSR_MTRRfix16K_80000 + i, p[2 + i * 2], p[3 + i * 2]);
     for (i = 0; i < 8; i++)
-		rdmsr (MTRRfix4K_C0000_MSR + i, p[6 + i * 2], p[7 + i * 2]);
+		rdmsr (MSR_MTRRfix4K_C0000 + i, p[6 + i * 2], p[7 + i * 2]);
 }
 
 
 static int __init set_fixed_ranges_testing (mtrr_type * frs)
 {
-	unsigned long *p = (unsigned long *) frs;
+	u32 *p = (u32 *) frs;
     int changed = FALSE;
     int i;
-    unsigned long lo, hi;
+	u32 lo, hi;
 
-	rdmsr (MTRRfix64K_00000_MSR, lo, hi);
+	printk (KERN_INFO "mtrr: rdmsr 64K_00000\n");
+	rdmsr (MSR_MTRRfix64K_00000, lo, hi);
 	if (p[0] != lo || p[1] != hi) {
-	wrmsr (MTRRfix64K_00000_MSR, p[0], p[1]);
+		printk (KERN_INFO "mtrr: Writing %x:%x to 64K MSR. lohi were %x:%x\n", p[0], p[1], lo, hi);
+		wrmsr (MSR_MTRRfix64K_00000, p[0], p[1]);
 	changed = TRUE;
     }
 
+	printk (KERN_INFO "mtrr: rdmsr 16K_80000\n");
 	for (i = 0; i < 2; i++) {
-	rdmsr (MTRRfix16K_80000_MSR + i, lo, hi);
+		rdmsr (MSR_MTRRfix16K_80000 + i, lo, hi);
 		if (p[2 + i * 2] != lo || p[3 + i * 2] != hi) {
-			wrmsr (MTRRfix16K_80000_MSR + i, p[2 + i * 2],
-			       p[3 + i * 2]);
+			printk (KERN_INFO "mtrr: Writing %x:%x to 16K MSR%d. lohi were %x:%x\n", p[2 + i * 2], p[3 + i * 2], i, lo, hi );
+			wrmsr (MSR_MTRRfix16K_80000 + i, p[2 + i * 2], p[3 + i * 2]);
 	    changed = TRUE;
 	}
     }
 
+	printk (KERN_INFO "mtrr: rdmsr 4K_C0000\n");
 	for (i = 0; i < 8; i++) {
-	rdmsr (MTRRfix4K_C0000_MSR + i, lo, hi);
+		rdmsr (MSR_MTRRfix4K_C0000 + i, lo, hi);
+		printk (KERN_INFO "mtrr: MTRRfix4K_C0000+%d = %x:%x\n", i, lo, hi);
 		if (p[6 + i * 2] != lo || p[7 + i * 2] != hi) {
-			wrmsr (MTRRfix4K_C0000_MSR + i, p[6 + i * 2],
-			       p[7 + i * 2]);
+			printk (KERN_INFO "mtrr: Writing %x:%x to 4K MSR%d. lohi were %x:%x\n", p[6 + i * 2], p[7 + i * 2], i, lo, hi);
+			wrmsr (MSR_MTRRfix4K_C0000 + i, p[6 + i * 2], p[7 + i * 2]);
 	    changed = TRUE;
 	}
     }
@@ -357,8 +340,8 @@ struct mtrr_state {
     unsigned int num_var_ranges;
     struct mtrr_var_range *var_ranges;
     mtrr_type fixed_ranges[NUM_FIXED_RANGES];
-    unsigned char enabled;
     mtrr_type def_type;
+	unsigned char enabled;
 };
 
 
@@ -367,9 +350,9 @@ static void __init get_mtrr_state (struct mtrr_state *state)
 {
     unsigned int nvrs, i;
     struct mtrr_var_range *vrs;
-    unsigned long lo, dummy;
+	u32 lo, dummy;
 
-	nvrs = state->num_var_ranges = get_num_var_ranges ();
+	nvrs = state->num_var_ranges = get_num_var_ranges();
     vrs = state->var_ranges
               = kmalloc (nvrs * sizeof (struct mtrr_var_range), GFP_KERNEL);
     if (vrs == NULL)
@@ -379,7 +362,7 @@ static void __init get_mtrr_state (struct mtrr_state *state)
 	get_mtrr_var_range (i, &vrs[i]);
     get_fixed_ranges (state->fixed_ranges);
 
-    rdmsr (MTRRdefType_MSR, lo, dummy);
+	rdmsr (MSR_MTRRdefType, lo, dummy);
     state->def_type = (lo & 0xff);
     state->enabled = (lo & 0xc00) >> 10;
 }
@@ -393,17 +376,18 @@ static void __init finalize_mtrr_state (struct mtrr_state *state)
 }
 
 
-static unsigned long __init set_mtrr_state (struct mtrr_state *state,
+/*
+ * Set the MTRR state for this CPU.
+ *  <state> The MTRR state information to read.
+ *  <ctxt> Some relevant CPU context.
+ *  [NOTE] The CPU must already be in a safe state for MTRR changes.
+ *  [RETURNS] 0 if no changes made, else a mask indication what was changed.
+ */
+static u64 __init set_mtrr_state (struct mtrr_state *state,
 						struct set_mtrr_context *ctxt)
-/*  [SUMMARY] Set the MTRR state for this CPU.
-    <state> The MTRR state information to read.
-    <ctxt> Some relevant CPU context.
-    [NOTE] The CPU must already be in a safe state for MTRR changes.
-    [RETURNS] 0 if no changes made, else a mask indication what was changed.
-*/
 {
     unsigned int i;
-    unsigned long change_mask = 0;
+	u64 change_mask = 0;
 
     for (i = 0; i < state->num_var_ranges; i++)
 		if (set_mtrr_var_range_testing (i, &state->var_ranges[i]))
@@ -428,16 +412,16 @@ static volatile int wait_barrier_execute = FALSE;
 static volatile int wait_barrier_cache_enable = FALSE;
 
 struct set_mtrr_data {
-    unsigned long smp_base;
-    unsigned long smp_size;
+	u64 smp_base;
+	u32 smp_size;
     unsigned int smp_reg;
     mtrr_type smp_type;
 };
 
+/*
+ * Synchronisation handler. Executed by "other" CPUs.
+ */
 static void ipi_handler (void *info)
-/*  [SUMMARY] Synchronisation handler. Executed by "other" CPUs.
-    [RETURNS] Nothing.
-*/
 {
     struct set_mtrr_data *data = info;
     struct set_mtrr_context ctxt;
@@ -449,7 +433,7 @@ static void ipi_handler (void *info)
 		barrier ();
 
     /*  The master has cleared me to execute  */
-    (*set_mtrr_up) (data->smp_reg, data->smp_base, data->smp_size,
+	set_mtrr_up (data->smp_reg, data->smp_base, data->smp_size,
 		    data->smp_type, FALSE);
 
     /*  Notify master CPU that I've executed the function  */
@@ -462,8 +446,7 @@ static void ipi_handler (void *info)
 }
 
 
-static void set_mtrr_smp (unsigned int reg, unsigned long base,
-			  unsigned long size, mtrr_type type)
+static void set_mtrr_smp (unsigned int reg, u64 base, u32 size, mtrr_type type)
 {
     struct set_mtrr_data data;
     struct set_mtrr_context ctxt;
@@ -490,7 +473,7 @@ static void set_mtrr_smp (unsigned int reg, unsigned long base,
 	/* Set up for completion wait and then release other CPUs to change MTRRs */
     atomic_set (&undone_count, smp_num_cpus - 1);
     wait_barrier_execute = FALSE;
-    (*set_mtrr_up) (reg, base, size, type, FALSE);
+	set_mtrr_up (reg, base, size, type, FALSE);
 
     /*  Now wait for other CPUs to complete the function  */
 	while (atomic_read (&undone_count) > 0)
@@ -505,7 +488,7 @@ static void set_mtrr_smp (unsigned int reg, unsigned long base,
 
 
 /*  Some BIOS's are fucked and don't set all MTRRs the same!  */
-static void __init mtrr_state_warn (unsigned long mask)
+static void __init mtrr_state_warn (u32 mask)
 {
 	if (!mask)
 		return;
@@ -521,7 +504,7 @@ static void __init mtrr_state_warn (unsigned long mask)
 #endif  /*  CONFIG_SMP  */
 
 
-static char inline * attrib_to_str (int x)
+static inline char * attrib_to_str (int x)
 {
     return (x <= 6) ? mtrr_strings[x] : "?";
 }
@@ -551,21 +534,20 @@ static void __init init_table (void)
 }
 
 
-static int generic_get_free_region (unsigned long base,
-		unsigned long size)
-/*  [SUMMARY] Get a free MTRR.
-    <base> The starting (base) address of the region.
-    <size> The size (in bytes) of the region.
-    [RETURNS] The index of the region on success, else -1 on error.
+/*
+ * Get a free MTRR.
+ * returns the index of the region on success, else -1 on error.
 */
+static int get_free_region(void)
 {
     int i, max;
     mtrr_type ltype;
-    unsigned long lbase, lsize;
+	u64 lbase;
+	u32 lsize;
 
     max = get_num_var_ranges ();
 	for (i = 0; i < max; ++i) {
-	(*get_mtrr) (i, &lbase, &lsize, &ltype);
+		get_mtrr (i, &lbase, &lsize, &ltype);
 		if (lsize == 0)
 			return i;
     }
@@ -573,22 +555,19 @@ static int generic_get_free_region (unsigned long base,
 }
 
 
-static int (*get_free_region) (unsigned long base,
-			       unsigned long size) = generic_get_free_region;
-
 /**
  *	mtrr_add_page - Add a memory type region
  *	@base: Physical base address of region in pages (4 KB)
  *	@size: Physical size of region in pages (4 KB)
  *	@type: Type of MTRR desired
  *	@increment: If this is true do usage counting on the region
+ *	Returns The MTRR register on success, else a negative number
+ *	indicating the error code.
  *
- *	Memory type region registers control the caching on newer Intel and
- *	non Intel processors. This function allows drivers to request an
- *	MTRR is added. The details and hardware specifics of each processor's
- *	implementation are hidden from the caller, but nevertheless the 
- *	caller should expect to need to provide a power of two size on an
- *	equivalent power of two boundary.
+ *	Memory type region registers control the caching on newer
+ *	processors. This function allows drivers to request an MTRR is added.
+ *	The caller should expect to need to provide a power of two size on
+ *	an equivalent power of two boundary.
  *
  *	If the region cannot be added either because all regions are in use
  *	or the CPU cannot support it a negative value is returned. On success
@@ -596,42 +575,28 @@ static int (*get_free_region) (unsigned long base,
  *	as a cookie only.
  *
  *	On a multiprocessor machine the changes are made to all processors.
- *	This is required on x86 by the Intel processors.
  *
  *	The available types are
  *
  *	%MTRR_TYPE_UNCACHABLE	-	No caching
- *
  *	%MTRR_TYPE_WRBACK	-	Write data back in bursts whenever
- *
  *	%MTRR_TYPE_WRCOMB	-	Write data back soon but allow bursts
- *
  *	%MTRR_TYPE_WRTHROUGH	-	Cache reads but not writes
  *
  *	BUGS: Needs a quiet flag for the cases where drivers do not mind
  *	failures and do not wish system log messages to be sent.
  */
 
-int mtrr_add_page (unsigned long base, unsigned long size,
-		unsigned int type, char increment)
+int mtrr_add_page (u64 base, u32 size, unsigned int type, char increment)
 {
-/*  [SUMMARY] Add an MTRR entry.
-    <base> The starting (base, in pages) address of the region.
-    <size> The size of the region. (in pages)
-    <type> The type of the new region.
-    <increment> If true and the region already exists, the usage count will be
-    incremented.
-    [RETURNS] The MTRR register on success, else a negative number indicating
-    the error code.
-    [NOTE] This routine uses a spinlock.
-*/
     int i, max;
     mtrr_type ltype;
-    unsigned long lbase, lsize, last;
+	u64 lbase, last;
+	u32 lsize;
 
 	if (base + size < 0x100) {
 		printk (KERN_WARNING
-			"mtrr: cannot set region below 1 MiB (0x%lx000,0x%lx000)\n",
+			"mtrr: cannot set region below 1 MiB (0x%lx000,0x%x000)\n",
 		    base, size);
 	    return -EINVAL;
 	}
@@ -644,7 +609,7 @@ int mtrr_add_page (unsigned long base, unsigned long size,
 
 	if (lbase != last) {
 		printk (KERN_WARNING
-			"mtrr: base(0x%lx000) is not aligned on a size(0x%lx000) boundary\n",
+			"mtrr: base(0x%lx000) is not aligned on a size(0x%x000) boundary\n",
 			base, size);
 	return -EINVAL;
     }
@@ -655,7 +620,7 @@ int mtrr_add_page (unsigned long base, unsigned long size,
     }
 
     /*  If the type is WC, check that this processor supports it  */
-	if ((type == MTRR_TYPE_WRCOMB) && !have_wrcomb ()) {
+	if ((type == MTRR_TYPE_WRCOMB) && !have_wrcomb()) {
 		printk (KERN_WARNING
 			"mtrr: your processor doesn't support write-combining\n");
         return -ENOSYS;
@@ -669,9 +634,9 @@ int mtrr_add_page (unsigned long base, unsigned long size,
     increment = increment ? 1 : 0;
     max = get_num_var_ranges ();
     /*  Search for existing MTRR  */
-	down (&main_lock);
+	down (&mtrr_lock);
 	for (i = 0; i < max; ++i) {
-	(*get_mtrr) (i, &lbase, &lsize, &ltype);
+		get_mtrr (i, &lbase, &lsize, &ltype);
 		if (base >= lbase + lsize)
 			continue;
 		if ((base < lbase) && (base + size <= lbase))
@@ -679,41 +644,41 @@ int mtrr_add_page (unsigned long base, unsigned long size,
 
 	/*  At this point we know there is some kind of overlap/enclosure  */
 		if ((base < lbase) || (base + size > lbase + lsize)) {
-			up (&main_lock);
+			up (&mtrr_lock);
 			printk (KERN_WARNING
-				"mtrr: 0x%lx000,0x%lx000 overlaps existing"
-				" 0x%lx000,0x%lx000\n", base, size, lbase,
-				lsize);
+				"mtrr: 0x%lx000,0x%x000 overlaps existing"
+				" 0x%lx000,0x%x000\n", base, size, lbase, lsize);
 	    return -EINVAL;
 	}
 	/*  New region is enclosed by an existing region  */
 		if (ltype != type) {
 			if (type == MTRR_TYPE_UNCACHABLE)
 				continue;
-			up (&main_lock);
+			up (&mtrr_lock);
 			printk
-			    ("mtrr: type mismatch for %lx000,%lx000 old: %s new: %s\n",
-			     base, size, attrib_to_str (ltype),
+			    ("mtrr: type mismatch for %lx000,%x000 old: %s new: %s\n",
+			     base, size,
+				 attrib_to_str (ltype),
 			     attrib_to_str (type));
 	    return -EINVAL;
 	}
 		if (increment)
 			++usage_table[i];
 	compute_ascii ();
-		up (&main_lock);
+		up (&mtrr_lock);
 	return i;
     }
     /*  Search for an empty MTRR  */
-    i = (*get_free_region) (base, size);
+	i = get_free_region();
 	if (i < 0) {
-		up (&main_lock);
+		up (&mtrr_lock);
 	printk ("mtrr: no more MTRRs available\n");
 	return i;
     }
     set_mtrr (i, base, size, type);
     usage_table[i] = 1;
     compute_ascii ();
-	up (&main_lock);
+	up (&mtrr_lock);
     return i;
 }
 
@@ -724,13 +689,13 @@ int mtrr_add_page (unsigned long base, unsigned long size,
  *	@size: Physical size of region
  *	@type: Type of MTRR desired
  *	@increment: If this is true do usage counting on the region
+ *	Return the MTRR register on success, else a negative numbe
+ *	indicating the error code.
  *
- *	Memory type region registers control the caching on newer Intel and
- *	non Intel processors. This function allows drivers to request an
- *	MTRR is added. The details and hardware specifics of each processor's
- *	implementation are hidden from the caller, but nevertheless the 
- *	caller should expect to need to provide a power of two size on an
- *	equivalent power of two boundary.
+ *	Memory type region registers control the caching on newer processors.
+ *	This function allows drivers to request an MTRR is added.
+ *	The caller should expect to need to provide a power of two size on
+ *	an equivalent power of two boundary.
  *
  *	If the region cannot be added either because all regions are in use
  *	or the CPU cannot support it a negative value is returned. On success
@@ -743,33 +708,19 @@ int mtrr_add_page (unsigned long base, unsigned long size,
  *	The available types are
  *
  *	%MTRR_TYPE_UNCACHABLE	-	No caching
- *
  *	%MTRR_TYPE_WRBACK	-	Write data back in bursts whenever
- *
  *	%MTRR_TYPE_WRCOMB	-	Write data back soon but allow bursts
- *
  *	%MTRR_TYPE_WRTHROUGH	-	Cache reads but not writes
  *
  *	BUGS: Needs a quiet flag for the cases where drivers do not mind
  *	failures and do not wish system log messages to be sent.
  */
 
-int mtrr_add (unsigned long base, unsigned long size, unsigned int type,
-		char increment)
+int mtrr_add (u64 base, u32 size, unsigned int type, char increment)
 {
-/*  [SUMMARY] Add an MTRR entry.
-    <base> The starting (base) address of the region.
-    <size> The size (in bytes) of the region.
-    <type> The type of the new region.
-    <increment> If true and the region already exists, the usage count will be
-    incremented.
-    [RETURNS] The MTRR register on success, else a negative number indicating
-    the error code.
-*/
-
 	if ((base & (PAGE_SIZE - 1)) || (size & (PAGE_SIZE - 1))) {
 	printk ("mtrr: size and base must be multiples of 4 kiB\n");
-	printk ("mtrr: size: 0x%lx  base: 0x%lx\n", size, base);
+		printk ("mtrr: size: 0x%x  base: 0x%lx\n", size, base);
 	return -EINVAL;
     }
 	return mtrr_add_page (base >> PAGE_SHIFT, size >> PAGE_SHIFT, type,
@@ -792,55 +743,46 @@ int mtrr_add (unsigned long base, unsigned long size, unsigned int type,
  *	code.
  */
  
-int mtrr_del_page (int reg, unsigned long base, unsigned long size)
-/*  [SUMMARY] Delete MTRR/decrement usage count.
-    <reg> The register. If this is less than 0 then <<base>> and <<size>> must
-    be supplied.
-    <base> The base address of the region. This is ignored if <<reg>> is >= 0.
-    <size> The size of the region. This is ignored if <<reg>> is >= 0.
-    [RETURNS] The register on success, else a negative number indicating
-    the error code.
-    [NOTE] This routine uses a spinlock.
-*/
+int mtrr_del_page (int reg, u64 base, u32 size)
 {
     int i, max;
     mtrr_type ltype;
-    unsigned long lbase, lsize;
+	u64 lbase;
+	u32 lsize;
 
     max = get_num_var_ranges ();
-    down (&main_lock);
+	down (&mtrr_lock);
 	if (reg < 0) {
 	/*  Search for existing MTRR  */
 		for (i = 0; i < max; ++i) {
-	    (*get_mtrr) (i, &lbase, &lsize, &ltype);
+			get_mtrr (i, &lbase, &lsize, &ltype);
 			if (lbase == base && lsize == size) {
 		reg = i;
 		break;
 	    }
 	}
 		if (reg < 0) {
-			up (&main_lock);
-			printk ("mtrr: no MTRR for %lx000,%lx000 found\n", base,
-				size);
+			up (&mtrr_lock);
+			printk ("mtrr: no MTRR for %lx000,%x000 found\n", base, size);
 	    return -EINVAL;
 	}
     }
 
 	if (reg >= max) {
-	up (&main_lock);
+		up (&mtrr_lock);
 	printk ("mtrr: register: %d too big\n", reg);
 	return -EINVAL;
     }
-    (*get_mtrr) (reg, &lbase, &lsize, &ltype);
+	get_mtrr (reg, &lbase, &lsize, &ltype);
 
 	if (lsize < 1) {
-	up (&main_lock);
+		up (&mtrr_lock);
 	printk ("mtrr: MTRR %d not used\n", reg);
 	return -EINVAL;
     }
 
 	if (usage_table[reg] < 1) {
-	up (&main_lock);
+		up (&mtrr_lock);
 	printk ("mtrr: reg: %d has count=0\n", reg);
 	return -EINVAL;
     }
@@ -848,7 +790,7 @@ int mtrr_del_page (int reg, unsigned long base, unsigned long size)
 	if (--usage_table[reg] < 1)
 		set_mtrr (reg, 0, 0, 0);
     compute_ascii ();
-    up (&main_lock);
+	up (&mtrr_lock);
     return reg;
 }
 
@@ -868,19 +810,11 @@ int mtrr_del_page (int reg, unsigned long base, unsigned long size)
  *	code.
  */
  
-int mtrr_del (int reg, unsigned long base, unsigned long size)
-/*  [SUMMARY] Delete MTRR/decrement usage count.
-    <reg> The register. If this is less than 0 then <<base>> and <<size>> must
-    be supplied.
-    <base> The base address of the region. This is ignored if <<reg>> is >= 0.
-    <size> The size of the region. This is ignored if <<reg>> is >= 0.
-    [RETURNS] The register on success, else a negative number indicating
-    the error code.
-*/
+int mtrr_del (int reg, u64 base, u32 size)
 {
 	if ((base & (PAGE_SIZE - 1)) || (size & (PAGE_SIZE - 1))) {
 	printk ("mtrr: size and base must be multiples of 4 kiB\n");
-	printk ("mtrr: size: 0x%lx  base: 0x%lx\n", size, base);
+		printk ("mtrr: size: 0x%x  base: 0x%lx\n", size, base);
 	return -EINVAL;
     }
 	return mtrr_del_page (reg, base >> PAGE_SHIFT, size >> PAGE_SHIFT);
@@ -889,8 +823,8 @@ int mtrr_del (int reg, unsigned long base, unsigned long size)
 
 #ifdef USERSPACE_INTERFACE
 
-static int mtrr_file_add (unsigned long base, unsigned long size,
-			  unsigned int type, char increment, struct file *file, int page)
+static int mtrr_file_add (u64 base, u32 size, unsigned int type,
+		struct file *file, int page)
 {
     int reg, max;
     unsigned int *fcount = file->private_data;
@@ -910,7 +844,7 @@ static int mtrr_file_add (unsigned long base, unsigned long size,
 		if ((base & (PAGE_SIZE - 1)) || (size & (PAGE_SIZE - 1))) {
 			printk
 			    ("mtrr: size and base must be multiples of 4 kiB\n");
-	    printk ("mtrr: size: 0x%lx  base: 0x%lx\n", size, base);
+			printk ("mtrr: size: 0x%x  base: 0x%lx\n", size, base);
 	    return -EINVAL;
 	}
 	base >>= PAGE_SHIFT;
@@ -925,7 +859,7 @@ static int mtrr_file_add (unsigned long base, unsigned long size,
 }
 
 
-static int mtrr_file_del (unsigned long base, unsigned long size,
+static int mtrr_file_del (u64 base, u32 size,
 			  struct file *file, int page)
 {
     int reg;
@@ -935,7 +869,7 @@ static int mtrr_file_del (unsigned long base, unsigned long size,
 		if ((base & (PAGE_SIZE - 1)) || (size & (PAGE_SIZE - 1))) {
 			printk
 			    ("mtrr: size and base must be multiples of 4 kiB\n");
-	    printk ("mtrr: size: 0x%lx  base: 0x%lx\n", size, base);
+			printk ("mtrr: size: 0x%x  base: 0x%lx\n", size, base);
 	    return -EINVAL;
 	}
 	base >>= PAGE_SHIFT;
@@ -977,9 +911,9 @@ static ssize_t mtrr_write (struct file *file, const char *buf,
     "disable=%d"
 */
 {
-    int i, err;
-    unsigned long reg;
-    unsigned long long base, size;
+	int i, err, reg;
+	u64 base;
+	u32 size;
     char *ptr;
     char line[LINE_SIZE];
 
@@ -1027,7 +961,7 @@ static ssize_t mtrr_write (struct file *file, const char *buf,
 
 	if ((base & 0xfff) || (size & 0xfff)) {
 	printk ("mtrr: size and base must be multiples of 4 kiB\n");
-	printk ("mtrr: size: 0x%Lx  base: 0x%Lx\n", size, base);
+		printk ("mtrr: size: 0x%x  base: 0x%lx\n", size, base);
 	return -EINVAL;
     }
 
@@ -1046,9 +980,7 @@ static ssize_t mtrr_write (struct file *file, const char *buf,
 			continue;
 	base >>= PAGE_SHIFT;
 	size >>= PAGE_SHIFT;
-		err =
-		    mtrr_add_page ((unsigned long) base, (unsigned long) size,
-				   i, 1);
+		err = mtrr_add_page ((u64) base, size, i, 1);
 		if (err < 0)
 			return err;
 	return len;
@@ -1076,7 +1008,7 @@ static int mtrr_ioctl (struct inode *inode, struct file *file,
 		if (copy_from_user (&sentry, (void *) arg, sizeof sentry))
 	    return -EFAULT;
 		err =
-		    mtrr_file_add (sentry.base, sentry.size, sentry.type, 1,
+		    mtrr_file_add (sentry.base, sentry.size, sentry.type,
 				   file, 0);
 		if (err < 0)
 			return err;
@@ -1117,7 +1049,7 @@ static int mtrr_ioctl (struct inode *inode, struct file *file,
 	    return -EFAULT;
 		if (gentry.regnum >= get_num_var_ranges ())
 			return -EINVAL;
-	(*get_mtrr) (gentry.regnum, &gentry.base, &gentry.size, &type);
+		get_mtrr (gentry.regnum, &gentry.base, &gentry.size, &type);
 
 	/* Hide entries that go above 4GB */
 		if (gentry.base + gentry.size > 0x100000
@@ -1139,7 +1071,7 @@ static int mtrr_ioctl (struct inode *inode, struct file *file,
 		if (copy_from_user (&sentry, (void *) arg, sizeof sentry))
 	    return -EFAULT;
 		err =
-		    mtrr_file_add (sentry.base, sentry.size, sentry.type, 1,
+		    mtrr_file_add (sentry.base, sentry.size, sentry.type,
 				   file, 1);
 		if (err < 0)
 			return err;
@@ -1180,7 +1112,7 @@ static int mtrr_ioctl (struct inode *inode, struct file *file,
 	    return -EFAULT;
 		if (gentry.regnum >= get_num_var_ranges ())
 			return -EINVAL;
-	(*get_mtrr) (gentry.regnum, &gentry.base, &gentry.size, &type);
+		get_mtrr (gentry.regnum, &gentry.base, &gentry.size, &type);
 	gentry.type = type;
 
 		if (copy_to_user ((void *) arg, &gentry, sizeof gentry))
@@ -1199,7 +1131,6 @@ static int mtrr_close (struct inode *ino, struct file *file)
 	if (fcount == NULL)
 		return 0;
 
-	lock_kernel ();
     max = get_num_var_ranges ();
 	for (i = 0; i < max; ++i) {
 		while (fcount[i] > 0) {
@@ -1208,7 +1139,6 @@ static int mtrr_close (struct inode *ino, struct file *file)
 	    --fcount[i];
 	}
     }
-	unlock_kernel ();
     kfree (fcount);
     file->private_data = NULL;
     return 0;
@@ -1234,12 +1164,13 @@ static void compute_ascii (void)
     char factor;
     int i, max;
     mtrr_type type;
-    unsigned long base, size;
+	u64 base;
+	u32 size;
 
     ascii_buf_bytes = 0;
     max = get_num_var_ranges ();
 	for (i = 0; i < max; i++) {
-	(*get_mtrr) (i, &base, &size, &type);
+		get_mtrr (i, &base, &size, &type);
 		if (size == 0)
 			usage_table[i] = 0;
 		else {
@@ -1253,11 +1184,10 @@ static void compute_ascii (void)
 	    }
 	    sprintf
 		(ascii_buffer + ascii_buf_bytes,
-		 "reg%02i: base=0x%05lx000 (%4liMB), size=%4li%cB: %s, count=%d\n",
+			     "reg%02i: base=0x%05lx000 (%4liMB), size=%4i%cB: %s, count=%d\n",
 		 i, base, base >> (20 - PAGE_SHIFT), size, factor,
 		 attrib_to_str (type), usage_table[i]);
-			ascii_buf_bytes +=
-			    strlen (ascii_buffer + ascii_buf_bytes);
+			ascii_buf_bytes += strlen (ascii_buffer + ascii_buf_bytes);
 	}
     }
     devfs_set_file_size (devfs_handle, ascii_buf_bytes);
@@ -1283,22 +1213,16 @@ static void __init mtrr_setup (void)
 		if ((cpuid_eax (0x80000000) >= 0x80000008)) {
 			u32	phys_addr;
 			phys_addr = cpuid_eax (0x80000008) & 0xff;
-			size_or_mask =
-			    ~((1 << (phys_addr - PAGE_SHIFT)) - 1);
-			size_and_mask = ~size_or_mask & 0xfff00000;
-		} else {
-			/* FIXME: This is to make it work on Athlon during debugging. */
-		size_or_mask  = 0xff000000; /* 36 bits */
-		size_and_mask = 0x00f00000;
+			size_or_mask = ~((1 << (phys_addr - PAGE_SHIFT)) - 1);
+			size_and_mask = ~size_or_mask & 0xfffffffffff00000;
 	}
-
 		printk ("mtrr: detected mtrr type: x86-64\n");
     }
 }
 
 #ifdef CONFIG_SMP
 
-static volatile unsigned long smp_changes_mask __initdata = 0;
+static volatile u32 smp_changes_mask __initdata = 0;
 static struct mtrr_state smp_mtrr_state __initdata = { 0, 0 };
 
 void __init mtrr_init_boot_cpu (void)
@@ -1310,7 +1234,8 @@ void __init mtrr_init_boot_cpu (void)
 
 void __init mtrr_init_secondary_cpu (void)
 {
-    unsigned long mask, count;
+	u64 mask;
+	int count;
     struct set_mtrr_context ctxt;
 
     /*  Note that this is not ideal, since the cache is only flushed/disabled
@@ -1357,4 +1282,3 @@ int __init mtrr_init (void)
     init_table ();
     return 0;
 }
-
diff --git a/arch/x86_64/kernel/process.c b/arch/x86_64/kernel/process.c
index e233b3557ce5..f00fff0638de 100644
--- a/arch/x86_64/kernel/process.c
+++ b/arch/x86_64/kernel/process.c
@@ -39,6 +39,7 @@
 #include <linux/reboot.h>
 #include <linux/init.h>
 #include <linux/ctype.h>
+#include <linux/slab.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -320,9 +321,6 @@ void show_regs(struct pt_regs * regs)
 	printk("CR2: %016lx CR3: %016lx CR4: %016lx\n", cr2, cr3, cr4);
 }
 
-#define __STR(x) #x
-#define __STR2(x) __STR(x)
-
 extern void load_gs_index(unsigned);
 
 /*
@@ -330,7 +328,13 @@ extern void load_gs_index(unsigned);
  */
 void exit_thread(void)
 {
-	/* nothing to do ... */
+	struct task_struct *me = current;
+	if (me->thread.io_bitmap_ptr) { 
+		kfree(me->thread.io_bitmap_ptr); 
+		me->thread.io_bitmap_ptr = NULL;
+		(init_tss + smp_processor_id())->io_map_base = 
+			INVALID_IO_BITMAP_OFFSET;
+	}
 }
 
 void flush_thread(void)
@@ -392,6 +396,14 @@ int copy_thread(int nr, unsigned long clone_flags, unsigned long rsp,
 	unlazy_fpu(current);	
 	p->thread.i387 = current->thread.i387;
 
+	if (unlikely(me->thread.io_bitmap_ptr != NULL)) { 
+		p->thread.io_bitmap_ptr = kmalloc((IO_BITMAP_SIZE+1)*4, GFP_KERNEL);
+		if (!p->thread.io_bitmap_ptr) 
+			return -ENOMEM;
+		memcpy(p->thread.io_bitmap_ptr, me->thread.io_bitmap_ptr, 
+		       (IO_BITMAP_SIZE+1)*4);
+	} 
+
 	return 0;
 }
 
@@ -491,21 +503,14 @@ void __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	/* 
 	 * Handle the IO bitmap 
 	 */ 
-	if (unlikely(prev->ioperm || next->ioperm)) {
-		if (next->ioperm) {
+	if (unlikely(prev->io_bitmap_ptr || next->io_bitmap_ptr)) {
+		if (next->io_bitmap_ptr) {
 			/*
 			 * 4 cachelines copy ... not good, but not that
 			 * bad either. Anyone got something better?
 			 * This only affects processes which use ioperm().
-			 * [Putting the TSSs into 4k-tlb mapped regions
-			 * and playing VM tricks to switch the IO bitmap
-			 * is not really acceptable.]
-			 * On x86-64 we could put multiple bitmaps into 
-			 * the GDT and just switch offsets
-			 * This would require ugly special cases on overflow
-			 * though -AK 
 			 */
-			memcpy(tss->io_bitmap, next->io_bitmap,
+			memcpy(tss->io_bitmap, next->io_bitmap_ptr,
 				 IO_BITMAP_SIZE*sizeof(u32));
 			tss->io_map_base = IO_BITMAP_OFFSET;
 		} else {
diff --git a/arch/x86_64/kernel/setup64.c b/arch/x86_64/kernel/setup64.c
index f6c296dce4b5..66ae787c8d19 100644
--- a/arch/x86_64/kernel/setup64.c
+++ b/arch/x86_64/kernel/setup64.c
@@ -91,6 +91,9 @@ void pda_init(int cpu)
 	pda->me = pda;
 	pda->cpudata_offset = 0;
 
+	pda->active_mm = &init_mm;
+	pda->mmu_state = 0;
+	
 	asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0)); 
 	wrmsrl(MSR_GS_BASE, cpu_pda + cpu);
 } 
diff --git a/arch/x86_64/kernel/signal.c b/arch/x86_64/kernel/signal.c
index 98b653afe853..229592faf805 100644
--- a/arch/x86_64/kernel/signal.c
+++ b/arch/x86_64/kernel/signal.c
@@ -84,7 +84,6 @@ struct rt_sigframe
 	char *pretcode;
 	struct ucontext uc;
 	struct siginfo info;
-	struct _fpstate fpstate;
 };
 
 static int
@@ -186,8 +185,7 @@ badframe:
  */
 
 static int
-setup_sigcontext(struct sigcontext *sc, struct _fpstate *fpstate,
-		 struct pt_regs *regs, unsigned long mask)
+setup_sigcontext(struct sigcontext *sc, struct pt_regs *regs, unsigned long mask)
 {
 	int tmp, err = 0;
 	struct task_struct *me = current;
@@ -221,20 +219,17 @@ setup_sigcontext(struct sigcontext *sc, struct _fpstate *fpstate,
 	err |= __put_user(mask, &sc->oldmask);
 	err |= __put_user(me->thread.cr2, &sc->cr2);
 
-	tmp = save_i387(fpstate);
-	if (tmp < 0)
-	  err = 1;
-	else
-	  err |= __put_user(tmp ? fpstate : NULL, &sc->fpstate);
-
 	return err;
 }
 
 /*
  * Determine which stack to use..
  */
-static inline struct rt_sigframe *
-get_sigframe(struct k_sigaction *ka, struct pt_regs * regs)
+
+#define round_down(p, r) ((void *)  ((unsigned long)((p) - (r) + 1) & ~((r)-1)))
+
+static void *
+get_stack(struct k_sigaction *ka, struct pt_regs *regs, unsigned long size)
 {
 	unsigned long rsp;
 
@@ -247,22 +242,34 @@ get_sigframe(struct k_sigaction *ka, struct pt_regs * regs)
 			rsp = current->sas_ss_sp + current->sas_ss_size;
 	}
 
-	rsp = (rsp - sizeof(struct _fpstate)) & ~(15UL); 
-	rsp -= offsetof(struct rt_sigframe, fpstate);
-
-	return (struct rt_sigframe *) rsp; 
+	return round_down(rsp - size, 16); 
 }
 
 static void setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info,
 			   sigset_t *set, struct pt_regs * regs)
 {
-	struct rt_sigframe *frame;
+	struct rt_sigframe *frame = NULL;
+	struct _fpstate *fp = NULL; 
 	int err = 0;
 
-	frame = get_sigframe(ka, regs);
+	if (current->used_math) {
+		fp = get_stack(ka, regs, sizeof(struct _fpstate)); 
+		frame = round_down((char *)fp - sizeof(struct rt_sigframe), 16) - 8;
 
-	if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame)))
+		if (!access_ok(VERIFY_WRITE, fp, sizeof(struct _fpstate))) { 
 		goto give_sigsegv;
+		}
+
+		if (save_i387(fp) < 0) 
+			err |= -1; 
+	}
+
+	if (!frame)
+		frame = get_stack(ka, regs, sizeof(struct rt_sigframe)) - 8;
+
+	if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) {
+		goto give_sigsegv;
+	}
 
 	if (ka->sa.sa_flags & SA_SIGINFO) { 
 		err |= copy_siginfo_to_user(&frame->info, info);
@@ -278,14 +285,10 @@ static void setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info,
 	err |= __put_user(sas_ss_flags(regs->rsp),
 			  &frame->uc.uc_stack.ss_flags);
 	err |= __put_user(current->sas_ss_size, &frame->uc.uc_stack.ss_size);
-	err |= setup_sigcontext(&frame->uc.uc_mcontext, &frame->fpstate,
-			        regs, set->sig[0]);
+	err |= setup_sigcontext(&frame->uc.uc_mcontext, regs, set->sig[0]);
+	err |= __put_user(fp, &frame->uc.uc_mcontext.fpstate);
 	err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));
 
-	if (err) { 
-		goto give_sigsegv;
-	} 
-
 	/* Set up to return from userspace.  If provided, use a stub
 	   already in userspace.  */
 	/* x86-64 should always use SA_RESTORER. */
@@ -297,7 +300,6 @@ static void setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info,
 	}
 
 	if (err) { 
-		printk("fault 3\n"); 
 		goto give_sigsegv;
 	} 
 
@@ -305,7 +307,6 @@ static void setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info,
 	printk("%d old rip %lx old rsp %lx old rax %lx\n", current->pid,regs->rip,regs->rsp,regs->rax);
 #endif
 
-
 	/* Set up registers for signal handler */
 	{ 
 		struct exec_domain *ed = current_thread_info()->exec_domain;
@@ -320,9 +321,10 @@ static void setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info,
 	   next argument after the signal number on the stack. */
 	regs->rsi = (unsigned long)&frame->info; 
 	regs->rdx = (unsigned long)&frame->uc; 
-	regs->rsp = (unsigned long) frame;
 	regs->rip = (unsigned long) ka->sa.sa_handler;
 
+	regs->rsp = (unsigned long)frame;
+
 	set_fs(USER_DS);
 	regs->eflags &= ~TF_MASK;
 
diff --git a/arch/x86_64/kernel/smp.c b/arch/x86_64/kernel/smp.c
index 3d6e8a406b54..f0d99edfec0e 100644
--- a/arch/x86_64/kernel/smp.c
+++ b/arch/x86_64/kernel/smp.c
@@ -25,8 +25,6 @@
 /* The 'big kernel lock' */
 spinlock_t kernel_flag __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
 
-struct tlb_state cpu_tlbstate[NR_CPUS] = {[0 ... NR_CPUS-1] = { &init_mm, 0 }};
-
 /*
  * the following functions deal with sending IPIs between CPUs.
  *
@@ -147,9 +145,9 @@ static spinlock_t tlbstate_lock = SPIN_LOCK_UNLOCKED;
  */
 static void inline leave_mm (unsigned long cpu)
 {
-	if (cpu_tlbstate[cpu].state == TLBSTATE_OK)
+	if (read_pda(mmu_state) == TLBSTATE_OK)
 		BUG();
-	clear_bit(cpu, &cpu_tlbstate[cpu].active_mm->cpu_vm_mask);
+	clear_bit(cpu, &read_pda(active_mm)->cpu_vm_mask);
 	__flush_tlb();
 }
 
@@ -164,18 +162,18 @@ static void inline leave_mm (unsigned long cpu)
  * 	the other cpus, but smp_invalidate_interrupt ignore flush ipis
  * 	for the wrong mm, and in the worst case we perform a superflous
  * 	tlb flush.
- * 1a2) set cpu_tlbstate to TLBSTATE_OK
+ * 1a2) set cpu mmu_state to TLBSTATE_OK
  * 	Now the smp_invalidate_interrupt won't call leave_mm if cpu0
  *	was in lazy tlb mode.
- * 1a3) update cpu_tlbstate[].active_mm
+ * 1a3) update cpu active_mm
  * 	Now cpu0 accepts tlb flushes for the new mm.
  * 1a4) set_bit(cpu, &new_mm->cpu_vm_mask);
  * 	Now the other cpus will send tlb flush ipis.
  * 1a4) change cr3.
  * 1b) thread switch without mm change
- *	cpu_tlbstate[].active_mm is correct, cpu0 already handles
+ *	cpu active_mm is correct, cpu0 already handles
  *	flush ipis.
- * 1b1) set cpu_tlbstate to TLBSTATE_OK
+ * 1b1) set cpu mmu_state to TLBSTATE_OK
  * 1b2) test_and_set the cpu bit in cpu_vm_mask.
  * 	Atomically set the bit [other cpus will start sending flush ipis],
  * 	and test the bit.
@@ -188,7 +186,7 @@ static void inline leave_mm (unsigned long cpu)
  *   runs in kernel space, the cpu could load tlb entries for user space
  *   pages.
  *
- * The good news is that cpu_tlbstate is local to each cpu, no
+ * The good news is that cpu mmu_state is local to each cpu, no
  * write/read ordering problems.
  */
 
@@ -216,8 +214,8 @@ asmlinkage void smp_invalidate_interrupt (void)
 		 * BUG();
 		 */
 		 
-	if (flush_mm == cpu_tlbstate[cpu].active_mm) {
-		if (cpu_tlbstate[cpu].state == TLBSTATE_OK) {
+	if (flush_mm == read_pda(active_mm)) {
+		if (read_pda(mmu_state) == TLBSTATE_OK) {
 			if (flush_va == FLUSH_ALL)
 				local_flush_tlb();
 			else
@@ -335,7 +333,7 @@ static inline void do_flush_tlb_all_local(void)
 	unsigned long cpu = smp_processor_id();
 
 	__flush_tlb_all();
-	if (cpu_tlbstate[cpu].state == TLBSTATE_LAZY)
+	if (read_pda(mmu_state) == TLBSTATE_LAZY)
 		leave_mm(cpu);
 }
 
diff --git a/arch/x86_64/kernel/vsyscall.c b/arch/x86_64/kernel/vsyscall.c
index b292ca527a8a..e576e9f98ec5 100644
--- a/arch/x86_64/kernel/vsyscall.c
+++ b/arch/x86_64/kernel/vsyscall.c
@@ -47,7 +47,7 @@
 
 #define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr)))
 
-#define NO_VSYSCALL 1
+//#define NO_VSYSCALL 1
 
 #ifdef NO_VSYSCALL
 #include <asm/unistd.h>
diff --git a/arch/x86_64/kernel/x8664_ksyms.c b/arch/x86_64/kernel/x8664_ksyms.c
index 9d88edb5c62d..2bbb7d8238b5 100644
--- a/arch/x86_64/kernel/x8664_ksyms.c
+++ b/arch/x86_64/kernel/x8664_ksyms.c
@@ -189,3 +189,5 @@ EXPORT_SYMBOL_NOVERS(do_softirq_thunk);
 
 void out_of_line_bug(void);
 EXPORT_SYMBOL(out_of_line_bug);
+
+EXPORT_SYMBOL(init_level4_pgt);
diff --git a/arch/x86_64/lib/Makefile b/arch/x86_64/lib/Makefile
index 8fbcee522aeb..6791678212ed 100644
--- a/arch/x86_64/lib/Makefile
+++ b/arch/x86_64/lib/Makefile
@@ -12,7 +12,7 @@ obj-y  = csum-partial.o csum-copy.o csum-wrappers.o delay.o \
 	thunk.o io.o clear_page.o copy_page.o
 obj-y += memcpy.o
 obj-y += memmove.o
-#obj-y += memset.o
+obj-y += memset.o
 obj-y += copy_user.o
 
 export-objs := io.o csum-wrappers.o csum-partial.o
diff --git a/arch/x86_64/lib/memset.S b/arch/x86_64/lib/memset.S
index 1c5d73cd73b8..44ce1223d832 100644
--- a/arch/x86_64/lib/memset.S
+++ b/arch/x86_64/lib/memset.S
@@ -1,6 +1,4 @@
-/* Copyright 2002 Andi Kleen, SuSE Labs */
-
-	// #define FIX_ALIGNMENT 1
+/* Copyright 2002 Andi Kleen */
 
 /*
  * ISO C memset - set a memory block to a byte value.
@@ -11,51 +9,51 @@
  * 
  * rax   original destination
  */	
-	.globl ____memset
+ 	.globl __memset
+	.globl memset
 	.p2align
-____memset:
-	movq %rdi,%r10		/* save destination for return address */
-	movq %rdx,%r11		/* save count */ 
+memset:	
+__memset:
+	movq %rdi,%r10
+	movq %rdx,%r11
 
 	/* expand byte value  */
-	movzbl %sil,%ecx	/* zero extend char value */
-	movabs $0x0101010101010101,%rax		/* expansion pattern */
-	mul    %rcx		/* expand with rax, clobbers rdx */
+	movzbl %sil,%ecx
+	movabs $0x0101010101010101,%rax
+	mul    %rcx		/* with rax, clobbers rdx */
 
-#ifdef FIX_ALIGNMENT
 	/* align dst */
 	movl  %edi,%r9d		
-	andl  $7,%r9d		/* test unaligned bits */
+	andl  $7,%r9d	
 	jnz  bad_alignment
 after_bad_alignment:
-#endif
 	
-	movq %r11,%rcx		/* restore count */
-	shrq $6,%rcx		/* divide by 64 */
-	jz	 handle_tail	/* block smaller than 64 bytes? */
-	movl $64,%r8d		/* CSE loop block size */
+	movq %r11,%rcx
+	movl $64,%r8d
+	shrq $6,%rcx
+	jz	 handle_tail
 
 loop_64:	
-	movnti  %rax,0*8(%rdi) 
-	movnti  %rax,1*8(%rdi) 
-	movnti  %rax,2*8(%rdi) 
-	movnti  %rax,3*8(%rdi) 
-	movnti  %rax,4*8(%rdi) 
-	movnti  %rax,5*8(%rdi) 
-	movnti  %rax,6*8(%rdi) 
-	movnti  %rax,7*8(%rdi) 	/* clear 64 byte blocks */
-	addq    %r8,%rdi	/* increase pointer by 64 bytes */
-	loop	loop_64		/* decrement rcx and if not zero loop */
+	movnti  %rax,(%rdi) 
+	movnti  %rax,8(%rdi) 
+	movnti  %rax,16(%rdi) 
+	movnti  %rax,24(%rdi) 
+	movnti  %rax,32(%rdi) 
+	movnti  %rax,40(%rdi) 
+	movnti  %rax,48(%rdi) 
+	movnti  %rax,56(%rdi) 
+	addq    %r8,%rdi
+	loop	loop_64
 
 	/* Handle tail in loops. The loops should be faster than hard
 	   to predict jump tables. */ 
 handle_tail:
 	movl	%r11d,%ecx
-	andl    $63,%ecx
-	shrl	$3,%ecx
+	andl    $63&(~7),%ecx
 	jz 	handle_7
+	shrl	$3,%ecx
 loop_8:
-	movnti  %rax,(%rdi) 	/* long words */
+	movnti  %rax,(%rdi) 
 	addq    $8,%rdi
 	loop 	loop_8   
 
@@ -64,22 +62,20 @@ handle_7:
 	andl	$7,%ecx
 	jz      ende
 loop_1:
-	movb 	%al,(%rdi)	/* bytes */
-	incq	%rdi
+	movb 	%al,(%rdi)
+	addq	$1,%rdi
 	loop	loop_1
 	
 ende:	
 	movq	%r10,%rax
 	ret
 
-#ifdef FIX_ALIGNMENT
 bad_alignment:
-	andq $-8,%r11		/* shorter than 8 bytes */
-	jz      handle_7	/* if yes handle it in the tail code */
-	movnti %rax,(%rdi)	/* unaligned store of 8 bytes */
+	cmpq $7,%r11
+	jbe	handle_7
+	movnti %rax,(%rdi)	/* unaligned store */
 	movq $8,%r8			
-	subq %r9,%r8 		/* compute alignment (8-misalignment) */
-	addq %r8,%rdi		/* fix destination */
-	subq %r8,%r11		/* fix count */
+	subq %r9,%r8 
+	addq %r8,%rdi
+	subq %r8,%r11
 	jmp after_bad_alignment
-#endif
diff --git a/drivers/block/DAC960.c b/drivers/block/DAC960.c
index d57dc51df3f5..210449ad1715 100644
--- a/drivers/block/DAC960.c
+++ b/drivers/block/DAC960.c
@@ -28,6 +28,7 @@
 #include <linux/types.h>
 #include <linux/blk.h>
 #include <linux/blkdev.h>
+#include <linux/bio.h>
 #include <linux/completion.h>
 #include <linux/delay.h>
 #include <linux/genhd.h>
diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 9ae961460ff2..e06fd274b653 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -30,6 +30,7 @@
 #include <linux/delay.h>
 #include <linux/major.h>
 #include <linux/fs.h>
+#include <linux/bio.h>
 #include <linux/blkpg.h>
 #include <linux/timer.h>
 #include <linux/proc_fs.h>
diff --git a/drivers/block/cpqarray.c b/drivers/block/cpqarray.c
index 727cdeb23c0c..fccef1bb792c 100644
--- a/drivers/block/cpqarray.c
+++ b/drivers/block/cpqarray.c
@@ -24,6 +24,7 @@
 #include <linux/version.h>
 #include <linux/types.h>
 #include <linux/pci.h>
+#include <linux/bio.h>
 #include <linux/kernel.h>
 #include <linux/slab.h>
 #include <linux/delay.h>
diff --git a/drivers/block/elevator.c b/drivers/block/elevator.c
index 189814dbc7d1..cd3a4254e9e3 100644
--- a/drivers/block/elevator.c
+++ b/drivers/block/elevator.c
@@ -28,6 +28,7 @@
 #include <linux/fs.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
+#include <linux/bio.h>
 #include <linux/blk.h>
 #include <linux/config.h>
 #include <linux/module.h>
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 94f42b356556..aff8acff0ef3 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -165,6 +165,7 @@ static int print_unex=1;
 #include <linux/errno.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
+#include <linux/bio.h>
 #include <linux/string.h>
 #include <linux/fcntl.h>
 #include <linux/delay.h>
diff --git a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c
index d53122b1ae46..16abcb3f5481 100644
--- a/drivers/block/ll_rw_blk.c
+++ b/drivers/block/ll_rw_blk.c
@@ -18,6 +18,7 @@
 #include <linux/errno.h>
 #include <linux/string.h>
 #include <linux/config.h>
+#include <linux/bio.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
 #include <linux/init.h>
@@ -2002,8 +2003,8 @@ int __init blk_dev_init(void)
 	queue_nr_requests = (total_ram >> 8) & ~15;	/* One per quarter-megabyte */
 	if (queue_nr_requests < 32)
 		queue_nr_requests = 32;
-	if (queue_nr_requests > 512)
-		queue_nr_requests = 512;
+	if (queue_nr_requests > 256)
+		queue_nr_requests = 256;
 
 	/*
 	 * Batch frees according to queue length
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5689de41b771..982604ff6bfd 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -60,6 +60,7 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/bio.h>
 #include <linux/stat.h>
 #include <linux/errno.h>
 #include <linux/major.h>
@@ -168,6 +169,15 @@ static void figure_loop_size(struct loop_device *lo)
 					
 }
 
+static inline int lo_do_transfer(struct loop_device *lo, int cmd, char *rbuf,
+				 char *lbuf, int size, int rblock)
+{
+	if (!lo->transfer)
+		return 0;
+
+	return lo->transfer(lo, cmd, rbuf, lbuf, size, rblock);
+}
+
 static int
 do_lo_send(struct loop_device *lo, struct bio_vec *bvec, int bsize, loff_t pos)
 {
@@ -454,20 +464,43 @@ static struct bio *loop_get_buffer(struct loop_device *lo, struct bio *rbh)
 out_bh:
 	bio->bi_sector = rbh->bi_sector + (lo->lo_offset >> 9);
 	bio->bi_rw = rbh->bi_rw;
-	spin_lock_irq(&lo->lo_lock);
 	bio->bi_bdev = lo->lo_device;
-	spin_unlock_irq(&lo->lo_lock);
 
 	return bio;
 }
 
-static int loop_make_request(request_queue_t *q, struct bio *rbh)
+static int
+bio_transfer(struct loop_device *lo, struct bio *to_bio,
+			      struct bio *from_bio)
 {
-	struct bio *bh = NULL;
+	unsigned long IV = loop_get_iv(lo, from_bio->bi_sector);
+	struct bio_vec *from_bvec, *to_bvec;
+	char *vto, *vfrom;
+	int ret = 0, i;
+
+	__bio_for_each_segment(from_bvec, from_bio, i, 0) {
+		to_bvec = &to_bio->bi_io_vec[i];
+
+		kmap(from_bvec->bv_page);
+		kmap(to_bvec->bv_page);
+		vfrom = page_address(from_bvec->bv_page) + from_bvec->bv_offset;
+		vto = page_address(to_bvec->bv_page) + to_bvec->bv_offset;
+		ret |= lo_do_transfer(lo, bio_data_dir(to_bio), vto, vfrom,
+					from_bvec->bv_len, IV);
+		kunmap(from_bvec->bv_page);
+		kunmap(to_bvec->bv_page);
+	}
+
+	return ret;
+}
+		
+static int loop_make_request(request_queue_t *q, struct bio *old_bio)
+{
+	struct bio *new_bio = NULL;
 	struct loop_device *lo;
 	unsigned long IV;
-	int rw = bio_rw(rbh);
-	int unit = minor(to_kdev_t(rbh->bi_bdev->bd_dev));
+	int rw = bio_rw(old_bio);
+	int unit = minor(to_kdev_t(old_bio->bi_bdev->bd_dev));
 
 	if (unit >= max_loop)
 		goto out;
@@ -489,60 +522,41 @@ static int loop_make_request(request_queue_t *q, struct bio *rbh)
 		goto err;
 	}
 
-	blk_queue_bounce(q, &rbh);
+	blk_queue_bounce(q, &old_bio);
 
 	/*
 	 * file backed, queue for loop_thread to handle
 	 */
 	if (lo->lo_flags & LO_FLAGS_DO_BMAP) {
-		loop_add_bio(lo, rbh);
+		loop_add_bio(lo, old_bio);
 		return 0;
 	}
 
 	/*
 	 * piggy old buffer on original, and submit for I/O
 	 */
-	bh = loop_get_buffer(lo, rbh);
-	IV = loop_get_iv(lo, rbh->bi_sector);
+	new_bio = loop_get_buffer(lo, old_bio);
+	IV = loop_get_iv(lo, old_bio->bi_sector);
 	if (rw == WRITE) {
-		if (lo_do_transfer(lo, WRITE, bio_data(bh), bio_data(rbh),
-				   bh->bi_size, IV))
+		if (bio_transfer(lo, new_bio, old_bio))
 			goto err;
 	}
 
-	generic_make_request(bh);
+	generic_make_request(new_bio);
 	return 0;
 
 err:
 	if (atomic_dec_and_test(&lo->lo_pending))
 		up(&lo->lo_bh_mutex);
-	loop_put_buffer(bh);
+	loop_put_buffer(new_bio);
 out:
-	bio_io_error(rbh);
+	bio_io_error(old_bio);
 	return 0;
 inactive:
 	spin_unlock_irq(&lo->lo_lock);
 	goto out;
 }
 
-static int do_bio_blockbacked(struct loop_device *lo, struct bio *bio,
-			      struct bio *rbh)
-{
-	unsigned long IV = loop_get_iv(lo, rbh->bi_sector);
-	struct bio_vec *from;
-	char *vto, *vfrom;
-	int ret = 0, i;
-
-	bio_for_each_segment(from, rbh, i) {
-		vfrom = page_address(from->bv_page) + from->bv_offset;
-		vto = page_address(bio->bi_io_vec[i].bv_page) + bio->bi_io_vec[i].bv_offset;
-		ret |= lo_do_transfer(lo, bio_data_dir(bio), vto, vfrom,
-					from->bv_len, IV);
-	}
-
-	return ret;
-}
-
 static inline void loop_handle_bio(struct loop_device *lo, struct bio *bio)
 {
 	int ret;
@@ -556,7 +570,7 @@ static inline void loop_handle_bio(struct loop_device *lo, struct bio *bio)
 	} else {
 		struct bio *rbh = bio->bi_private;
 
-		ret = do_bio_blockbacked(lo, bio, rbh);
+		ret = bio_transfer(lo, bio, rbh);
 
 		bio_endio(rbh, !ret);
 		loop_put_buffer(bio);
@@ -588,10 +602,8 @@ static int loop_thread(void *data)
 
 	set_user_nice(current, -20);
 
-	spin_lock_irq(&lo->lo_lock);
 	lo->lo_state = Lo_bound;
 	atomic_inc(&lo->lo_pending);
-	spin_unlock_irq(&lo->lo_lock);
 
 	/*
 	 * up sem, we are running
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 67344c7fcc1a..697e825c3a91 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -39,6 +39,7 @@
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/fs.h>
+#include <linux/bio.h>
 #include <linux/stat.h>
 #include <linux/errno.h>
 #include <linux/file.h>
diff --git a/drivers/block/rd.c b/drivers/block/rd.c
index 4faf52c7be5c..7b60e75d5584 100644
--- a/drivers/block/rd.c
+++ b/drivers/block/rd.c
@@ -45,6 +45,8 @@
 #include <linux/config.h>
 #include <linux/string.h>
 #include <linux/slab.h>
+#include <asm/atomic.h>
+#include <linux/bio.h>
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/devfs_fs_kernel.h>
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 8c61688cab1c..44909021aa06 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -37,6 +37,7 @@
 #include <linux/config.h>
 #include <linux/sched.h>
 #include <linux/fs.h>
+#include <linux/bio.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
diff --git a/drivers/char/agp/agp.h b/drivers/char/agp/agp.h
index be8178161e80..94e405104df4 100644
--- a/drivers/char/agp/agp.h
+++ b/drivers/char/agp/agp.h
@@ -118,8 +118,8 @@ struct agp_bridge_data {
 	int (*remove_memory) (agp_memory *, off_t, int);
 	agp_memory *(*alloc_by_type) (size_t, int);
 	void (*free_by_type) (agp_memory *);
-	unsigned long (*agp_alloc_page) (void);
-	void (*agp_destroy_page) (unsigned long);
+	void *(*agp_alloc_page) (void);
+	void (*agp_destroy_page) (void *);
 	int (*suspend)(void);
 	void (*resume)(void);
 	
diff --git a/drivers/char/agp/agpgart_be.c b/drivers/char/agp/agpgart_be.c
index 44cbc013d91c..8ba761695215 100644
--- a/drivers/char/agp/agpgart_be.c
+++ b/drivers/char/agp/agpgart_be.c
@@ -22,6 +22,8 @@
  * OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE 
  * OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
  *
+ * TODO: 
+ * - Allocate more than order 0 pages to avoid too much linear map splitting.
  */
 #include <linux/config.h>
 #include <linux/version.h>
@@ -43,6 +45,7 @@
 #include <asm/uaccess.h>
 #include <asm/io.h>
 #include <asm/page.h>
+#include <asm/agp.h>
 
 #include <linux/agp_backend.h>
 #include "agp.h"
@@ -59,56 +62,28 @@ EXPORT_SYMBOL(agp_enable);
 EXPORT_SYMBOL(agp_backend_acquire);
 EXPORT_SYMBOL(agp_backend_release);
 
-static void flush_cache(void);
-
 static struct agp_bridge_data agp_bridge;
 static int agp_try_unsupported __initdata = 0;
 
-
-static inline void flush_cache(void)
-{
-#if defined(__i386__) || defined(__x86_64__)
-	asm volatile ("wbinvd":::"memory");
-#elif defined(__alpha__) || defined(__ia64__) || defined(__sparc__)
-	/* ??? I wonder if we'll really need to flush caches, or if the
-	   core logic can manage to keep the system coherent.  The ARM
-	   speaks only of using `cflush' to get things in memory in
-	   preparation for power failure.
-
-	   If we do need to call `cflush', we'll need a target page,
-	   as we can only flush one page at a time.
-
-	   Ditto for IA-64. --davidm 00/08/07 */
-	mb();
-#else
-#error "Please define flush_cache."
-#endif
-}
-
 #ifdef CONFIG_SMP
-static atomic_t cpus_waiting;
-
 static void ipi_handler(void *null)
 {
-	flush_cache();
-	atomic_dec(&cpus_waiting);
-	while (atomic_read(&cpus_waiting) > 0)
-		barrier();
+	flush_agp_cache();
 }
 
 static void smp_flush_cache(void)
 {
-	atomic_set(&cpus_waiting, num_online_cpus() - 1);
-	if (smp_call_function(ipi_handler, NULL, 1, 0) != 0)
+	if (smp_call_function(ipi_handler, NULL, 1, 1) != 0)
 		panic(PFX "timed out waiting for the other CPUs!\n");
-	flush_cache();
-	while (atomic_read(&cpus_waiting) > 0)
-		barrier();
+	flush_agp_cache();
 }
 #define global_cache_flush smp_flush_cache
 #else				/* CONFIG_SMP */
-#define global_cache_flush flush_cache
-#endif				/* CONFIG_SMP */
+static void global_cache_flush(void)
+{
+	flush_agp_cache();
+}
+#endif				/* !CONFIG_SMP */
 
 int agp_backend_acquire(void)
 {
@@ -208,8 +183,7 @@ void agp_free_memory(agp_memory * curr)
 	if (curr->page_count != 0) {
 		for (i = 0; i < curr->page_count; i++) {
 			curr->memory[i] &= ~(0x00000fff);
-			agp_bridge.agp_destroy_page((unsigned long)
-					 phys_to_virt(curr->memory[i]));
+			agp_bridge.agp_destroy_page(phys_to_virt(curr->memory[i]));
 		}
 	}
 	agp_free_key(curr->key);
@@ -252,21 +226,22 @@ agp_memory *agp_allocate_memory(size_t page_count, u32 type)
 	      	MOD_DEC_USE_COUNT;
 		return NULL;
 	}
+
 	for (i = 0; i < page_count; i++) {
-		new->memory[i] = agp_bridge.agp_alloc_page();
+		void *addr = agp_bridge.agp_alloc_page();
 
-		if (new->memory[i] == 0) {
+		if (addr == NULL) {
 			/* Free this structure */
 			agp_free_memory(new);
 			return NULL;
 		}
 		new->memory[i] =
-		    agp_bridge.mask_memory(
-				   virt_to_phys((void *) new->memory[i]),
-						  type);
+		    agp_bridge.mask_memory(virt_to_phys(addr), type);
 		new->page_count++;
 	}
 
+	flush_agp_mappings();
+
 	return new;
 }
 
@@ -561,6 +536,7 @@ static int agp_generic_create_gatt_table(void)
 					    agp_bridge.current_size;
 					break;
 				}
+				temp = agp_bridge.current_size;	
 			} else {
 				agp_bridge.aperture_size_idx = i;
 			}
@@ -761,7 +737,7 @@ static void agp_generic_free_by_type(agp_memory * curr)
  * against a maximum value.
  */
 
-static unsigned long agp_generic_alloc_page(void)
+static void *agp_generic_alloc_page(void)
 {
 	struct page * page;
 	
@@ -769,24 +745,26 @@ static unsigned long agp_generic_alloc_page(void)
 	if (page == NULL)
 		return 0;
 
+	map_page_into_agp(page);
+
 	get_page(page);
 	SetPageLocked(page);
 	atomic_inc(&agp_bridge.current_memory_agp);
-	return (unsigned long)page_address(page);
+	return page_address(page);
 }
 
-static void agp_generic_destroy_page(unsigned long addr)
+static void agp_generic_destroy_page(void *addr)
 {
-	void *pt = (void *) addr;
 	struct page *page;
 
-	if (pt == NULL)
+	if (addr == NULL)
 		return;
 
-	page = virt_to_page(pt);
+	page = virt_to_page(addr);
+	unmap_page_from_agp(page);
 	put_page(page);
 	unlock_page(page);
-	free_page((unsigned long) pt);
+	free_page((unsigned long)addr);
 	atomic_dec(&agp_bridge.current_memory_agp);
 }
 
@@ -993,6 +971,7 @@ static agp_memory *intel_i810_alloc_by_type(size_t pg_count, int type)
 		return new;
 	}
 	if(type == AGP_PHYS_MEMORY) {
+		void *addr;
 		/* The I810 requires a physical address to program
 		 * it's mouse pointer into hardware.  However the
 		 * Xserver still writes to it through the agp
@@ -1007,17 +986,14 @@ static agp_memory *intel_i810_alloc_by_type(size_t pg_count, int type)
 			return NULL;
 		}
 	   	MOD_INC_USE_COUNT;
-		new->memory[0] = agp_bridge.agp_alloc_page();
+		addr = agp_bridge.agp_alloc_page();
 
-		if (new->memory[0] == 0) {
+		if (addr == NULL) {
 			/* Free this structure */
 			agp_free_memory(new);
 			return NULL;
 		}
-		new->memory[0] =
-		    agp_bridge.mask_memory(
-				   virt_to_phys((void *) new->memory[0]),
-						  type);
+		new->memory[0] = agp_bridge.mask_memory(virt_to_phys(addr), type);
 		new->page_count = 1;
 	   	new->num_scratch_pages = 1;
 	   	new->type = AGP_PHYS_MEMORY;
@@ -1032,7 +1008,7 @@ static void intel_i810_free_by_type(agp_memory * curr)
 {
 	agp_free_key(curr->key);
    	if(curr->type == AGP_PHYS_MEMORY) {
-	   	agp_bridge.agp_destroy_page((unsigned long)
+	   	agp_bridge.agp_destroy_page(
 				 phys_to_virt(curr->memory[0]));
 		vfree(curr->memory);
 	}
@@ -1291,7 +1267,7 @@ static agp_memory *intel_i830_alloc_by_type(size_t pg_count,int type)
 	if (type == AGP_DCACHE_MEMORY) return(NULL);
 
 	if (type == AGP_PHYS_MEMORY) {
-		unsigned long physical;
+		void *addr;
 
 		/* The i830 requires a physical address to program
 		 * it's mouse pointer into hardware. However the
@@ -1306,19 +1282,18 @@ static agp_memory *intel_i830_alloc_by_type(size_t pg_count,int type)
 		if (nw == NULL) return(NULL);
 
 		MOD_INC_USE_COUNT;
-		nw->memory[0] = agp_bridge.agp_alloc_page();
-		physical = nw->memory[0];
-		if (nw->memory[0] == 0) {
+		addr = agp_bridge.agp_alloc_page();
+		if (addr == NULL) {
 			/* free this structure */
 			agp_free_memory(nw);
 			return(NULL);
 		}
 
-		nw->memory[0] = agp_bridge.mask_memory(virt_to_phys((void *) nw->memory[0]),type);
+		nw->memory[0] = agp_bridge.mask_memory(virt_to_phys(addr),type);
 		nw->page_count = 1;
 		nw->num_scratch_pages = 1;
 		nw->type = AGP_PHYS_MEMORY;
-		nw->physical = virt_to_phys((void *) physical);
+		nw->physical = virt_to_phys(addr);
 		return(nw);
 	}
 
@@ -1849,16 +1824,17 @@ static int intel_i460_remove_memory(agp_memory * mem, off_t pg_start, int type)
  * Let's just hope nobody counts on the allocated AGP memory being there
  * before bind time (I don't think current drivers do)...
  */
-static unsigned long intel_i460_alloc_page(void)
+static void * intel_i460_alloc_page(void)
 {
 	if (intel_i460_cpk)
 		return agp_generic_alloc_page();
 
 	/* Returning NULL would cause problems */
-	return ~0UL;
+	/* AK: really dubious code. */
+	return (void *)~0UL;
 }
 
-static void intel_i460_destroy_page(unsigned long page)
+static void intel_i460_destroy_page(void *page)
 {
 	if (intel_i460_cpk)
 		agp_generic_destroy_page(page);
@@ -3298,38 +3274,29 @@ static void ali_cache_flush(void)
 	}
 }
 
-static unsigned long ali_alloc_page(void)
+static void *ali_alloc_page(void)
 {
-	struct page *page;
-	u32 temp;
+        void *adr = agp_generic_alloc_page();
+	unsigned temp;
 
-	page = alloc_page(GFP_KERNEL);
-	if (page == NULL)
+	if (adr == 0)
 		return 0;
 
-	get_page(page);
-	SetPageLocked(page);
-	atomic_inc(&agp_bridge.current_memory_agp);
-
-	global_cache_flush();
-
 	if (agp_bridge.type == ALI_M1541) {
 		pci_read_config_dword(agp_bridge.dev, ALI_CACHE_FLUSH_CTRL, &temp);
 		pci_write_config_dword(agp_bridge.dev, ALI_CACHE_FLUSH_CTRL,
 				(((temp & ALI_CACHE_FLUSH_ADDR_MASK) |
-				  virt_to_phys(page_address(page))) |
+				  virt_to_phys(adr)) |
 				    ALI_CACHE_FLUSH_EN ));
 	}
-	return (unsigned long)page_address(page);
+	return adr;
 }
 
-static void ali_destroy_page(unsigned long addr)
+static void ali_destroy_page(void * addr)
 {
 	u32 temp;
-	void *pt = (void *) addr;
-	struct page *page;
 
-	if (pt == NULL)
+	if (addr == NULL)
 		return;
 
 	global_cache_flush();
@@ -3338,15 +3305,11 @@ static void ali_destroy_page(unsigned long addr)
 		pci_read_config_dword(agp_bridge.dev, ALI_CACHE_FLUSH_CTRL, &temp);
 		pci_write_config_dword(agp_bridge.dev, ALI_CACHE_FLUSH_CTRL,
 				(((temp & ALI_CACHE_FLUSH_ADDR_MASK) |
-				  virt_to_phys((void *)pt)) |
+				  virt_to_phys(addr)) |
 				    ALI_CACHE_FLUSH_EN));
 	}
 
-	page = virt_to_page(pt);
-	put_page(page);
-	unlock_page(page);
-	free_page((unsigned long) pt);
-	atomic_dec(&agp_bridge.current_memory_agp);
+	agp_generic_destroy_page(addr);
 }
 
 /* Setup function */
@@ -5011,15 +4974,15 @@ static int __init agp_backend_initialize(void)
 	}
 
 	if (agp_bridge.needs_scratch_page == TRUE) {
-		agp_bridge.scratch_page = agp_bridge.agp_alloc_page();
+		void *addr;
+		addr = agp_bridge.agp_alloc_page();
 
-		if (agp_bridge.scratch_page == 0) {
+		if (addr == NULL) {
 			printk(KERN_ERR PFX "unable to get memory for "
 			       "scratch page.\n");
 			return -ENOMEM;
 		}
-		agp_bridge.scratch_page =
-		    virt_to_phys((void *) agp_bridge.scratch_page);
+		agp_bridge.scratch_page = virt_to_phys(addr);
 		agp_bridge.scratch_page =
 		    agp_bridge.mask_memory(agp_bridge.scratch_page, 0);
 	}
@@ -5064,8 +5027,7 @@ static int __init agp_backend_initialize(void)
 err_out:
 	if (agp_bridge.needs_scratch_page == TRUE) {
 		agp_bridge.scratch_page &= ~(0x00000fff);
-		agp_bridge.agp_destroy_page((unsigned long)
-				 phys_to_virt(agp_bridge.scratch_page));
+		agp_bridge.agp_destroy_page(phys_to_virt(agp_bridge.scratch_page));
 	}
 	if (got_gatt)
 		agp_bridge.free_gatt_table();
@@ -5084,8 +5046,7 @@ static void agp_backend_cleanup(void)
 
 	if (agp_bridge.needs_scratch_page == TRUE) {
 		agp_bridge.scratch_page &= ~(0x00000fff);
-		agp_bridge.agp_destroy_page((unsigned long)
-				 phys_to_virt(agp_bridge.scratch_page));
+		agp_bridge.agp_destroy_page(phys_to_virt(agp_bridge.scratch_page));
 	}
 }
 
diff --git a/drivers/char/random.c b/drivers/char/random.c
index db20dec287d0..9db52acb9ef2 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -252,6 +252,7 @@
 #include <linux/poll.h>
 #include <linux/init.h>
 #include <linux/fs.h>
+#include <linux/tqueue.h>
 
 #include <asm/processor.h>
 #include <asm/uaccess.h>
diff --git a/drivers/ide/ioctl.c b/drivers/ide/ioctl.c
index b986555fd4f3..609ed7dcfa56 100644
--- a/drivers/ide/ioctl.c
+++ b/drivers/ide/ioctl.c
@@ -345,8 +345,9 @@ int ata_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned
 			if (!arg) {
 				if (ide_spin_wait_hwgroup(drive))
 					return -EBUSY;
-				else
-					return 0;
+				/* Do nothing, just unlock */
+				spin_unlock_irq(drive->channel->lock);
+				return 0;
 			}
 
 			return do_cmd_ioctl(drive, arg);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 118ce821a208..48fb74e50d5c 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -20,7 +20,7 @@
 
 #include <linux/raid/md.h>
 #include <linux/slab.h>
-
+#include <linux/bio.h>
 #include <linux/raid/linear.h>
 
 #define MAJOR_NR MD_MAJOR
diff --git a/drivers/md/lvm-snap.c b/drivers/md/lvm-snap.c
index c90947fc5f89..46df5c8ff0ef 100644
--- a/drivers/md/lvm-snap.c
+++ b/drivers/md/lvm-snap.c
@@ -224,7 +224,7 @@ static inline void invalidate_snap_cache(unsigned long start, unsigned long nr,
 
 	for (i = 0; i < nr; i++)
 	{
-		bh = get_hash_table(dev, start++, blksize);
+		bh = find_get_block(dev, start++, blksize);
 		if (bh)
 			bforget(bh);
 	}
diff --git a/drivers/md/lvm.c b/drivers/md/lvm.c
index dfc256c6a2ec..c44a1b8a74b2 100644
--- a/drivers/md/lvm.c
+++ b/drivers/md/lvm.c
@@ -209,6 +209,7 @@
 #include <linux/hdreg.h>
 #include <linux/stat.h>
 #include <linux/fs.h>
+#include <linux/bio.h>
 #include <linux/proc_fs.h>
 #include <linux/blkdev.h>
 #include <linux/genhd.h>
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 21e20ea10be7..d23270322804 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -33,6 +33,7 @@
 #include <linux/linkage.h>
 #include <linux/raid/md.h>
 #include <linux/sysctl.h>
+#include <linux/bio.h>
 #include <linux/raid/xor.h>
 #include <linux/devfs_fs_kernel.h>
 
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index 46f089ee8481..6db555317b13 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/raid/multipath.h>
+#include <linux/bio.h>
 #include <linux/buffer_head.h>
 #include <asm/atomic.h>
 
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 430448c566af..8f149a1efe1b 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -20,6 +20,7 @@
 
 #include <linux/module.h>
 #include <linux/raid/raid0.h>
+#include <linux/bio.h>
 
 #define MAJOR_NR MD_MAJOR
 #define MD_DRIVER
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 43fdb75de0fe..96ad858cf033 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -23,6 +23,7 @@
  */
 
 #include <linux/raid/raid1.h>
+#include <linux/bio.h>
 
 #define MAJOR_NR MD_MAJOR
 #define MD_DRIVER
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9402b0c779b9..62873d89e395 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -20,6 +20,7 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/raid/raid5.h>
+#include <linux/bio.h>
 #include <asm/bitops.h>
 #include <asm/atomic.h>
 
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 0260ccf2092a..db4cdb8e3ad4 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -210,3 +210,4 @@ EXPORT_SYMBOL(pci_match_device);
 EXPORT_SYMBOL(pci_register_driver);
 EXPORT_SYMBOL(pci_unregister_driver);
 EXPORT_SYMBOL(pci_dev_driver);
+EXPORT_SYMBOL(pci_bus_type);
diff --git a/drivers/pcmcia/pci_socket.c b/drivers/pcmcia/pci_socket.c
index d30df9b4203a..5a4b78312391 100644
--- a/drivers/pcmcia/pci_socket.c
+++ b/drivers/pcmcia/pci_socket.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <linux/pci.h>
 #include <linux/sched.h>
+#include <linux/tqueue.h>
 #include <linux/interrupt.h>
 
 #include <pcmcia/ss.h>
diff --git a/drivers/pcmcia/yenta.c b/drivers/pcmcia/yenta.c
index e5453fb455e2..40b20b945488 100644
--- a/drivers/pcmcia/yenta.c
+++ b/drivers/pcmcia/yenta.c
@@ -6,6 +6,7 @@
 #include <linux/init.h>
 #include <linux/pci.h>
 #include <linux/sched.h>
+#include <linux/tqueue.h>
 #include <linux/interrupt.h>
 #include <linux/delay.h>
 #include <linux/module.h>
diff --git a/drivers/scsi/README.st b/drivers/scsi/README.st
index e06a21597910..702a5b178b61 100644
--- a/drivers/scsi/README.st
+++ b/drivers/scsi/README.st
@@ -2,7 +2,7 @@ This file contains brief information about the SCSI tape driver.
 The driver is currently maintained by Kai M{kisara (email
 Kai.Makisara@metla.fi)
 
-Last modified: Tue Jan 22 21:08:57 2002 by makisara
+Last modified: Tue Jun 18 18:13:50 2002 by makisara
 
 
 BASICS
@@ -105,15 +105,19 @@ The default is BSD semantics.
 
 BUFFERING
 
-The driver uses tape buffers allocated either at system initialization
-or at run-time when needed. One buffer is used for each open tape
-device. The size of the buffers is selectable at compile and/or boot
-time. The buffers are used to store the data being transferred to/from
-the SCSI adapter. The following buffering options are selectable at
-compile time and/or at run time (via ioctl):
+The driver uses tape buffers allocated at run-time when needed and it
+is freed when the device file is closed. One buffer is used for each
+open tape device. 
+
+The size of the buffers is always at least one tape block. In fixed
+block mode, the minimum buffer size is defined (in 1024 byte units) by
+ST_FIXED_BUFFER_BLOCKS. With small block size this allows buffering of
+several blocks and using one SCSI read or write to transfer all of the
+blocks. Buffering of data across write calls in fixed block mode is
+allowed if ST_BUFFER_WRITES is non-zero. Buffer allocation uses chunks of
+memory having sizes 2^n * (page size). Because of this the actual
+buffer size may be larger than the minimum allowable buffer size.
 
-Buffering of data across write calls in fixed block mode (define
-ST_BUFFER_WRITES).
 
 Asynchronous writing. Writing the buffer contents to the tape is
 started and the write call returns immediately. The status is checked
@@ -128,30 +132,6 @@ attempted even if the user does not want to get all of the data at
 this read command. Should be disabled for those drives that don't like
 a filemark to truncate a read request or that don't like backspacing.
 
-The buffer size is defined (in 1024 byte units) by ST_BUFFER_BLOCKS or
-at boot time. If this size is not large enough, the driver tries to
-temporarily enlarge the buffer. Buffer allocation uses chunks of
-memory having sizes 2^n * (page size). Because of this the actual
-buffer size may be larger than the buffer size specified with
-ST_BUFFER_BLOCKS.
-
-A small number of buffers are allocated at driver initialisation. The
-maximum number of these buffers is defined by ST_MAX_BUFFERS. The
-maximum can be changed with kernel or module startup options. One
-buffer is allocated for each drive detected when the driver is
-initialized up to the maximum.
-
-The driver tries to allocate new buffers at run-time if
-necessary. These buffers are freed after use. If the maximum number of
-initial buffers is set to zero, all buffer allocation is done at
-run-time. The advantage of run-time allocation is that memory is not
-wasted for buffers not being used. The disadvantage is that there may
-not be memory available at the time when a buffer is needed for the
-first time (once a buffer is allocated, it is not released). This risk
-should not be big if the tape drive is connected to a PCI adapter that
-supports scatter/gather (the allocation is not limited to "DMA memory"
-and the buffer can be composed of several fragments).
-
 The threshold for triggering asynchronous write in fixed block mode
 is defined by ST_WRITE_THRESHOLD. This may be optimized for each
 use pattern. The default triggers asynchronous write after three
diff --git a/drivers/scsi/cpqfcTSinit.c b/drivers/scsi/cpqfcTSinit.c
index e6f03847c212..f38e377207c7 100644
--- a/drivers/scsi/cpqfcTSinit.c
+++ b/drivers/scsi/cpqfcTSinit.c
@@ -39,6 +39,7 @@
 #include <linux/pci.h>
 #include <linux/delay.h>
 #include <linux/timer.h>
+#include <linux/init.h>
 #include <linux/ioport.h>  // request_region() prototype
 #include <linux/vmalloc.h> // ioremap()
 //#if LINUX_VERSION_CODE >= LinuxVersionCode(2,4,7)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index fc69760ab484..bede96547efb 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -23,6 +23,7 @@
 #include <linux/timer.h>
 #include <linux/string.h>
 #include <linux/slab.h>
+#include <linux/bio.h>
 #include <linux/ioport.h>
 #include <linux/kernel.h>
 #include <linux/stat.h>
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 382e04ceace2..63fe305e4342 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -36,6 +36,7 @@
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
+#include <linux/bio.h>
 #include <linux/string.h>
 #include <linux/hdreg.h>
 #include <linux/errno.h>
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index d536f3bc94f6..0e28dc69652b 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -39,6 +39,7 @@
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
+#include <linux/bio.h>
 #include <linux/string.h>
 #include <linux/errno.h>
 #include <linux/cdrom.h>
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index f48ac845bc08..7342c3e661f3 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -12,13 +12,13 @@
    Copyright 1992 - 2002 Kai Makisara
    email Kai.Makisara@metla.fi
 
-   Last modified: Tue Feb  5 21:25:55 2002 by makisara
+   Last modified: Sat Jun 15 13:01:56 2002 by makisara
    Some small formal changes - aeb, 950809
 
    Last modified: 18-JAN-1998 Richard Gooch <rgooch@atnf.csiro.au> Devfs support
  */
 
-static char *verstr = "20020205";
+static char *verstr = "20020615";
 
 #include <linux/module.h>
 
@@ -69,7 +69,6 @@ static char *verstr = "20020205";
 
 static int buffer_kbs;
 static int write_threshold_kbs;
-static int max_buffers = (-1);
 static int max_sg_segs;
 
 MODULE_AUTHOR("Kai Makisara");
@@ -80,8 +79,6 @@ MODULE_PARM(buffer_kbs, "i");
 MODULE_PARM_DESC(buffer_kbs, "Default driver buffer size (KB; 32)");
 MODULE_PARM(write_threshold_kbs, "i");
 MODULE_PARM_DESC(write_threshold_kbs, "Asynchronous write threshold (KB; 30)");
-MODULE_PARM(max_buffers, "i");
-MODULE_PARM_DESC(max_buffers, "Maximum number of buffer allocated at initialisation (4)");
 MODULE_PARM(max_sg_segs, "i");
 MODULE_PARM_DESC(max_sg_segs, "Maximum number of scatter/gather segments to use (32)");
 
@@ -97,9 +94,6 @@ static struct st_dev_parm {
 		"write_threshold_kbs", &write_threshold_kbs
 	},
 	{
-		"max_buffers", &max_buffers
-	},
-	{
 		"max_sg_segs", &max_sg_segs
 	}
 };
@@ -108,12 +102,12 @@ static struct st_dev_parm {
 
 /* The default definitions have been moved to st_options.h */
 
-#define ST_BUFFER_SIZE (ST_BUFFER_BLOCKS * ST_KILOBYTE)
+#define ST_FIXED_BUFFER_SIZE (ST_FIXED_BUFFER_BLOCKS * ST_KILOBYTE)
 #define ST_WRITE_THRESHOLD (ST_WRITE_THRESHOLD_BLOCKS * ST_KILOBYTE)
 
 /* The buffer size should fit into the 24 bits for length in the
    6-byte SCSI read and write commands. */
-#if ST_BUFFER_SIZE >= (2 << 24 - 1)
+#if ST_FIXED_BUFFER_SIZE >= (2 << 24 - 1)
 #error "Buffer size should not exceed (2 << 24 - 1) bytes!"
 #endif
 
@@ -121,7 +115,7 @@ DEB( static int debugging = DEBUG; )
 
 #define MAX_RETRIES 0
 #define MAX_WRITE_RETRIES 0
-#define MAX_READY_RETRIES 5
+#define MAX_READY_RETRIES 0
 #define NO_TAPE  NOT_READY
 
 #define ST_TIMEOUT (900 * HZ)
@@ -137,18 +131,15 @@ DEB( static int debugging = DEBUG; )
 #define ST_DEV_ARR_LUMP  6
 static rwlock_t st_dev_arr_lock = RW_LOCK_UNLOCKED;
 
-static int st_nbr_buffers;
-static ST_buffer **st_buffers = NULL;
-static int st_buffer_size = ST_BUFFER_SIZE;
+static int st_fixed_buffer_size = ST_FIXED_BUFFER_SIZE;
 static int st_write_threshold = ST_WRITE_THRESHOLD;
-static int st_max_buffers = ST_MAX_BUFFERS;
 static int st_max_sg_segs = ST_MAX_SG;
 
 static Scsi_Tape **scsi_tapes = NULL;
 
 static int modes_defined;
 
-static ST_buffer *new_tape_buffer(int, int, int);
+static ST_buffer *new_tape_buffer(int, int);
 static int enlarge_buffer(ST_buffer *, int, int);
 static void normalize_buffer(ST_buffer *);
 static int append_to_buffer(const char *, ST_buffer *, int);
@@ -914,8 +905,7 @@ static int check_tape(Scsi_Tape *STp, struct file *filp)
    module count. */
 static int st_open(struct inode *inode, struct file *filp)
 {
-	int i, need_dma_buffer;
-	int retval = (-EIO);
+	int i, retval = (-EIO);
 	Scsi_Tape *STp;
 	ST_partstat *STps;
 	int dev = TAPE_NR(inode->i_rdev);
@@ -945,38 +935,15 @@ static int st_open(struct inode *inode, struct file *filp)
 		goto err_out;
 	}
 
-	/* Allocate a buffer for this user */
-	need_dma_buffer = STp->restr_dma;
-	write_lock(&st_dev_arr_lock);
-	for (i = 0; i < st_nbr_buffers; i++)
-		if (!st_buffers[i]->in_use &&
-		    (!need_dma_buffer || st_buffers[i]->dma)) {
-			STp->buffer = st_buffers[i];
-			(STp->buffer)->in_use = 1;
-			break;
-		}
-	write_unlock(&st_dev_arr_lock);
-	if (i >= st_nbr_buffers) {
-		STp->buffer = new_tape_buffer(FALSE, need_dma_buffer, TRUE);
-		if (STp->buffer == NULL) {
-			printk(KERN_WARNING "st%d: Can't allocate tape buffer.\n", dev);
-			retval = (-EBUSY);
-			goto err_out;
-		}
+	/* See that we have at least a one page buffer available */
+	if (!enlarge_buffer(STp->buffer, PAGE_SIZE, STp->restr_dma)) {
+		printk(KERN_WARNING "st%d: Can't allocate tape buffer.\n", dev);
+		retval = (-EOVERFLOW);
+		goto err_out;
 	}
 
 	(STp->buffer)->writing = 0;
 	(STp->buffer)->syscall_result = 0;
-	(STp->buffer)->use_sg = STp->device->host->sg_tablesize;
-
-	/* Compute the usable buffer size for this SCSI adapter */
-	if (!(STp->buffer)->use_sg)
-		(STp->buffer)->buffer_size = (STp->buffer)->sg[0].length;
-	else {
-		for (i = 0, (STp->buffer)->buffer_size = 0; i < (STp->buffer)->use_sg &&
-		     i < (STp->buffer)->sg_segs; i++)
-			(STp->buffer)->buffer_size += (STp->buffer)->sg[i].length;
-	}
 
 	STp->write_prot = ((filp->f_flags & O_ACCMODE) == O_RDONLY);
 
@@ -999,10 +966,7 @@ static int st_open(struct inode *inode, struct file *filp)
 	return 0;
 
  err_out:
-	if (STp->buffer != NULL) {
-		(STp->buffer)->in_use = 0;
-		STp->buffer = NULL;
-	}
+	normalize_buffer(STp->buffer);
 	STp->in_use = 0;
 	STp->device->access_count--;
 	if (STp->device->host->hostt->module)
@@ -1149,16 +1113,8 @@ static int st_release(struct inode *inode, struct file *filp)
 	if (STp->door_locked == ST_LOCKED_AUTO)
 		st_int_ioctl(STp, MTUNLOCK, 0);
 
-	if (STp->buffer != NULL) {
-		normalize_buffer(STp->buffer);
-		write_lock(&st_dev_arr_lock);
-		(STp->buffer)->in_use = 0;
-		STp->buffer = NULL;
-	}
-	else {
-		write_lock(&st_dev_arr_lock);
-	}
-
+	normalize_buffer(STp->buffer);
+	write_lock(&st_dev_arr_lock);
 	STp->in_use = 0;
 	write_unlock(&st_dev_arr_lock);
 	STp->device->access_count--;
@@ -1168,31 +1124,11 @@ static int st_release(struct inode *inode, struct file *filp)
 	return result;
 }
 
-
-/* Write command */
-static ssize_t
- st_write(struct file *filp, const char *buf, size_t count, loff_t * ppos)
+/* The checks common to both reading and writing */
+static ssize_t rw_checks(Scsi_Tape *STp, struct file *filp, size_t count, loff_t *ppos)
 {
-	struct inode *inode = filp->f_dentry->d_inode;
-	ssize_t total;
-	ssize_t i, do_count, blks, transfer;
+	int bufsize;
 	ssize_t retval = 0;
-	int write_threshold;
-	int doing_write = 0;
-	unsigned char cmd[MAX_COMMAND_SIZE];
-	const char *b_point;
-	Scsi_Request *SRpnt = NULL;
-	Scsi_Tape *STp;
-	ST_mode *STm;
-	ST_partstat *STps;
-	int dev = TAPE_NR(inode->i_rdev);
-
-	read_lock(&st_dev_arr_lock);
-	STp = scsi_tapes[dev];
-	read_unlock(&st_dev_arr_lock);
-
-	if (down_interruptible(&STp->lock))
-		return -ERESTARTSYS;
 
 	/*
 	 * If we are in the middle of error recovery, don't let anyone
@@ -1219,13 +1155,11 @@ static ssize_t
 		goto out;
 	}
 
-	STm = &(STp->modes[STp->current_mode]);
-	if (!STm->defined) {
+	if (! STp->modes[STp->current_mode].defined) {
 		retval = (-ENXIO);
 		goto out;
 	}
-	if (count == 0)
-		goto out;
+
 
 	/*
 	 * If there was a bus reset, block further access
@@ -1236,30 +1170,20 @@ static ssize_t
 		goto out;
 	}
 
+	if (count == 0)
+		goto out;
+
         DEB(
 	if (!STp->in_use) {
+		int dev = TAPE_NR(filp->f_dentry->d_inode->i_rdev);
 		printk(ST_DEB_MSG "st%d: Incorrect device.\n", dev);
 		retval = (-EIO);
 		goto out;
 	} ) /* end DEB */
 
-	/* Write must be integral number of blocks */
-	if (STp->block_size != 0 && (count % STp->block_size) != 0) {
-		printk(KERN_WARNING "st%d: Write not multiple of tape block size.\n",
-		       dev);
-		retval = (-EINVAL);
-		goto out;
-	}
-
 	if (STp->can_partitions &&
 	    (retval = update_partition(STp)) < 0)
 		goto out;
-	STps = &(STp->ps[STp->partition]);
-
-	if (STp->write_prot) {
-		retval = (-EACCES);
-		goto out;
-	}
 
 	if (STp->block_size == 0) {
 		if (STp->max_block > 0 &&
@@ -1273,19 +1197,73 @@ static ssize_t
 			goto out;
 		}
 	}
-	if ((STp->buffer)->buffer_blocks < 1) {
-		/* Fixed block mode with too small buffer */
-		if (!enlarge_buffer(STp->buffer, STp->block_size, STp->restr_dma)) {
+	else {
+		/* Fixed block mode with too small buffer? */
+		bufsize = STp->block_size > st_fixed_buffer_size ?
+			STp->block_size : st_fixed_buffer_size;
+		if ((STp->buffer)->buffer_size < bufsize &&
+		    !enlarge_buffer(STp->buffer, bufsize, STp->restr_dma)) {
 			retval = (-EOVERFLOW);
 			goto out;
 		}
-		(STp->buffer)->buffer_blocks = 1;
+		(STp->buffer)->buffer_blocks = bufsize / STp->block_size;
 	}
 
 	if (STp->do_auto_lock && STp->door_locked == ST_UNLOCKED &&
 	    !st_int_ioctl(STp, MTLOCK, 0))
 		STp->door_locked = ST_LOCKED_AUTO;
 
+ out:
+	return retval;
+}
+
+
+/* Write command */
+static ssize_t
+ st_write(struct file *filp, const char *buf, size_t count, loff_t * ppos)
+{
+	struct inode *inode = filp->f_dentry->d_inode;
+	ssize_t total;
+	ssize_t i, do_count, blks, transfer;
+	ssize_t retval;
+	int write_threshold;
+	int doing_write = 0;
+	unsigned char cmd[MAX_COMMAND_SIZE];
+	const char *b_point;
+	Scsi_Request *SRpnt = NULL;
+	Scsi_Tape *STp;
+	ST_mode *STm;
+	ST_partstat *STps;
+	int dev = TAPE_NR(inode->i_rdev);
+
+	read_lock(&st_dev_arr_lock);
+	STp = scsi_tapes[dev];
+	read_unlock(&st_dev_arr_lock);
+
+	if (down_interruptible(&STp->lock))
+		return -ERESTARTSYS;
+
+	retval = rw_checks(STp, filp, count, ppos);
+	if (retval || count == 0)
+		goto out;
+
+	/* Write must be integral number of blocks */
+	if (STp->block_size != 0 && (count % STp->block_size) != 0) {
+		printk(KERN_WARNING "st%d: Write not multiple of tape block size.\n",
+		       dev);
+		retval = (-EINVAL);
+		goto out;
+	}
+
+	STm = &(STp->modes[STp->current_mode]);
+	STps = &(STp->ps[STp->partition]);
+
+	if (STp->write_prot) {
+		retval = (-EACCES);
+		goto out;
+	}
+
+
 	if (STps->rw == ST_READING) {
 		retval = flush_buffer(STp, 0);
 		if (retval)
@@ -1718,77 +1696,17 @@ static ssize_t
 	if (down_interruptible(&STp->lock))
 		return -ERESTARTSYS;
 
-	/*
-	 * If we are in the middle of error recovery, don't let anyone
-	 * else try and use this device.  Also, if error recovery fails, it
-	 * may try and take the device offline, in which case all further
-	 * access to the device is prohibited.
-	 */
-	if (!scsi_block_when_processing_errors(STp->device)) {
-		retval = (-ENXIO);
-		goto out;
-	}
-
-	if (ppos != &filp->f_pos) {
-		/* "A request was outside the capabilities of the device." */
-		retval = (-ENXIO);
+	retval = rw_checks(STp, filp, count, ppos);
+	if (retval || count == 0)
 		goto out;
-	}
 
-	if (STp->ready != ST_READY) {
-		if (STp->ready == ST_NO_TAPE)
-			retval = (-ENOMEDIUM);
-		else
-			retval = (-EIO);
-		goto out;
-	}
 	STm = &(STp->modes[STp->current_mode]);
-	if (!STm->defined) {
-		retval = (-ENXIO);
-		goto out;
-	}
-        DEB(
-	if (!STp->in_use) {
-		printk(ST_DEB_MSG "st%d: Incorrect device.\n", dev);
-		retval = (-EIO);
-		goto out;
-	} ) /* end DEB */
-
-	if (STp->can_partitions &&
-	    (retval = update_partition(STp)) < 0)
-		goto out;
-
-	if (STp->block_size == 0) {
-		if (STp->max_block > 0 &&
-		    (count < STp->min_block || count > STp->max_block)) {
-			retval = (-EINVAL);
-			goto out;
-		}
-		if (count > (STp->buffer)->buffer_size &&
-		    !enlarge_buffer(STp->buffer, count, STp->restr_dma)) {
-			retval = (-EOVERFLOW);
-			goto out;
-		}
-	}
-	if ((STp->buffer)->buffer_blocks < 1) {
-		/* Fixed block mode with too small buffer */
-		if (!enlarge_buffer(STp->buffer, STp->block_size, STp->restr_dma)) {
-			retval = (-EOVERFLOW);
-			goto out;
-		}
-		(STp->buffer)->buffer_blocks = 1;
-	}
-
 	if (!(STm->do_read_ahead) && STp->block_size != 0 &&
 	    (count % STp->block_size) != 0) {
 		retval = (-EINVAL);	/* Read must be integral number of blocks */
 		goto out;
 	}
 
-	if (STp->do_auto_lock && STp->door_locked == ST_UNLOCKED &&
-	    !st_int_ioctl(STp, MTLOCK, 0))
-		STp->door_locked = ST_LOCKED_AUTO;
-
 	STps = &(STp->ps[STp->partition]);
 	if (STps->rw == ST_WRITING) {
 		retval = flush_buffer(STp, 0);
@@ -1986,7 +1904,7 @@ static int st_set_options(Scsi_Tape *STp, long options)
 		st_log_options(STp, STm, dev);
 	} else if (code == MT_ST_WRITE_THRESHOLD) {
 		value = (options & ~MT_ST_OPTIONS) * ST_KILOBYTE;
-		if (value < 1 || value > st_buffer_size) {
+		if (value < 1 || value > st_fixed_buffer_size) {
 			printk(KERN_WARNING
                                "st%d: Write threshold %d too small or too large.\n",
 			       dev, value);
@@ -2289,8 +2207,10 @@ static int do_load_unload(Scsi_Tape *STp, struct file *filp, int load_code)
 
 	if (!retval) {	/* SCSI command successful */
 
-		if (!load_code)
+		if (!load_code) {
 			STp->rew_at_close = 0;
+			STp->ready = ST_NO_TAPE;
+		}
 		else {
 			STp->rew_at_close = STp->autorew_dev;
 			retval = check_tape(STp, filp);
@@ -2619,10 +2539,14 @@ static int st_int_ioctl(Scsi_Tape *STp, unsigned int cmd_in, unsigned long arg)
 			ioctl_result = st_int_ioctl(STp, MTBSF, 1);
 
 		if (cmd_in == MTSETBLK || cmd_in == SET_DENS_AND_BLK) {
+			int old_block_size = STp->block_size;
 			STp->block_size = arg & MT_ST_BLKSIZE_MASK;
-			if (STp->block_size != 0)
+			if (STp->block_size != 0) {
+				if (old_block_size == 0)
+					normalize_buffer(STp->buffer);
 				(STp->buffer)->buffer_blocks =
 				    (STp->buffer)->buffer_size / STp->block_size;
+			}
 			(STp->buffer)->buffer_bytes = (STp->buffer)->read_pointer = 0;
 			if (cmd_in == SET_DENS_AND_BLK)
 				STp->density = arg >> MT_ST_DENSITY_SHIFT;
@@ -3372,18 +3296,11 @@ static int st_ioctl(struct inode *inode, struct file *file,
 /* Try to allocate a new tape buffer. Calling function must not hold
    dev_arr_lock. */
 static ST_buffer *
- new_tape_buffer(int from_initialization, int need_dma, int in_use)
+ new_tape_buffer(int from_initialization, int need_dma)
 {
-	int i, priority, b_size, order, got = 0, segs = 0;
+	int i, priority, got = 0, segs = 0;
 	ST_buffer *tb;
 
-	read_lock(&st_dev_arr_lock);
-	if (st_nbr_buffers >= st_template.dev_max) {
-		read_unlock(&st_dev_arr_lock);
-		return NULL;	/* Should never happen */
-	}
-	read_unlock(&st_dev_arr_lock);
-
 	if (from_initialization)
 		priority = GFP_ATOMIC;
 	else
@@ -3391,85 +3308,19 @@ static ST_buffer *
 
 	i = sizeof(ST_buffer) + (st_max_sg_segs - 1) * sizeof(struct scatterlist);
 	tb = kmalloc(i, priority);
-	if (tb) {
-		if (need_dma)
-			priority |= GFP_DMA;
-
-		/* Try to allocate the first segment up to ST_FIRST_ORDER and the
-		   others big enough to reach the goal */
-		for (b_size = PAGE_SIZE, order=0;
-		     b_size < st_buffer_size && order < ST_FIRST_ORDER;
-		     order++, b_size *= 2)
-			;
-		for ( ; b_size >= PAGE_SIZE; order--, b_size /= 2) {
-			tb->sg[0].page = alloc_pages(priority, order);
-			tb->sg[0].offset = 0;
-			if (tb->sg[0].page != NULL) {
-				tb->sg[0].length = b_size;
-				break;
-			}
-		}
-		if (tb->sg[segs].page == NULL) {
-			kfree(tb);
-			tb = NULL;
-		} else {	/* Got something, continue */
-
-			for (b_size = PAGE_SIZE, order=0;
-			     st_buffer_size >
-                                     tb->sg[0].length + (ST_FIRST_SG - 1) * b_size;
-			     order++, b_size *= 2)
-				;
-			for (segs = 1, got = tb->sg[0].length;
-			     got < st_buffer_size && segs < ST_FIRST_SG;) {
-				tb->sg[segs].page = alloc_pages(priority, order);
-				tb->sg[segs].offset = 0;
-				if (tb->sg[segs].page == NULL) {
-					if (st_buffer_size - got <=
-					    (ST_FIRST_SG - segs) * b_size / 2) {
-						b_size /= 2; /* Large enough for the
-                                                                rest of the buffers */
-						order--;
-						continue;
-					}
-					tb->sg_segs = segs;
-					tb->orig_sg_segs = 0;
-					DEB(tb->buffer_size = got);
-					normalize_buffer(tb);
-					kfree(tb);
-					tb = NULL;
-					break;
-				}
-				tb->sg[segs].length = b_size;
-				got += b_size;
-				segs++;
-			}
-		}
-	}
-
 	if (!tb) {
-		printk(KERN_NOTICE "st: Can't allocate new tape buffer (nbr %d).\n",
-		       st_nbr_buffers);
+		printk(KERN_NOTICE "st: Can't allocate new tape buffer.\n");
 		return NULL;
 	}
 	tb->sg_segs = tb->orig_sg_segs = segs;
-	tb->b_data = page_address(tb->sg[0].page);
+	if (segs > 0)
+		tb->b_data = page_address(tb->sg[0].page);
 
-        DEBC(printk(ST_DEB_MSG
-                    "st: Allocated tape buffer %d (%d bytes, %d segments, dma: %d, a: %p).\n",
-                    st_nbr_buffers, got, tb->sg_segs, need_dma, tb->b_data);
-             printk(ST_DEB_MSG
-                    "st: segment sizes: first %d, last %d bytes.\n",
-                    tb->sg[0].length, tb->sg[segs - 1].length);
-	)
-	tb->in_use = in_use;
+	tb->in_use = TRUE;
 	tb->dma = need_dma;
 	tb->buffer_size = got;
 	tb->writing = 0;
 
-	write_lock(&st_dev_arr_lock);
-	st_buffers[st_nbr_buffers++] = tb;
-	write_unlock(&st_dev_arr_lock);
-
 	return tb;
 }
 
@@ -3479,6 +3330,9 @@ static int enlarge_buffer(ST_buffer * STbuffer, int new_size, int need_dma)
 {
 	int segs, nbr, max_segs, b_size, priority, order, got;
 
+	if (new_size <= STbuffer->buffer_size)
+		return TRUE;
+
 	normalize_buffer(STbuffer);
 
 	max_segs = STbuffer->use_sg;
@@ -3492,13 +3346,14 @@ static int enlarge_buffer(ST_buffer * STbuffer, int new_size, int need_dma)
 	if (need_dma)
 		priority |= GFP_DMA;
 	for (b_size = PAGE_SIZE, order=0;
-	     b_size * nbr < new_size - STbuffer->buffer_size;
+	     b_size < new_size - STbuffer->buffer_size;
 	     order++, b_size *= 2)
 		;  /* empty */
 
 	for (segs = STbuffer->sg_segs, got = STbuffer->buffer_size;
 	     segs < max_segs && got < new_size;) {
 		STbuffer->sg[segs].page = alloc_pages(priority, order);
+		/* printk("st: allocated %x, order %d\n", STbuffer->sg[segs].page, order); */
 		STbuffer->sg[segs].offset = 0;
 		if (STbuffer->sg[segs].page == NULL) {
 			if (new_size - got <= (max_segs - segs) * b_size / 2) {
@@ -3518,9 +3373,10 @@ static int enlarge_buffer(ST_buffer * STbuffer, int new_size, int need_dma)
 		STbuffer->buffer_size = got;
 		segs++;
 	}
+	STbuffer->b_data = page_address(STbuffer->sg[0].page);
         DEBC(printk(ST_DEB_MSG
-                    "st: Succeeded to enlarge buffer to %d bytes (segs %d->%d, %d).\n",
-                    got, STbuffer->orig_sg_segs, STbuffer->sg_segs, b_size));
+                    "st: Succeeded to enlarge buffer at %p to %d bytes (segs %d->%d, %d).\n",
+                    STbuffer, got, STbuffer->orig_sg_segs, STbuffer->sg_segs, b_size));
 
 	return TRUE;
 }
@@ -3535,14 +3391,14 @@ static void normalize_buffer(ST_buffer * STbuffer)
 		for (b_size=PAGE_SIZE, order=0; b_size < STbuffer->sg[i].length;
 		     order++, b_size *= 2)
 			; /* empty */
+		/* printk("st: freeing %x, order %d\n", STbuffer->sg[i].page, order); */
 		__free_pages(STbuffer->sg[i].page, order);
 		STbuffer->buffer_size -= STbuffer->sg[i].length;
 	}
         DEB(
 	if (debugging && STbuffer->orig_sg_segs < STbuffer->sg_segs)
 		printk(ST_DEB_MSG "st: Buffer at %p normalized to %d bytes (segs %d).\n",
-		       page_address(STbuffer->sg[0].page), STbuffer->buffer_size,
-		       STbuffer->sg_segs);
+		       STbuffer, STbuffer->buffer_size, STbuffer->sg_segs);
         ) /* end DEB */
 	STbuffer->sg_segs = STbuffer->orig_sg_segs;
 }
@@ -3619,18 +3475,16 @@ static int from_buffer(ST_buffer * st_bp, char *ubp, int do_count)
 static void validate_options(void)
 {
 	if (buffer_kbs > 0)
-		st_buffer_size = buffer_kbs * ST_KILOBYTE;
+		st_fixed_buffer_size = buffer_kbs * ST_KILOBYTE;
 	if (write_threshold_kbs > 0)
 		st_write_threshold = write_threshold_kbs * ST_KILOBYTE;
 	else if (buffer_kbs > 0)
-		st_write_threshold = st_buffer_size - 2048;
-	if (st_write_threshold > st_buffer_size) {
-		st_write_threshold = st_buffer_size;
+		st_write_threshold = st_fixed_buffer_size - 2048;
+	if (st_write_threshold > st_fixed_buffer_size) {
+		st_write_threshold = st_fixed_buffer_size;
 		printk(KERN_WARNING "st: write_threshold limited to %d bytes.\n",
 		       st_write_threshold);
 	}
-	if (max_buffers >= 0)
-		st_max_buffers = max_buffers;
 	if (max_sg_segs >= ST_FIRST_SG)
 		st_max_sg_segs = max_sg_segs;
 }
@@ -3694,7 +3548,8 @@ static int st_attach(Scsi_Device * SDp)
 	Scsi_Tape *tpnt;
 	ST_mode *STm;
 	ST_partstat *STps;
-	int i, mode, target_nbr, dev_num;
+	ST_buffer *buffer;
+	int i, mode, dev_num;
 	char *stp;
 
 	if (SDp->type != TYPE_TAPE)
@@ -3707,6 +3562,12 @@ static int st_attach(Scsi_Device * SDp)
 		return 1;
 	}
 
+	buffer = new_tape_buffer(TRUE, (SDp->host)->unchecked_isa_dma);
+	if (buffer == NULL) {
+		printk(KERN_ERR "st: Can't allocate new tape buffer. Device not attached.\n");
+		return 1;
+	}
+
 	write_lock(&st_dev_arr_lock);
 	if (st_template.nr_dev >= st_template.dev_max) {
 		Scsi_Tape **tmp_da;
@@ -3745,14 +3606,6 @@ static int st_attach(Scsi_Device * SDp)
 		}
 		scsi_tapes = tmp_da;
 
-		memset(tmp_ba, 0, tmp_dev_max * sizeof(ST_buffer *));
-		if (st_buffers != NULL) {
-			memcpy(tmp_ba, st_buffers,
-			       st_template.dev_max * sizeof(ST_buffer *));
-			kfree(st_buffers);
-		}
-		st_buffers = tmp_ba;
-
 		st_template.dev_max = tmp_dev_max;
 	}
 
@@ -3799,6 +3652,9 @@ static int st_attach(Scsi_Device * SDp)
 	else
 		tpnt->tape_type = MT_ISSCSI2;
 
+	buffer->use_sg = tpnt->device->host->sg_tablesize;
+	tpnt->buffer = buffer;
+
         tpnt->inited = 0;
 	tpnt->devt = mk_kdev(SCSI_TAPE_MAJOR, i);
 	tpnt->dirty = 0;
@@ -3858,18 +3714,6 @@ static int st_attach(Scsi_Device * SDp)
 	"Attached scsi tape st%d at scsi%d, channel %d, id %d, lun %d\n",
 	       dev_num, SDp->host->host_no, SDp->channel, SDp->id, SDp->lun);
 
-	/* See if we need to allocate more static buffers */
-	target_nbr = st_template.nr_dev;
-	if (target_nbr > st_max_buffers)
-		target_nbr = st_max_buffers;
-	for (i=st_nbr_buffers; i < target_nbr; i++)
-		if (!new_tape_buffer(TRUE, TRUE, FALSE)) {
-			printk(KERN_INFO "st: Unable to allocate new static buffer.\n");
-			break;
-		}
-	/* If the previous allocation fails, we will try again when the buffer is
-	   really needed. */
-
 	return 0;
 };
 
@@ -3897,6 +3741,11 @@ static void st_detach(Scsi_Device * SDp)
 				devfs_unregister (tpnt->de_n[mode]);
 				tpnt->de_n[mode] = NULL;
 			}
+			if (tpnt->buffer) {
+				tpnt->buffer->orig_sg_segs = 0;
+				normalize_buffer(tpnt->buffer);
+				kfree(tpnt->buffer);
+			}
 			kfree(tpnt);
 			scsi_tapes[i] = 0;
 			SDp->attached--;
@@ -3916,10 +3765,10 @@ static int __init init_st(void)
 	validate_options();
 
 	printk(KERN_INFO
-		"st: Version %s, bufsize %d, wrt %d, "
-		"max init. bufs %d, s/g segs %d\n",
-		verstr, st_buffer_size, st_write_threshold,
-		st_max_buffers, st_max_sg_segs);
+		"st: Version %s, fixed bufsize %d, wrt %d, "
+		"s/g segs %d\n",
+		verstr, st_fixed_buffer_size, st_write_threshold,
+		st_max_sg_segs);
 
 	if (devfs_register_chrdev(SCSI_TAPE_MAJOR, "st", &st_fops) >= 0)
 		return scsi_register_device(&st_template);
@@ -3939,16 +3788,6 @@ static void __exit exit_st(void)
 			if (scsi_tapes[i])
 				kfree(scsi_tapes[i]);
 		kfree(scsi_tapes);
-		if (st_buffers != NULL) {
-			for (i = 0; i < st_nbr_buffers; i++) {
-				if (st_buffers[i] != NULL) {
-					st_buffers[i]->orig_sg_segs = 0;
-					normalize_buffer(st_buffers[i]);
-					kfree(st_buffers[i]);
-				}
-			}	
-			kfree(st_buffers);
-		}
 	}
 	st_template.dev_max = 0;
 	printk(KERN_INFO "st: Unloaded.\n");
diff --git a/drivers/scsi/st_options.h b/drivers/scsi/st_options.h
index 325bd3cb5c1e..2c412f72be13 100644
--- a/drivers/scsi/st_options.h
+++ b/drivers/scsi/st_options.h
@@ -3,7 +3,7 @@
 
    Copyright 1995-2000 Kai Makisara.
 
-   Last modified: Tue Jan 22 21:52:34 2002 by makisara
+   Last modified: Sun May  5 15:09:56 2002 by makisara
 */
 
 #ifndef _ST_OPTIONS_H
@@ -30,22 +30,17 @@
    SENSE. */
 #define ST_DEFAULT_BLOCK 0
 
-/* The tape driver buffer size in kilobytes. Must be non-zero. */
-#define ST_BUFFER_BLOCKS 32
+/* The minimum tape driver buffer size in kilobytes in fixed block mode.
+   Must be non-zero. */
+#define ST_FIXED_BUFFER_BLOCKS 32
 
 /* The number of kilobytes of data in the buffer that triggers an
    asynchronous write in fixed block mode. See also ST_ASYNC_WRITES
    below. */
 #define ST_WRITE_THRESHOLD_BLOCKS 30
 
-/* The maximum number of tape buffers the driver tries to allocate at 
-   driver initialisation. The number is also constrained by the number
-   of drives detected. If more buffers are needed, they are allocated
-   at run time and freed after use. */
-#define ST_MAX_BUFFERS 4
-
 /* Maximum number of scatter/gather segments */
-#define ST_MAX_SG      16
+#define ST_MAX_SG      64
 
 /* The number of scatter/gather segments to allocate at first try (must be
    smaller or equal to the maximum). */
diff --git a/fs/bio.c b/fs/bio.c
index e89734a07bea..5fdae32e35ae 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -17,6 +17,7 @@
  *
  */
 #include <linux/mm.h>
+#include <linux/bio.h>
 #include <linux/blk.h>
 #include <linux/slab.h>
 #include <linux/iobuf.h>
@@ -284,8 +285,8 @@ struct bio *bio_copy(struct bio *bio, int gfp_mask, int copy)
 			vto = kmap(bbv->bv_page);
 		} else {
 			local_irq_save(flags);
-			vfrom = kmap_atomic(bv->bv_page, KM_BIO_IRQ);
-			vto = kmap_atomic(bbv->bv_page, KM_BIO_IRQ);
+			vfrom = kmap_atomic(bv->bv_page, KM_BIO_SRC_IRQ);
+			vto = kmap_atomic(bbv->bv_page, KM_BIO_DST_IRQ);
 		}
 
 		memcpy(vto + bbv->bv_offset, vfrom + bv->bv_offset, bv->bv_len);
@@ -293,8 +294,8 @@ struct bio *bio_copy(struct bio *bio, int gfp_mask, int copy)
 			kunmap(bbv->bv_page);
 			kunmap(bv->bv_page);
 		} else {
-			kunmap_atomic(vto, KM_BIO_IRQ);
-			kunmap_atomic(vfrom, KM_BIO_IRQ);
+			kunmap_atomic(vto, KM_BIO_DST_IRQ);
+			kunmap_atomic(vfrom, KM_BIO_SRC_IRQ);
 			local_irq_restore(flags);
 		}
 	}
diff --git a/fs/buffer.c b/fs/buffer.c
index b7e31f59193b..dde8e7d9bae6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -152,14 +152,16 @@ __set_page_buffers(struct page *page, struct buffer_head *head)
 {
 	if (page_has_buffers(page))
 		buffer_error();
-	set_page_buffers(page, head);
 	page_cache_get(page);
+	SetPagePrivate(page);
+	page->private = (unsigned long)head;
 }
 
 static inline void
 __clear_page_buffers(struct page *page)
 {
-	clear_page_buffers(page);
+	ClearPagePrivate(page);
+	page->private = 0;
 	page_cache_release(page);
 }
 
@@ -376,7 +378,7 @@ out:
 }
 
 /*
- * Various filesystems appear to want __get_hash_table to be non-blocking.
+ * Various filesystems appear to want __find_get_block to be non-blocking.
  * But it's the page lock which protects the buffers.  To get around this,
  * we get exclusion from try_to_free_buffers with the blockdev mapping's
  * private_lock.
@@ -387,7 +389,7 @@ out:
  * private_lock is contended then so is mapping->page_lock).
  */
 struct buffer_head *
-__get_hash_table(struct block_device *bdev, sector_t block, int unused)
+__find_get_block(struct block_device *bdev, sector_t block, int unused)
 {
 	struct inode *bd_inode = bdev->bd_inode;
 	struct address_space *bd_mapping = bd_inode->i_mapping;
@@ -492,7 +494,7 @@ static void free_more_memory(void)
 }
 
 /*
- * I/O completion handler for block_read_full_page() and brw_page() - pages
+ * I/O completion handler for block_read_full_page() - pages
  * which come unlocked at the end of I/O.
  */
 static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
@@ -542,14 +544,6 @@ static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
 	 */
 	if (page_uptodate && !PageError(page))
 		SetPageUptodate(page);
-
-	/*
-	 * swap page handling is a bit hacky.  A standalone completion handler
-	 * for swapout pages would fix that up.  swapin can use this function.
-	 */
-	if (PageSwapCache(page) && PageWriteback(page))
-		end_page_writeback(page);
-
 	unlock_page(page);
 	return;
 
@@ -856,8 +850,9 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
 		if (mapping->assoc_mapping != buffer_mapping)
 			BUG();
 	}
-	buffer_insert_list(&buffer_mapping->private_lock,
-			bh, &mapping->private_list);
+	if (list_empty(&bh->b_assoc_buffers))
+		buffer_insert_list(&buffer_mapping->private_lock,
+				bh, &mapping->private_list);
 }
 EXPORT_SYMBOL(mark_buffer_dirty_inode);
 
@@ -952,12 +947,12 @@ void invalidate_inode_buffers(struct inode *inode)
  * the size of each buffer.. Use the bh->b_this_page linked list to
  * follow the buffers created.  Return NULL if unable to create more
  * buffers.
- * The async flag is used to differentiate async IO (paging, swapping)
- * from ordinary buffer allocations, and only async requests are allowed
- * to sleep waiting for buffer heads. 
+ *
+ * The retry flag is used to differentiate async IO (paging, swapping)
+ * which may not fail from ordinary buffer allocations.
  */
 static struct buffer_head *
-create_buffers(struct page * page, unsigned long size, int async)
+create_buffers(struct page * page, unsigned long size, int retry)
 {
 	struct buffer_head *bh, *head;
 	long offset;
@@ -966,7 +961,7 @@ try_again:
 	head = NULL;
 	offset = PAGE_SIZE;
 	while ((offset -= size) >= 0) {
-		bh = alloc_buffer_head(async);
+		bh = alloc_buffer_head();
 		if (!bh)
 			goto no_grow;
 
@@ -1003,7 +998,7 @@ no_grow:
 	 * become available.  But we don't want tasks sleeping with 
 	 * partially complete buffers, so all were released above.
 	 */
-	if (!async)
+	if (!retry)
 		return NULL;
 
 	/* We're _really_ low on memory. Now we just
@@ -1096,7 +1091,7 @@ grow_dev_page(struct block_device *bdev, unsigned long block,
 
 	/*
 	 * Link the page to the buffers and initialise them.  Take the
-	 * lock to be atomic wrt __get_hash_table(), which does not
+	 * lock to be atomic wrt __find_get_block(), which does not
 	 * run under the page lock.
 	 */
 	spin_lock(&inode->i_mapping->private_lock);
@@ -1169,7 +1164,7 @@ __getblk(struct block_device *bdev, sector_t block, int size)
 	for (;;) {
 		struct buffer_head * bh;
 
-		bh = __get_hash_table(bdev, block, size);
+		bh = __find_get_block(bdev, block, size);
 		if (bh) {
 			touch_buffer(bh);
 			return bh;
@@ -1218,7 +1213,7 @@ void mark_buffer_dirty(struct buffer_head *bh)
 {
 	if (!buffer_uptodate(bh))
 		buffer_error();
-	if (!test_set_buffer_dirty(bh))
+	if (!buffer_dirty(bh) && !test_set_buffer_dirty(bh))
 		__set_page_dirty_nobuffers(bh->b_page);
 }
 
@@ -1243,10 +1238,17 @@ void __brelse(struct buffer_head * buf)
  * bforget() is like brelse(), except it discards any
  * potentially dirty data.
  */
-void __bforget(struct buffer_head * buf)
+void __bforget(struct buffer_head *bh)
 {
-	clear_buffer_dirty(buf);
-	__brelse(buf);
+	clear_buffer_dirty(bh);
+	if (!list_empty(&bh->b_assoc_buffers)) {
+		struct address_space *buffer_mapping = bh->b_page->mapping;
+
+		spin_lock(&buffer_mapping->private_lock);
+		list_del_init(&bh->b_assoc_buffers);
+		spin_unlock(&buffer_mapping->private_lock);
+	}
+	__brelse(bh);
 }
 
 /**
@@ -1359,11 +1361,11 @@ int block_invalidatepage(struct page *page, unsigned long offset)
 {
 	struct buffer_head *head, *bh, *next;
 	unsigned int curr_off = 0;
+	int ret = 1;
 
-	if (!PageLocked(page))
-		BUG();
+	BUG_ON(!PageLocked(page));
 	if (!page_has_buffers(page))
-		return 1;
+		goto out;
 
 	head = page_buffers(page);
 	bh = head;
@@ -1385,12 +1387,10 @@ int block_invalidatepage(struct page *page, unsigned long offset)
 	 * The get_block cached value has been unconditionally invalidated,
 	 * so real IO is not possible anymore.
 	 */
-	if (offset == 0) {
-		if (!try_to_release_page(page, 0))
-			return 0;
-	}
-
-	return 1;
+	if (offset == 0)
+		ret = try_to_release_page(page, 0);
+out:
+	return ret;
 }
 EXPORT_SYMBOL(block_invalidatepage);
 
@@ -1449,7 +1449,7 @@ void unmap_underlying_metadata(struct block_device *bdev, sector_t block)
 {
 	struct buffer_head *old_bh;
 
-	old_bh = __get_hash_table(bdev, block, 0);
+	old_bh = __find_get_block(bdev, block, 0);
 	if (old_bh) {
 #if 0	/* This happens.  Later. */
 		if (buffer_dirty(old_bh))
@@ -2266,68 +2266,6 @@ int brw_kiovec(int rw, int nr, struct kiobuf *iovec[],
 }
 
 /*
- * Start I/O on a page.
- * This function expects the page to be locked and may return
- * before I/O is complete. You then have to check page->locked
- * and page->uptodate.
- *
- * FIXME: we need a swapper_inode->get_block function to remove
- *        some of the bmap kludges and interface ugliness here.
- *
- * NOTE: unlike file pages, swap pages are locked while under writeout.
- * This is to throttle processes which reuse their swapcache pages while
- * they are under writeout, and to ensure that there is no I/O going on
- * when the page has been successfully locked.  Functions such as
- * free_swap_and_cache() need to guarantee that there is no I/O in progress
- * because they will be freeing up swap blocks, which may then be reused.
- *
- * Swap pages are also marked PageWriteback when they are being written
- * so that memory allocators will throttle on them.
- */
-int brw_page(int rw, struct page *page,
-		struct block_device *bdev, sector_t b[], int size)
-{
-	struct buffer_head *head, *bh;
-
-	BUG_ON(!PageLocked(page));
-
-	if (!page_has_buffers(page))
-		create_empty_buffers(page, size, 0);
-	head = bh = page_buffers(page);
-
-	/* Stage 1: lock all the buffers */
-	do {
-		lock_buffer(bh);
-		bh->b_blocknr = *(b++);
-		bh->b_bdev = bdev;
-		set_buffer_mapped(bh);
-		if (rw == WRITE) {
-			set_buffer_uptodate(bh);
-			clear_buffer_dirty(bh);
-		}
-		/*
-		 * Swap pages are locked during writeout, so use
-		 * buffer_async_read in strange ways.
-		 */
-		mark_buffer_async_read(bh);
-		bh = bh->b_this_page;
-	} while (bh != head);
-
-	if (rw == WRITE) {
-		BUG_ON(PageWriteback(page));
-		SetPageWriteback(page);
-	}
-
-	/* Stage 2: start the IO */
-	do {
-		struct buffer_head *next = bh->b_this_page;
-		submit_bh(rw, bh);
-		bh = next;
-	} while (bh != head);
-	return 0;
-}
-
-/*
  * Sanity checks for try_to_free_buffers.
  */
 static void check_ttfb_buffer(struct page *page, struct buffer_head *bh)
@@ -2456,7 +2394,7 @@ asmlinkage long sys_bdflush(int func, long data)
 static kmem_cache_t *bh_cachep;
 static mempool_t *bh_mempool;
 
-struct buffer_head *alloc_buffer_head(int async)
+struct buffer_head *alloc_buffer_head(void)
 {
 	return mempool_alloc(bh_mempool, GFP_NOFS);
 }
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index 16bd5714cecf..5c581916ecdd 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -147,21 +147,26 @@ exit:
 
 int coda_permission(struct inode *inode, int mask)
 {
-        int error;
+        int error = 0;
  
 	if (!mask)
 		return 0; 
 
+	lock_kernel();
+
 	coda_vfs_stat.permission++;
 
 	if (coda_cache_check(inode, mask))
-		return 0; 
+		goto out; 
 
         error = venus_access(inode->i_sb, coda_i2f(inode), mask);
     
 	if (!error)
 		coda_cache_enter(inode, mask);
 
+ out:
+	unlock_kernel();
+
         return error; 
 }
 
diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index f8f6828d5f59..c5cc2178ad4a 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -352,7 +352,7 @@ do_more:
 #ifdef CONFIG_JBD_DEBUG
 		{
 			struct buffer_head *debug_bh;
-			debug_bh = sb_get_hash_table(sb, block + i);
+			debug_bh = sb_find_get_block(sb, block + i);
 			if (debug_bh) {
 				BUFFER_TRACE(debug_bh, "Deleted!");
 				if (!bh2jh(bitmap_bh)->b_committed_data)
@@ -701,7 +701,7 @@ got_block:
 		struct buffer_head *debug_bh;
 
 		/* Record bitmap buffer state in the newly allocated block */
-		debug_bh = sb_get_hash_table(sb, tmp);
+		debug_bh = sb_find_get_block(sb, tmp);
 		if (debug_bh) {
 			BUFFER_TRACE(debug_bh, "state when allocated");
 			BUFFER_TRACE2(debug_bh, bh, "bitmap state");
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index b339c253628e..a9b2c7beb70b 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1650,7 +1650,7 @@ ext3_clear_blocks(handle_t *handle, struct inode *inode, struct buffer_head *bh,
 			struct buffer_head *bh;
 
 			*p = 0;
-			bh = sb_get_hash_table(inode->i_sb, nr);
+			bh = sb_find_get_block(inode->i_sb, nr);
 			ext3_forget(handle, 0, inode, bh, nr);
 		}
 	}
diff --git a/fs/inode.c b/fs/inode.c
index bc90e4232713..a3b2cd4e8a3c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -913,16 +913,6 @@ int bmap(struct inode * inode, int block)
 	return res;
 }
 
-static inline void do_atime_update(struct inode *inode)
-{
-	unsigned long time = CURRENT_TIME;
-	if (inode->i_atime != time) {
-		inode->i_atime = time;
-		mark_inode_dirty_sync(inode);
-	}
-}
-
-
 /**
  *	update_atime	-	update the access time
  *	@inode: inode accessed
@@ -932,15 +922,19 @@ static inline void do_atime_update(struct inode *inode)
  *	as well as the "noatime" flag and inode specific "noatime" markers.
  */
  
-void update_atime (struct inode *inode)
+void update_atime(struct inode *inode)
 {
 	if (inode->i_atime == CURRENT_TIME)
 		return;
-	if ( IS_NOATIME (inode) ) return;
-	if ( IS_NODIRATIME (inode) && S_ISDIR (inode->i_mode) ) return;
-	if ( IS_RDONLY (inode) ) return;
-	do_atime_update(inode);
-}   /*  End Function update_atime  */
+	if (IS_NOATIME(inode))
+		return;
+	if (IS_NODIRATIME(inode) && S_ISDIR(inode->i_mode))
+		return;
+	if (IS_RDONLY(inode))
+		return;
+	inode->i_atime = CURRENT_TIME;
+	mark_inode_dirty_sync(inode);
+}
 
 int inode_needs_sync(struct inode *inode)
 {
diff --git a/fs/intermezzo/dir.c b/fs/intermezzo/dir.c
index c8a8c1988f16..cec0471800f1 100644
--- a/fs/intermezzo/dir.c
+++ b/fs/intermezzo/dir.c
@@ -785,13 +785,15 @@ int presto_permission(struct inode *inode, int mask)
 {
         unsigned short mode = inode->i_mode;
         struct presto_cache *cache;
-        int rc;
+        int rc = 0;
 
+	lock_kernel();
         ENTRY;
+
         if ( presto_can_ilookup() && !(mask & S_IWOTH)) {
                 CDEBUG(D_CACHE, "ilookup on %ld OK\n", inode->i_ino);
-                EXIT;
-                return 0;
+		EXIT;
+		goto out;
         }
 
         cache = presto_get_cache(inode);
@@ -803,25 +805,22 @@ int presto_permission(struct inode *inode, int mask)
 
                 if ( S_ISREG(mode) && fiops && fiops->permission ) {
                         EXIT;
-                        return fiops->permission(inode, mask);
+                        rc = fiops->permission(inode, mask);
+			goto out;
                 }
                 if ( S_ISDIR(mode) && diops && diops->permission ) {
                         EXIT;
-                        return diops->permission(inode, mask);
+                        rc = diops->permission(inode, mask);
+			goto out;
                 }
         }
 
-        /* The cache filesystem doesn't have its own permission function,
-         * but we don't want to duplicate the VFS code here.  In order
-         * to avoid looping from permission calling this function again,
-         * we temporarily override the permission operation while we call
-         * the VFS permission function.
-         */
-        inode->i_op->permission = NULL;
-        rc = permission(inode, mask);
-        inode->i_op->permission = &presto_permission;
+        rc = vfs_permission(inode, mask);
 
         EXIT;
+
+ out:
+	unlock_kernel();
         return rc;
 }
 
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index e4ce53b05a55..2283894a81a6 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -659,6 +659,20 @@ skip_commit:
 		 * there's no point in keeping a checkpoint record for
 		 * it. */
 		bh = jh2bh(jh);
+
+		/* A buffer which has been freed while still being
+		 * journaled by a previous transaction may end up still
+		 * being dirty here, but we want to avoid writing back
+		 * that buffer in the future now that the last use has
+		 * been committed.  That's not only a performance gain,
+		 * it also stops aliasing problems if the buffer is left
+		 * behind for writeback and gets reallocated for another
+		 * use in a different page. */
+		if (buffer_freed(bh)) {
+			clear_buffer_freed(bh);
+			clear_buffer_jbddirty(bh);
+		}
+			
 		if (buffer_jdirty(bh)) {
 			JBUFFER_TRACE(jh, "add to new checkpointing trans");
 			__journal_insert_checkpoint(jh, commit_transaction);
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 052dd4ef3f01..ade37ad43606 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -463,7 +463,7 @@ int journal_write_metadata_buffer(transaction_t *transaction,
 	 * Right, time to make up the new buffer_head.
 	 */
 	do {
-		new_bh = alloc_buffer_head(0);
+		new_bh = alloc_buffer_head();
 		if (!new_bh) {
 			printk (KERN_NOTICE "%s: ENOMEM at alloc_buffer_head, "
 				"trying again.\n", __FUNCTION__);
diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
index 7cecb0237988..6a6464533c35 100644
--- a/fs/jbd/revoke.c
+++ b/fs/jbd/revoke.c
@@ -293,7 +293,7 @@ int journal_revoke(handle_t *handle, unsigned long blocknr,
 	bh = bh_in;
 
 	if (!bh) {
-		bh = __get_hash_table(bdev, blocknr, journal->j_blocksize);
+		bh = __find_get_block(bdev, blocknr, journal->j_blocksize);
 		if (bh)
 			BUFFER_TRACE(bh, "found on hash");
 	}
@@ -303,7 +303,7 @@ int journal_revoke(handle_t *handle, unsigned long blocknr,
 
 		/* If there is a different buffer_head lying around in
 		 * memory anywhere... */
-		bh2 = __get_hash_table(bdev, blocknr, journal->j_blocksize);
+		bh2 = __find_get_block(bdev, blocknr, journal->j_blocksize);
 		if (bh2) {
 			/* ... and it has RevokeValid status... */
 			if ((bh2 != bh) &&
@@ -407,7 +407,7 @@ int journal_cancel_revoke(handle_t *handle, struct journal_head *jh)
 	 * state machine will get very upset later on. */
 	if (need_cancel) {
 		struct buffer_head *bh2;
-		bh2 = __get_hash_table(bh->b_bdev, bh->b_blocknr, bh->b_size);
+		bh2 = __find_get_block(bh->b_bdev, bh->b_blocknr, bh->b_size);
 		if (bh2) {
 			if (bh2 != bh)
 				clear_bit(BH_Revoked, &bh2->b_state);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 89c625bf9fa8..37c9ed30ebfd 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1601,8 +1601,7 @@ void journal_unfile_buffer(struct journal_head *jh)
  *
  * Returns non-zero iff we were able to free the journal_head.
  */
-static int __journal_try_to_free_buffer(struct buffer_head *bh,
-					int *locked_or_dirty)
+static inline int __journal_try_to_free_buffer(struct buffer_head *bh)
 {
 	struct journal_head *jh;
 
@@ -1610,12 +1609,7 @@ static int __journal_try_to_free_buffer(struct buffer_head *bh,
 
 	jh = bh2jh(bh);
 
-	if (buffer_locked(bh) || buffer_dirty(bh)) {
-		*locked_or_dirty = 1;
-		goto out;
-	}
-
-	if (!buffer_uptodate(bh))	/* AKPM: why? */
+	if (buffer_locked(bh) || buffer_dirty(bh))
 		goto out;
 
 	if (jh->b_next_transaction != 0)
@@ -1630,8 +1624,7 @@ static int __journal_try_to_free_buffer(struct buffer_head *bh,
 			__journal_remove_journal_head(bh);
 			__brelse(bh);
 		}
-	}
-	else if (jh->b_cp_transaction != 0 && jh->b_transaction == 0) {
+	} else if (jh->b_cp_transaction != 0 && jh->b_transaction == 0) {
 		/* written-back checkpointed metadata buffer */
 		if (jh->b_jlist == BJ_None) {
 			JBUFFER_TRACE(jh, "remove from checkpoint list");
@@ -1647,10 +1640,8 @@ out:
 }
 
 /*
- * journal_try_to_free_buffers().  For all the buffers on this page,
- * if they are fully written out ordered data, move them onto BUF_CLEAN
- * so try_to_free_buffers() can reap them.  Called with lru_list_lock
- * not held.  Does its own locking.
+ * journal_try_to_free_buffers().  Try to remove all this page's buffers
+ * from the journal.
  *
  * This complicates JBD locking somewhat.  We aren't protected by the
  * BKL here.  We wish to remove the buffer from its committing or
@@ -1669,50 +1660,28 @@ out:
  * journal_try_to_free_buffer() is changing its state.  But that
  * cannot happen because we never reallocate freed data as metadata
  * while the data is part of a transaction.  Yes?
- *
- * This function returns non-zero if we wish try_to_free_buffers()
- * to be called. We do this is the page is releasable by try_to_free_buffers().
- * We also do it if the page has locked or dirty buffers and the caller wants
- * us to perform sync or async writeout.
  */
 int journal_try_to_free_buffers(journal_t *journal, 
-				struct page *page, int gfp_mask)
+				struct page *page, int unused_gfp_mask)
 {
+	struct buffer_head *head;
 	struct buffer_head *bh;
-	struct buffer_head *tmp;
-	int locked_or_dirty = 0;
-	int call_ttfb = 1;
-	int ret;
+	int ret = 0;
 
 	J_ASSERT(PageLocked(page));
 
-	bh = page_buffers(page);
-	tmp = bh;
+	head = page_buffers(page);
+	bh = head;
 	spin_lock(&journal_datalist_lock);
 	do {
-		struct buffer_head *p = tmp;
-
-		tmp = tmp->b_this_page;
-		if (buffer_jbd(p))
-			if (!__journal_try_to_free_buffer(p, &locked_or_dirty))
-				call_ttfb = 0;
-	} while (tmp != bh);
+		if (buffer_jbd(bh) && !__journal_try_to_free_buffer(bh)) {
+			spin_unlock(&journal_datalist_lock);
+			goto busy;
+		}
+	} while ((bh = bh->b_this_page) != head);
 	spin_unlock(&journal_datalist_lock);
-
-	if (!(gfp_mask & (__GFP_IO|__GFP_WAIT)))
-		goto out;
-	if (!locked_or_dirty)
-		goto out;
-	/*
-	 * The VM wants us to do writeout, or to block on IO, or both.
-	 * So we allow try_to_free_buffers to be called even if the page
-	 * still has journalled buffers.
-	 */
-	call_ttfb = 1;
-out:
-	ret = 0;
-	if (call_ttfb)
-		ret = try_to_free_buffers(page);
+	ret = try_to_free_buffers(page);
+busy:
 	return ret;
 }
 
@@ -1861,6 +1830,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh)
 		 * running transaction if that is set, but nothing
 		 * else. */
 		JBUFFER_TRACE(jh, "on committing transaction");
+		set_buffer_freed(bh);
 		if (jh->b_next_transaction) {
 			J_ASSERT(jh->b_next_transaction ==
 					journal->j_running_transaction);
diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index ea37f1c39a64..7790f413096a 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
@@ -65,6 +65,7 @@
 #include <linux/smp_lock.h>
 #include <linux/completion.h>
 #include <linux/buffer_head.h>		/* for sync_blockdev() */
+#include <linux/bio.h>
 #include "jfs_incore.h"
 #include "jfs_filsys.h"
 #include "jfs_metapage.h"
diff --git a/fs/namei.c b/fs/namei.c
index 506f8b5eee6b..8ac8afda4ccb 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -204,13 +204,8 @@ int vfs_permission(struct inode * inode, int mask)
 
 int permission(struct inode * inode,int mask)
 {
-	if (inode->i_op && inode->i_op->permission) {
-		int retval;
-		lock_kernel();
-		retval = inode->i_op->permission(inode, mask);
-		unlock_kernel();
-		return retval;
-	}
+	if (inode->i_op && inode->i_op->permission)
+		return inode->i_op->permission(inode, mask);
 	return vfs_permission(inode, mask);
 }
 
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 1cbf3a697bda..73d57238a1cc 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1123,6 +1123,8 @@ nfs_permission(struct inode *inode, int mask)
 	    && error != -EACCES)
 		goto out;
 
+	lock_kernel();
+
 	error = NFS_PROTO(inode)->access(inode, mask, 0);
 
 	if (error == -EACCES && NFS_CLIENT(inode)->cl_droppriv &&
@@ -1130,6 +1132,8 @@ nfs_permission(struct inode *inode, int mask)
 	    (current->fsuid != current->uid || current->fsgid != current->gid))
 		error = NFS_PROTO(inode)->access(inode, mask, 1);
 
+	unlock_kernel();
+
  out:
 	return error;
 }
diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c
index fbff42392bab..7c20a2949e96 100644
--- a/fs/ntfs/aops.c
+++ b/fs/ntfs/aops.c
@@ -61,10 +61,10 @@ static void end_buffer_read_file_async(struct buffer_head *bh, int uptodate)
 
 			if (file_ofs < ni->initialized_size)
 				ofs = ni->initialized_size - file_ofs;
-			addr = kmap_atomic(page, KM_BIO_IRQ);
+			addr = kmap_atomic(page, KM_BIO_SRC_IRQ);
 			memset(addr + bh_offset(bh) + ofs, 0, bh->b_size - ofs);
 			flush_dcache_page(page);
-			kunmap_atomic(addr, KM_BIO_IRQ);
+			kunmap_atomic(addr, KM_BIO_SRC_IRQ);
 		}
 	} else
 		SetPageError(page);
@@ -363,10 +363,10 @@ static void end_buffer_read_mftbmp_async(struct buffer_head *bh, int uptodate)
 
 			if (file_ofs < vol->mftbmp_initialized_size)
 				ofs = vol->mftbmp_initialized_size - file_ofs;
-			addr = kmap_atomic(page, KM_BIO_IRQ);
+			addr = kmap_atomic(page, KM_BIO_SRC_IRQ);
 			memset(addr + bh_offset(bh) + ofs, 0, bh->b_size - ofs);
 			flush_dcache_page(page);
-			kunmap_atomic(addr, KM_BIO_IRQ);
+			kunmap_atomic(addr, KM_BIO_SRC_IRQ);
 		}
 	} else
 		SetPageError(page);
@@ -559,10 +559,10 @@ static void end_buffer_read_mst_async(struct buffer_head *bh, int uptodate)
 
 			if (file_ofs < ni->initialized_size)
 				ofs = ni->initialized_size - file_ofs;
-			addr = kmap_atomic(page, KM_BIO_IRQ);
+			addr = kmap_atomic(page, KM_BIO_SRC_IRQ);
 			memset(addr + bh_offset(bh) + ofs, 0, bh->b_size - ofs);
 			flush_dcache_page(page);
-			kunmap_atomic(addr, KM_BIO_IRQ);
+			kunmap_atomic(addr, KM_BIO_SRC_IRQ);
 		}
 	} else
 		SetPageError(page);
@@ -593,7 +593,7 @@ static void end_buffer_read_mst_async(struct buffer_head *bh, int uptodate)
 
 		rec_size = ni->_IDM(index_block_size);
 		recs = PAGE_CACHE_SIZE / rec_size;
-		addr = kmap_atomic(page, KM_BIO_IRQ);
+		addr = kmap_atomic(page, KM_BIO_SRC_IRQ);
 		for (i = 0; i < recs; i++) {
 			if (!post_read_mst_fixup((NTFS_RECORD*)(addr +
 					i * rec_size), rec_size))
@@ -607,7 +607,7 @@ static void end_buffer_read_mst_async(struct buffer_head *bh, int uptodate)
 					ni->_IDM(index_block_size_bits)) + i));
 		}
 		flush_dcache_page(page);
-		kunmap_atomic(addr, KM_BIO_IRQ);
+		kunmap_atomic(addr, KM_BIO_SRC_IRQ);
 		if (likely(!nr_err && recs))
 			SetPageUptodate(page);
 		else {
diff --git a/fs/qnx4/fsync.c b/fs/qnx4/fsync.c
index 2bb315473ee6..df5bc75d5414 100644
--- a/fs/qnx4/fsync.c
+++ b/fs/qnx4/fsync.c
@@ -37,7 +37,7 @@ static int sync_block(struct inode *inode, unsigned short *block, int wait)
 	if (!*block)
 		return 0;
 	tmp = *block;
-	bh = sb_get_hash_table(inode->i_sb, *block);
+	bh = sb_find_get_block(inode->i_sb, *block);
 	if (!bh)
 		return 0;
 	if (*block != tmp) {
diff --git a/fs/reiserfs/fix_node.c b/fs/reiserfs/fix_node.c
index 0bdb34c5acf4..1cdcd39a06bd 100644
--- a/fs/reiserfs/fix_node.c
+++ b/fs/reiserfs/fix_node.c
@@ -920,7 +920,7 @@ static int  is_left_neighbor_in_cache(
   /* Get left neighbor block number. */
   n_left_neighbor_blocknr = B_N_CHILD_NUM(p_s_tb->FL[n_h], n_left_neighbor_position);
   /* Look for the left neighbor in the cache. */
-  if ( (left = sb_get_hash_table(p_s_sb, n_left_neighbor_blocknr)) ) {
+  if ( (left = sb_find_get_block(p_s_sb, n_left_neighbor_blocknr)) ) {
 
     RFALSE( buffer_uptodate (left) && ! B_IS_IN_TREE(left),
 	    "vs-8170: left neighbor (%b %z) is not in the tree", left, left);
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index c16dbdc12ca6..2cf16631e224 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -689,7 +689,7 @@ retry:
   count = 0 ;
   for (i = 0 ; atomic_read(&(jl->j_commit_left)) > 1 && i < (jl->j_len + 1) ; i++) {  /* everything but commit_bh */
     bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) + (jl->j_start+i) %  SB_ONDISK_JOURNAL_SIZE(s);
-    tbh = journal_get_hash_table(s, bn) ;
+    tbh = journal_find_get_block(s, bn) ;
 
 /* kill this sanity check */
 if (count > (orig_commit_left + 2)) {
@@ -718,7 +718,7 @@ reiserfs_panic(s, "journal-539: flush_commit_list: BAD count(%d) > orig_commit_l
     for (i = 0 ; atomic_read(&(jl->j_commit_left)) > 1 && 
                  i < (jl->j_len + 1) ; i++) {  /* everything but commit_bh */
       bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) + (jl->j_start + i) % SB_ONDISK_JOURNAL_SIZE(s) ;
-      tbh = journal_get_hash_table(s, bn) ;
+      tbh = journal_find_get_block(s, bn) ;
 
       wait_on_buffer(tbh) ;
       if (!buffer_uptodate(tbh)) {
@@ -2764,7 +2764,7 @@ int journal_mark_freed(struct reiserfs_transaction_handle *th, struct super_bloc
   int cleaned = 0 ;
   
   if (reiserfs_dont_log(th->t_super)) {
-    bh = sb_get_hash_table(p_s_sb, blocknr) ;
+    bh = sb_find_get_block(p_s_sb, blocknr) ;
     if (bh && buffer_dirty (bh)) {
       printk ("journal_mark_freed(dont_log): dirty buffer on hash list: %lx %ld\n", bh->b_state, blocknr);
       BUG ();
@@ -2772,7 +2772,7 @@ int journal_mark_freed(struct reiserfs_transaction_handle *th, struct super_bloc
     brelse (bh);
     return 0 ;
   }
-  bh = sb_get_hash_table(p_s_sb, blocknr) ;
+  bh = sb_find_get_block(p_s_sb, blocknr) ;
   /* if it is journal new, we just remove it from this transaction */
   if (bh && buffer_journal_new(bh)) {
     mark_buffer_notjournal_new(bh) ;
diff --git a/fs/select.c b/fs/select.c
index 30c29f1e49f8..6a5909a75677 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -12,6 +12,9 @@
  *  24 January 2000
  *     Changed sys_poll()/do_poll() to use PAGE_SIZE chunk-based allocation 
  *     of fds to overcome nfds < 16390 descriptors limit (Tigran Aivazian).
+ * 
+ *  Dec 2001
+ *     Stack allocation and fast path (Andi Kleen) 
  */
 
 #include <linux/slab.h>
@@ -26,21 +29,6 @@
 #define ROUND_UP(x,y) (((x)+(y)-1)/(y))
 #define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM)
 
-struct poll_table_entry {
-	struct file * filp;
-	wait_queue_t wait;
-	wait_queue_head_t * wait_address;
-};
-
-struct poll_table_page {
-	struct poll_table_page * next;
-	struct poll_table_entry * entry;
-	struct poll_table_entry entries[0];
-};
-
-#define POLL_TABLE_FULL(table) \
-	((unsigned long)((table)->entry+1) > PAGE_SIZE + (unsigned long)(table))
-
 /*
  * Ok, Peter made a complicated, but straightforward multiple_wait() function.
  * I have rewritten this, taking some shortcuts: This code may not be easy to
@@ -62,30 +50,39 @@ void poll_freewait(poll_table* pt)
 		struct poll_table_page *old;
 
 		entry = p->entry;
-		do {
+		while (entry > p->entries) {
 			entry--;
 			remove_wait_queue(entry->wait_address,&entry->wait);
 			fput(entry->filp);
-		} while (entry > p->entries);
+		}
 		old = p;
 		p = p->next;
-		free_page((unsigned long) old);
+		if (old != &pt->inline_page) 
+			free_page((unsigned long) old);
 	}
 }
 
 void __pollwait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
 {
 	struct poll_table_page *table = p->table;
-
-	if (!table || POLL_TABLE_FULL(table)) {
-		struct poll_table_page *new_table;
-
-		new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL);
-		if (!new_table) {
-			p->error = -ENOMEM;
-			__set_current_state(TASK_RUNNING);
-			return;
+	struct poll_table_page *new_table = NULL;
+	int sz;
+
+	if (!table) { 
+		new_table = &p->inline_page; 
+	} else { 
+		sz = (table == &p->inline_page) ? POLL_INLINE_TABLE_LEN : PAGE_SIZE; 
+		if ((char*)table->entry >= (char*)table + sz) {
+			new_table = (struct poll_table_page *)__get_free_page(GFP_KERNEL);
+			if (!new_table) {
+				p->error = -ENOMEM;
+				__set_current_state(TASK_RUNNING);
+				return;
+			}
 		}
+	} 
+
+	if (new_table) { 
 		new_table->entry = new_table->entries;
 		new_table->next = table;
 		p->table = new_table;
@@ -113,48 +110,6 @@ void __pollwait(struct file * filp, wait_queue_head_t * wait_address, poll_table
 
 #define BITS(fds, n)		(*__IN(fds, n)|*__OUT(fds, n)|*__EX(fds, n))
 
-static int max_select_fd(unsigned long n, fd_set_bits *fds)
-{
-	unsigned long *open_fds;
-	unsigned long set;
-	int max;
-
-	/* handle last in-complete long-word first */
-	set = ~(~0UL << (n & (__NFDBITS-1)));
-	n /= __NFDBITS;
-	open_fds = current->files->open_fds->fds_bits+n;
-	max = 0;
-	if (set) {
-		set &= BITS(fds, n);
-		if (set) {
-			if (!(set & ~*open_fds))
-				goto get_max;
-			return -EBADF;
-		}
-	}
-	while (n) {
-		open_fds--;
-		n--;
-		set = BITS(fds, n);
-		if (!set)
-			continue;
-		if (set & ~*open_fds)
-			return -EBADF;
-		if (max)
-			continue;
-get_max:
-		do {
-			max++;
-			set >>= 1;
-		} while (set);
-		max += n * __NFDBITS;
-	}
-
-	return max;
-}
-
-#define BIT(i)		(1UL << ((i)&(__NFDBITS-1)))
-#define MEM(i,m)	((m)+(unsigned)(i)/__NFDBITS)
 #define ISSET(i,m)	(((i)&*(m)) != 0)
 #define SET(i,m)	(*(m) |= (i))
 
@@ -165,56 +120,71 @@ get_max:
 int do_select(int n, fd_set_bits *fds, long *timeout)
 {
 	poll_table table, *wait;
-	int retval, i, off;
+	int retval, off, max, maxoff;
 	long __timeout = *timeout;
 
- 	read_lock(&current->files->file_lock);
-	retval = max_select_fd(n, fds);
-	read_unlock(&current->files->file_lock);
-
-	if (retval < 0)
-		return retval;
-	n = retval;
-
 	poll_initwait(&table);
 	wait = &table;
 	if (!__timeout)
 		wait = NULL;
+	
 	retval = 0;
+	maxoff = n/BITS_PER_LONG; 
+	max = 0; 
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		for (i = 0 ; i < n; i++) {
-			unsigned long bit = BIT(i);
-			unsigned long mask;
-			struct file *file;
+		for (off = 0; off <= maxoff; off++) { 
+			unsigned long val = BITS(fds, off); 
 
-			off = i / __NFDBITS;
-			if (!(bit & BITS(fds, off)))
+			if (!val) 
 				continue;
-			file = fget(i);
-			mask = POLLNVAL;
-			if (file) {
-				mask = DEFAULT_POLLMASK;
-				if (file->f_op && file->f_op->poll)
-					mask = file->f_op->poll(file, wait);
-				fput(file);
-			}
-			if ((mask & POLLIN_SET) && ISSET(bit, __IN(fds,off))) {
-				SET(bit, __RES_IN(fds,off));
-				retval++;
-				wait = NULL;
-			}
-			if ((mask & POLLOUT_SET) && ISSET(bit, __OUT(fds,off))) {
-				SET(bit, __RES_OUT(fds,off));
-				retval++;
-				wait = NULL;
-			}
-			if ((mask & POLLEX_SET) && ISSET(bit, __EX(fds,off))) {
-				SET(bit, __RES_EX(fds,off));
-				retval++;
-				wait = NULL;
+			while (val) { 
+				int k = ffz(~val); 
+				unsigned long mask, bit;
+				struct file *file;
+
+				if (k > n%BITS_PER_LONG) 
+					break;
+
+				bit = (1UL << k); 
+				val &= ~bit; 
+
+				file = fget((off * BITS_PER_LONG) + k);
+				mask = POLLNVAL;
+				if (file) {
+					mask = DEFAULT_POLLMASK;
+					if (file->f_op && file->f_op->poll)
+						mask = file->f_op->poll(file, wait);
+					fput(file);
+				} else { 
+					/* This error will shadow all other results. 
+					 * This matches previous linux behaviour */
+					retval = -EBADF; 
+					goto out; 
+				} 
+				if ((mask & POLLIN_SET) && ISSET(bit, __IN(fds,off))) {
+					SET(bit, __RES_IN(fds,off));
+					retval++;
+					wait = NULL;
+				}
+				if ((mask& POLLOUT_SET) && ISSET(bit,__OUT(fds,off))) {
+					SET(bit, __RES_OUT(fds,off));
+					retval++;
+					wait = NULL;
+				}
+				if ((mask & POLLEX_SET) && ISSET(bit, __EX(fds,off))) {
+					SET(bit, __RES_EX(fds,off));
+					retval++;
+					wait = NULL;
+				}
+
+				if (!(val &= ~bit))
+					break;
 			}
 		}
+
+		
+		maxoff = max; 
 		wait = NULL;
 		if (retval || !__timeout || signal_pending(current))
 			break;
@@ -224,25 +194,43 @@ int do_select(int n, fd_set_bits *fds, long *timeout)
 		}
 		__timeout = schedule_timeout(__timeout);
 	}
+
+out:	
 	current->state = TASK_RUNNING;
 
 	poll_freewait(&table);
 
 	/*
-	 * Up-to-date the caller timeout.
+	 * Update the caller timeout.
 	 */
 	*timeout = __timeout;
 	return retval;
 }
 
-static void *select_bits_alloc(int size)
-{
-	return kmalloc(6 * size, GFP_KERNEL);
-}
+/*
+ * We do a VERIFY_WRITE here even though we are only reading this time:
+ * we'll write to it eventually..
+ */
 
-static void select_bits_free(void *bits, int size)
+static int get_fd_set(unsigned long nr, void *ufdset, unsigned long *fdset)
 {
-	kfree(bits);
+	unsigned long rounded = FDS_BYTES(nr), mask; 
+	if (ufdset) {
+		int error = verify_area(VERIFY_WRITE, ufdset, rounded);
+		if (!error && __copy_from_user(fdset, ufdset, rounded))
+			error = -EFAULT;
+		if (nr % __NFDBITS == 0) 
+			mask = 0;
+		else { 
+			/* This includes one bit too much according to SU;
+			   but without this some programs hang. */ 
+			mask = ~(~0UL << (nr%__NFDBITS)); 
+		} 
+		fdset[nr/__NFDBITS] &= mask; 
+		return error;
+	}
+	memset(fdset, 0, rounded);
+	return 0;
 }
 
 /*
@@ -263,6 +251,7 @@ sys_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp)
 	char *bits;
 	long timeout;
 	int ret, size, max_fdset;
+	char stack_bits[FDS_BYTES(FAST_SELECT_MAX) * 6]; 
 
 	timeout = MAX_SCHEDULE_TIMEOUT;
 	if (tvp) {
@@ -297,11 +286,16 @@ sys_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp)
 	 * since we used fdset we need to allocate memory in units of
 	 * long-words. 
 	 */
-	ret = -ENOMEM;
 	size = FDS_BYTES(n);
-	bits = select_bits_alloc(size);
-	if (!bits)
-		goto out_nofds;
+	if (n < FAST_SELECT_MAX) { 
+		bits = stack_bits;
+	} else { 
+		ret = -ENOMEM;
+		bits = kmalloc(6*size, GFP_KERNEL);
+		if (!bits)
+			goto out_nofds;
+	} 
+
 	fds.in      = (unsigned long *)  bits;
 	fds.out     = (unsigned long *) (bits +   size);
 	fds.ex      = (unsigned long *) (bits + 2*size);
@@ -313,9 +307,7 @@ sys_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp)
 	    (ret = get_fd_set(n, outp, fds.out)) ||
 	    (ret = get_fd_set(n, exp, fds.ex)))
 		goto out;
-	zero_fd_set(n, fds.res_in);
-	zero_fd_set(n, fds.res_out);
-	zero_fd_set(n, fds.res_ex);
+	memset(fds.res_in, 0, 3*size); 
 
 	ret = do_select(n, &fds, &timeout);
 
@@ -326,8 +318,8 @@ sys_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp)
 			usec = timeout % HZ;
 			usec *= (1000000/HZ);
 		}
-		put_user(sec, &tvp->tv_sec);
-		put_user(usec, &tvp->tv_usec);
+		__put_user(sec, &tvp->tv_sec);
+		__put_user(usec, &tvp->tv_usec);
 	}
 
 	if (ret < 0)
@@ -344,8 +336,10 @@ sys_select(int n, fd_set *inp, fd_set *outp, fd_set *exp, struct timeval *tvp)
 	set_fd_set(n, exp, fds.res_ex);
 
 out:
-	select_bits_free(bits, size);
+	if (n >= FAST_SELECT_MAX) 
+		kfree(bits);
 out_nofds:
+
 	return ret;
 }
 
@@ -410,12 +404,42 @@ static int do_poll(unsigned int nfds, unsigned int nchunks, unsigned int nleft,
 	return count;
 }
 
+static int fast_poll(poll_table *table, poll_table *wait, struct pollfd *ufds, 
+		     unsigned int nfds, long timeout)
+{ 
+	poll_table *pt = wait; 
+	struct pollfd fds[FAST_POLL_MAX];
+	int count, i; 
+
+	if (copy_from_user(fds, ufds, nfds * sizeof(struct pollfd)))
+		return -EFAULT; 
+	for (;;) { 
+		set_current_state(TASK_INTERRUPTIBLE);
+		count = 0; 
+		do_pollfd(nfds, fds, &pt, &count); 
+		pt = NULL;
+		if (count || !timeout || signal_pending(current))
+			break;
+		count = wait->error; 
+		if (count) 
+			break; 		
+		timeout = schedule_timeout(timeout);
+	} 
+	current->state = TASK_RUNNING;
+	for (i = 0; i < nfds; i++) 
+		__put_user(fds[i].revents, &ufds[i].revents);
+	poll_freewait(table);	
+	if (!count && signal_pending(current)) 
+		return -EINTR; 
+	return count; 
+} 
+
 asmlinkage long sys_poll(struct pollfd * ufds, unsigned int nfds, long timeout)
 {
-	int i, j, fdcount, err;
+	int i, j, err, fdcount;
 	struct pollfd **fds;
 	poll_table table, *wait;
-	int nchunks, nleft;
+	int nchunks, nleft; 
 
 	/* Do a sanity check on nfds ... */
 	if (nfds > NR_OPEN)
@@ -429,43 +453,45 @@ asmlinkage long sys_poll(struct pollfd * ufds, unsigned int nfds, long timeout)
 			timeout = MAX_SCHEDULE_TIMEOUT;
 	}
 
+
 	poll_initwait(&table);
 	wait = &table;
 	if (!timeout)
 		wait = NULL;
 
-	err = -ENOMEM;
-	fds = NULL;
-	if (nfds != 0) {
-		fds = (struct pollfd **)kmalloc(
-			(1 + (nfds - 1) / POLLFD_PER_PAGE) * sizeof(struct pollfd *),
-			GFP_KERNEL);
-		if (fds == NULL)
-			goto out;
-	}
+	if (nfds < FAST_POLL_MAX) 
+		return fast_poll(&table, wait, ufds, nfds, timeout); 
 
+	err = -ENOMEM;
+	fds = (struct pollfd **)kmalloc(
+		(1 + (nfds - 1) / POLLFD_PER_PAGE) * sizeof(struct pollfd *),
+		GFP_KERNEL);
+	if (fds == NULL)
+		goto out;
+	
 	nchunks = 0;
 	nleft = nfds;
-	while (nleft > POLLFD_PER_PAGE) { /* allocate complete PAGE_SIZE chunks */
+	while (nleft > POLLFD_PER_PAGE) { 
 		fds[nchunks] = (struct pollfd *)__get_free_page(GFP_KERNEL);
 		if (fds[nchunks] == NULL)
 			goto out_fds;
 		nchunks++;
 		nleft -= POLLFD_PER_PAGE;
 	}
-	if (nleft) { /* allocate last PAGE_SIZE chunk, only nleft elements used */
+	if (nleft) { 
 		fds[nchunks] = (struct pollfd *)__get_free_page(GFP_KERNEL);
 		if (fds[nchunks] == NULL)
 			goto out_fds;
-	}
-
+	} 
+	
 	err = -EFAULT;
 	for (i=0; i < nchunks; i++)
 		if (copy_from_user(fds[i], ufds + i*POLLFD_PER_PAGE, PAGE_SIZE))
 			goto out_fds1;
+	
 	if (nleft) {
 		if (copy_from_user(fds[nchunks], ufds + nchunks*POLLFD_PER_PAGE, 
-				nleft * sizeof(struct pollfd)))
+				   nleft * sizeof(struct pollfd)))
 			goto out_fds1;
 	}
 
@@ -489,8 +515,7 @@ out_fds1:
 out_fds:
 	for (i=0; i < nchunks; i++)
 		free_page((unsigned long)(fds[i]));
-	if (nfds != 0)
-		kfree(fds);
+	kfree(fds);
 out:
 	poll_freewait(&table);
 	return err;
diff --git a/fs/ufs/truncate.c b/fs/ufs/truncate.c
index f8134d41d98e..6b87c6f26702 100644
--- a/fs/ufs/truncate.c
+++ b/fs/ufs/truncate.c
@@ -117,7 +117,7 @@ static int ufs_trunc_direct (struct inode * inode)
 	frag1 = ufs_fragnum (frag1);
 	frag2 = ufs_fragnum (frag2);
 	for (j = frag1; j < frag2; j++) {
-		bh = sb_get_hash_table (sb, tmp + j);
+		bh = sb_find_get_block (sb, tmp + j);
 		if ((bh && DATA_BUFFER_USED(bh)) || tmp != fs32_to_cpu(sb, *p)) {
 			retry = 1;
 			brelse (bh);
@@ -140,7 +140,7 @@ next1:
 		if (!tmp)
 			continue;
 		for (j = 0; j < uspi->s_fpb; j++) {
-			bh = sb_get_hash_table(sb, tmp + j);
+			bh = sb_find_get_block(sb, tmp + j);
 			if ((bh && DATA_BUFFER_USED(bh)) || tmp != fs32_to_cpu(sb, *p)) {
 				retry = 1;
 				brelse (bh);
@@ -179,7 +179,7 @@ next2:;
 		ufs_panic(sb, "ufs_truncate_direct", "internal error");
 	frag4 = ufs_fragnum (frag4);
 	for (j = 0; j < frag4; j++) {
-		bh = sb_get_hash_table (sb, tmp + j);
+		bh = sb_find_get_block (sb, tmp + j);
 		if ((bh && DATA_BUFFER_USED(bh)) || tmp != fs32_to_cpu(sb, *p)) {
 			retry = 1;
 			brelse (bh);
@@ -238,7 +238,7 @@ static int ufs_trunc_indirect (struct inode * inode, unsigned offset, u32 * p)
 		if (!tmp)
 			continue;
 		for (j = 0; j < uspi->s_fpb; j++) {
-			bh = sb_get_hash_table(sb, tmp + j);
+			bh = sb_find_get_block(sb, tmp + j);
 			if ((bh && DATA_BUFFER_USED(bh)) || tmp != fs32_to_cpu(sb, *ind)) {
 				retry = 1;
 				brelse (bh);
diff --git a/include/asm-alpha/agp.h b/include/asm-alpha/agp.h
new file mode 100644
index 000000000000..ba05bdf9a211
--- /dev/null
+++ b/include/asm-alpha/agp.h
@@ -0,0 +1,11 @@
+#ifndef AGP_H
+#define AGP_H 1
+
+/* dummy for now */
+
+#define map_page_into_agp(page) 
+#define unmap_page_from_agp(page) 
+#define flush_agp_mappings() 
+#define flush_agp_cache() mb()
+
+#endif
diff --git a/include/asm-i386/agp.h b/include/asm-i386/agp.h
new file mode 100644
index 000000000000..9ae97c09fb49
--- /dev/null
+++ b/include/asm-i386/agp.h
@@ -0,0 +1,23 @@
+#ifndef AGP_H
+#define AGP_H 1
+
+#include <asm/pgtable.h>
+
+/* 
+ * Functions to keep the agpgart mappings coherent with the MMU.
+ * The GART gives the CPU a physical alias of pages in memory. The alias region is
+ * mapped uncacheable. Make sure there are no conflicting mappings
+ * with different cachability attributes for the same page. This avoids
+ * data corruption on some CPUs.
+ */
+
+#define map_page_into_agp(page) change_page_attr(page, 1, PAGE_KERNEL_NOCACHE)
+#define unmap_page_from_agp(page) change_page_attr(page, 1, PAGE_KERNEL)
+#define flush_agp_mappings() global_flush_tlb()
+
+/* Could use CLFLUSH here if the cpu supports it. But then it would
+   need to be called for each cacheline of the whole page so it may not be 
+   worth it. Would need a page for it. */
+#define flush_agp_cache() asm volatile("wbinvd":::"memory")
+
+#endif
diff --git a/include/asm-i386/cacheflush.h b/include/asm-i386/cacheflush.h
index 58d027dfc5ff..319e65a7047f 100644
--- a/include/asm-i386/cacheflush.h
+++ b/include/asm-i386/cacheflush.h
@@ -15,4 +15,7 @@
 #define flush_icache_page(vma,pg)		do { } while (0)
 #define flush_icache_user_range(vma,pg,adr,len)	do { } while (0)
 
+void global_flush_tlb(void); 
+int change_page_attr(struct page *page, int numpages, pgprot_t prot);
+
 #endif /* _I386_CACHEFLUSH_H */
diff --git a/include/asm-i386/io.h b/include/asm-i386/io.h
index 44996d06ecc3..9922dd823c9c 100644
--- a/include/asm-i386/io.h
+++ b/include/asm-i386/io.h
@@ -121,31 +121,7 @@ static inline void * ioremap (unsigned long offset, unsigned long size)
 	return __ioremap(offset, size, 0);
 }
 
-/**
- * ioremap_nocache     -   map bus memory into CPU space
- * @offset:    bus address of the memory
- * @size:      size of the resource to map
- *
- * ioremap_nocache performs a platform specific sequence of operations to
- * make bus memory CPU accessible via the readb/readw/readl/writeb/
- * writew/writel functions and the other mmio helpers. The returned
- * address is not guaranteed to be usable directly as a virtual
- * address. 
- *
- * This version of ioremap ensures that the memory is marked uncachable
- * on the CPU as well as honouring existing caching rules from things like
- * the PCI bus. Note that there are other caches and buffers on many 
- * busses. In paticular driver authors should read up on PCI writes
- *
- * It's useful if some control registers are in such an area and
- * write combining or read caching is not desirable:
- */
-
-static inline void * ioremap_nocache (unsigned long offset, unsigned long size)
-{
-        return __ioremap(offset, size, _PAGE_PCD);
-}
-
+extern void * ioremap_nocache (unsigned long offset, unsigned long size);
 extern void iounmap(void *addr);
 
 /*
diff --git a/include/asm-i386/kmap_types.h b/include/asm-i386/kmap_types.h
index 9a12267d3a4f..0ae7bb3c2b8d 100644
--- a/include/asm-i386/kmap_types.h
+++ b/include/asm-i386/kmap_types.h
@@ -15,10 +15,11 @@ D(1)	KM_SKB_SUNRPC_DATA,
 D(2)	KM_SKB_DATA_SOFTIRQ,
 D(3)	KM_USER0,
 D(4)	KM_USER1,
-D(5)	KM_BIO_IRQ,
-D(6)	KM_PTE0,
-D(7)	KM_PTE1,
-D(8)	KM_TYPE_NR
+D(5)	KM_BIO_SRC_IRQ,
+D(6)	KM_BIO_DST_IRQ,
+D(7)	KM_PTE0,
+D(8)	KM_PTE1,
+D(9)	KM_TYPE_NR
 };
 
 #undef D
diff --git a/include/asm-i386/page.h b/include/asm-i386/page.h
index 4737ef69ae18..d8e1f404c08b 100644
--- a/include/asm-i386/page.h
+++ b/include/asm-i386/page.h
@@ -6,6 +6,9 @@
 #define PAGE_SIZE	(1UL << PAGE_SHIFT)
 #define PAGE_MASK	(~(PAGE_SIZE-1))
 
+#define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1))
+#define LARGE_PAGE_SIZE (1UL << PMD_SHIFT)
+
 #ifdef __KERNEL__
 #ifndef __ASSEMBLY__
 
diff --git a/include/asm-i386/pgtable-2level.h b/include/asm-i386/pgtable-2level.h
index e22db0cc6824..9f8bdc13adac 100644
--- a/include/asm-i386/pgtable-2level.h
+++ b/include/asm-i386/pgtable-2level.h
@@ -40,6 +40,7 @@ static inline int pgd_present(pgd_t pgd)	{ return 1; }
  * hook is made available.
  */
 #define set_pte(pteptr, pteval) (*(pteptr) = pteval)
+#define set_pte_atomic(pteptr, pteval) set_pte(pteptr,pteval)
 /*
  * (pmds are folded into pgds so this doesnt get actually called,
  * but the define is needed for a generic inline function.)
diff --git a/include/asm-i386/pgtable-3level.h b/include/asm-i386/pgtable-3level.h
index bb2eaea63fde..beb0c1bc3d30 100644
--- a/include/asm-i386/pgtable-3level.h
+++ b/include/asm-i386/pgtable-3level.h
@@ -49,6 +49,8 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
 	smp_wmb();
 	ptep->pte_low = pte.pte_low;
 }
+#define set_pte_atomic(pteptr,pteval) \
+		set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
 #define set_pmd(pmdptr,pmdval) \
 		set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
 #define set_pgd(pgdptr,pgdval) \
diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
index f48db2beeeba..71b75fa234af 100644
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -237,6 +237,9 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 #define pmd_page(pmd) \
 	(mem_map + (pmd_val(pmd) >> PAGE_SHIFT))
 
+#define pmd_large(pmd) \
+	((pmd_val(pmd) & (_PAGE_PSE|_PAGE_PRESENT)) == (_PAGE_PSE|_PAGE_PRESENT))
+
 /* to find an entry in a page-table-directory. */
 #define pgd_index(address) ((address >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
 
diff --git a/include/asm-ia64/agp.h b/include/asm-ia64/agp.h
new file mode 100644
index 000000000000..ba05bdf9a211
--- /dev/null
+++ b/include/asm-ia64/agp.h
@@ -0,0 +1,11 @@
+#ifndef AGP_H
+#define AGP_H 1
+
+/* dummy for now */
+
+#define map_page_into_agp(page) 
+#define unmap_page_from_agp(page) 
+#define flush_agp_mappings() 
+#define flush_agp_cache() mb()
+
+#endif
diff --git a/include/asm-ppc/kmap_types.h b/include/asm-ppc/kmap_types.h
index 99fec407abf5..bce7fd8c1ff2 100644
--- a/include/asm-ppc/kmap_types.h
+++ b/include/asm-ppc/kmap_types.h
@@ -11,7 +11,8 @@ enum km_type {
 	KM_SKB_DATA_SOFTIRQ,
 	KM_USER0,
 	KM_USER1,
-	KM_BIO_IRQ,
+	KM_BIO_SRC_IRQ,
+	KM_BIO_DST_IRQ,
 	KM_PTE0,
 	KM_PTE1,
 	KM_TYPE_NR
diff --git a/include/asm-sparc/kmap_types.h b/include/asm-sparc/kmap_types.h
index 7e9a5661c698..bab20a2a676b 100644
--- a/include/asm-sparc/kmap_types.h
+++ b/include/asm-sparc/kmap_types.h
@@ -7,7 +7,8 @@ enum km_type {
 	KM_SKB_DATA_SOFTIRQ,
 	KM_USER0,
 	KM_USER1,
-	KM_BIO_IRQ,
+	KM_BIO_SRC_IRQ,
+	KM_BIO_DST_IRQ,
 	KM_TYPE_NR
 };
 
diff --git a/include/asm-sparc64/agp.h b/include/asm-sparc64/agp.h
new file mode 100644
index 000000000000..ba05bdf9a211
--- /dev/null
+++ b/include/asm-sparc64/agp.h
@@ -0,0 +1,11 @@
+#ifndef AGP_H
+#define AGP_H 1
+
+/* dummy for now */
+
+#define map_page_into_agp(page) 
+#define unmap_page_from_agp(page) 
+#define flush_agp_mappings() 
+#define flush_agp_cache() mb()
+
+#endif
diff --git a/include/asm-x86_64/agp.h b/include/asm-x86_64/agp.h
new file mode 100644
index 000000000000..8c2fabe80419
--- /dev/null
+++ b/include/asm-x86_64/agp.h
@@ -0,0 +1,23 @@
+#ifndef AGP_H
+#define AGP_H 1
+
+#include <asm/cacheflush.h>
+
+/* 
+ * Functions to keep the agpgart mappings coherent.
+ * The GART gives the CPU a physical alias of memory. The alias is
+ * mapped uncacheable. Make sure there are no conflicting mappings
+ * with different cachability attributes for the same page.
+ */
+
+#define map_page_into_agp(page) \
+      change_page_attr(page, __pgprot(__PAGE_KERNEL | _PAGE_PCD))
+#define unmap_page_from_agp(page) change_page_attr(page, PAGE_KERNEL)
+#define flush_agp_mappings() global_flush_tlb()
+
+/* Could use CLFLUSH here if the cpu supports it. But then it would
+   need to be called for each cacheline of the whole page so it may not be 
+   worth it. Would need a page for it. */
+#define flush_agp_cache() asm volatile("wbinvd":::"memory")
+
+#endif
diff --git a/include/asm-x86_64/cacheflush.h b/include/asm-x86_64/cacheflush.h
index 58d027dfc5ff..319e65a7047f 100644
--- a/include/asm-x86_64/cacheflush.h
+++ b/include/asm-x86_64/cacheflush.h
@@ -15,4 +15,7 @@
 #define flush_icache_page(vma,pg)		do { } while (0)
 #define flush_icache_user_range(vma,pg,adr,len)	do { } while (0)
 
+void global_flush_tlb(void); 
+int change_page_attr(struct page *page, int numpages, pgprot_t prot);
+
 #endif /* _I386_CACHEFLUSH_H */
diff --git a/include/asm-x86_64/i387.h b/include/asm-x86_64/i387.h
index edb75edb063e..2a0292c00b54 100644
--- a/include/asm-x86_64/i387.h
+++ b/include/asm-x86_64/i387.h
@@ -16,11 +16,22 @@
 #include <asm/processor.h>
 #include <asm/sigcontext.h>
 #include <asm/user.h>
+#include <asm/thread_info.h>
 
 extern void fpu_init(void);
 extern void init_fpu(void);
 int save_i387(struct _fpstate *buf);
 
+static inline int need_signal_i387(struct task_struct *me) 
+{ 
+	if (!me->used_math)
+		return 0;
+	me->used_math = 0; 
+	if (!test_thread_flag(TIF_USEDFPU))
+		return 0;
+	return 1;
+} 
+
 /*
  * FPU lazy state save handling...
  */
diff --git a/include/asm-x86_64/ia32.h b/include/asm-x86_64/ia32.h
index e57c2e593007..7830bf40cfd4 100644
--- a/include/asm-x86_64/ia32.h
+++ b/include/asm-x86_64/ia32.h
@@ -18,7 +18,9 @@ typedef int		       __kernel_clock_t32;
 typedef int		       __kernel_pid_t32;
 typedef unsigned short	       __kernel_ipc_pid_t32;
 typedef unsigned short	       __kernel_uid_t32;
+typedef unsigned 				__kernel_uid32_t32;
 typedef unsigned short	       __kernel_gid_t32;
+typedef unsigned 				__kernel_gid32_t32;
 typedef unsigned short	       __kernel_dev_t32;
 typedef unsigned int	       __kernel_ino_t32;
 typedef unsigned short	       __kernel_mode_t32;
diff --git a/include/asm-x86_64/ipc.h b/include/asm-x86_64/ipc.h
index 49ea4fdc19b4..2ca5773be061 100644
--- a/include/asm-x86_64/ipc.h
+++ b/include/asm-x86_64/ipc.h
@@ -1,34 +1,6 @@
 #ifndef __i386_IPC_H__
 #define __i386_IPC_H__
 
-/* 
- * These are used to wrap system calls on x86.
- *
- * See arch/i386/kernel/sys_i386.c for ugly details..
- *
- * (on x86-64 only used for 32bit emulation) 
- */
-
-struct ipc_kludge {
-	struct msgbuf *msgp;
-	long msgtyp;
-};
-
-#define SEMOP		 1
-#define SEMGET		 2
-#define SEMCTL		 3
-#define MSGSND		11
-#define MSGRCV		12
-#define MSGGET		13
-#define MSGCTL		14
-#define SHMAT		21
-#define SHMDT		22
-#define SHMGET		23
-#define SHMCTL		24
-
-/* Used by the DIPC package, try and avoid reusing it */
-#define DIPC            25
-
-#define IPCCALL(version,op)	((version)<<16 | (op))
+/* dummy */
 
 #endif
diff --git a/include/asm-x86_64/kmap_types.h b/include/asm-x86_64/kmap_types.h
index 7e9a5661c698..bab20a2a676b 100644
--- a/include/asm-x86_64/kmap_types.h
+++ b/include/asm-x86_64/kmap_types.h
@@ -7,7 +7,8 @@ enum km_type {
 	KM_SKB_DATA_SOFTIRQ,
 	KM_USER0,
 	KM_USER1,
-	KM_BIO_IRQ,
+	KM_BIO_SRC_IRQ,
+	KM_BIO_DST_IRQ,
 	KM_TYPE_NR
 };
 
diff --git a/include/asm-x86_64/mmu_context.h b/include/asm-x86_64/mmu_context.h
index e9f6d661cf4c..e21f0e6721f8 100644
--- a/include/asm-x86_64/mmu_context.h
+++ b/include/asm-x86_64/mmu_context.h
@@ -19,8 +19,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk, unsigned cpu)
 {
-	if(cpu_tlbstate[cpu].state == TLBSTATE_OK)
-		cpu_tlbstate[cpu].state = TLBSTATE_LAZY;	
+	if (read_pda(mmu_state) == TLBSTATE_OK) 
+		write_pda(mmu_state, TLBSTATE_LAZY);
 }
 #else
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk, unsigned cpu)
@@ -35,8 +35,8 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 		/* stop flush ipis for the previous mm */
 		clear_bit(cpu, &prev->cpu_vm_mask);
 #ifdef CONFIG_SMP
-		cpu_tlbstate[cpu].state = TLBSTATE_OK;
-		cpu_tlbstate[cpu].active_mm = next;
+		write_pda(mmu_state, TLBSTATE_OK);
+		write_pda(active_mm, next);
 #endif
 		set_bit(cpu, &next->cpu_vm_mask);
 		/* Re-load page tables */
@@ -48,8 +48,8 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	}
 #ifdef CONFIG_SMP
 	else {
-		cpu_tlbstate[cpu].state = TLBSTATE_OK;
-		if(cpu_tlbstate[cpu].active_mm != next)
+		write_pda(mmu_state, TLBSTATE_OK);
+		if (read_pda(active_mm) != next)
 			out_of_line_bug();
 		if(!test_and_set_bit(cpu, &next->cpu_vm_mask)) {
 			/* We were in lazy tlb mode and leave_mm disabled 
diff --git a/include/asm-x86_64/msr.h b/include/asm-x86_64/msr.h
index 7e522c2f4846..4085cc8c5dbe 100644
--- a/include/asm-x86_64/msr.h
+++ b/include/asm-x86_64/msr.h
@@ -95,6 +95,7 @@
 #define MSR_IA32_PERFCTR0      0xc1
 #define MSR_IA32_PERFCTR1      0xc2
 
+#define MSR_MTRRcap		0x0fe
 #define MSR_IA32_BBL_CR_CTL        0x119
 
 #define MSR_IA32_MCG_CAP       0x179
@@ -110,6 +111,19 @@
 #define MSR_IA32_LASTINTFROMIP     0x1dd
 #define MSR_IA32_LASTINTTOIP       0x1de
 
+#define MSR_MTRRfix64K_00000	0x250
+#define MSR_MTRRfix16K_80000	0x258
+#define MSR_MTRRfix16K_A0000	0x259
+#define MSR_MTRRfix4K_C0000	0x268
+#define MSR_MTRRfix4K_C8000	0x269
+#define MSR_MTRRfix4K_D0000	0x26a
+#define MSR_MTRRfix4K_D8000	0x26b
+#define MSR_MTRRfix4K_E0000	0x26c
+#define MSR_MTRRfix4K_E8000	0x26d
+#define MSR_MTRRfix4K_F0000	0x26e
+#define MSR_MTRRfix4K_F8000	0x26f
+#define MSR_MTRRdefType		0x2ff
+
 #define MSR_IA32_MC0_CTL       0x400
 #define MSR_IA32_MC0_STATUS        0x401
 #define MSR_IA32_MC0_ADDR      0x402
@@ -171,11 +185,4 @@
 #define MSR_IA32_APICBASE_ENABLE        (1<<11)
 #define MSR_IA32_APICBASE_BASE          (0xfffff<<12)
 
-
-#define MSR_IA32_THERM_CONTROL		0x19a
-#define MSR_IA32_THERM_INTERRUPT	0x19b
-#define MSR_IA32_THERM_STATUS		0x19c
-#define MSR_IA32_MISC_ENABLE		0x1a0
-
-
 #endif
diff --git a/include/asm-x86_64/mtrr.h b/include/asm-x86_64/mtrr.h
index ff3ea870d0d6..6505d7bd6ece 100644
--- a/include/asm-x86_64/mtrr.h
+++ b/include/asm-x86_64/mtrr.h
@@ -30,16 +30,16 @@
 
 struct mtrr_sentry
 {
-    unsigned long base;    /*  Base address     */
-    unsigned long size;    /*  Size of region   */
+    __u64 base;    /*  Base address     */
+    __u32 size;    /*  Size of region   */
     unsigned int type;     /*  Type of region   */
 };
 
 struct mtrr_gentry
 {
+    __u64 base;    /*  Base address     */
+    __u32 size;    /*  Size of region   */
     unsigned int regnum;   /*  Register number  */
-    unsigned long base;    /*  Base address     */
-    unsigned long size;    /*  Size of region   */
     unsigned int type;     /*  Type of region   */
 };
 
@@ -81,46 +81,38 @@ static char *mtrr_strings[MTRR_NUM_TYPES] =
 #ifdef __KERNEL__
 
 /*  The following functions are for use by other drivers  */
-# ifdef CONFIG_MTRR
-extern int mtrr_add (unsigned long base, unsigned long size,
-		     unsigned int type, char increment);
-extern int mtrr_add_page (unsigned long base, unsigned long size,
-		     unsigned int type, char increment);
-extern int mtrr_del (int reg, unsigned long base, unsigned long size);
-extern int mtrr_del_page (int reg, unsigned long base, unsigned long size);
-extern void mtrr_centaur_report_mcr(int mcr, u32 lo, u32 hi);
-#  else
-static __inline__ int mtrr_add (unsigned long base, unsigned long size,
+#ifdef CONFIG_MTRR
+extern int mtrr_add (__u64 base, __u32 size, unsigned int type, char increment);
+extern int mtrr_add_page (__u64 base, __u32 size, unsigned int type, char increment);
+extern int mtrr_del (int reg, __u64 base, __u32 size);
+extern int mtrr_del_page (int reg, __u64 base, __u32 size);
+#else
+static __inline__ int mtrr_add (__u64 base, __u32 size,
 				unsigned int type, char increment)
 {
     return -ENODEV;
 }
-static __inline__ int mtrr_add_page (unsigned long base, unsigned long size,
+static __inline__ int mtrr_add_page (__u64 base, __u32 size,
 				unsigned int type, char increment)
 {
     return -ENODEV;
 }
-static __inline__ int mtrr_del (int reg, unsigned long base,
-				unsigned long size)
+static __inline__ int mtrr_del (int reg, __u64 base, __u32 size)
 {
     return -ENODEV;
 }
-static __inline__ int mtrr_del_page (int reg, unsigned long base,
-				unsigned long size)
+static __inline__ int mtrr_del_page (int reg, __u64 base, __u32 size)
 {
     return -ENODEV;
 }
-
-static __inline__ void mtrr_centaur_report_mcr(int mcr, u32 lo, u32 hi) {;}
-
-#  endif
+#endif
 
 /*  The following functions are for initialisation: don't use them!  */
 extern int mtrr_init (void);
-#  if defined(CONFIG_SMP) && defined(CONFIG_MTRR)
+#if defined(CONFIG_SMP) && defined(CONFIG_MTRR)
 extern void mtrr_init_boot_cpu (void);
 extern void mtrr_init_secondary_cpu (void);
-#  endif
+#endif
 
 #endif
 
diff --git a/include/asm-x86_64/pda.h b/include/asm-x86_64/pda.h
index 7ff508346013..eb38cf70fb90 100644
--- a/include/asm-x86_64/pda.h
+++ b/include/asm-x86_64/pda.h
@@ -22,6 +22,8 @@ struct x8664_pda {
 	unsigned int __local_bh_count;
 	unsigned int __nmi_count;	/* arch dependent */
 	struct task_struct * __ksoftirqd_task; /* waitqueue is too large */
+	struct mm_struct *active_mm;
+	int mmu_state;     
 } ____cacheline_aligned;
 
 #define PDA_STACKOFFSET (5*8)
diff --git a/include/asm-x86_64/processor.h b/include/asm-x86_64/processor.h
index 4cda0f055a5f..03875338aedf 100644
--- a/include/asm-x86_64/processor.h
+++ b/include/asm-x86_64/processor.h
@@ -45,21 +45,12 @@ struct cpuinfo_x86 {
 	__u8	x86_vendor;	/* CPU vendor */
 	__u8	x86_model;
 	__u8	x86_mask;
-  /* We know that wp_works_ok = 1, hlt_works_ok = 1, hard_math = 1,
-     etc... */
-	char	wp_works_ok;	/* It doesn't on 386's */
-	char	hlt_works_ok;	/* Problems on some 486Dx4's and old 386's */
-	char	hard_math;
-	char	rfu;
 	int	cpuid_level;	/* Maximum supported CPUID level, -1=no CPUID */
 	__u32	x86_capability[NCAPINTS];
 	char	x86_vendor_id[16];
 	char	x86_model_id[64];
 	int 	x86_cache_size;  /* in KB - valid for CPUS which support this
 				    call  */
-	int	fdiv_bug;
-	int	f00f_bug;
-	int	coma_bug;
 	unsigned long loops_per_jiffy;
 } ____cacheline_aligned;
 
@@ -323,7 +314,7 @@ struct thread_struct {
 /* IO permissions. the bitmap could be moved into the GDT, that would make
    switch faster for a limited number of ioperm using tasks. -AK */
 	int		ioperm;
-	u32	io_bitmap[IO_BITMAP_SIZE+1];
+	u32	*io_bitmap_ptr;
 };
 
 #define INIT_THREAD  {				\
diff --git a/include/asm-x86_64/spinlock.h b/include/asm-x86_64/spinlock.h
index 6f1d71c65a68..a276217b88a3 100644
--- a/include/asm-x86_64/spinlock.h
+++ b/include/asm-x86_64/spinlock.h
@@ -15,7 +15,7 @@ extern int printk(const char * fmt, ...)
 
 typedef struct {
 	volatile unsigned int lock;
-#ifdef CONFIG_DEBUG_SPINLOCK
+#if SPINLOCK_DEBUG
 	unsigned magic;
 #endif
 } spinlock_t;
@@ -39,7 +39,7 @@ typedef struct {
  * We make no fairness assumptions. They have a cost.
  */
 
-#define spin_is_locked(x)	(*(volatile char *)(&(x)->lock) <= 0)
+#define spin_is_locked(x)	(*(volatile signed char *)(&(x)->lock) <= 0)
 #define spin_unlock_wait(x)	do { barrier(); } while(spin_is_locked(x))
 
 #define spin_lock_string \
@@ -62,7 +62,7 @@ typedef struct {
 
 static inline int _raw_spin_trylock(spinlock_t *lock)
 {
-	char oldval;
+	signed char oldval;
 	__asm__ __volatile__(
 		"xchgb %b0,%1"
 		:"=q" (oldval), "=m" (lock->lock)
diff --git a/include/asm-x86_64/string.h b/include/asm-x86_64/string.h
index ec456eadb674..27876b9da06a 100644
--- a/include/asm-x86_64/string.h
+++ b/include/asm-x86_64/string.h
@@ -40,18 +40,9 @@ extern void *__memcpy(void *to, const void *from, size_t len);
 		 __ret = __builtin_memcpy((dst),(src),__len);	\
 	   __ret; }) 
 
-#if 0
+
 #define __HAVE_ARCH_MEMSET
-extern void *__memset(void *mem, int val, size_t len); 
-#define memset(dst,val,len)					\
-	({ size_t __len = (len);				\
-	   void *__ret;						\
-	   if (__builtin_constant_p(len) && __len >= 64)	\
-		   __ret = __memset((dst),(val),__len);		\
-	   else							\
-		   __ret = __builtin_memset((dst),(val),__len);	\
-	   __ret; }) 
-#endif	   
+#define memset __builtin_memset
 
 #define __HAVE_ARCH_MEMMOVE
 void * memmove(void * dest,const void *src,size_t count);
diff --git a/include/asm-x86_64/suspend.h b/include/asm-x86_64/suspend.h
new file mode 100644
index 000000000000..9f065f8fe33d
--- /dev/null
+++ b/include/asm-x86_64/suspend.h
@@ -0,0 +1,6 @@
+#ifndef SUSPEND_H
+#define SUSPEND_H 1
+
+/* dummy for now */
+
+#endif
diff --git a/include/asm-x86_64/system.h b/include/asm-x86_64/system.h
index 1df84d087823..9d6c6f1f48d5 100644
--- a/include/asm-x86_64/system.h
+++ b/include/asm-x86_64/system.h
@@ -13,7 +13,10 @@
 #define LOCK_PREFIX ""
 #endif
 
-#define prepare_to_switch() do {} while(0)
+#define prepare_arch_schedule(prev)            do { } while(0)
+#define finish_arch_schedule(prev)             do { } while(0)
+#define prepare_arch_switch(rq)                        do { } while(0)
+#define finish_arch_switch(rq)                 spin_unlock_irq(&(rq)->lock)
 
 #define __STR(x) #x
 #define STR(x) __STR(x)
@@ -41,7 +44,7 @@
 	__POP(rax) __POP(r15) __POP(r14) __POP(r13) __POP(r12) __POP(r11) __POP(r10) \
 	__POP(r9) __POP(r8)
 
-#define switch_to(prev,next) \
+#define switch_to(prev,next,last) \
 	asm volatile(SAVE_CONTEXT						    \
 		     "movq %%rsp,%[prevrsp]\n\t"				    \
 		     "movq %[nextrsp],%%rsp\n\t"				    \
diff --git a/include/asm-x86_64/timex.h b/include/asm-x86_64/timex.h
index b87680d9e51a..98bddc2d805a 100644
--- a/include/asm-x86_64/timex.h
+++ b/include/asm-x86_64/timex.h
@@ -48,6 +48,4 @@ static inline cycles_t get_cycles (void)
 
 extern unsigned int cpu_khz;
 
-#define ARCH_HAS_JIFFIES_64
-
 #endif
diff --git a/include/asm-x86_64/tlbflush.h b/include/asm-x86_64/tlbflush.h
index 3f086b2d03b3..2e811ac262af 100644
--- a/include/asm-x86_64/tlbflush.h
+++ b/include/asm-x86_64/tlbflush.h
@@ -106,15 +106,6 @@ static inline void flush_tlb_range(struct vm_area_struct * vma, unsigned long st
 #define TLBSTATE_OK	1
 #define TLBSTATE_LAZY	2
 
-struct tlb_state
-{
-	struct mm_struct *active_mm;
-	int state;
-	char __cacheline_padding[24];
-};
-extern struct tlb_state cpu_tlbstate[NR_CPUS];
-
-
 #endif
 
 #define flush_tlb_kernel_range(start, end) flush_tlb_all()
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b244108a27a8..ffc38fca9c1e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -21,6 +21,8 @@
 #define __LINUX_BIO_H
 
 #include <linux/kdev_t.h>
+#include <linux/highmem.h>
+
 /* Platforms may set this to teach the BIO layer about IOMMU hardware. */
 #include <asm/io.h>
 #ifndef BIO_VMERGE_BOUNDARY
@@ -47,9 +49,6 @@ struct bio_vec {
 	unsigned int	bv_offset;
 };
 
-/*
- * weee, c forward decl...
- */
 struct bio;
 typedef void (bio_end_io_t) (struct bio *);
 typedef void (bio_destructor_t) (struct bio *);
@@ -206,4 +205,49 @@ extern inline void bio_init(struct bio *);
 
 extern int bio_ioctl(kdev_t, unsigned int, unsigned long);
 
+#ifdef CONFIG_HIGHMEM
+/*
+ * remember to add offset! and never ever reenable interrupts between a
+ * bio_kmap_irq and bio_kunmap_irq!!
+ *
+ * This function MUST be inlined - it plays with the CPU interrupt flags.
+ * Hence the `extern inline'.
+ */
+extern inline char *bio_kmap_irq(struct bio *bio, unsigned long *flags)
+{
+	unsigned long addr;
+
+	__save_flags(*flags);
+
+	/*
+	 * could be low
+	 */
+	if (!PageHighMem(bio_page(bio)))
+		return bio_data(bio);
+
+	/*
+	 * it's a highmem page
+	 */
+	__cli();
+	addr = (unsigned long) kmap_atomic(bio_page(bio), KM_BIO_SRC_IRQ);
+
+	if (addr & ~PAGE_MASK)
+		BUG();
+
+	return (char *) addr + bio_offset(bio);
+}
+
+extern inline void bio_kunmap_irq(char *buffer, unsigned long *flags)
+{
+	unsigned long ptr = (unsigned long) buffer & PAGE_MASK;
+
+	kunmap_atomic((void *) ptr, KM_BIO_SRC_IRQ);
+	__restore_flags(*flags);
+}
+
+#else
+#define bio_kmap_irq(bio, flags)	(bio_data(bio))
+#define bio_kunmap_irq(buf, flags)	do { *(flags) = 0; } while (0)
+#endif
+
 #endif /* __LINUX_BIO_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ef86a3ed6e64..c0c099834df2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -246,12 +246,7 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn;
 #define BLK_BOUNCE_ISA		(ISA_DMA_THRESHOLD)
 
 extern int init_emergency_isa_pool(void);
-extern void create_bounce(unsigned long pfn, int gfp, struct bio **bio_orig);
-
-extern inline void blk_queue_bounce(request_queue_t *q, struct bio **bio)
-{
-	create_bounce(q->bounce_pfn, q->bounce_gfp, bio);
-}
+void blk_queue_bounce(request_queue_t *q, struct bio **bio);
 
 #define rq_for_each_bio(bio, rq)	\
 	if ((rq->bio))			\
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 90767fc78617..4fc6bab55825 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -108,12 +108,7 @@ BUFFER_FNS(Async_Read, async_read)
 BUFFER_FNS(Async_Write, async_write)
 BUFFER_FNS(Boundary, boundary)
 
-/*
- * FIXME: this is used only by bh_kmap, which is used only by RAID5.
- * Move all that stuff into raid5.c
- */
 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
-
 #define touch_buffer(bh)	mark_page_accessed(bh->b_page)
 
 /* If we *know* page->private refers to buffer_heads */
@@ -124,16 +119,6 @@ BUFFER_FNS(Boundary, boundary)
 		((struct buffer_head *)(page)->private);	\
 	})
 #define page_has_buffers(page)	PagePrivate(page)
-#define set_page_buffers(page, buffers)				\
-	do {							\
-		SetPagePrivate(page);				\
-		page->private = (unsigned long)buffers;		\
-	} while (0)
-#define clear_page_buffers(page)				\
-	do {							\
-		ClearPagePrivate(page);				\
-		page->private = 0;				\
-	} while (0)
 
 #define invalidate_buffers(dev)	__invalidate_buffers((dev), 0)
 #define destroy_buffers(dev)	__invalidate_buffers((dev), 1)
@@ -175,15 +160,14 @@ int fsync_dev(kdev_t);
 int fsync_bdev(struct block_device *);
 int fsync_super(struct super_block *);
 int fsync_no_super(struct block_device *);
-struct buffer_head *__get_hash_table(struct block_device *, sector_t, int);
+struct buffer_head *__find_get_block(struct block_device *, sector_t, int);
 struct buffer_head * __getblk(struct block_device *, sector_t, int);
 void __brelse(struct buffer_head *);
 void __bforget(struct buffer_head *);
 struct buffer_head * __bread(struct block_device *, int, int);
 void wakeup_bdflush(void);
-struct buffer_head *alloc_buffer_head(int async);
+struct buffer_head *alloc_buffer_head(void);
 void free_buffer_head(struct buffer_head * bh);
-int brw_page(int, struct page *, struct block_device *, sector_t [], int);
 void FASTCALL(unlock_buffer(struct buffer_head *bh));
 
 /*
@@ -270,9 +254,9 @@ static inline struct buffer_head * sb_getblk(struct super_block *sb, int block)
 }
 
 static inline struct buffer_head *
-sb_get_hash_table(struct super_block *sb, int block)
+sb_find_get_block(struct super_block *sb, int block)
 {
-	return __get_hash_table(sb->s_bdev, block, sb->s_blocksize);
+	return __find_get_block(sb->s_bdev, block, sb->s_blocksize);
 }
 
 static inline void
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index da66723d62c5..68c841afc622 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -2,7 +2,6 @@
 #define _LINUX_HIGHMEM_H
 
 #include <linux/config.h>
-#include <linux/bio.h>
 #include <linux/fs.h>
 #include <asm/cacheflush.h>
 
@@ -15,45 +14,8 @@ extern struct page *highmem_start_page;
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
 
-extern void create_bounce(unsigned long pfn, int gfp, struct bio **bio_orig);
 extern void check_highmem_ptes(void);
 
-/*
- * remember to add offset! and never ever reenable interrupts between a
- * bio_kmap_irq and bio_kunmap_irq!!
- */
-static inline char *bio_kmap_irq(struct bio *bio, unsigned long *flags)
-{
-	unsigned long addr;
-
-	__save_flags(*flags);
-
-	/*
-	 * could be low
-	 */
-	if (!PageHighMem(bio_page(bio)))
-		return bio_data(bio);
-
-	/*
-	 * it's a highmem page
-	 */
-	__cli();
-	addr = (unsigned long) kmap_atomic(bio_page(bio), KM_BIO_IRQ);
-
-	if (addr & ~PAGE_MASK)
-		BUG();
-
-	return (char *) addr + bio_offset(bio);
-}
-
-static inline void bio_kunmap_irq(char *buffer, unsigned long *flags)
-{
-	unsigned long ptr = (unsigned long) buffer & PAGE_MASK;
-
-	kunmap_atomic((void *) ptr, KM_BIO_IRQ);
-	__restore_flags(*flags);
-}
-
 #else /* CONFIG_HIGHMEM */
 
 static inline unsigned int nr_free_highpages(void) { return 0; }
@@ -65,12 +27,6 @@ static inline void *kmap(struct page *page) { return page_address(page); }
 #define kmap_atomic(page,idx)		kmap(page)
 #define kunmap_atomic(page,idx)		kunmap(page)
 
-#define bh_kmap(bh)	((bh)->b_data)
-#define bh_kunmap(bh)	do { } while (0)
-
-#define bio_kmap_irq(bio, flags)	(bio_data(bio))
-#define bio_kunmap_irq(buf, flags)	do { *(flags) = 0; } while (0)
-
 #endif /* CONFIG_HIGHMEM */
 
 /* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */
diff --git a/include/linux/ide.h b/include/linux/ide.h
index e07d0f19fcd1..03c21c567ce4 100644
--- a/include/linux/ide.h
+++ b/include/linux/ide.h
@@ -15,6 +15,7 @@
 #include <linux/devfs_fs_kernel.h>
 #include <linux/interrupt.h>
 #include <linux/bitops.h>
+#include <linux/bio.h>
 #include <asm/byteorder.h>
 #include <asm/hdreg.h>
 
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index 835d38c9dbfc..683c1247fd70 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -238,6 +238,7 @@ enum jbd_state_bits {
 BUFFER_FNS(JBD, jbd)
 BUFFER_FNS(JBDDirty, jbddirty)
 TAS_BUFFER_FNS(JBDDirty, jbddirty)
+BUFFER_FNS(Freed, freed)
 
 static inline struct buffer_head *jh2bh(struct journal_head *jh)
 {
diff --git a/include/linux/loop.h b/include/linux/loop.h
index d4dc0665a92d..4dfa8b14a586 100644
--- a/include/linux/loop.h
+++ b/include/linux/loop.h
@@ -62,14 +62,6 @@ typedef	int (* transfer_proc_t)(struct loop_device *, int cmd,
 				char *raw_buf, char *loop_buf, int size,
 				int real_block);
 
-static inline int lo_do_transfer(struct loop_device *lo, int cmd, char *rbuf,
-				 char *lbuf, int size, int rblock)
-{
-	if (!lo->transfer)
-		return 0;
-
-	return lo->transfer(lo, cmd, rbuf, lbuf, size, rblock);
-}
 #endif /* __KERNEL__ */
 
 /*
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 796aac51388a..86b1ee2d3eb3 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -10,13 +10,32 @@
 #include <linux/mm.h>
 #include <asm/uaccess.h>
 
-struct poll_table_page;
+#define POLL_INLINE_BYTES 256
+#define FAST_SELECT_MAX  128
+#define FAST_POLL_MAX    128
+#define POLL_INLINE_ENTRIES (1+(POLL_INLINE_BYTES / sizeof(struct poll_table_entry)))
+
+struct poll_table_entry {
+	struct file * filp;
+	wait_queue_t wait;
+	wait_queue_head_t * wait_address;
+};
+
+struct poll_table_page {
+	struct poll_table_page * next;
+	struct poll_table_entry * entry;
+	struct poll_table_entry entries[0];
+};
 
 typedef struct poll_table_struct {
 	int error;
 	struct poll_table_page * table;
+	struct poll_table_page inline_page; 
+	struct poll_table_entry inline_table[POLL_INLINE_ENTRIES]; 
 } poll_table;
 
+#define POLL_INLINE_TABLE_LEN (sizeof(poll_table) - offsetof(poll_table, inline_page))
+
 extern void __pollwait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p);
 
 static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
@@ -30,6 +49,7 @@ static inline void poll_initwait(poll_table* pt)
 	pt->error = 0;
 	pt->table = NULL;
 }
+
 extern void poll_freewait(poll_table* pt);
 
 
@@ -49,27 +69,6 @@ typedef struct {
 #define FDS_LONGS(nr)	(((nr)+FDS_BITPERLONG-1)/FDS_BITPERLONG)
 #define FDS_BYTES(nr)	(FDS_LONGS(nr)*sizeof(long))
 
-/*
- * We do a VERIFY_WRITE here even though we are only reading this time:
- * we'll write to it eventually..
- *
- * Use "unsigned long" accesses to let user-mode fd_set's be long-aligned.
- */
-static inline
-int get_fd_set(unsigned long nr, void *ufdset, unsigned long *fdset)
-{
-	nr = FDS_BYTES(nr);
-	if (ufdset) {
-		int error;
-		error = verify_area(VERIFY_WRITE, ufdset, nr);
-		if (!error && __copy_from_user(fdset, ufdset, nr))
-			error = -EFAULT;
-		return error;
-	}
-	memset(fdset, 0, nr);
-	return 0;
-}
-
 static inline
 void set_fd_set(unsigned long nr, void *ufdset, unsigned long *fdset)
 {
@@ -77,12 +76,6 @@ void set_fd_set(unsigned long nr, void *ufdset, unsigned long *fdset)
 		__copy_to_user(ufdset, fdset, FDS_BYTES(nr));
 }
 
-static inline
-void zero_fd_set(unsigned long nr, unsigned long *fdset)
-{
-	memset(fdset, 0, FDS_BYTES(nr));
-}
-
 extern int do_select(int n, fd_set_bits *fds, long *timeout);
 
 #endif /* KERNEL */
diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h
index 5c25120581a7..67f7bf471798 100644
--- a/include/linux/raid/raid5.h
+++ b/include/linux/raid/raid5.h
@@ -3,6 +3,7 @@
 
 #include <linux/raid/md.h>
 #include <linux/raid/xor.h>
+#include <linux/bio.h>
 
 /*
  *
diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h
index 4a3d16d7b8dc..29f6063b3546 100644
--- a/include/linux/reiserfs_fs.h
+++ b/include/linux/reiserfs_fs.h
@@ -1651,7 +1651,7 @@ extern wait_queue_head_t reiserfs_commit_thread_wait ;
 #define JOURNAL_BUFFER(j,n) ((j)->j_ap_blocks[((j)->j_start + (n)) % JOURNAL_BLOCK_COUNT])
 
 // We need these to make journal.c code more readable
-#define journal_get_hash_table(s, block) __get_hash_table(SB_JOURNAL(s)->j_dev_bd, block, s->s_blocksize)
+#define journal_find_get_block(s, block) __find_get_block(SB_JOURNAL(s)->j_dev_bd, block, s->s_blocksize)
 #define journal_getblk(s, block) __getblk(SB_JOURNAL(s)->j_dev_bd, block, s->s_blocksize)
 #define journal_bread(s, block) __bread(SB_JOURNAL(s)->j_dev_bd, block, s->s_blocksize)
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3b43d3bb1123..9e7d80851c32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -7,7 +7,6 @@ extern unsigned long event;
 
 #include <linux/config.h>
 #include <linux/capability.h>
-#include <linux/tqueue.h>
 #include <linux/threads.h>
 #include <linux/kernel.h>
 #include <linux/types.h>
@@ -160,7 +159,6 @@ extern unsigned long cache_decay_ticks;
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);
 
-extern int schedule_task(struct tq_struct *task);
 extern void flush_scheduled_tasks(void);
 extern int start_context_thread(void);
 extern int current_is_keventd(void);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d0160265e3c5..0b448a811a39 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -5,6 +5,7 @@
 #include <linux/kdev_t.h>
 #include <linux/linkage.h>
 #include <linux/mmzone.h>
+#include <linux/list.h>
 #include <asm/page.h>
 
 #define SWAP_FLAG_PREFER	0x8000	/* set if swap priority specified */
@@ -62,6 +63,21 @@ typedef struct {
 #ifdef __KERNEL__
 
 /*
+ * A swap extent maps a range of a swapfile's PAGE_SIZE pages onto a range of
+ * disk blocks.  A list of swap extents maps the entire swapfile.  (Where the
+ * term `swapfile' refers to either a blockdevice or an IS_REG file.  Apart
+ * from setup, they're handled identically.
+ *
+ * We always assume that blocks are of size PAGE_SIZE.
+ */
+struct swap_extent {
+	struct list_head list;
+	pgoff_t start_page;
+	pgoff_t nr_pages;
+	sector_t start_block;
+};
+
+/*
  * Max bad pages in the new format..
  */
 #define __swapoffset(x) ((unsigned long)&((union swap_header *)0)->x)
@@ -83,11 +99,17 @@ enum {
 
 /*
  * The in-memory structure used to track swap areas.
+ * extent_list.prev points at the lowest-index extent.  That list is
+ * sorted.
  */
 struct swap_info_struct {
 	unsigned int flags;
 	spinlock_t sdev_lock;
 	struct file *swap_file;
+	struct block_device *bdev;
+	struct list_head extent_list;
+	int nr_extents;
+	struct swap_extent *curr_swap_extent;
 	unsigned old_block_size;
 	unsigned short * swap_map;
 	unsigned int lowest_bit;
@@ -134,8 +156,9 @@ extern wait_queue_head_t kswapd_wait;
 extern int FASTCALL(try_to_free_pages(zone_t *, unsigned int, unsigned int));
 
 /* linux/mm/page_io.c */
-extern void rw_swap_page(int, struct page *);
-extern void rw_swap_page_nolock(int, swp_entry_t, char *);
+int swap_readpage(struct file *file, struct page *page);
+int swap_writepage(struct page *page);
+int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page);
 
 /* linux/mm/page_alloc.c */
 
@@ -163,12 +186,13 @@ extern unsigned int nr_swapfiles;
 extern struct swap_info_struct swap_info[];
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(void);
-extern void get_swaphandle_info(swp_entry_t, unsigned long *, struct inode **);
 extern int swap_duplicate(swp_entry_t);
-extern int swap_count(struct page *);
 extern int valid_swaphandles(swp_entry_t, unsigned long *);
 extern void swap_free(swp_entry_t);
 extern void free_swap_and_cache(swp_entry_t);
+sector_t map_swap_page(struct swap_info_struct *p, pgoff_t offset);
+struct swap_info_struct *get_swap_info_struct(unsigned type);
+
 struct swap_list_t {
 	int head;	/* head of priority-ordered swapfile list */
 	int next;	/* swapfile to be used next */
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index a5a6684f9a50..488bc05dbcc1 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -130,16 +130,21 @@ enum
 /* CTL_VM names: */
 enum
 {
-	VM_SWAPCTL=1,		/* struct: Set vm swapping control */
-	VM_SWAPOUT=2,		/* int: Linear or sqrt() swapout for hogs */
-	VM_FREEPG=3,		/* struct: Set free page thresholds */
+	VM_UNUSED1=1,		/* was: struct: Set vm swapping control */
+	VM_UNUSED2=2,		/* was; int: Linear or sqrt() swapout for hogs */
+	VM_UNUSED3=3,		/* was: struct: Set free page thresholds */
 	VM_BDFLUSH_UNUSED=4,	/* Spare */
 	VM_OVERCOMMIT_MEMORY=5,	/* Turn off the virtual memory safety limit */
-	VM_BUFFERMEM=6,		/* struct: Set buffer memory thresholds */
-	VM_PAGECACHE=7,		/* struct: Set cache memory thresholds */
+	VM_UNUSED4=6,		/* was: struct: Set buffer memory thresholds */
+	VM_UNUSED5=7,		/* was: struct: Set cache memory thresholds */
 	VM_PAGERDAEMON=8,	/* struct: Control kswapd behaviour */
-	VM_PGT_CACHE=9,		/* struct: Set page table cache parameters */
-	VM_PAGE_CLUSTER=10	/* int: set number of pages to swap together */
+	VM_UNUSED6=9,		/* was: struct: Set page table cache parameters */
+	VM_PAGE_CLUSTER=10,	/* int: set number of pages to swap together */
+	VM_DIRTY_BACKGROUND=11,	/* dirty_background_ratio */
+	VM_DIRTY_ASYNC=12,	/* dirty_async_ratio */
+	VM_DIRTY_SYNC=13,	/* dirty_sync_ratio */
+	VM_DIRTY_WB_CS=14,	/* dirty_writeback_centisecs */
+	VM_DIRTY_EXPIRE_CS=15,	/* dirty_expire_centisecs */
 };
 
 
diff --git a/include/linux/timer.h b/include/linux/timer.h
index d6f0ce5f8740..6e1e61a4c07b 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -25,10 +25,8 @@ extern int del_timer(struct timer_list * timer);
 
 #ifdef CONFIG_SMP
 extern int del_timer_sync(struct timer_list * timer);
-extern void sync_timers(void);
 #else
 #define del_timer_sync(t)	del_timer(t)
-#define sync_timers()		do { } while (0)
 #endif
 
 /*
diff --git a/include/linux/tqueue.h b/include/linux/tqueue.h
index 3d3047027229..d4729c518f22 100644
--- a/include/linux/tqueue.h
+++ b/include/linux/tqueue.h
@@ -110,6 +110,9 @@ static inline int queue_task(struct tq_struct *bh_pointer, task_queue *bh_list)
 	return ret;
 }
 
+/* Schedule a tq to run in process context */
+extern int schedule_task(struct tq_struct *task);
+
 /*
  * Call all "bottom halfs" on a given list.
  */
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 4051c031a976..9cc67b500368 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -13,6 +13,7 @@ struct vm_struct {
 	unsigned long flags;
 	void * addr;
 	unsigned long size;
+	unsigned long phys_addr;
 	struct vm_struct * next;
 };
 
@@ -23,6 +24,8 @@ extern long vread(char *buf, char *addr, unsigned long count);
 extern void vmfree_area_pages(unsigned long address, unsigned long size);
 extern int vmalloc_area_pages(unsigned long address, unsigned long size,
                               int gfp_mask, pgprot_t prot);
+extern struct vm_struct *remove_kernel_area(void *addr);
+
 /*
  * Various ways to allocate pages.
  */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index cf706c783eda..a06b0f116ebd 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -45,6 +45,12 @@ static inline void wait_on_inode(struct inode *inode)
 /*
  * mm/page-writeback.c
  */
+extern int dirty_background_ratio;
+extern int dirty_async_ratio;
+extern int dirty_sync_ratio;
+extern int dirty_writeback_centisecs;
+extern int dirty_expire_centisecs;
+
 void balance_dirty_pages(struct address_space *mapping);
 void balance_dirty_pages_ratelimited(struct address_space *mapping);
 int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
diff --git a/kernel/context.c b/kernel/context.c
index 56bada438f61..c49f914430e0 100644
--- a/kernel/context.c
+++ b/kernel/context.c
@@ -20,6 +20,7 @@
 #include <linux/unistd.h>
 #include <linux/signal.h>
 #include <linux/completion.h>
+#include <linux/tqueue.h>
 
 static DECLARE_TASK_QUEUE(tq_context);
 static DECLARE_WAIT_QUEUE_HEAD(context_task_wq);
diff --git a/kernel/kmod.c b/kernel/kmod.c
index a9f0ddb521cc..05388d9557fa 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -28,6 +28,7 @@
 #include <linux/namespace.h>
 #include <linux/completion.h>
 #include <linux/file.h>
+#include <linux/tqueue.h>
 
 #include <asm/uaccess.h>
 
diff --git a/kernel/ksyms.c b/kernel/ksyms.c
index 9391bb0e933d..8b2511787ccb 100644
--- a/kernel/ksyms.c
+++ b/kernel/ksyms.c
@@ -120,7 +120,7 @@ EXPORT_SYMBOL(vmtruncate);
 EXPORT_SYMBOL(find_vma);
 EXPORT_SYMBOL(get_unmapped_area);
 EXPORT_SYMBOL(init_mm);
-EXPORT_SYMBOL(create_bounce);
+EXPORT_SYMBOL(blk_queue_bounce);
 #ifdef CONFIG_HIGHMEM
 EXPORT_SYMBOL(kmap_high);
 EXPORT_SYMBOL(kunmap_high);
@@ -551,7 +551,7 @@ EXPORT_SYMBOL(file_fsync);
 EXPORT_SYMBOL(fsync_buffers_list);
 EXPORT_SYMBOL(clear_inode);
 EXPORT_SYMBOL(init_special_inode);
-EXPORT_SYMBOL(__get_hash_table);
+EXPORT_SYMBOL(__find_get_block);
 EXPORT_SYMBOL(new_inode);
 EXPORT_SYMBOL(__insert_inode_hash);
 EXPORT_SYMBOL(remove_inode_hash);
@@ -559,7 +559,6 @@ EXPORT_SYMBOL(buffer_insert_list);
 EXPORT_SYMBOL(make_bad_inode);
 EXPORT_SYMBOL(is_bad_inode);
 EXPORT_SYMBOL(event);
-EXPORT_SYMBOL(brw_page);
 
 #ifdef CONFIG_UID16
 EXPORT_SYMBOL(overflowuid);
diff --git a/kernel/suspend.c b/kernel/suspend.c
index 2fcf5db57868..12e5b0f01f57 100644
--- a/kernel/suspend.c
+++ b/kernel/suspend.c
@@ -320,14 +320,15 @@ static void mark_swapfiles(swp_entry_t prev, int mode)
 {
 	swp_entry_t entry;
 	union diskpage *cur;
-	
-	cur = (union diskpage *)get_free_page(GFP_ATOMIC);
-	if (!cur)
+	struct page *page;
+
+	page = alloc_page(GFP_ATOMIC);
+	if (!page)
 		panic("Out of memory in mark_swapfiles");
+	cur = page_address(page);
 	/* XXX: this is dirty hack to get first page of swap file */
 	entry = swp_entry(root_swap, 0);
-	lock_page(virt_to_page((unsigned long)cur));
-	rw_swap_page_nolock(READ, entry, (char *) cur);
+	rw_swap_page_sync(READ, entry, page);
 
 	if (mode == MARK_SWAP_RESUME) {
 	  	if (!memcmp("SUSP1R",cur->swh.magic.magic,6))
@@ -345,10 +346,8 @@ static void mark_swapfiles(swp_entry_t prev, int mode)
 		cur->link.next = prev; /* prev is the first/last swap page of the resume area */
 		/* link.next lies *no more* in last 4 bytes of magic */
 	}
-	lock_page(virt_to_page((unsigned long)cur));
-	rw_swap_page_nolock(WRITE, entry, (char *)cur);
-	
-	free_page((unsigned long)cur);
+	rw_swap_page_sync(WRITE, entry, page);
+	__free_page(page);
 }
 
 static void read_swapfiles(void) /* This is called before saving image */
@@ -409,6 +408,7 @@ static int write_suspend_image(void)
 	int nr_pgdir_pages = SUSPEND_PD_PAGES(nr_copy_pages);
 	union diskpage *cur,  *buffer = (union diskpage *)get_free_page(GFP_ATOMIC);
 	unsigned long address;
+	struct page *page;
 
 	PRINTS( "Writing data to swap (%d pages): ", nr_copy_pages );
 	for (i=0; i<nr_copy_pages; i++) {
@@ -421,13 +421,8 @@ static int write_suspend_image(void)
 			panic("\nPage %d: not enough swapspace on suspend device", i );
 	    
 		address = (pagedir_nosave+i)->address;
-		lock_page(virt_to_page(address));
-		{
-			long dummy1;
-			struct inode *suspend_file;
-			get_swaphandle_info(entry, &dummy1, &suspend_file);
-		}
-		rw_swap_page_nolock(WRITE, entry, (char *) address);
+		page = virt_to_page(address);
+		rw_swap_page_sync(WRITE, entry, page);
 		(pagedir_nosave+i)->swap_address = entry;
 	}
 	PRINTK(" done\n");
@@ -452,8 +447,8 @@ static int write_suspend_image(void)
 		if (PAGE_SIZE % sizeof(struct pbe))
 			panic("I need PAGE_SIZE to be integer multiple of struct pbe, otherwise next assignment could damage pagedir");
 		cur->link.next = prev;				
-		lock_page(virt_to_page((unsigned long)cur));
-		rw_swap_page_nolock(WRITE, entry, (char *) cur);
+		page = virt_to_page((unsigned long)cur);
+		rw_swap_page_sync(WRITE, entry, page);
 		prev = entry;
 	}
 	PRINTK(", header");
@@ -473,8 +468,8 @@ static int write_suspend_image(void)
 		
 	cur->link.next = prev;
 
-	lock_page(virt_to_page((unsigned long)cur));
-	rw_swap_page_nolock(WRITE, entry, (char *) cur);
+	page = virt_to_page((unsigned long)cur);
+	rw_swap_page_sync(WRITE, entry, page);
 	prev = entry;
 
 	PRINTK( ", signature" );
diff --git a/kernel/sys.c b/kernel/sys.c
index 3bd38f344817..2ba72b6c87d4 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -16,6 +16,7 @@
 #include <linux/init.h>
 #include <linux/highuid.h>
 #include <linux/fs.h>
+#include <linux/tqueue.h>
 #include <linux/device.h>
 
 #include <asm/uaccess.h>
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7eb271716af9..f0c6215b1718 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -31,6 +31,7 @@
 #include <linux/init.h>
 #include <linux/sysrq.h>
 #include <linux/highuid.h>
+#include <linux/writeback.h>
 
 #include <asm/uaccess.h>
 
@@ -264,6 +265,19 @@ static ctl_table vm_table[] = {
 	 &pager_daemon, sizeof(pager_daemon_t), 0644, NULL, &proc_dointvec},
 	{VM_PAGE_CLUSTER, "page-cluster", 
 	 &page_cluster, sizeof(int), 0644, NULL, &proc_dointvec},
+	{VM_DIRTY_BACKGROUND, "dirty_background_ratio",
+	&dirty_background_ratio, sizeof(dirty_background_ratio),
+	0644, NULL, &proc_dointvec},
+	{VM_DIRTY_ASYNC, "dirty_async_ratio", &dirty_async_ratio,
+	sizeof(dirty_async_ratio), 0644, NULL, &proc_dointvec},
+	{VM_DIRTY_SYNC, "dirty_sync_ratio", &dirty_sync_ratio,
+	sizeof(dirty_sync_ratio), 0644, NULL, &proc_dointvec},
+	{VM_DIRTY_WB_CS, "dirty_writeback_centisecs",
+	&dirty_writeback_centisecs, sizeof(dirty_writeback_centisecs), 0644,
+	NULL, &proc_dointvec},
+	{VM_DIRTY_EXPIRE_CS, "dirty_expire_centisecs",
+	&dirty_expire_centisecs, sizeof(dirty_expire_centisecs), 0644,
+	NULL, &proc_dointvec},
 	{0}
 };
 
diff --git a/kernel/timer.c b/kernel/timer.c
index 0b7efa84970b..858954c871e1 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -22,6 +22,7 @@
 #include <linux/delay.h>
 #include <linux/smp_lock.h>
 #include <linux/interrupt.h>
+#include <linux/tqueue.h>
 #include <linux/kernel_stat.h>
 
 #include <asm/uaccess.h>
@@ -69,11 +70,11 @@ unsigned long event;
 extern int do_setitimer(int, struct itimerval *, struct itimerval *);
 
 /*
- * The 64-bit value is not volatile - you MUST NOT read it
+ * The 64-bit jiffies value is not atomic - you MUST NOT read it
  * without holding read_lock_irq(&xtime_lock).
  * jiffies is defined in the linker script...
  */
-u64 jiffies_64;
+
 
 unsigned int * prof_buffer;
 unsigned long prof_len;
@@ -231,11 +232,6 @@ int del_timer(struct timer_list * timer)
 }
 
 #ifdef CONFIG_SMP
-void sync_timers(void)
-{
-	spin_unlock_wait(&global_bh_lock);
-}
-
 /*
  * SMP specific function to delete periodic timer.
  * Caller must disable by some means restarting the timer
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 689a5448ea31..e17cd888fc3d 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -29,7 +29,7 @@
 /*
  * Radix tree node definition.
  */
-#define RADIX_TREE_MAP_SHIFT  7
+#define RADIX_TREE_MAP_SHIFT  6
 #define RADIX_TREE_MAP_SIZE  (1UL << RADIX_TREE_MAP_SHIFT)
 #define RADIX_TREE_MAP_MASK  (RADIX_TREE_MAP_SIZE-1)
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 0b6edcc0d0eb..a31fbce9e196 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -445,8 +445,10 @@ int fail_writepage(struct page *page)
 {
 	/* Only activate on memory-pressure, not fsync.. */
 	if (current->flags & PF_MEMALLOC) {
-		activate_page(page);
-		SetPageReferenced(page);
+		if (!PageActive(page))
+			activate_page(page);
+		if (!PageReferenced(page))
+			SetPageReferenced(page);
 	}
 
 	/* Set the page dirty again, unlock */
@@ -868,55 +870,35 @@ struct page *grab_cache_page(struct address_space *mapping, unsigned long index)
  * This is intended for speculative data generators, where the data can
  * be regenerated if the page couldn't be grabbed.  This routine should
  * be safe to call while holding the lock for another page.
+ *
+ * Clear __GFP_FS when allocating the page to avoid recursion into the fs
+ * and deadlock against the caller's locked page.
  */
-struct page *grab_cache_page_nowait(struct address_space *mapping, unsigned long index)
+struct page *
+grab_cache_page_nowait(struct address_space *mapping, unsigned long index)
 {
-	struct page *page;
-
-	page = find_get_page(mapping, index);
-
-	if ( page ) {
-		if ( !TestSetPageLocked(page) ) {
-			/* Page found and locked */
-			/* This test is overly paranoid, but what the heck... */
-			if ( unlikely(page->mapping != mapping || page->index != index) ) {
-				/* Someone reallocated this page under us. */
-				unlock_page(page);
-				page_cache_release(page);
-				return NULL;
-			} else {
-				return page;
-			}
-		} else {
-			/* Page locked by someone else */
-			page_cache_release(page);
-			return NULL;
-		}
-	}
-
-	page = page_cache_alloc(mapping);
-	if (unlikely(!page))
-		return NULL;	/* Failed to allocate a page */
+	struct page *page = find_get_page(mapping, index);
 
-	if (unlikely(add_to_page_cache_unique(page, mapping, index))) {
-		/*
-		 * Someone else grabbed the page already, or
-		 * failed to allocate a radix-tree node
-		 */
+	if (page) {
+		if (!TestSetPageLocked(page))
+			return page;
 		page_cache_release(page);
 		return NULL;
 	}
-
+	page = alloc_pages(mapping->gfp_mask & ~__GFP_FS, 0);
+	if (page && add_to_page_cache_unique(page, mapping, index)) {
+		page_cache_release(page);
+		page = NULL;
+	}
 	return page;
 }
 
 /*
  * Mark a page as having seen activity.
  *
- * If it was already so marked, move it
- * to the active queue and drop the referenced
- * bit. Otherwise, just mark it for future
- * action..
+ * inactive,unreferenced	->	inactive,referenced
+ * inactive,referenced		->	active,unreferenced
+ * active,unreferenced		->	active,referenced
  */
 void mark_page_accessed(struct page *page)
 {
@@ -924,10 +906,9 @@ void mark_page_accessed(struct page *page)
 		activate_page(page);
 		ClearPageReferenced(page);
 		return;
+	} else if (!PageReferenced(page)) {
+		SetPageReferenced(page);
 	}
-
-	/* Mark the page referenced, AFTER checking for previous usage.. */
-	SetPageReferenced(page);
 }
 
 /*
@@ -2286,7 +2267,8 @@ generic_file_write(struct file *file, const char *buf,
 			}
 		}
 		kunmap(page);
-		SetPageReferenced(page);
+		if (!PageReferenced(page))
+			SetPageReferenced(page);
 		unlock_page(page);
 		page_cache_release(page);
 		if (status < 0)
diff --git a/mm/highmem.c b/mm/highmem.c
index de5ebeb0a167..ae9c5a26376b 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -17,6 +17,7 @@
  */
 
 #include <linux/mm.h>
+#include <linux/bio.h>
 #include <linux/pagemap.h>
 #include <linux/mempool.h>
 #include <linux/blkdev.h>
@@ -347,13 +348,15 @@ static void bounce_end_io_read_isa(struct bio *bio)
 	return __bounce_end_io_read(bio, isa_page_pool);
 }
 
-void create_bounce(unsigned long pfn, int gfp, struct bio **bio_orig)
+void blk_queue_bounce(request_queue_t *q, struct bio **bio_orig)
 {
 	struct page *page;
 	struct bio *bio = NULL;
 	int i, rw = bio_data_dir(*bio_orig), bio_gfp;
 	struct bio_vec *to, *from;
 	mempool_t *pool;
+	unsigned long pfn = q->bounce_pfn;
+	int gfp = q->bounce_gfp;
 
 	BUG_ON((*bio_orig)->bi_idx);
 
diff --git a/mm/msync.c b/mm/msync.c
index 2a2b31de8957..5ea980e6b1dc 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -169,7 +169,7 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
 	unsigned long end;
 	struct vm_area_struct * vma;
-	int unmapped_error, error = -EINVAL;
+	int unmapped_error, error = -ENOMEM;
 
 	down_read(&current->mm->mmap_sem);
 	if (start & ~PAGE_MASK)
@@ -185,18 +185,18 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 		goto out;
 	/*
 	 * If the interval [start,end) covers some unmapped address ranges,
-	 * just ignore them, but return -EFAULT at the end.
+	 * just ignore them, but return -ENOMEM at the end.
 	 */
 	vma = find_vma(current->mm, start);
 	unmapped_error = 0;
 	for (;;) {
 		/* Still start < end. */
-		error = -EFAULT;
+		error = -ENOMEM;
 		if (!vma)
 			goto out;
 		/* Here start < vma->vm_end. */
 		if (start < vma->vm_start) {
-			unmapped_error = -EFAULT;
+			unmapped_error = -ENOMEM;
 			start = vma->vm_start;
 		}
 		/* Here vma->vm_start <= start < vma->vm_end. */
@@ -220,5 +220,3 @@ out:
 	up_read(&current->mm->mmap_sem);
 	return error;
 }
-
-
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 082e8fb8cb16..6d4555c3fb91 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -26,29 +26,56 @@
  * The maximum number of pages to writeout in a single bdflush/kupdate
  * operation.  We do this so we don't hold I_LOCK against an inode for
  * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
  */
 #define MAX_WRITEBACK_PAGES	1024
 
 /*
- * Memory thresholds, in percentages
- * FIXME: expose these via /proc or whatever.
+ * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
+ * will look to see if it needs to force writeback or throttling.  Probably
+ * should be scaled by memory size.
+ */
+#define RATELIMIT_PAGES		1000
+
+/*
+ * When balance_dirty_pages decides that the caller needs to perform some
+ * non-background writeback, this is how many pages it will attempt to write.
+ * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
+ * large amounts of I/O are submitted.
+ */
+#define SYNC_WRITEBACK_PAGES	1500
+
+
+/*
+ * Dirty memory thresholds, in percentages
  */
 
 /*
  * Start background writeback (via pdflush) at this level
  */
-static int dirty_background_ratio = 40;
+int dirty_background_ratio = 40;
 
 /*
  * The generator of dirty data starts async writeback at this level
  */
-static int dirty_async_ratio = 50;
+int dirty_async_ratio = 50;
 
 /*
  * The generator of dirty data performs sync writeout at this level
  */
-static int dirty_sync_ratio = 60;
+int dirty_sync_ratio = 60;
+
+/*
+ * The interval between `kupdate'-style writebacks.
+ */
+int dirty_writeback_centisecs = 5 * 100;
+
+/*
+ * The largest amount of time for which data is allowed to remain dirty
+ */
+int dirty_expire_centisecs = 30 * 100;
+
 
 static void background_writeout(unsigned long _min_pages);
 
@@ -84,12 +111,12 @@ void balance_dirty_pages(struct address_space *mapping)
 	sync_thresh = (dirty_sync_ratio * tot) / 100;
 
 	if (dirty_and_writeback > sync_thresh) {
-		int nr_to_write = 1500;
+		int nr_to_write = SYNC_WRITEBACK_PAGES;
 
 		writeback_unlocked_inodes(&nr_to_write, WB_SYNC_LAST, NULL);
 		get_page_state(&ps);
 	} else if (dirty_and_writeback > async_thresh) {
-		int nr_to_write = 1500;
+		int nr_to_write = SYNC_WRITEBACK_PAGES;
 
 		writeback_unlocked_inodes(&nr_to_write, WB_SYNC_NONE, NULL);
 		get_page_state(&ps);
@@ -118,7 +145,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	int cpu;
 
 	cpu = get_cpu();
-	if (ratelimits[cpu].count++ >= 1000) {
+	if (ratelimits[cpu].count++ >= RATELIMIT_PAGES) {
 		ratelimits[cpu].count = 0;
 		put_cpu();
 		balance_dirty_pages(mapping);
@@ -162,17 +189,6 @@ void wakeup_bdflush(void)
 	pdflush_operation(background_writeout, ps.nr_dirty);
 }
 
-/*
- * The interval between `kupdate'-style writebacks.
- *
- * Traditional kupdate writes back data which is 30-35 seconds old.
- * This one does that, but it also writes back just 1/6th of the dirty
- * data.  This is to avoid great I/O storms.
- *
- * We chunk the writes up and yield, to permit any throttled page-allocators
- * to perform their I/O against a large file.
- */
-static int wb_writeback_jifs = 5 * HZ;
 static struct timer_list wb_timer;
 
 /*
@@ -183,9 +199,9 @@ static struct timer_list wb_timer;
  * just walks the superblock inode list, writing back any inodes which are
  * older than a specific point in time.
  *
- * Try to run once per wb_writeback_jifs jiffies.  But if a writeback event
- * takes longer than a wb_writeback_jifs interval, then leave a one-second
- * gap.
+ * Try to run once per dirty_writeback_centisecs.  But if a writeback event
+ * takes longer than a dirty_writeback_centisecs interval, then leave a
+ * one-second gap.
  *
  * older_than_this takes precedence over nr_to_write.  So we'll only write back
  * all dirty pages if they are all attached to "old" mappings.
@@ -201,9 +217,9 @@ static void wb_kupdate(unsigned long arg)
 	sync_supers();
 	get_page_state(&ps);
 
-	oldest_jif = jiffies - 30*HZ;
+	oldest_jif = jiffies - (dirty_expire_centisecs * HZ) / 100;
 	start_jif = jiffies;
-	next_jif = start_jif + wb_writeback_jifs;
+	next_jif = start_jif + (dirty_writeback_centisecs * HZ) / 100;
 	nr_to_write = ps.nr_dirty;
 	writeback_unlocked_inodes(&nr_to_write, WB_SYNC_NONE, &oldest_jif);
 	blk_run_queues();
@@ -223,7 +239,7 @@ static void wb_timer_fn(unsigned long unused)
 static int __init wb_timer_init(void)
 {
 	init_timer(&wb_timer);
-	wb_timer.expires = jiffies + wb_writeback_jifs;
+	wb_timer.expires = jiffies + (dirty_writeback_centisecs * HZ) / 100;
 	wb_timer.data = 0;
 	wb_timer.function = wb_timer_fn;
 	add_timer(&wb_timer);
diff --git a/mm/page_io.c b/mm/page_io.c
index 942ea274dccd..3692ead4d94c 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -14,112 +14,163 @@
 #include <linux/kernel_stat.h>
 #include <linux/pagemap.h>
 #include <linux/swap.h>
-#include <linux/swapctl.h>
-#include <linux/buffer_head.h>		/* for brw_page() */
-
+#include <linux/bio.h>
+#include <linux/buffer_head.h>
 #include <asm/pgtable.h>
+#include <linux/swapops.h>
 
-/*
- * Reads or writes a swap page.
- * wait=1: start I/O and wait for completion. wait=0: start asynchronous I/O.
- *
- * Important prevention of race condition: the caller *must* atomically 
- * create a unique swap cache entry for this swap page before calling
- * rw_swap_page, and must lock that page.  By ensuring that there is a
- * single page of memory reserved for the swap entry, the normal VM page
- * lock on that page also doubles as a lock on swap entries.  Having only
- * one lock to deal with per swap entry (rather than locking swap and memory
- * independently) also makes it easier to make certain swapping operations
- * atomic, which is particularly important when we are trying to ensure 
- * that shared pages stay shared while being swapped.
- */
+static int
+swap_get_block(struct inode *inode, sector_t iblock,
+		struct buffer_head *bh_result, int create)
+{
+	struct swap_info_struct *sis;
+	swp_entry_t entry;
 
-static int rw_swap_page_base(int rw, swp_entry_t entry, struct page *page)
+	entry.val = iblock;
+	sis = get_swap_info_struct(swp_type(entry));
+	bh_result->b_bdev = sis->bdev;
+	bh_result->b_blocknr = map_swap_page(sis, swp_offset(entry));
+	bh_result->b_size = PAGE_SIZE;
+	set_buffer_mapped(bh_result);
+	return 0;
+}
+
+static struct bio *
+get_swap_bio(int gfp_flags, struct page *page, bio_end_io_t end_io)
 {
-	unsigned long offset;
-	sector_t zones[PAGE_SIZE/512];
-	int zones_used;
-	int block_size;
-	struct inode *swapf = 0;
-	struct block_device *bdev;
+	struct bio *bio;
+	struct buffer_head bh;
 
-	if (rw == READ) {
+	bio = bio_alloc(gfp_flags, 1);
+	if (bio) {
+		swap_get_block(NULL, page->index, &bh, 1);
+		bio->bi_sector = bh.b_blocknr * (PAGE_SIZE >> 9);
+		bio->bi_bdev = bh.b_bdev;
+		bio->bi_io_vec[0].bv_page = page;
+		bio->bi_io_vec[0].bv_len = PAGE_SIZE;
+		bio->bi_io_vec[0].bv_offset = 0;
+		bio->bi_vcnt = 1;
+		bio->bi_idx = 0;
+		bio->bi_size = PAGE_SIZE;
+		bio->bi_end_io = end_io;
+	}
+	return bio;
+}
+
+static void end_swap_bio_write(struct bio *bio)
+{
+	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct page *page = bio->bi_io_vec[0].bv_page;
+
+	if (!uptodate)
+		SetPageError(page);
+	end_page_writeback(page);
+	bio_put(bio);
+}
+
+static void end_swap_bio_read(struct bio *bio)
+{
+	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct page *page = bio->bi_io_vec[0].bv_page;
+
+	if (!uptodate) {
+		SetPageError(page);
 		ClearPageUptodate(page);
-		kstat.pswpin++;
-	} else
-		kstat.pswpout++;
-
-	get_swaphandle_info(entry, &offset, &swapf);
-	bdev = swapf->i_bdev;
-	if (bdev) {
-		zones[0] = offset;
-		zones_used = 1;
-		block_size = PAGE_SIZE;
 	} else {
-		int i, j;
-		unsigned int block = offset
-			<< (PAGE_SHIFT - swapf->i_sb->s_blocksize_bits);
-
-		block_size = swapf->i_sb->s_blocksize;
-		for (i=0, j=0; j< PAGE_SIZE ; i++, j += block_size)
-			if (!(zones[i] = bmap(swapf,block++))) {
-				printk("rw_swap_page: bad swap file\n");
-				return 0;
-			}
-		zones_used = i;
-		bdev = swapf->i_sb->s_bdev;
+		SetPageUptodate(page);
 	}
+	unlock_page(page);
+	bio_put(bio);
+}
 
- 	/* block_size == PAGE_SIZE/zones_used */
- 	brw_page(rw, page, bdev, zones, block_size);
+/*
+ * We may have stale swap cache pages in memory: notice
+ * them here and get rid of the unnecessary final write.
+ */
+int swap_writepage(struct page *page)
+{
+	struct bio *bio;
+	int ret = 0;
 
- 	/* Note! For consistency we do all of the logic,
- 	 * decrementing the page count, and unlocking the page in the
- 	 * swap lock map - in the IO completion handler.
- 	 */
-	return 1;
+	if (remove_exclusive_swap_page(page)) {
+		unlock_page(page);
+		goto out;
+	}
+	bio = get_swap_bio(GFP_NOIO, page, end_swap_bio_write);
+	if (bio == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	kstat.pswpout++;
+	SetPageWriteback(page);
+	unlock_page(page);
+	submit_bio(WRITE, bio);
+out:
+	return ret;
 }
 
+int swap_readpage(struct file *file, struct page *page)
+{
+	struct bio *bio;
+	int ret = 0;
+
+	ClearPageUptodate(page);
+	bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
+	if (bio == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	kstat.pswpin++;
+	submit_bio(READ, bio);
+out:
+	return ret;
+}
 /*
- * A simple wrapper so the base function doesn't need to enforce
- * that all swap pages go through the swap cache! We verify that:
- *  - the page is locked
- *  - it's marked as being swap-cache
- *  - it's associated with the swap inode
+ * swapper_space doesn't have a real inode, so it gets a special vm_writeback()
+ * so we don't need swap special cases in generic_vm_writeback().
+ *
+ * Swap pages are PageLocked and PageWriteback while under writeout so that
+ * memory allocators will throttle against them.
  */
-void rw_swap_page(int rw, struct page *page)
+static int swap_vm_writeback(struct page *page, int *nr_to_write)
 {
-	swp_entry_t entry;
+	struct address_space *mapping = page->mapping;
 
-	entry.val = page->index;
-
-	if (!PageLocked(page))
-		PAGE_BUG(page);
-	if (!PageSwapCache(page))
-		PAGE_BUG(page);
-	if (!rw_swap_page_base(rw, entry, page))
-		unlock_page(page);
+	unlock_page(page);
+	return generic_writepages(mapping, nr_to_write);
 }
 
+struct address_space_operations swap_aops = {
+	vm_writeback:	swap_vm_writeback,
+	writepage:	swap_writepage,
+	readpage:	swap_readpage,
+	sync_page:	block_sync_page,
+	set_page_dirty:	__set_page_dirty_nobuffers,
+};
+
 /*
- * The swap lock map insists that pages be in the page cache!
- * Therefore we can't use it.  Later when we can remove the need for the
- * lock map and we can reduce the number of functions exported.
+ * A scruffy utility function to read or write an arbitrary swap page
+ * and wait on the I/O.
  */
-void rw_swap_page_nolock(int rw, swp_entry_t entry, char *buf)
+int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
 {
-	struct page *page = virt_to_page(buf);
-	
-	if (!PageLocked(page))
-		PAGE_BUG(page);
-	if (page->mapping)
-		PAGE_BUG(page);
-	/* needs sync_page to wait I/O completation */
+	int ret;
+
+	lock_page(page);
+
+	BUG_ON(page->mapping);
 	page->mapping = &swapper_space;
-	if (rw_swap_page_base(rw, entry, page))
-		lock_page(page);
-	if (page_has_buffers(page) && !try_to_free_buffers(page))
-		PAGE_BUG(page);
+	page->index = entry.val;
+
+	if (rw == READ) {
+		ret = swap_readpage(NULL, page);
+		wait_on_page_locked(page);
+	} else {
+		ret = swap_writepage(page);
+		wait_on_page_writeback(page);
+	}
 	page->mapping = NULL;
-	unlock_page(page);
+	if (ret == 0 && (!PageUptodate(page) || PageError(page)))
+		ret = -EIO;
+	return ret;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 9367252b65b0..07bdba83bdf5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -426,15 +426,22 @@ found:
 	swap_free(entry);
 	ptr[offset] = (swp_entry_t) {0};
 
-	while (inode && move_from_swap_cache(page, idx, inode->i_mapping)) {
+	while (inode && (PageWriteback(page) ||
+			move_from_swap_cache(page, idx, inode->i_mapping))) {
 		/*
 		 * Yield for kswapd, and try again - but we're still
 		 * holding the page lock - ugh! fix this up later on.
 		 * Beware of inode being unlinked or truncated: just
 		 * leave try_to_unuse to delete_from_swap_cache if so.
+		 *
+		 * AKPM: We now wait on writeback too.  Note that it's
+		 * the page lock which prevents new writeback from starting.
 		 */
 		spin_unlock(&info->lock);
-		yield();
+		if (PageWriteback(page))
+			wait_on_page_writeback(page);
+		else
+			yield();
 		spin_lock(&info->lock);
 		ptr = shmem_swp_entry(info, idx, 0);
 		if (IS_ERR(ptr))
@@ -594,9 +601,14 @@ repeat:
 		}
 
 		/* We have to do this with page locked to prevent races */
-		if (TestSetPageLocked(page)) 
+		if (TestSetPageLocked(page))
 			goto wait_retry;
-
+		if (PageWriteback(page)) {
+			spin_unlock(&info->lock);
+			wait_on_page_writeback(page);
+			unlock_page(page);
+			goto repeat;
+		}
 		error = move_from_swap_cache(page, idx, mapping);
 		if (error < 0) {
 			unlock_page(page);
@@ -651,7 +663,7 @@ no_space:
 	return ERR_PTR(-ENOSPC);
 
 wait_retry:
-	spin_unlock (&info->lock);
+	spin_unlock(&info->lock);
 	wait_on_page_locked(page);
 	page_cache_release(page);
 	goto repeat;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 5fe5a4462bbb..4513649a1208 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -14,54 +14,27 @@
 #include <linux/init.h>
 #include <linux/pagemap.h>
 #include <linux/smp_lock.h>
-#include <linux/buffer_head.h>	/* block_sync_page()/try_to_free_buffers() */
+#include <linux/buffer_head.h>	/* block_sync_page() */
 
 #include <asm/pgtable.h>
 
 /*
- * We may have stale swap cache pages in memory: notice
- * them here and get rid of the unnecessary final write.
- */
-static int swap_writepage(struct page *page)
-{
-	if (remove_exclusive_swap_page(page)) {
-		unlock_page(page);
-		return 0;
-	}
-	rw_swap_page(WRITE, page);
-	return 0;
-}
-
-/*
- * swapper_space doesn't have a real inode, so it gets a special vm_writeback()
- * so we don't need swap special cases in generic_vm_writeback().
- *
- * Swap pages are PageLocked and PageWriteback while under writeout so that
- * memory allocators will throttle against them.
- */
-static int swap_vm_writeback(struct page *page, int *nr_to_write)
-{
-	struct address_space *mapping = page->mapping;
-
-	unlock_page(page);
-	return generic_writepages(mapping, nr_to_write);
-}
-
-static struct address_space_operations swap_aops = {
-	vm_writeback:	swap_vm_writeback,
-	writepage:	swap_writepage,
-	sync_page:	block_sync_page,
-	set_page_dirty:	__set_page_dirty_nobuffers,
-};
-
-/*
  * swapper_inode doesn't do anything much.  It is really only here to
  * avoid some special-casing in other parts of the kernel.
+ *
+ * We set i_size to "infinity" to keep the page I/O functions happy.  The swap
+ * block allocator makes sure that allocations are in-range.  A strange
+ * number is chosen to prevent various arith overflows elsewhere.  For example,
+ * `lblock' in block_read_full_page().
  */
 static struct inode swapper_inode = {
-	i_mapping:		&swapper_space,
+	i_mapping:	&swapper_space,
+	i_size:		PAGE_SIZE * 0xffffffffLL,
+	i_blkbits:	PAGE_SHIFT,
 };
 
+extern struct address_space_operations swap_aops;
+
 struct address_space swapper_space = {
 	page_tree:	RADIX_TREE_INIT(GFP_ATOMIC),
 	page_lock:	RW_LOCK_UNLOCKED,
@@ -131,10 +104,9 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry)
  */
 void __delete_from_swap_cache(struct page *page)
 {
-	if (!PageLocked(page))
-		BUG();
-	if (!PageSwapCache(page))
-		BUG();
+	BUG_ON(!PageLocked(page));
+	BUG_ON(!PageSwapCache(page));
+	BUG_ON(PageWriteback(page));
 	ClearPageDirty(page);
 	__remove_inode_page(page);
 	INC_CACHE_INFO(del_total);
@@ -150,14 +122,9 @@ void delete_from_swap_cache(struct page *page)
 {
 	swp_entry_t entry;
 
-	/*
-	 * I/O should have completed and nobody can have a ref against the
-	 * page's buffers
-	 */
 	BUG_ON(!PageLocked(page));
 	BUG_ON(PageWriteback(page));
-	if (page_has_buffers(page) && !try_to_free_buffers(page))
-		BUG();
+	BUG_ON(page_has_buffers(page));
   
 	entry.val = page->index;
 
@@ -223,16 +190,9 @@ int move_from_swap_cache(struct page *page, unsigned long index,
 	void **pslot;
 	int err;
 
-	/*
-	 * Drop the buffers now, before taking the page_lock.  Because
-	 * mapping->private_lock nests outside mapping->page_lock.
-	 * This "must" succeed.  The page is locked and all I/O has completed
-	 * and nobody else has a ref against its buffers.
-	 */
 	BUG_ON(!PageLocked(page));
 	BUG_ON(PageWriteback(page));
-	if (page_has_buffers(page) && !try_to_free_buffers(page))
-		BUG();
+	BUG_ON(page_has_buffers(page));
 
 	write_lock(&swapper_space.page_lock);
 	write_lock(&mapping->page_lock);
@@ -362,7 +322,7 @@ struct page * read_swap_cache_async(swp_entry_t entry)
 			/*
 			 * Initiate read into locked page and return.
 			 */
-			rw_swap_page(READ, new_page);
+			swap_readpage(NULL, new_page);
 			return new_page;
 		}
 	} while (err != -ENOENT && err != -ENOMEM);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 70a517bbcc16..175c812a63d6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -16,7 +16,7 @@
 #include <linux/namei.h>
 #include <linux/shm.h>
 #include <linux/blkdev.h>
-#include <linux/buffer_head.h>		/* for try_to_free_buffers() */
+#include <linux/buffer_head.h>
 
 #include <asm/pgtable.h>
 #include <linux/swapops.h>
@@ -294,11 +294,14 @@ int remove_exclusive_swap_page(struct page *page)
 	struct swap_info_struct * p;
 	swp_entry_t entry;
 
-	if (!PageLocked(page))
-		BUG();
+	BUG_ON(page_has_buffers(page));
+	BUG_ON(!PageLocked(page));
+
 	if (!PageSwapCache(page))
 		return 0;
-	if (page_count(page) - !!PagePrivate(page) != 2) /* 2: us + cache */
+	if (PageWriteback(page))
+		return 0;
+	if (page_count(page) != 2) /* 2: us + cache */
 		return 0;
 
 	entry.val = page->index;
@@ -311,13 +314,8 @@ int remove_exclusive_swap_page(struct page *page)
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the pagecache lock held.. */
 		write_lock(&swapper_space.page_lock);
-		if (page_count(page) - !!PagePrivate(page) == 2) {
+		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
-			/*
-			 * NOTE: if/when swap gets buffer/page coherency
-			 * like other mappings, we'll need to mark the buffers
-			 * dirty here too.  set_page_dirty().
-			 */
 			SetPageDirty(page);
 			retval = 1;
 		}
@@ -326,9 +324,6 @@ int remove_exclusive_swap_page(struct page *page)
 	swap_info_put(p);
 
 	if (retval) {
-		BUG_ON(PageWriteback(page));
-		if (page_has_buffers(page) && !try_to_free_buffers(page))
-			BUG();
 		swap_free(entry);
 		page_cache_release(page);
 	}
@@ -352,9 +347,13 @@ void free_swap_and_cache(swp_entry_t entry)
 		swap_info_put(p);
 	}
 	if (page) {
+		int one_user;
+
+		BUG_ON(page_has_buffers(page));
 		page_cache_get(page);
+		one_user = (page_count(page) == 2);
 		/* Only cache user (+us), or swap space full? Free it! */
-		if (page_count(page) - !!PagePrivate(page) == 2 || vm_swap_full()) {
+		if (!PageWriteback(page) && (one_user || vm_swap_full())) {
 			delete_from_swap_cache(page);
 			SetPageDirty(page);
 		}
@@ -606,6 +605,7 @@ static int try_to_unuse(unsigned int type)
 		wait_on_page_locked(page);
 		wait_on_page_writeback(page);
 		lock_page(page);
+		wait_on_page_writeback(page);
 
 		/*
 		 * Remove all references to entry, without blocking.
@@ -685,11 +685,13 @@ static int try_to_unuse(unsigned int type)
 		 * Note shmem_unuse already deleted its from swap cache.
 		 */
 		if ((*swap_map > 1) && PageDirty(page) && PageSwapCache(page)) {
-			rw_swap_page(WRITE, page);
+			swap_writepage(page);
 			lock_page(page);
 		}
-		if (PageSwapCache(page))
+		if (PageSwapCache(page)) {
+			wait_on_page_writeback(page);
 			delete_from_swap_cache(page);
+		}
 
 		/*
 		 * So we could skip searching mms once swap count went
@@ -717,6 +719,207 @@ static int try_to_unuse(unsigned int type)
 	return retval;
 }
 
+/*
+ * Use this swapdev's extent info to locate the (PAGE_SIZE) block which
+ * corresponds to page offset `offset'.
+ */
+sector_t map_swap_page(struct swap_info_struct *sis, pgoff_t offset)
+{
+	struct swap_extent *se = sis->curr_swap_extent;
+	struct swap_extent *start_se = se;
+
+	for ( ; ; ) {
+		struct list_head *lh;
+
+		if (se->start_page <= offset &&
+				offset < (se->start_page + se->nr_pages)) {
+			return se->start_block + (offset - se->start_page);
+		}
+		lh = se->list.prev;
+		if (lh == &sis->extent_list)
+			lh = lh->prev;
+		se = list_entry(lh, struct swap_extent, list);
+		sis->curr_swap_extent = se;
+		BUG_ON(se == start_se);		/* It *must* be present */
+	}
+}
+
+/*
+ * Free all of a swapdev's extent information
+ */
+static void destroy_swap_extents(struct swap_info_struct *sis)
+{
+	while (!list_empty(&sis->extent_list)) {
+		struct swap_extent *se;
+
+		se = list_entry(sis->extent_list.next,
+				struct swap_extent, list);
+		list_del(&se->list);
+		kfree(se);
+	}
+	sis->nr_extents = 0;
+}
+
+/*
+ * Add a block range (and the corresponding page range) into this swapdev's
+ * extent list.  The extent list is kept sorted in block order.
+ *
+ * This function rather assumes that it is called in ascending sector_t order.
+ * It doesn't look for extent coalescing opportunities.
+ */
+static int
+add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
+		unsigned long nr_pages, sector_t start_block)
+{
+	struct swap_extent *se;
+	struct swap_extent *new_se;
+	struct list_head *lh;
+
+	lh = sis->extent_list.next;	/* The highest-addressed block */
+	while (lh != &sis->extent_list) {
+		se = list_entry(lh, struct swap_extent, list);
+		if (se->start_block + se->nr_pages == start_block) {
+			/* Merge it */
+			se->nr_pages += nr_pages;
+			return 0;
+		}
+		lh = lh->next;
+	}
+
+	/*
+	 * No merge.  Insert a new extent, preserving ordering.
+	 */
+	new_se = kmalloc(sizeof(*se), GFP_KERNEL);
+	if (new_se == NULL)
+		return -ENOMEM;
+	new_se->start_page = start_page;
+	new_se->nr_pages = nr_pages;
+	new_se->start_block = start_block;
+
+	lh = sis->extent_list.prev;	/* The lowest block */
+	while (lh != &sis->extent_list) {
+		se = list_entry(lh, struct swap_extent, list);
+		if (se->start_block > start_block)
+			break;
+		lh = lh->prev;
+	}
+	list_add_tail(&new_se->list, lh);
+	sis->nr_extents++;
+	return 0;
+}
+
+/*
+ * A `swap extent' is a simple thing which maps a contiguous range of pages
+ * onto a contiguous range of disk blocks.  An ordered list of swap extents
+ * is built at swapon time and is then used at swap_writepage/swap_readpage
+ * time for locating where on disk a page belongs.
+ *
+ * If the swapfile is an S_ISBLK block device, a single extent is installed.
+ * This is done so that the main operating code can treat S_ISBLK and S_ISREG
+ * swap files identically.
+ *
+ * Whether the swapdev is an S_ISREG file or an S_ISBLK blockdev, the swap
+ * extent list operates in PAGE_SIZE disk blocks.  Both S_ISREG and S_ISBLK
+ * swapfiles are handled *identically* after swapon time.
+ *
+ * For S_ISREG swapfiles, setup_swap_extents() will walk all the file's blocks
+ * and will parse them into an ordered extent list, in PAGE_SIZE chunks.  If
+ * some stray blocks are found which do not fall within the PAGE_SIZE alignment
+ * requirements, they are simply tossed out - we will never use those blocks
+ * for swapping.
+ *
+ * The amount of disk space which a single swap extent represents varies.
+ * Typically it is in the 1-4 megabyte range.  So we can have hundreds of
+ * extents in the list.  To avoid much list walking, we cache the previous
+ * search location in `curr_swap_extent', and start new searches from there.
+ * This is extremely effective.  The average number of iterations in
+ * map_swap_page() has been measured at about 0.3 per page.  - akpm.
+ */
+static int setup_swap_extents(struct swap_info_struct *sis)
+{
+	struct inode *inode;
+	unsigned blocks_per_page;
+	unsigned long page_no;
+	unsigned blkbits;
+	sector_t probe_block;
+	sector_t last_block;
+	int ret;
+
+	inode = sis->swap_file->f_dentry->d_inode;
+	if (S_ISBLK(inode->i_mode)) {
+		ret = add_swap_extent(sis, 0, sis->max, 0);
+		goto done;
+	}
+
+	blkbits = inode->i_blkbits;
+	blocks_per_page = PAGE_SIZE >> blkbits;
+
+	/*
+	 * Map all the blocks into the extent list.  This code doesn't try
+	 * to be very smart.
+	 */
+	probe_block = 0;
+	page_no = 0;
+	last_block = inode->i_size >> blkbits;
+	while ((probe_block + blocks_per_page) <= last_block &&
+			page_no < sis->max) {
+		unsigned block_in_page;
+		sector_t first_block;
+
+		first_block = bmap(inode, probe_block);
+		if (first_block == 0)
+			goto bad_bmap;
+
+		/*
+		 * It must be PAGE_SIZE aligned on-disk
+		 */
+		if (first_block & (blocks_per_page - 1)) {
+			probe_block++;
+			goto reprobe;
+		}
+
+		for (block_in_page = 1; block_in_page < blocks_per_page;
+					block_in_page++) {
+			sector_t block;
+
+			block = bmap(inode, probe_block + block_in_page);
+			if (block == 0)
+				goto bad_bmap;
+			if (block != first_block + block_in_page) {
+				/* Discontiguity */
+				probe_block++;
+				goto reprobe;
+			}
+		}
+
+		/*
+		 * We found a PAGE_SIZE-length, PAGE_SIZE-aligned run of blocks
+		 */
+		ret = add_swap_extent(sis, page_no, 1,
+				first_block >> (PAGE_SHIFT - blkbits));
+		if (ret)
+			goto out;
+		page_no++;
+		probe_block += blocks_per_page;
+reprobe:
+		continue;
+	}
+	ret = 0;
+	if (page_no == 0)
+		ret = -EINVAL;
+	sis->max = page_no;
+	sis->highest_bit = page_no - 1;
+done:
+	sis->curr_swap_extent = list_entry(sis->extent_list.prev,
+					struct swap_extent, list);
+	goto out;
+bad_bmap:
+	printk(KERN_ERR "swapon: swapfile has holes\n");
+	ret = -EINVAL;
+out:
+	return ret;
+}
+
 asmlinkage long sys_swapoff(const char * specialfile)
 {
 	struct swap_info_struct * p = NULL;
@@ -733,7 +936,6 @@ asmlinkage long sys_swapoff(const char * specialfile)
 	if (err)
 		goto out;
 
-	lock_kernel();
 	prev = -1;
 	swap_list_lock();
 	for (type = swap_list.head; type >= 0; type = swap_info[type].next) {
@@ -763,9 +965,7 @@ asmlinkage long sys_swapoff(const char * specialfile)
 	total_swap_pages -= p->pages;
 	p->flags &= ~SWP_WRITEOK;
 	swap_list_unlock();
-	unlock_kernel();
 	err = try_to_unuse(type);
-	lock_kernel();
 	if (err) {
 		/* re-insert swap space back into swap_list */
 		swap_list_lock();
@@ -791,6 +991,7 @@ asmlinkage long sys_swapoff(const char * specialfile)
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
 	p->flags = 0;
+	destroy_swap_extents(p);
 	swap_device_unlock(p);
 	swap_list_unlock();
 	vfree(swap_map);
@@ -804,7 +1005,6 @@ asmlinkage long sys_swapoff(const char * specialfile)
 	err = 0;
 
 out_dput:
-	unlock_kernel();
 	path_release(&nd);
 out:
 	return err;
@@ -858,12 +1058,12 @@ int get_swaparea_info(char *buf)
 asmlinkage long sys_swapon(const char * specialfile, int swap_flags)
 {
 	struct swap_info_struct * p;
-	char *name;
+	char *name = NULL;
 	struct block_device *bdev = NULL;
 	struct file *swap_file = NULL;
 	struct address_space *mapping;
 	unsigned int type;
-	int i, j, prev;
+	int i, prev;
 	int error;
 	static int least_priority = 0;
 	union swap_header *swap_header = 0;
@@ -872,10 +1072,10 @@ asmlinkage long sys_swapon(const char * specialfile, int swap_flags)
 	unsigned long maxpages = 1;
 	int swapfilesize;
 	unsigned short *swap_map;
-	
+	struct page *page = NULL;
+
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
-	lock_kernel();
 	swap_list_lock();
 	p = swap_info;
 	for (type = 0 ; type < nr_swapfiles ; type++,p++)
@@ -888,7 +1088,9 @@ asmlinkage long sys_swapon(const char * specialfile, int swap_flags)
 	}
 	if (type >= nr_swapfiles)
 		nr_swapfiles = type+1;
+	INIT_LIST_HEAD(&p->extent_list);
 	p->flags = SWP_USED;
+	p->nr_extents = 0;
 	p->swap_file = NULL;
 	p->old_block_size = 0;
 	p->swap_map = NULL;
@@ -909,7 +1111,6 @@ asmlinkage long sys_swapon(const char * specialfile, int swap_flags)
 	if (IS_ERR(name))
 		goto bad_swap_2;
 	swap_file = filp_open(name, O_RDWR, 0);
-	putname(name);
 	error = PTR_ERR(swap_file);
 	if (IS_ERR(swap_file)) {
 		swap_file = NULL;
@@ -931,8 +1132,12 @@ asmlinkage long sys_swapon(const char * specialfile, int swap_flags)
 				      PAGE_SIZE);
 		if (error < 0)
 			goto bad_swap;
-	} else if (!S_ISREG(swap_file->f_dentry->d_inode->i_mode))
+		p->bdev = bdev;
+	} else if (S_ISREG(swap_file->f_dentry->d_inode->i_mode)) {
+		p->bdev = swap_file->f_dentry->d_inode->i_sb->s_bdev;
+	} else {
 		goto bad_swap;
+	}
 
 	mapping = swap_file->f_dentry->d_inode->i_mapping;
 	swapfilesize = mapping->host->i_size >> PAGE_SHIFT;
@@ -946,15 +1151,20 @@ asmlinkage long sys_swapon(const char * specialfile, int swap_flags)
 			goto bad_swap;
 	}
 
-	swap_header = (void *) __get_free_page(GFP_USER);
-	if (!swap_header) {
-		printk("Unable to start swapping: out of memory :-)\n");
-		error = -ENOMEM;
+	/*
+	 * Read the swap header.
+	 */
+	page = read_cache_page(mapping, 0,
+			(filler_t *)mapping->a_ops->readpage, swap_file);
+	if (IS_ERR(page)) {
+		error = PTR_ERR(page);
 		goto bad_swap;
 	}
-
-	lock_page(virt_to_page(swap_header));
-	rw_swap_page_nolock(READ, swp_entry(type,0), (char *) swap_header);
+	wait_on_page_locked(page);
+	if (!PageUptodate(page))
+		goto bad_swap;
+	kmap(page);
+	swap_header = page_address(page);
 
 	if (!memcmp("SWAP-SPACE",swap_header->magic.magic,10))
 		swap_header_version = 1;
@@ -968,33 +1178,10 @@ asmlinkage long sys_swapon(const char * specialfile, int swap_flags)
 	
 	switch (swap_header_version) {
 	case 1:
-		memset(((char *) swap_header)+PAGE_SIZE-10,0,10);
-		j = 0;
-		p->lowest_bit = 0;
-		p->highest_bit = 0;
-		for (i = 1 ; i < 8*PAGE_SIZE ; i++) {
-			if (test_bit(i,(unsigned long *) swap_header)) {
-				if (!p->lowest_bit)
-					p->lowest_bit = i;
-				p->highest_bit = i;
-				maxpages = i+1;
-				j++;
-			}
-		}
-		nr_good_pages = j;
-		p->swap_map = vmalloc(maxpages * sizeof(short));
-		if (!p->swap_map) {
-			error = -ENOMEM;		
-			goto bad_swap;
-		}
-		for (i = 1 ; i < maxpages ; i++) {
-			if (test_bit(i,(unsigned long *) swap_header))
-				p->swap_map[i] = 0;
-			else
-				p->swap_map[i] = SWAP_MAP_BAD;
-		}
-		break;
-
+		printk(KERN_ERR "version 0 swap is no longer supported. "
+			"Use mkswap -v1 %s\n", name);
+		error = -EINVAL;
+		goto bad_swap;
 	case 2:
 		/* Check the swap header's sub-version and the size of
                    the swap file and bad block lists */
@@ -1050,15 +1237,20 @@ asmlinkage long sys_swapon(const char * specialfile, int swap_flags)
 		goto bad_swap;
 	}
 	p->swap_map[0] = SWAP_MAP_BAD;
+	p->max = maxpages;
+	p->pages = nr_good_pages;
+
+	if (setup_swap_extents(p))
+		goto bad_swap;
+
 	swap_list_lock();
 	swap_device_lock(p);
-	p->max = maxpages;
 	p->flags = SWP_ACTIVE;
-	p->pages = nr_good_pages;
 	nr_swap_pages += nr_good_pages;
 	total_swap_pages += nr_good_pages;
-	printk(KERN_INFO "Adding Swap: %dk swap-space (priority %d)\n",
-	       nr_good_pages<<(PAGE_SHIFT-10), p->prio);
+	printk(KERN_INFO "Adding %dk swap on %s.  Priority:%d extents:%d\n",
+		nr_good_pages<<(PAGE_SHIFT-10), name,
+		p->prio, p->nr_extents);
 
 	/* insert swap space into swap_list: */
 	prev = -1;
@@ -1092,14 +1284,18 @@ bad_swap_2:
 	if (!(swap_flags & SWAP_FLAG_PREFER))
 		++least_priority;
 	swap_list_unlock();
+	destroy_swap_extents(p);
 	if (swap_map)
 		vfree(swap_map);
 	if (swap_file && !IS_ERR(swap_file))
 		filp_close(swap_file, NULL);
 out:
-	if (swap_header)
-		free_page((long) swap_header);
-	unlock_kernel();
+	if (page && !IS_ERR(page)) {
+		kunmap(page);
+		page_cache_release(page);
+	}
+	if (name)
+		putname(name);
 	return error;
 }
 
@@ -1168,78 +1364,10 @@ bad_file:
 	goto out;
 }
 
-/*
- * Page lock needs to be held in all cases to prevent races with
- * swap file deletion.
- */
-int swap_count(struct page *page)
+struct swap_info_struct *
+get_swap_info_struct(unsigned type)
 {
-	struct swap_info_struct * p;
-	unsigned long offset, type;
-	swp_entry_t entry;
-	int retval = 0;
-
-	entry.val = page->index;
-	if (!entry.val)
-		goto bad_entry;
-	type = swp_type(entry);
-	if (type >= nr_swapfiles)
-		goto bad_file;
-	p = type + swap_info;
-	offset = swp_offset(entry);
-	if (offset >= p->max)
-		goto bad_offset;
-	if (!p->swap_map[offset])
-		goto bad_unused;
-	retval = p->swap_map[offset];
-out:
-	return retval;
-
-bad_entry:
-	printk(KERN_ERR "swap_count: null entry!\n");
-	goto out;
-bad_file:
-	printk(KERN_ERR "swap_count: %s%08lx\n", Bad_file, entry.val);
-	goto out;
-bad_offset:
-	printk(KERN_ERR "swap_count: %s%08lx\n", Bad_offset, entry.val);
-	goto out;
-bad_unused:
-	printk(KERN_ERR "swap_count: %s%08lx\n", Unused_offset, entry.val);
-	goto out;
-}
-
-/*
- * Prior swap_duplicate protects against swap device deletion.
- */
-void get_swaphandle_info(swp_entry_t entry, unsigned long *offset, 
-			struct inode **swapf)
-{
-	unsigned long type;
-	struct swap_info_struct *p;
-
-	type = swp_type(entry);
-	if (type >= nr_swapfiles) {
-		printk(KERN_ERR "rw_swap_page: %s%08lx\n", Bad_file, entry.val);
-		return;
-	}
-
-	p = &swap_info[type];
-	*offset = swp_offset(entry);
-	if (*offset >= p->max && *offset != 0) {
-		printk(KERN_ERR "rw_swap_page: %s%08lx\n", Bad_offset, entry.val);
-		return;
-	}
-	if (p->swap_map && !p->swap_map[*offset]) {
-		printk(KERN_ERR "rw_swap_page: %s%08lx\n", Unused_offset, entry.val);
-		return;
-	}
-	if (!(p->flags & SWP_USED)) {
-		printk(KERN_ERR "rw_swap_page: %s%08lx\n", Unused_file, entry.val);
-		return;
-	}
-
-	*swapf = p->swap_file->f_dentry->d_inode;
+	return &swap_info[type];
 }
 
 /*
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f95ebed746b0..50cc6d13f0ff 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -195,6 +195,7 @@ struct vm_struct * get_vm_area(unsigned long size, unsigned long flags)
 		if (addr > VMALLOC_END-size)
 			goto out;
 	}
+	area->phys_addr = 0;
 	area->flags = flags;
 	area->addr = (void *)addr;
 	area->size = size;
@@ -209,9 +210,25 @@ out:
 	return NULL;
 }
 
-void vfree(void * addr)
+struct vm_struct *remove_kernel_area(void *addr) 
 {
 	struct vm_struct **p, *tmp;
+	write_lock(&vmlist_lock);
+	for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) {
+		if (tmp->addr == addr) {
+			*p = tmp->next;
+			write_unlock(&vmlist_lock);
+			return tmp;
+		}
+
+	}
+	write_unlock(&vmlist_lock);
+	return NULL;
+} 
+
+void vfree(void * addr)
+{
+	struct vm_struct *tmp;
 
 	if (!addr)
 		return;
@@ -219,17 +236,12 @@ void vfree(void * addr)
 		printk(KERN_ERR "Trying to vfree() bad address (%p)\n", addr);
 		return;
 	}
-	write_lock(&vmlist_lock);
-	for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) {
-		if (tmp->addr == addr) {
-			*p = tmp->next;
+	tmp = remove_kernel_area(addr); 
+	if (tmp) { 
 			vmfree_area_pages(VMALLOC_VMADDR(tmp->addr), tmp->size);
-			write_unlock(&vmlist_lock);
 			kfree(tmp);
 			return;
 		}
-	}
-	write_unlock(&vmlist_lock);
 	printk(KERN_ERR "Trying to vfree() nonexistent vm area (%p)\n", addr);
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 91f180f2b08a..6561f2b71b35 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -392,7 +392,8 @@ shrink_cache(int nr_pages, zone_t *classzone,
 	spin_lock(&pagemap_lru_lock);
 	while (--max_scan >= 0 &&
 			(entry = inactive_list.prev) != &inactive_list) {
-		struct page * page;
+		struct page *page;
+		int may_enter_fs;
 
 		if (need_resched()) {
 			spin_unlock(&pagemap_lru_lock);
@@ -427,10 +428,17 @@ shrink_cache(int nr_pages, zone_t *classzone,
 			goto page_mapped;
 
 		/*
+		 * swap activity never enters the filesystem and is safe
+		 * for GFP_NOFS allocations.
+		 */
+		may_enter_fs = (gfp_mask & __GFP_FS) ||
+				(PageSwapCache(page) && (gfp_mask & __GFP_IO));
+
+		/*
 		 * IO in progress? Leave it at the back of the list.
 		 */
 		if (unlikely(PageWriteback(page))) {
-			if (gfp_mask & __GFP_FS) {
+			if (may_enter_fs) {
 				page_cache_get(page);
 				spin_unlock(&pagemap_lru_lock);
 				wait_on_page_writeback(page);
@@ -451,7 +459,7 @@ shrink_cache(int nr_pages, zone_t *classzone,
 		mapping = page->mapping;
 
 		if (PageDirty(page) && is_page_cache_freeable(page) &&
-				page->mapping && (gfp_mask & __GFP_FS)) {
+				page->mapping && may_enter_fs) {
 			/*
 			 * It is not critical here to write it only if
 			 * the page is unmapped beause any direct writer
@@ -480,6 +488,15 @@ shrink_cache(int nr_pages, zone_t *classzone,
 		 * If the page has buffers, try to free the buffer mappings
 		 * associated with this page. If we succeed we try to free
 		 * the page as well.
+		 *
+		 * We do this even if the page is PageDirty().
+		 * try_to_release_page() does not perform I/O, but it is
+		 * possible for a page to have PageDirty set, but it is actually
+		 * clean (all its buffers are clean).  This happens if the
+		 * buffers were written out directly, with submit_bh(). ext3
+		 * will do this, as well as the blockdev mapping. 
+		 * try_to_release_page() will discover that cleanness and will
+		 * drop the buffers and mark the page clean - it can be freed.
 		 */
 		if (PagePrivate(page)) {
 			spin_unlock(&pagemap_lru_lock);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8b1f2a159e19..464a56367e28 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2419,7 +2419,7 @@ struct ip_rt_acct *ip_rt_acct;
 /* This code sucks.  But you should have seen it before! --RR */
 
 /* IP route accounting ptr for this logical cpu number. */
-#define IP_RT_ACCT_CPU(i) (ip_rt_acct + cpu_logical_map(i) * 256)
+#define IP_RT_ACCT_CPU(i) (ip_rt_acct + i * 256)
 
 static int ip_rt_acct_read(char *buffer, char **start, off_t offset,
 			   int length, int *eof, void *data)
@@ -2441,6 +2441,8 @@ static int ip_rt_acct_read(char *buffer, char **start, off_t offset,
 	/* Add the other cpus in, one int at a time */
 	for (i = 1; i < NR_CPUS; i++) {
 		unsigned int j;
+		if (!cpu_online(i))
+			continue;
 		for (j = 0; j < length/4; j++)
 			((u32*)buffer)[j] += ((u32*)IP_RT_ACCT_CPU(i))[j];
 	}
author	Linus Torvalds <torvalds@home.transmeta.com>	2002-06-17 20:48:29 -0700
committer	Linus Torvalds <torvalds@home.transmeta.com>	2002-06-17 20:48:29 -0700
commit	1f60ade2a44d22a67c75a165b70d66f9d4e0b76e (patch)
tree	7a8bda4c45fb3e5d255a023b030137e3b6be87ee
parent	8509486ae776be099cbedb6c37c37741ddc20ad8 (diff)
parent	3986594c6167a269053d3d88f17e53e0ca4023f8 (diff)