| Age | Commit message (Collapse) | Author |
|
|
|
Several hash table implementations in the networking were
remotely exploitable. Remote attackers could launch attacks
whereby, using carefully choosen forged source addresses, make
every routing cache entry get hashed into the same hash chain.
Netfilter's IP conntrack module and the TCP syn-queue implementation
had identical vulnerabilities and have been fixed too.
The choosen solution to the problem involved using Bob's Jenkins
hash along with a randomly choosen input. For the ipv4 routing
cache we take things one step further and periodically choose a
new random secret. By default this happens every 10 minutes, but
this is configurable by the user via sysctl knobs.
|
|
|
|
From: Russell Miller <rmiller@duskglow.com>
A BUG or an oops will often leave a machine in a useless state. There is no
way to remotely recover the machine from that state.
The patch adds a /proc/sys/kernel/panic_on_oops sysctl which, when set, will
cause the x86 kernel to call panic() at the end of the oops handler. If the
user has also set /proc/sys/kernel/panic then a reboot will occur.
The implementation will try to sleep for a while before panicing so the oops
info has a chance of hitting the logs.
The implementation is designed so that other architectures can easily do this
in their oops handlers.
|
|
|
|
Currently it turns of prequeue processing, but more decisions
may be guided by it in the future.
Based upon a patch from Andi Kleen.
|
|
into dyn9-47-18-140.beaverton.ibm.com:/home/sridhar/BK/lksctp-2.5.52
|
|
This allows us to control the aggressiveness of the lower-zone defense
algorithm. The `incremental min'. For workloads which are using a
serious amount of mlocked memory, a few megabytes is not enough.
So the `lower_zone_protection' tunable allows the administrator to
increase the amount of protection which lower zones receive against
allocations which _could_ use higher zones.
The default value of lower_zone_protection is zero, giving unchanged
behaviour. We should not normally make large amounts of memory
unavailable for pagecache just in case someone mlocks many hundreds of
megabytes.
|
|
|
|
1. Expiration of SAs. Some missing updates of counters.
Question: very strange, rfc defines use_time as time of the first use
of SA. But kame setkey refers to this as lastuse.
2. Bug fixes for tunnel mode and forwarding.
3. Fix bugs in per-socket policy: policy entries do not leak but are destroyed,
when socket is closed, and are cloned on children of listening sockets.
4. Implemented use policy: i.e. use ipsec if a SA is available,
ignore if it is not.
5. Added sysctl to disable in/out policy on some devices.
It is set on loopback by default.
6. Remove resolved reference from template. It is not used,
but pollutes code.
7. Added all the SASTATEs, now they make sense.
|
|
|
|
|
|
Since /proc/sys/vm/dirty_sync_ratio went away, the name
"dirty_async_ratio" makes no sense.
So rename it to just /proc/sys/vm/dirty_ratio.
|
|
/proc/sys/vm/swappiness controls the VM's tendency to unmap pages and to
swap things out.
100 -> basically current 2.5 behaviour
0 -> not very swappy at all
The mechanism which is used to control swappiness is: to be reluctant
to bring mapped pages onto the inactive list. Prefer to reclaim
pagecache instead.
The control for that mechanism is as follows:
- If there is a large amount of mapped memory in the machine, we
prefer to bring mapped pages onto the inactive list.
- If page reclaim is under distress (more scanning is happening) then
prefer to bring mapped pages onto the inactive list. This is
basically the 2.4 algorithm, really.
- If the /proc/sys/vm/swappiness control is high then prefer to bring
mapped pages onto the inactive list.
The implementation is simple: calculate the above three things as
percentages and add them up. If that's over 100% then start reclaiming
mapped pages.
The `proportion of mapped memory' is downgraded so that we don't swap
just because a lot of memory is mapped into pagetables - we still need
some VM distress before starting to swap that memory out.
For a while I was adding a little bias so that we prefer to unmap
file-backed memory before swapping out anon memory. Because usually
file backed memory can be evicted and reestablished with one I/O, not
two. It was unmapping executable text too easily, so here I just treat
them equally.
|
|
To my suprise a lot of big site/beowulf type people all really want this
diff, which I'd otherwise filed as 'interesting but not important'
|
|
Motivation for this modification is that especially on some wireless
network technologies there are delay spikes that trigger RTO even though
no packets are lost. F-RTO sender continues by sending new data after RTO
retransmission in order to avoid unnecessary retransmissions in that case.
If the sender sees any duplicate acks after the RTO retransmission, it
reverts to traditional slow start retransmissions. If new acks arrive
after forward transmissions, they very likely indicate that the RTO was
indeed spurious and the sender can continue sending new data (because
only one segment was retransmitted).
|
|
From Christoph Hellwig, acked by Rohit.
- fix config.in description: we know we're on i386 and we also know
that a feature can only be enabled if the hw supports it, the code
alone is not enough
- the sysctl is VM-releated, so move it from /proc/sys/kernel tp
/proc/sys/vm
- adopt to standard sysctl names
|
|
This was designed to be a really sterm throttling threshold: if dirty
memory reaches this level then perform writeback and actually wait on
it.
It doesn't work. Because memory dirtiers are required to perform
writeback if the amount of dirty AND writeback memory exceeds
dirty_async_ratio.
So kill it, and rely just on the request queues being appropriately
scaled to the machine size (they are).
This is basically what 2.4 does.
|
|
into hera.kernel.org:/home/hch/BK/xfs/linux-2.5
|
|
Rohit Seth's ia32 huge tlb pages patch.
Anton Blanchard took a look at this today; he seemed happy
with it and said he could borrow bits.
|
|
|
|
This is the pid-max patch, the one i sent for 2.5.31 was botched. I
have removed the 'once' debugging stupidity - now PIDs start at 0 again.
Also, for an unknown reason the previous patch missed the hunk that had
the declaration of 'DEFAULT_PID_MAX' which made it not compile ...
|
|
|
|
These were totally unused for a long time. It's interesting how
many files include swapctl.h, though..
|
|
|
|
Alan's overcommit patch, brought to 2.5 by Robert Love.
Can't say I've tested its functionality at all, but it doesn't crash,
it has been in -ac and RH kernels for some time and I haven't observed
any of its functions on profiles.
"So what is strict VM overcommit? We introduce new overcommit
policies that attempt to never succeed an allocation that can not be
fulfilled by the backing store and consequently never OOM. This is
achieved through strict accounting of the committed address space and
a policy to allow/refuse allocations based on that accounting.
In the strictest of modes, it should be impossible to allocate more
memory than available and impossible to OOM. All memory failures
should be pushed down to the allocation routines -- malloc, mmap, etc.
The new modes are available via sysctl (same as before). See
Documentation/vm/overcommit-accounting for more information."
|
|
|
|
input subsystem.
This is needed due to the next header file changes.
|
|
Writeback/pdflush cleanup patch from Steven Augart
* Exposes nr_pdflush_threads as /proc/sys/vm/nr_pdflush_threads, read-only.
(I like this - I expect that management of the pdflush thread pool
will be important for many-spindle machines, and this is a neat way
of getting at the info).
* Adds minimum and maximum checking to the five writable pdflush
and fs-writeback parameters.
* Minor indentation fix in sysctl.c
* mm/pdflush.c now includes linux/writeback.h, which prototypes
pdflush_operation. This is so that the compiler can
automatically check that the prototype matches the definition.
* Adds a few comments to existing code.
|
|
Adds five sysctls for tuning the writeback behaviour:
dirty_async_ratio
dirty_background_ratio
dirty_sync_ratio
dirty_expire_centisecs
dirty_writeback_centisecs
these are described in Documentation/filesystems/proc.txt They are
basically the tradiditional knobs which we've always had...
We are accreting a ton of obsolete sysctl numbers under /proc/sys/vm/.
I didn't recycle these - just mark them unused and remove the obsolete
documentation.
|
|
This changes the sysctl interface to use reasonable names in
/proc/sys/fs/quota/
|
|
Transform new quota code to use sysctl instead of /proc/fs.
|
|
[ I reversed the order in which writeback walks the superblock's
dirty inodes. It sped up dbench's unlink phase greatly. I'm
such a sleaze ]
The core writeback patch. Switches file writeback from the dirty
buffer LRU over to address_space.dirty_pages.
- The buffer LRU is removed
- The buffer hash is removed (uses blockdev pagecache lookups)
- The bdflush and kupdate functions are implemented against
address_spaces, via pdflush.
- The relationship between pages and buffers is changed.
- If a page has dirty buffers, it is marked dirty
- If a page is marked dirty, it *may* have dirty buffers.
- A dirty page may be "partially dirty". block_write_full_page
discovers this.
- A bunch of consistency checks of the form
if (!something_which_should_be_true())
buffer_error();
have been introduced. These fog the code up but are important for
ensuring that the new buffer/page code is working correctly.
- New locking (inode.i_bufferlist_lock) is introduced for exclusion
from try_to_free_buffers(). This is needed because set_page_dirty
is called under spinlock, so it cannot lock the page. But it
needs access to page->buffers to set them all dirty.
i_bufferlist_lock is also used to protect inode.i_dirty_buffers.
- fs/inode.c has been split: all the code related to file data writeback
has been moved into fs/fs-writeback.c
- Code related to file data writeback at the address_space level is in
the new mm/page-writeback.c
- try_to_free_buffers() is now non-blocking
- Switches vmscan.c over to understand that all pages with dirty data
are now marked dirty.
- Introduces a new a_op for VM writeback:
->vm_writeback(struct page *page, int *nr_to_write)
this is a bit half-baked at present. The intent is that the address_space
is given the opportunity to perform clustered writeback. To allow it to
opportunistically write out disk-contiguous dirty data which may be in other zones.
To allow delayed-allocate filesystems to get good disk layout.
- Added address_space.io_pages. Pages which are being prepared for
writeback. This is here for two reasons:
1: It will be needed later, when BIOs are assembled direct
against pagecache, bypassing the buffer layer. It avoids a
deadlock which would occur if someone moved the page back onto the
dirty_pages list after it was added to the BIO, but before it was
submitted. (hmm. This may not be a problem with PG_writeback logic).
2: Avoids a livelock which would occur if some other thread is continually
redirtying pages.
- There are two known performance problems in this code:
1: Pages which are locked for writeback cause undesirable
blocking when they are being overwritten. A patch which leaves
pages unlocked during writeback comes later in the series.
2: While inodes are under writeback, they are locked. This
causes namespace lookups against the file to get unnecessarily
blocked in wait_on_inode(). This is a fairly minor problem.
I don't have a fix for this at present - I'll fix this when I
attach dirty address_spaces direct to super_blocks.
- The patch vastly increases the amount of dirty data which the
kernel permits highmem machines to maintain. This is because the
balancing decisions are made against the amount of memory in the
machine, not against the amount of buffercache-allocatable memory.
This may be very wrong, although it works fine for me (2.5 gigs).
We can trivially go back to the old-style throttling with
s/nr_free_pagecache_pages/nr_free_buffer_pages/ in
balance_dirty_pages(). But better would be to allow blockdev
mappings to use highmem (I'm thinking about this one, slowly). And
to move writer-throttling and writeback decisions into the VM (modulo
the file-overwriting problem).
- Drops 24 bytes from struct buffer_head. More to come.
- There's some gunk like super_block.flags:MS_FLUSHING which needs to
be killed. Need a better way of providing collision avoidance
between pdflush threads, to prevent more than one pdflush thread
working a disk at the same time.
The correct way to do that is to put a flag in the request queue to
say "there's a pdlfush thread working this disk". This is easy to
do: just generalise the "ra_pages" pointer to point at a struct which
includes ra_pages and the new collision-avoidance flag.
|
|
during connect when the connection will still have a unique
identity. Fixes port space exhaustion, especially in web
caches.
Initial work done by Andi Kleen.
|
|
It is used to differentiate the devices by the medium
they are attached to. It is used to change proxy_arp behavior:
the proxy arp feature is enabled for packets forwarded between
two devices attached to different media.
|
|
Robert Olsson, and Alexey Kuznetsov. This changeset adds
the framework and implementation, but drivers need to be
ported to NAPI in order to take advantage of the new
facilities. NAPI is fully backwards compatible, current
drivers will continue to work as they always have.
NAPI is a way for dealing with high packet load. It allows
the driver to disable the RX interrupts on the card and enter
a polling mode. Another way to describe NAPI would be as
implicit mitigation. Once the device enters this polling
mode, it will exit back to interrupt based processing when
the receive packet queue is purged.
A full porting and description document is found at:
Documentation/networking/NAPI_HOWTO.txt
and this also makes reference to Usenix papers on the
web and other such resources available on NAPI.
NAPI has been found to not only increase packet processing
rates, it also gives greater fairness to the other interfaces
in the system which are not experiencing high packet load.
|
|
- Alan Cox: more driver merging
- Al Viro: make ext2 group allocation more readable
|
|
- me/Al Viro: fix bdget() oops with block device modules that don't
clean up after they exit
- Alan Cox: continued merging (drivers, license tags)
- David Miller: sparc update, network fixes
- Christoph Hellwig: work around broken drivers that add a gendisk more
than once
- Jakub Jelinek: handle more ELF loading special cases
- Trond Myklebust: NFS client and lockd reclaimer cleanups/fixes
- Greg KH: USB updates
- Mikael Pettersson: sparate out local APIC / IO-APIC config options
|
|
- Alan Cox: much more merging
- Pete Zaitcev: ymfpci race fixes
- Andrea Arkangeli: VM race fix and OOM tweak.
- Arjan Van de Ven: merge RH kernel fixes
- Andi Kleen: use more readable 'likely()/unlikely()' instead of __builtin_expect()
- Keith Owens: fix 64-bit ELF types
- Gerd Knorr: mark more broken PCI bridges, update btaudio driver
- Paul Mackerras: powermac driver update
- me: clean up PTRACE_DETACH to use common infrastructure
|
|
- Greg KH: start migration to new "min()/max()"
- Roman Zippel: move affs over to "min()/max()".
- Vojtech Pavlik: VIA update (make sure not to IRQ-unmask a vt82c576)
- Jan Kara: quota bug-fix (don't decrement quota for non-counted inode)
- Anton Altaparmakov: more NTFS updates
- Al Viro: make nosuid/noexec/nodev be per-mount flags, not per-filesystem
- Alan Cox: merge input/joystick layer differences, driver and alpha merge
- Keith Owens: scsi Makefile cleanup
- Trond Myklebust: fix oopsable race in locking code
- Jean Tourrilhes: IrDA update
|
|
- Christoph Hellwig: clean up personality handling a bit
- Robert Love: update sysctl/vm documentation
- make the three-argument (that everybody hates) "min()" be "min_t()",
and introduce a type-anal "min()" that complains about arguments of
different types.
|
|
- Al Viro: block device cleanups
- Marcelo Tosatti: make bounce buffer allocations more robust (it's ok
for them to do IO, just not cause recursive bounce IO. So allow them)
- Anton Altaparmakov: NTFS update (1.1.17)
- Paul Mackerras: PPC update (big re-org)
- Petko Manolov: USB pegasus driver fixes
- David Miller: networking and sparc updates
- Trond Myklebust: Export atomic_dec_and_lock
- OGAWA Hirofumi: find and fix umsdos "filldir" users that were broken
by the 64-bit-cleanups. Fix msdos warnings.
- Al Viro: superblock handling cleanups and race fixes
- Johannes Erdfelt++: USB updates
|
|
- Al Viro: sanity-check user arguments, zero-terminated strings etc.
- Urban Widmark: smbfs update (server/client cache coherency etc)
- Rik van Riel, Marcelo Tosatti: VM updates
- Cort Dougan: PPC updates
- Neil Brown: raid1/5 failed drive fixups, NULL ptr checking, md error cleanup
- Neil Brown: knfsd fix for 64-bit architectures, and filehandle resolveir
- Ken Brownfield: workaround for menuconfig CPU selection glitch
- David Miller: sparc64 MM setup fix, arpfilter forward port
- Keith Owens: Remove obsolete IPv6 provider based addressing
- Jari Ruusu: block_write error case cleanup fix
- Jeff Garzik: netdriver update
|
|
- Hui-Fen Hsu: sis900 driver update
- NIIBE Yutaka: Super-H update
- Alan Cox: more resyncs (ARM down, but more to go)
- David Miller: network zerocopy, Sparc sync, qlogic,FC fix, etc.
- David Miller/me: get rid of various drivers hacks to do mmap
alignment behind the back of the VM layer. Create a real
protocol for it.
|
|
- Paul Mackerras: PPC update for thread-safe page table handling
- Ingo Molnar: x86 PAE update for thread-safe page table handling
- Jeff Garzik: network driver updates, i810 rng driver, and
"alloc_etherdev()" network driver insert race condition fix.
- David Miller: UltraSparcIII update, network locking fixes
- Al Viro: fix fs counts on mount failure
|
|
|