<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/linux/writeback.h, branch v2.6.29.2</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v2.6.29.2</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v2.6.29.2'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2009-01-06T23:59:09Z</updated>
<entry>
<title>fs: remove WB_SYNC_HOLD</title>
<updated>2009-01-06T23:59:09Z</updated>
<author>
<name>Nick Piggin</name>
<email>npiggin@suse.de</email>
</author>
<published>2009-01-06T22:40:25Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4f5a99d64c17470a784a6c68064207d82e3e74a5'/>
<id>urn:sha1:4f5a99d64c17470a784a6c68064207d82e3e74a5</id>
<content type='text'>
Remove WB_SYNC_HOLD.  The primary motiviation is the design of my
anti-starvation code for fsync.  It requires taking an inode lock over the
sync operation, so we could run into lock ordering problems with multiple
inodes.  It is possible to take a single global lock to solve the ordering
problem, but then that would prevent a future nice implementation of "sync
multiple inodes" based on lock order via inode address.

Seems like a backward step to remove this, but actually it is busted
anyway: we can't use the inode lists for data integrity wait: an inode can
be taken off the dirty lists but still be under writeback.  In order to
satisfy data integrity semantics, we should wait for it to finish
writeback, but if we only search the dirty lists, we'll miss it.

It would be possible to have a "writeback" list, for sys_sync, I suppose.
But why complicate things by prematurely optimise?  For unmounting, we
could avoid the "livelock avoidance" code, which would be easier, but
again premature IMO.

Fixing the existing data integrity problem will come next.

Signed-off-by: Nick Piggin &lt;npiggin@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: add dirty_background_bytes and dirty_bytes sysctls</title>
<updated>2009-01-06T23:59:03Z</updated>
<author>
<name>David Rientjes</name>
<email>rientjes@google.com</email>
</author>
<published>2009-01-06T22:39:31Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2da02997e08d3efe8174c7a47696e6f7cbe69ba9'/>
<id>urn:sha1:2da02997e08d3efe8174c7a47696e6f7cbe69ba9</id>
<content type='text'>
This change introduces two new sysctls to /proc/sys/vm:
dirty_background_bytes and dirty_bytes.

dirty_background_bytes is the counterpart to dirty_background_ratio and
dirty_bytes is the counterpart to dirty_ratio.

With growing memory capacities of individual machines, it's no longer
sufficient to specify dirty thresholds as a percentage of the amount of
dirtyable memory over the entire system.

dirty_background_bytes and dirty_bytes specify quantities of memory, in
bytes, that represent the dirty limits for the entire system.  If either
of these values is set, its value represents the amount of dirty memory
that is needed to commence either background or direct writeback.

When a `bytes' or `ratio' file is written, its counterpart becomes a
function of the written value.  For example, if dirty_bytes is written to
be 8096, 8K of memory is required to commence direct writeback.
dirty_ratio is then functionally equivalent to 8K / the amount of
dirtyable memory:

	dirtyable_memory = free pages + mapped pages + file cache

	dirty_background_bytes = dirty_background_ratio * dirtyable_memory
		-or-
	dirty_background_ratio = dirty_background_bytes / dirtyable_memory

		AND

	dirty_bytes = dirty_ratio * dirtyable_memory
		-or-
	dirty_ratio = dirty_bytes / dirtyable_memory

Only one of dirty_background_bytes and dirty_background_ratio may be
specified at a time, and only one of dirty_bytes and dirty_ratio may be
specified.  When one sysctl is written, the other appears as 0 when read.

The `bytes' files operate on a page size granularity since dirty limits
are compared with ZVC values, which are in page units.

Prior to this change, the minimum dirty_ratio was 5 as implemented by
get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
written value between 0 and 100.  This restriction is maintained, but
dirty_bytes has a lower limit of only one page.

Also prior to this change, the dirty_background_ratio could not equal or
exceed dirty_ratio.  This restriction is maintained in addition to
restricting dirty_background_bytes.  If either background threshold equals
or exceeds that of the dirty threshold, it is implicitly set to half the
dirty threshold.

Acked-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Cc: Andrea Righi &lt;righi.andrea@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: change dirty limit type specifiers to unsigned long</title>
<updated>2009-01-06T23:59:02Z</updated>
<author>
<name>David Rientjes</name>
<email>rientjes@google.com</email>
</author>
<published>2009-01-06T22:39:29Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=364aeb2849789b51bf4b9af2ddd02fee7285c54e'/>
<id>urn:sha1:364aeb2849789b51bf4b9af2ddd02fee7285c54e</id>
<content type='text'>
The background dirty and dirty limits are better defined with type
specifiers of unsigned long since negative writeback thresholds are not
possible.

These values, as returned by get_dirty_limits(), are normally compared
with ZVC values to determine whether writeback shall commence or be
throttled.  Such page counts cannot be negative, so declaring the page
limits as signed is unnecessary.

Acked-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Christoph Lameter &lt;cl@linux-foundation.org&gt;
Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Cc: Andrea Righi &lt;righi.andrea@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>vfs: Add no_nrwrite_index_update writeback control flag</title>
<updated>2008-10-16T14:09:17Z</updated>
<author>
<name>Aneesh Kumar K.V</name>
<email>aneesh.kumar@linux.vnet.ibm.com</email>
</author>
<published>2008-10-16T14:09:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4'/>
<id>urn:sha1:17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4</id>
<content type='text'>
If no_nrwrite_index_update is set we don't update nr_to_write and
address space writeback_index in write_cache_pages.  This change
enables a file system to skip these updates in write_cache_pages and do
them in the writepages() callback.  This patch will be followed by an
ext4 patch that make use of these new flags.

Signed-off-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Signed-off-by: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
CC: linux-fsdevel@vger.kernel.org
</content>
</entry>
<entry>
<title>vfs: Remove the range_cont writeback mode.</title>
<updated>2008-10-14T13:21:02Z</updated>
<author>
<name>Aneesh Kumar K.V</name>
<email>aneesh.kumar@linux.vnet.ibm.com</email>
</author>
<published>2008-10-14T13:21:02Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=74baaaaec8b4f22e1ae279f5ecca4ff705b28912'/>
<id>urn:sha1:74baaaaec8b4f22e1ae279f5ecca4ff705b28912</id>
<content type='text'>
Ext4 was the only user of range_cont writeback mode and ext4 switched
to a different method. So remove the range_cont mode which is not used
in the kernel.

Signed-off-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Signed-off-by: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
CC: linux-fsdevel@vger.kernel.org
</content>
</entry>
<entry>
<title>Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4</title>
<updated>2008-07-15T15:36:38Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2008-07-15T15:36:38Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8d2567a620ae8c24968a2bdc1c906c724fac1f6a'/>
<id>urn:sha1:8d2567a620ae8c24968a2bdc1c906c724fac1f6a</id>
<content type='text'>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits)
  ext4: Documention update for new ordered mode and delayed allocation
  ext4: do not set extents feature from the kernel
  ext4: Don't allow nonextenst mount option for large filesystem
  ext4: Enable delalloc by default.
  ext4: delayed allocation i_blocks fix for stat
  ext4: fix delalloc i_disksize early update issue
  ext4: Handle page without buffers in ext4_*_writepage()
  ext4: Add ordered mode support for delalloc
  ext4: Invert lock ordering of page_lock and transaction start in delalloc
  mm: Add range_cont mode for writeback
  ext4: delayed allocation ENOSPC handling
  percpu_counter: new function percpu_counter_sum_and_set
  ext4: Add delayed allocation support in data=writeback mode
  vfs: add hooks for ext4's delayed allocation support
  jbd2: Remove data=ordered mode support using jbd buffer heads
  ext4: Use new framework for data=ordered mode in JBD2
  jbd2: Implement data=ordered mode handling via inodes
  vfs: export filemap_fdatawrite_range()
  ext4: Fix lock inversion in ext4_ext_truncate()
  ext4: Invert the locking order of page_lock and transaction start
  ...
</content>
</entry>
<entry>
<title>mm: Add range_cont mode for writeback</title>
<updated>2008-07-11T23:27:31Z</updated>
<author>
<name>Aneesh Kumar K.V</name>
<email>aneesh.kumar@linux.vnet.ibm.com</email>
</author>
<published>2008-07-11T23:27:31Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=06d6cf6959d22037fcec598f4f954db5db3d7356'/>
<id>urn:sha1:06d6cf6959d22037fcec598f4f954db5db3d7356</id>
<content type='text'>
Filesystems like ext4 needs to start a new transaction in
the writepages for block allocation. This happens with delayed
allocation and there is limit to how many credits we can request
from the journal layer. So we call write_cache_pages multiple
times with wbc-&gt;nr_to_write set to the maximum possible value
limitted by the max journal credits available.

Add a new mode to writeback that enables us to handle this
behaviour. In the new mode we update the wbc-&gt;range_start
to point to the new offset to be written. Next call to
call to write_cache_pages will start writeout from specified
range_start offset. In the new mode we also limit writing
to the specified wbc-&gt;range_end.

Signed-off-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Signed-off-by: Mingming Cao &lt;cmm@us.ibm.com&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
</content>
</entry>
<entry>
<title>ftrace: limit trace entries</title>
<updated>2008-05-23T20:05:14Z</updated>
<author>
<name>Steven Rostedt</name>
<email>rostedt@goodmis.org</email>
</author>
<published>2008-05-12T19:21:04Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=3eefae994d9224fb7771a3ddb683868363c23510'/>
<id>urn:sha1:3eefae994d9224fb7771a3ddb683868363c23510</id>
<content type='text'>
Currently there is no protection from the root user to use up all of
memory for trace buffers. If the root user allocates too many entries,
the OOM killer might start kill off all tasks.

This patch adds an algorith to check the following condition:

 pages_requested &gt; (freeable_memory + current_trace_buffer_pages) / 4

If the above is met then the allocation fails. The above prevents more
than 1/4th of freeable memory from being used by trace buffers.

To determine the freeable_memory, I made determine_dirtyable_memory in
mm/page-writeback.c global.

Special thanks goes to Peter Zijlstra for suggesting the above calculation.

Signed-off-by: Steven Rostedt &lt;srostedt@redhat.com&gt;
Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
</entry>
<entry>
<title>mm: bdi: export BDI attributes in sysfs</title>
<updated>2008-04-30T15:29:49Z</updated>
<author>
<name>Peter Zijlstra</name>
<email>a.p.zijlstra@chello.nl</email>
</author>
<published>2008-04-30T07:54:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=cf0ca9fe5dd9e3693d935757a7b2fc50fc576554'/>
<id>urn:sha1:cf0ca9fe5dd9e3693d935757a7b2fc50fc576554</id>
<content type='text'>
Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
This allows us to see and set the various BDI specific variables.

In particular this properly exposes the read-ahead window for all relevant
users and /sys/block/&lt;block&gt;/queue/read_ahead_kb should be deprecated.

With patient help from Kay Sievers and Greg KH

[mszeredi@suse.cz]

 - split off NFS and FUSE changes into separate patches
 - document new sysfs attributes under Documentation/ABI
 - do bdi_class_init as a core_initcall, otherwise the "default" BDI
   won't be initialized
 - remove bdi_init_fmt macro, it's not used very much

[akpm@linux-foundation.org: fix ia64 warning]
Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Kay Sievers &lt;kay.sievers@vrfy.org&gt;
Acked-by: Greg KH &lt;greg@kroah.com&gt;
Cc: Trond Myklebust &lt;trond.myklebust@fys.uio.no&gt;
Signed-off-by: Miklos Szeredi &lt;mszeredi@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>writeback: speed up writeback of big dirty files</title>
<updated>2008-02-05T17:44:19Z</updated>
<author>
<name>Fengguang Wu</name>
<email>wfg@mail.ustc.edu.cn</email>
</author>
<published>2008-02-05T06:29:36Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8bc3be2751b4f74ab90a446da1912fd8204d53f7'/>
<id>urn:sha1:8bc3be2751b4f74ab90a446da1912fd8204d53f7</id>
<content type='text'>
After making dirty a 100M file, the normal behavior is to start the
writeback for all data after 30s delays.  But sometimes the following
happens instead:

	- after 30s:    ~4M
	- after 5s:     ~4M
	- after 5s:     all remaining 92M

Some analyze shows that the internal io dispatch queues goes like this:

		s_io            s_more_io
		-------------------------
	1)	100M,1K         0
	2)	1K              96M
	3)	0               96M
1) initial state with a 100M file and a 1K file

2) 4M written, nr_to_write &lt;= 0, so write more

3) 1K written, nr_to_write &gt; 0, no more writes(BUG)

nr_to_write &gt; 0 in (3) fools the upper layer to think that data have all
been written out.  The big dirty file is actually still sitting in
s_more_io.  We cannot simply splice s_more_io back to s_io as soon as s_io
becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
may starve newly expired inodes in s_dirty.  It is also not an option to
draw inodes from both s_more_io and s_dirty, an let the loop go on: this
might lead to live locks, and might also starve other superblocks in sync
time(well kupdate may still starve some superblocks, that's another bug).

We have to return when a full scan of s_io completes.  So nr_to_write &gt; 0
does not necessarily mean that "all data are written".  This patch
introduces a flag writeback_control.more_io to indicate that more io should
be done.  With it the big dirty file no longer has to wait for the next
kupdate invokation 5s later.

In sync_sb_inodes() we only set more_io on super_blocks we actually
visited.  This avoids the interaction between two pdflush deamons.

Also in __sync_single_inode() we don't blindly keep requeuing the io if the
filesystem cannot progress.  Failing to do so may lead to 100% iowait.

Tested-by: Mike Snitzer &lt;snitzer@gmail.com&gt;
Signed-off-by: Fengguang Wu &lt;wfg@mail.ustc.edu.cn&gt;
Cc: Michael Rubin &lt;mrubin@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
