<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/trace/events/writeback.h, branch v3.2.40</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.2.40</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v3.2.40'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2012-02-20T20:46:17Z</updated>
<entry>
<title>writeback: fix dereferencing NULL bdi-&gt;dev on trace_writeback_queue</title>
<updated>2012-02-20T20:46:17Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2012-02-05T02:54:03Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=422204b77968331ce85721527f7ece49f72658f2'/>
<id>urn:sha1:422204b77968331ce85721527f7ece49f72658f2</id>
<content type='text'>
commit 977b7e3a52a7421ad33a393a38ece59f3d41c2fa upstream.

When a SD card is hot removed without umount, del_gendisk() will call
bdi_unregister() without destroying/freeing it. This leaves the bdi in
the bdi-&gt;dev = NULL, bdi-&gt;wb.task = NULL, bdi-&gt;bdi_list removed state.

When sync(2) gets the bdi before bdi_unregister() and calls
bdi_queue_work() after the unregister, trace_writeback_queue will be
dereferencing the NULL bdi-&gt;dev. Fix it with a simple test for NULL.

LKML-reference: http://lkml.org/lkml/2012/1/18/346
Reported-by: Rabin Vincent &lt;rabin@rab.in&gt;
Tested-by: Namjae Jeon &lt;linkinjeon@gmail.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>writeback: fix NULL bdi-&gt;dev in trace writeback_single_inode</title>
<updated>2012-02-20T20:46:16Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2012-01-17T17:18:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=aec14f459cc9d40f9fd4a7aad2be761de084b320'/>
<id>urn:sha1:aec14f459cc9d40f9fd4a7aad2be761de084b320</id>
<content type='text'>
commit 15eb77a07c714ac80201abd0a9568888bcee6276 upstream.

bdi_prune_sb() resets sb-&gt;s_bdi to default_backing_dev_info when the
tearing down the original bdi. Fix trace_writeback_single_inode to
use sb-&gt;s_bdi=default_backing_dev_info rather than bdi-&gt;dev=NULL for a
teared down bdi.

Reported-by: Rabin Vincent &lt;rabin@rab.in&gt;
Tested-by: Rabin Vincent &lt;rabin@rab.in&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>writeback: show writeback reason with __print_symbolic</title>
<updated>2011-12-18T06:20:17Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-12-08T22:53:54Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b3bba872ddb0320a7ecb54decae53c13ceb2ed4c'/>
<id>urn:sha1:b3bba872ddb0320a7ecb54decae53c13ceb2ed4c</id>
<content type='text'>
This makes the binary trace understandable by trace-cmd.

CC: Dave Chinner &lt;david@fromorbit.com&gt;
CC: Curt Wohlgemuth &lt;curtw@google.com&gt;
CC: Steven Rostedt &lt;rostedt@goodmis.org&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: Add a 'reason' to wb_writeback_work</title>
<updated>2011-10-30T16:33:36Z</updated>
<author>
<name>Curt Wohlgemuth</name>
<email>curtw@google.com</email>
</author>
<published>2011-10-08T03:54:10Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0e175a1835ffc979e55787774e58ec79e41957d7'/>
<id>urn:sha1:0e175a1835ffc979e55787774e58ec79e41957d7</id>
<content type='text'>
This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity.  A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.

The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.

And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.

Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Curt Wohlgemuth &lt;curtw@google.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: send work item to queue_io, move_expired_inodes</title>
<updated>2011-10-30T16:33:27Z</updated>
<author>
<name>Curt Wohlgemuth</name>
<email>curtw@google.com</email>
</author>
<published>2011-10-08T03:51:56Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ad4e38dd6a33bb3a4882c487d7abe621e583b982'/>
<id>urn:sha1:ad4e38dd6a33bb3a4882c487d7abe621e583b982</id>
<content type='text'>
Instead of sending -&gt;older_than_this to queue_io() and
move_expired_inodes(), send the entire wb_writeback_work
structure.  There are other fields of a work item that are
useful in these routines and in tracepoints.

Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Curt Wohlgemuth &lt;curtw@google.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: trace event balance_dirty_pages</title>
<updated>2011-10-30T16:29:38Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2010-08-30T05:33:20Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ece13ac31bbe492d940ba0bc4ade2ae1521f46a5'/>
<id>urn:sha1:ece13ac31bbe492d940ba0bc4ade2ae1521f46a5</id>
<content type='text'>
Useful for analyzing the dynamics of the throttling algorithms and
debugging user reported problems.

Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: trace event bdi_dirty_ratelimit</title>
<updated>2011-10-30T16:29:21Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-03-02T23:22:49Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=b48c104d2211b0ac881a71f5f76a3816225f8111'/>
<id>urn:sha1:b48c104d2211b0ac881a71f5f76a3816225f8111</id>
<content type='text'>
It helps understand how various throttle bandwidths are updated.

Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: IO-less balance_dirty_pages()</title>
<updated>2011-10-03T13:08:57Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2010-08-28T00:45:12Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=143dfe8611a63030ce0c79419dc362f7838be557'/>
<id>urn:sha1:143dfe8611a63030ce0c79419dc362f7838be557</id>
<content type='text'>
As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: show raw dirtied_when in trace writeback_single_inode</title>
<updated>2011-08-31T00:48:15Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2011-08-29T15:52:23Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=c8ad620638f97bdb7c8ef8cc788f08a04b14eadc'/>
<id>urn:sha1:c8ad620638f97bdb7c8ef8cc788f08a04b14eadc</id>
<content type='text'>
Save inode-&gt;dirtied_when in the raw trace output for reliable scripting,
and to also show in formatted output the relative age in seconds for
easy human reading.

CC: Jan Kara &lt;jack@suse.cz&gt;
Acked-by: Christoph Hellwig &lt;hch@infradead.org&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>writeback: trace global_dirty_state</title>
<updated>2011-07-10T05:09:03Z</updated>
<author>
<name>Wu Fengguang</name>
<email>fengguang.wu@intel.com</email>
</author>
<published>2010-12-07T04:34:29Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=e1cbe236013c82bcf9a156e98d7b47efb89d2674'/>
<id>urn:sha1:e1cbe236013c82bcf9a156e98d7b47efb89d2674</id>
<content type='text'>
Add trace event balance_dirty_state for showing the global dirty page
counts and thresholds at each global_dirty_limits() invocation.  This
will cover the callers throttle_vm_writeout(), over_bground_thresh()
and each balance_dirty_pages() loop.

Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
</feed>
