<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/fs/pipe.c, branch v5.7</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.7</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v5.7'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2020-04-02T16:35:28Z</updated>
<entry>
<title>mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page()</title>
<updated>2020-04-02T16:35:28Z</updated>
<author>
<name>Roman Gushchin</name>
<email>guro@fb.com</email>
</author>
<published>2020-04-02T04:06:46Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f4b00eab5004e823f28a268580ae4ed16df9fabf'/>
<id>urn:sha1:f4b00eab5004e823f28a268580ae4ed16df9fabf</id>
<content type='text'>
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
to better reflect what they are actually doing:

1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
   the current memcg

2) set or clear the PageKmemcg flag

Signed-off-by: Roman Gushchin &lt;guro@fb.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Shakeel Butt &lt;shakeelb@google.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@kernel.org&gt;
Cc: Vladimir Davydov &lt;vdavydov.dev@gmail.com&gt;
Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pipe: make sure to wake up everybody when the last reader/writer closes</title>
<updated>2020-02-18T22:34:36Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-02-18T18:12:58Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=6551d5c56eb0d02db2274d7a8d26c333deba7fd2'/>
<id>urn:sha1:6551d5c56eb0d02db2274d7a8d26c333deba7fd2</id>
<content type='text'>
Andrei Vagin reported that commit 0ddad21d3e99 ("pipe: use exclusive
waits when reading or writing") broke one of the CRIU tests.  He even
has a trivial reproducer:

    #include &lt;unistd.h&gt;
    #include &lt;sys/types.h&gt;
    #include &lt;sys/wait.h&gt;

    int main()
    {
            int p[2];
            pid_t p1, p2;
            int status;

            if (pipe(p) == -1)
                    return 1;

            p1 = fork();
            if (p1 == 0) {
                    close(p[1]);
                    read(p[0], &amp;status, sizeof(status));
                    return 0;
            }
            p2 = fork();
            if (p2 == 0) {
                    close(p[1]);
                    read(p[0], &amp;status, sizeof(status));
                    return 0;
            }
            sleep(1);
            close(p[1]);
            wait(&amp;status);
            wait(&amp;status);

            return 0;
    }

and the problem - once he points it out - is obvious.  We use these nice
exclusive waits, but when the last writer goes away, it then needs to
wake up _every_ reader (and conversely, the last reader disappearing
needs to wake every writer, of course).

In fact, when going through this, we had several small oddities around
how to wake things.  We did in fact wake every reader when we changed
the size of the pipe buffers.  But that's entirely pointless, since that
just acts as a possible source of new space - no new data to read.

And when we change the size of the buffer, we don't need to wake all
writers even when we add space - that case acts just as if somebody made
space by reading, and any writer that finds itself not filling it up
entirely will wake the next one.

On the other hand, on the exit path, we tried to limit the wakeups with
the proper poll keys etc, which is entirely pointless, because at that
point we obviously need to wake up everybody.  So don't do that: just
wake up everybody - but only do that if the counts changed to zero.

So fix those non-IO wakeups to be more proper: space change doesn't add
any new data, but it might make room for writers, so it wakes up a
writer.  And the actual changes to reader/writer counts should wake up
everybody, since everybody is affected (ie readers will all see EOF if
the writers have gone away, and writers will all get EPIPE if all
readers have gone away).

Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Reported-and-tested-by: Andrei Vagin &lt;avagin@gmail.com&gt;
Cc: Josh Triplett &lt;josh@joshtriplett.org&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pipe: use exclusive waits when reading or writing</title>
<updated>2020-02-08T19:39:19Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-12-09T17:48:27Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0ddad21d3e99c743a3aa473121dc5561679e26bb'/>
<id>urn:sha1:0ddad21d3e99c743a3aa473121dc5561679e26bb</id>
<content type='text'>
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).

While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.

In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.

This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.

A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):

    #include &lt;unistd.h&gt;

    int main(int argc, char **argv)
    {
        int fd[2], counters[2];

        pipe(fd);
        counters[0] = 0;
        counters[1] = -1;
        write(fd[1], counters, sizeof(counters));

        /* 64 processes */
        fork(); fork(); fork(); fork(); fork(); fork();

        do {
                int i;
                read(fd[0], &amp;i, sizeof(i));
                if (i &lt; 0)
                        continue;
                counters[0] = i+1;
                write(fd[1], counters, (1+(i &amp; 1)) *sizeof(int));
        } while (counters[0] &lt; 1000000);
        return 0;
    }

and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.

But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows

        231,852.37 msec task-clock                #   15.857 CPUs utilized
        11,250,961      context-switches          #    0.049 M/sec
           616,304      cpu-migrations            #    0.003 M/sec
             1,648      page-faults               #    0.007 K/sec
 1,097,903,998,514      cycles                    #    4.735 GHz
   120,781,778,352      instructions              #    0.11  insn per cycle
    27,997,056,043      branches                  #  120.754 M/sec
       283,581,233      branch-misses             #    1.01% of all branches

      14.621273891 seconds time elapsed

       0.018243000 seconds user
       3.611468000 seconds sys

before this commit.

After this commit, I get

          5,229.55 msec task-clock                #    3.072 CPUs utilized
         1,212,233      context-switches          #    0.232 M/sec
           103,951      cpu-migrations            #    0.020 M/sec
             1,328      page-faults               #    0.254 K/sec
    21,307,456,166      cycles                    #    4.074 GHz
    12,947,819,999      instructions              #    0.61  insn per cycle
     2,881,985,678      branches                  #  551.096 M/sec
        64,267,015      branch-misses             #    2.23% of all branches

       1.702148350 seconds time elapsed

       0.004868000 seconds user
       0.110786000 seconds sys

instead. Much better.

[ Note! This kernel improvement seems to be very good at triggering a
  race condition in the make jobserver (in GNU make 4.2.1) for me. It's
  a long known bug that was fixed back in June 2017 by GNU make commit
  b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
  avoid hangs.").

  But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
  so a number of distributions may still have the buggy version. Some
  have backported the fix to their 4.2.1 release, though, and even
  without the fix it's quite timing-dependent whether the bug actually
  is hit. ]

Josh Triplett says:
 "I've been hammering on your pipe fix patch (switching to exclusive
  wait queues) for a month or so, on several different systems, and I've
  run into no issues with it. The patch *substantially* improves
  parallel build times on large (~100 CPU) systems, both with parallel
  make and with other things that use make's pipe-based jobserver.

  All current distributions (including stable and long-term stable
  distributions) have versions of GNU make that no longer have the
  jobserver bug"

Tested-by: Josh Triplett &lt;josh@joshtriplett.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pipe: fix empty pipe check in pipe_write()</title>
<updated>2019-12-22T17:47:47Z</updated>
<author>
<name>Jan Stancek</name>
<email>jstancek@redhat.com</email>
</author>
<published>2019-12-22T12:33:24Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=0dd1e3773ae8afc4bfdce782bdeffc10f9cae6ec'/>
<id>urn:sha1:0dd1e3773ae8afc4bfdce782bdeffc10f9cae6ec</id>
<content type='text'>
LTP pipeio_1 test is hanging with v5.5-rc2-385-gb8e382a185eb,
with read side observing empty pipe and sleeping and write
side running out of space and then sleeping as well. In this
scenario there are 5 writers and 1 reader.

Problem is that after pipe_write() reacquires pipe lock, it
re-checks for empty pipe with potentially stale 'head' and
doesn't wake up read side anymore. pipe-&gt;tail can advance
beyond 'head', because there are multiple writers.

Use pipe-&gt;head for empty pipe check after reacquiring lock
to observe current state.

Testing: With patch, LTP pipeio_1 ran successfully in loop for 1 hour.
         Without patch it hanged within a minute.

Fixes: 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic")
Reported-by: Rachel Sibley &lt;rasibley@redhat.com&gt;
Signed-off-by: Jan Stancek &lt;jstancek@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pipe: simplify signal handling in pipe_read() and add comments</title>
<updated>2019-12-11T19:46:19Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-12-11T19:46:19Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=d1c6a2aa02af09238ad09493eb3c8685ccc3fe12'/>
<id>urn:sha1:d1c6a2aa02af09238ad09493eb3c8685ccc3fe12</id>
<content type='text'>
There's no need to separately check for signals while inside the locked
region, since we're going to do "wait_event_interruptible()" right
afterwards anyway, and the error handling is much simpler there.

The check for whether we had already read anything was also redundant,
since we no longer do the odd merging of reads when there are pending
writers.

But perhaps more importantly, this adds commentary about why we still
need to wake up possible writers even though we didn't read any data,
and why we can skip all the finishing touches now if we get a signal (or
had a signal pending) while waiting for more data.

[ This is a split-out cleanup from my "make pipe IO use exclusive wait
  queues" thing, which I can't apply because it triggers a nasty bug in
  the GNU make jobserver   - Linus ]

Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pipe: don't use 'pipe_wait() for basic pipe IO</title>
<updated>2019-12-07T21:53:09Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-12-07T21:53:09Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=85190d15f4ea88e60047256fecb8216d5323ea47'/>
<id>urn:sha1:85190d15f4ea88e60047256fecb8216d5323ea47</id>
<content type='text'>
pipe_wait() may be simple, but since it relies on the pipe lock, it
means that we have to do the wakeup while holding the lock.  That's
unfortunate, because the very first thing the waked entity will want to
do is to get the pipe lock for itself.

So get rid of the pipe_wait() usage by simply releasing the pipe lock,
doing the wakeup (if required) and then using wait_event_interruptible()
to wait on the right condition instead.

wait_event_interruptible() handles races on its own by comparing the
wakeup condition before and after adding itself to the wait queue, so
you can use an optimistic unlocked condition for it.

Cc: David Howells &lt;dhowells@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pipe: remove 'waiting_writers' merging logic</title>
<updated>2019-12-07T21:21:01Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-12-07T21:21:01Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a28c8b9db8a1014aa572cd19a3bdb9ddebd3e555'/>
<id>urn:sha1:a28c8b9db8a1014aa572cd19a3bdb9ddebd3e555</id>
<content type='text'>
This code is ancient, and goes back to when we only had a single page
for the pipe buffers.  The exact history is hidden in the mists of time
(ie "before git", and in fact predates the BK repository too).

At that long-ago point in time, it actually helped to try to merge big
back-and-forth pipe reads and writes, and not limit pipe reads to the
single pipe buffer in length just because that was all we had at a time.

However, since then we've expanded the pipe buffers to multiple pages,
and this logic really doesn't seem to make sense.  And a lot of it is
somewhat questionable (ie "hmm, the user asked for a non-blocking read,
but we see that there's a writer pending, so let's wait anyway to get
the extra data that the writer will have").

But more importantly, it makes the "go to sleep" logic much less
obvious, and considering the wakeup issues we've had, I want to make for
less of those kinds of things.

Cc: David Howells &lt;dhowells@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pipe: fix and clarify pipe read wakeup logic</title>
<updated>2019-12-07T20:54:26Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-12-07T20:54:26Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f467a6a66419a18f782f87507464a5dedc942dec'/>
<id>urn:sha1:f467a6a66419a18f782f87507464a5dedc942dec</id>
<content type='text'>
This is the read side version of the previous commit: it simplifies the
logic to only wake up waiting writers when necessary, and makes sure to
use a synchronous wakeup.  This time not so much for GNU make jobserver
reasons (that pipe never fills up), but simply to get the writer going
quickly again.

A bit less verbose commentary this time, if only because I assume that
the write side commentary isn't going to be ignored if you touch this
code.

Cc: David Howells &lt;dhowells@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pipe: fix and clarify pipe write wakeup logic</title>
<updated>2019-12-07T20:14:28Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-12-07T20:14:28Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=1b6b26ae7053e4914181eedf70f2d92c12abda8a'/>
<id>urn:sha1:1b6b26ae7053e4914181eedf70f2d92c12abda8a</id>
<content type='text'>
The pipe rework ends up having been extra painful, partly becaused of
actual bugs with ordering and caching of the pipe state, but also
because of subtle performance issues.

In particular, the pipe rework caused the kernel build to inexplicably
slow down.

The reason turns out to be that the GNU make jobserver (which limits the
parallelism of the build) uses a pipe to implement a "token" system: a
parallel submake will read a character from the pipe to get the job
token before starting a new job, and will write a character back to the
pipe when it is done.  The overall job limit is thus easily controlled
by just writing the appropriate number of initial token characters into
the pipe.

But to work well, that really means that the old behavior of write
wakeups being synchronous (WF_SYNC) is very important - when the pipe
writer wakes up a reader, we want the reader to actually get scheduled
immediately.  Otherwise you lose the parallelism of the build.

The pipe rework lost that synchronous wakeup on write, and we had
clearly all forgotten the reasons and rules for it.

This rewrites the pipe write wakeup logic to do the required Wsync
wakeups, but also clarifies the logic and avoids extraneous wakeups.

It also ends up addign a number of comments about what oit does and why,
so that we hopefully don't end up forgetting about this next time we
change this code.

Cc: David Howells &lt;dhowells@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>pipe: fix poll/select race introduced by the pipe rework</title>
<updated>2019-12-07T18:41:17Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2019-12-07T18:41:17Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=ad910e36da4ca3a1bd436989f632d062dda0c921'/>
<id>urn:sha1:ad910e36da4ca3a1bd436989f632d062dda0c921</id>
<content type='text'>
The kernel wait queues have a basic rule to them: you add yourself to
the wait-queue first, and then you check the things that you're going to
wait on.  That avoids the races with the event you're waiting for.

The same goes for poll/select logic: the "poll_wait()" goes first, and
then you check the things you're polling for.

Of course, if you use locking, the ordering doesn't matter since the
lock will serialize with anything that changes the state you're looking
at. That's not the case here, though.

So move the poll_wait() first in pipe_poll(), before you start looking
at the pipe state.

Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Cc: David Howells &lt;dhowells@redhat.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
