<feed xmlns='http://www.w3.org/2005/Atom'>
<title>user/sven/linux.git/include/uapi/linux/io_uring.h, branch v6.7.9</title>
<subtitle>Linux Kernel
</subtitle>
<id>https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.7.9</id>
<link rel='self' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/atom?h=v6.7.9'/>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/'/>
<updated>2023-11-01T21:25:08Z</updated>
<entry>
<title>Merge tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux</title>
<updated>2023-11-01T21:25:08Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2023-11-01T21:25:08Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4de520f1fcefd4ebb7dddcf28bde1b330c2f6b5d'/>
<id>urn:sha1:4de520f1fcefd4ebb7dddcf28bde1b330c2f6b5d</id>
<content type='text'>
Pull io_uring futex support from Jens Axboe:
 "This adds support for using futexes through io_uring - first futex
  wake and wait, and then the vectored variant of waiting, futex waitv.

  For both wait/wake/waitv, we support the bitset variant, as the
  'normal' variants can be easily implemented on top of that.

  PI and requeue are not supported through io_uring, just the above
  mentioned parts. This may change in the future, but in the spirit of
  keeping this small (and based on what people have been asking for),
  this is what we currently have.

  Wake support is pretty straight forward, most of the thought has gone
  into the wait side to avoid needing to offload wait operations to a
  blocking context. Instead, we rely on the usual callbacks to retry and
  post a completion event, when appropriate.

  As far as I can recall, the first request for futex support with
  io_uring came from Andres Freund, working on postgres. His aio rework
  of postgres was one of the early adopters of io_uring, and futex
  support was a natural extension for that. This is relevant from both a
  usability point of view, as well as for effiency and performance. In
  Andres's words, for the former:

     Futex wait support in io_uring makes it a lot easier to avoid
     deadlocks in concurrent programs that have their own buffer pool:
     Obviously pages in the application buffer pool have to be locked
     during IO. If the initiator of IO A needs to wait for a held lock
     B, the holder of lock B might wait for the IO A to complete. The
     ability to wait for a lock and IO completions at the same time
     provides an efficient way to avoid such deadlocks

  and in terms of effiency, even without unlocking the full potential
  yet, Andres says:

     Futex wake support in io_uring is useful because it allows for more
     efficient directed wakeups. For some "locks" postgres has queues
     implemented in userspace, with wakeup logic that cannot easily be
     implemented with FUTEX_WAKE_BITSET on a single "futex word"
     (imagine waiting for journal flushes to have completed up to a
     certain point).

     Thus a "lock release" sometimes need to wake up many processes in a
     row. A quick-and-dirty conversion to doing these wakeups via
     io_uring lead to a 3% throughput increase, with 12% fewer context
     switches, albeit in a fairly extreme workload"

* tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux:
  io_uring: add support for vectored futex waits
  futex: make the vectored futex operations available
  futex: make futex_parse_waitv() available as a helper
  futex: add wake_data to struct futex_q
  io_uring: add support for futex wake and wait
  futex: abstract out a __futex_wake_mark() helper
  futex: factor out the futex wake handling
  futex: move FUTEX2_VALID_MASK to futex.h
</content>
</entry>
<entry>
<title>io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT</title>
<updated>2023-10-19T22:42:03Z</updated>
<author>
<name>Breno Leitao</name>
<email>leitao@debian.org</email>
</author>
<published>2023-10-16T13:47:48Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=4232c6e349f3a591fd0f432e6b858d32095adce6'/>
<id>urn:sha1:4232c6e349f3a591fd0f432e6b858d32095adce6</id>
<content type='text'>
Add initial support for SOCKET_URING_OP_SETSOCKOPT. This new command is
similar to setsockopt. This implementation leverages the function
do_sock_setsockopt(), which is shared with the setsockopt() system call
path.

Important to say that userspace needs to keep the pointer's memory alive
until the operation is completed. I.e, the memory could not be
deallocated before the CQE is returned to userspace.

Signed-off-by: Breno Leitao &lt;leitao@debian.org&gt;
Reviewed-by: Gabriel Krisman Bertazi &lt;krisman@suse.de&gt;
Link: https://lore.kernel.org/r/20231016134750.1381153-11-leitao@debian.org
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT</title>
<updated>2023-10-19T22:42:03Z</updated>
<author>
<name>Breno Leitao</name>
<email>leitao@debian.org</email>
</author>
<published>2023-10-16T13:47:47Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=a5d2f99aff6b6f9cd6a1ab6907d8be8066114791'/>
<id>urn:sha1:a5d2f99aff6b6f9cd6a1ab6907d8be8066114791</id>
<content type='text'>
Add support for getsockopt command (SOCKET_URING_OP_GETSOCKOPT), where
level is SOL_SOCKET. This is leveraging the sockptr_t infrastructure,
where a sockptr_t is either userspace or kernel space, and handled as
such.

Differently from the getsockopt(2), the optlen field is not a userspace
pointers. In getsockopt(2), userspace provides optlen pointer, which is
overwritten by the kernel.  In this implementation, userspace passes a
u32, and the new value is returned in cqe-&gt;res. I.e., optlen is not a
pointer.

Important to say that userspace needs to keep the pointer alive until
the CQE is completed.

Signed-off-by: Breno Leitao &lt;leitao@debian.org&gt;
Reviewed-by: Gabriel Krisman Bertazi &lt;krisman@suse.de&gt;
Link: https://lore.kernel.org/r/20231016134750.1381153-10-leitao@debian.org
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: add support for vectored futex waits</title>
<updated>2023-09-29T08:37:08Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2023-06-13T01:04:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8f350194d5cfd7016d4cd44e433df0faa4d4a703'/>
<id>urn:sha1:8f350194d5cfd7016d4cd44e433df0faa4d4a703</id>
<content type='text'>
This adds support for IORING_OP_FUTEX_WAITV, which allows registering a
notification for a number of futexes at once. If one of the futexes are
woken, then the request will complete with the index of the futex that got
woken as the result. This is identical to what the normal vectored futex
waitv operation does.

Use like IORING_OP_FUTEX_WAIT, except sqe-&gt;addr must now contain a
pointer to a struct futex_waitv array, and sqe-&gt;off must now contain the
number of elements in that array. As flags are passed in the futex_vector
array, and likewise for the value and futex address(es), sqe-&gt;addr2
and sqe-&gt;addr3 are also reserved for IORING_OP_FUTEX_WAITV.

For cancelations, FUTEX_WAITV does not rely on the futex_unqueue()
return value as we're dealing with multiple futexes. Instead, a separate
per io_uring request atomic is used to claim ownership of the request.

Waiting on N futexes could be done with IORING_OP_FUTEX_WAIT as well,
but that punts a lot of the work to the application:

1) Application would need to submit N IORING_OP_FUTEX_WAIT requests,
   rather than just a single IORING_OP_FUTEX_WAITV.

2) When one futex is woken, application would need to cancel the
   remaining N-1 requests that didn't trigger.

While this is of course doable, having a single vectored futex wait
makes for much simpler application code.

Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: add support for futex wake and wait</title>
<updated>2023-09-29T08:36:57Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2023-06-08T17:57:40Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=194bb58c6090e39bd7d9b9c888a079213628e1f6'/>
<id>urn:sha1:194bb58c6090e39bd7d9b9c888a079213628e1f6</id>
<content type='text'>
Add support for FUTEX_WAKE/WAIT primitives.

IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as
it does support passing in a bitset.

Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and
FUTEX_WAIT_BITSET.

For both of them, they are using the futex2 interface.

FUTEX_WAKE is straight forward, as those can always be done directly from
the io_uring submission without needing async handling. For FUTEX_WAIT,
things are a bit more complicated. If the futex isn't ready, then we
rely on a callback via futex_queue-&gt;wake() when someone wakes up the
futex. From that calback, we queue up task_work with the original task,
which will post a CQE and wake it, if necessary.

Cancelations are supported, both from the application point-of-view,
but also to be able to cancel pending waits if the ring exits before
all events have occurred. The return value of futex_unqueue() is used
to gate who wins the potential race between cancelation and futex
wakeups. Whomever gets a 'ret == 1' return from that claims ownership
of the io_uring futex request.

This is just the barebones wait/wake support. PI or REQUEUE support is
not added at this point, unclear if we might look into that later.

Likewise, explicit timeouts are not supported either. It is expected
that users that need timeouts would do so via the usual io_uring
mechanism to do that using linked timeouts.

The SQE format is as follows:

`addr`		Address of futex
`fd`		futex2(2) FUTEX2_* flags
`futex_flags`	io_uring specific command flags. None valid now.
`addr2`		Value of futex
`addr3`		Mask to wake/wait

Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: retain top 8bits of uring_cmd flags for kernel internal use</title>
<updated>2023-09-28T13:31:41Z</updated>
<author>
<name>Ming Lei</name>
<email>ming.lei@redhat.com</email>
</author>
<published>2023-09-28T12:43:24Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=528ce6781726e022bc5dc84034360e6e8f1b89bd'/>
<id>urn:sha1:528ce6781726e022bc5dc84034360e6e8f1b89bd</id>
<content type='text'>
Retain top 8bits of uring_cmd flags for kernel internal use, so that we
can move IORING_URING_CMD_POLLED out of uapi header.

Reviewed-by: Gabriel Krisman Bertazi &lt;krisman@suse.de&gt;
Reviewed-by: Anuj Gupta &lt;anuj20.g@samsung.com&gt;
Signed-off-by: Ming Lei &lt;ming.lei@redhat.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: add IORING_OP_WAITID support</title>
<updated>2023-09-21T18:04:45Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2023-07-10T22:14:37Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=f31ecf671ddc498f20219453395794ff2383e06b'/>
<id>urn:sha1:f31ecf671ddc498f20219453395794ff2383e06b</id>
<content type='text'>
This adds support for an async version of waitid(2), in a fully async
version. If an event isn't immediately available, wait for a callback
to trigger a retry.

The format of the sqe is as follows:

sqe-&gt;len		The 'which', the idtype being queried/waited for.
sqe-&gt;fd			The 'pid' (or id) being waited for.
sqe-&gt;file_index		The 'options' being set.
sqe-&gt;addr2		A pointer to siginfo_t, if any, being filled in.

buf_index, add3, and waitid_flags are reserved/unused for now.
waitid_flags will be used for options for this request type. One
interesting use case may be to add multi-shot support, so that the
request stays armed and posts a notification every time a monitored
process state change occurs.

Note that this does not support rusage, on Arnd's recommendation.

See the waitid(2) man page for details on the arguments.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring/rw: add support for IORING_OP_READ_MULTISHOT</title>
<updated>2023-09-21T18:02:30Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2023-09-11T19:35:42Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=fc68fcda049108478ee4704d8a3ad3e05cc72fd0'/>
<id>urn:sha1:fc68fcda049108478ee4704d8a3ad3e05cc72fd0</id>
<content type='text'>
This behaves like IORING_OP_READ, except:

1) It only supports pollable files (eg pipes, sockets, etc). Note that
   for sockets, you probably want to use recv/recvmsg with multishot
   instead.

2) It supports multishot mode, meaning it will repeatedly trigger a
   read and fill a buffer when data is available. This allows similar
   use to recv/recvmsg but on non-sockets, where a single request will
   repeatedly post a CQE whenever data is read from it.

3) Because of #2, it must be used with provided buffers. This is
   uniformly true across any request type that supports multishot and
   transfers data, with the reason being that it's obviously not
   possible to pass in a single buffer for the data, as multiple reads
   may very well trigger before an application has a chance to process
   previous CQEs and the data passed from them.

Reviewed-by: Gabriel Krisman Bertazi &lt;krisman@suse.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: add option to remove SQ indirection</title>
<updated>2023-08-24T23:16:19Z</updated>
<author>
<name>Pavel Begunkov</name>
<email>asml.silence@gmail.com</email>
</author>
<published>2023-08-24T22:53:32Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=2af89abda7d9c2aeb573677e2c498ddb09f8058a'/>
<id>urn:sha1:2af89abda7d9c2aeb573677e2c498ddb09f8058a</id>
<content type='text'>
Not many aware, but io_uring submission queue has two levels. The first
level usually appears as sq_array and stores indexes into the actual SQ.

To my knowledge, no one has ever seriously used it, nor liburing exposes
it to users. Add IORING_SETUP_NO_SQARRAY, when set we don't bother
creating and using the sq_array and SQ heads/tails will be pointing
directly into the SQ. Improves memory footprint, in term of both
allocations as well as cache usage, and also should make io_get_sqe()
less branchy in the end.

Signed-off-by: Pavel Begunkov &lt;asml.silence@gmail.com&gt;
Link: https://lore.kernel.org/r/0ffa3268a5ef61d326201ff43a233315c96312e0.1692916914.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
<entry>
<title>io_uring: Add io_uring command support for sockets</title>
<updated>2023-08-09T16:46:15Z</updated>
<author>
<name>Breno Leitao</name>
<email>leitao@debian.org</email>
</author>
<published>2023-06-27T13:44:24Z</published>
<link rel='alternate' type='text/html' href='https://git.stealer.net/cgit.cgi/user/sven/linux.git/commit/?id=8e9fad0e70b7b62848e0aeb1a873903b9ce4d7c4'/>
<id>urn:sha1:8e9fad0e70b7b62848e0aeb1a873903b9ce4d7c4</id>
<content type='text'>
Enable io_uring commands on network sockets. Create two new
SOCKET_URING_OP commands that will operate on sockets.

In order to call ioctl on sockets, use the file_operations-&gt;io_uring_cmd
callbacks, and map it to a uring socket function, which handles the
SOCKET_URING_OP accordingly, and calls socket ioctls.

This patches was tested by creating a new test case in liburing.
Link: https://github.com/leitao/liburing/tree/io_uring_cmd

Signed-off-by: Breno Leitao &lt;leitao@debian.org&gt;
Acked-by: Jakub Kicinski &lt;kuba@kernel.org&gt;
Link: https://lore.kernel.org/r/20230627134424.2784797-1-leitao@debian.org
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
</entry>
</feed>
