Commit graph

1726 commits

Author SHA1 Message Date
Jens Axboe
f660fd2ca1 io_uring: add new helpers for posting overflows
Add two helpers, one for posting overflows for lockless_cq rings, and
one for non-lockless_cq rings. The former can allocate sanely with
GFP_KERNEL, but needs to grab the completion lock for posting, while the
latter must do non-sleeping allocs as it already holds the completion
lock.

While at it, mark the overflow handling functions as __cold as well, as
they should not generally be called during normal operations of the
ring.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:47:19 -06:00
Jens Axboe
c80bdb1c55 io_uring: pass in struct io_big_cqe to io_alloc_ocqe()
Rather than pass extra1/extra2 separately, just pass in the (now) named
io_big_cqe struct instead. The callers that don't use/support CQE32 will
now just pass a single NULL, rather than two seperate mystery zero
values.

Move the clearing of the big_cqe elements into io_alloc_ocqe() as well,
so it can get moved out of the generic code.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:47:18 -06:00
Jens Axboe
072d37b52c io_uring: make io_alloc_ocqe() take a struct io_cqe pointer
The number of arguments to io_alloc_ocqe() is a bit unwieldy. Make it
take a struct io_cqe pointer rather than three separate CQE args. One
path already has that readily available, add an io_init_cqe() helper for
the remaining two.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:46:53 -06:00
Jens Axboe
10f466abc4 io_uring: split alloc and add of overflow
Add a new helper, io_alloc_ocqe(), that simply allocates and fills an
overflow entry. Then it can get done outside of the locking section,
and hence use more appropriate gfp_t allocation flags rather than always
default to GFP_ATOMIC.

Inspired by a previous series from Pavel:

https://lore.kernel.org/io-uring/cover.1747209332.git.asml.silence@gmail.com/

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-17 18:46:50 -06:00
Pavel Begunkov
5288b9e28f io_uring: open code io_req_cqe_overflow()
A preparation patch, just open code io_req_cqe_overflow().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-16 12:38:36 -06:00
Jens Axboe
16256648cd io_uring/fdinfo: get rid of dumping credentials
It's a faily obscure feature, and registered credentials would for that
mostly be a static thing. Don't bother including code to dump the
personalities indices.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-16 12:33:28 -06:00
Jens Axboe
9a10926627 io_uring/fdinfo: only compile if CONFIG_PROC_FS is set
Rather than wrap fdinfo.c in one big if, handle it on the Makefile
side instead. io_uring.c already conditionally sets fops->fdinfo()
anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-16 12:33:02 -06:00
Jens Axboe
3de7361f7c Merge branch 'io_uring-6.15' into for-6.16/io_uring
Merge in 6.15 io_uring fixes, mostly so that the fdinfo changes can
get easily extended without causing merge conflicts.

* io_uring-6.15:
  io_uring/fdinfo: grab ctx->uring_lock around io_uring_show_fdinfo()
  io_uring/memmap: don't use page_address() on a highmem page
  io_uring/uring_cmd: fix hybrid polling initialization issue
  io_uring/sqpoll: Increase task_work submission batch size
  io_uring: ensure deferred completions are flushed for multishot
  io_uring: always arm linked timeouts prior to issue
  io_uring/fdinfo: annotate racy sq/cq head/tail reads
  io_uring: fix 'sync' handling of io_fallback_tw()
  io_uring: don't duplicate flushing in io_req_post_cqe
2025-05-16 12:31:19 -06:00
Jakub Kicinski
bebd7b2626 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc7).

Conflicts:

tools/testing/selftests/drivers/net/hw/ncdevmem.c
  97c4e094a4 ("tests/ncdevmem: Fix double-free of queue array")
  2f1a805f32 ("selftests: ncdevmem: Implement devmem TCP TX")
https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au

Adjacent changes:

net/core/devmem.c
net/core/devmem.h
  0afc44d8cd ("net: devmem: fix kernel panic when netlink socket close after module unload")
  bd61848900 ("net: devmem: Implement TX path")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15 11:28:30 -07:00
Jens Axboe
d871198ee4 io_uring/fdinfo: grab ctx->uring_lock around io_uring_show_fdinfo()
Not everything requires locking in there, which is why the 'has_lock'
variable exists. But enough does that it's a bit unwieldy to manage.
Wrap the whole thing in a ->uring_lock trylock, and just return
with no output if we fail to grab it. The existing trylock() will
already have greatly diminished utility/output for the failure case.

This fixes an issue with reading the SQE fields, if the ring is being
actively resized at the same time.

Reported-by: Jann Horn <jannh@google.com>
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-14 07:15:28 -06:00
Pavel Begunkov
2b61bb1d9a io_uring/kbuf: unify legacy buf provision and removal
Combine IORING_OP_PROVIDE_BUFFERS and IORING_OP_REMOVE_BUFFERS
->issue(), so that we can deduplicate ring locking and list lookups.
This way we further reduce code for legacy provided buffers. Locking is
also separated from buffer related handling, which makes it a bit
simpler with label jumps.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f61af131622ad4337c2fb9f7c453d5b0102c7b90.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:55 -06:00
Pavel Begunkov
c724e80123 io_uring/kbuf: refactor __io_remove_buffers
__io_remove_buffers used for two purposes, the first is removing
buffers for non ring based lists, which implies that it can be called
multiple times for the same list. And the second is for destroying
lists, which is not perfectly reentrable for ring based lists.

It's confusing, so just have a helper for the legacy pbuf buffer
removal, make sure it's not called for ring pbuf, and open code all ring
pbuf destruction into io_put_bl().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0ae416b099d311ad23f285cea02f2c94c8ae9a6c.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:55 -06:00
Pavel Begunkov
52a05d0cf8 io_uring/kbuf: don't compute size twice on prep
The size in prep is calculated by io_provide_buffers_prep(), so remove
the recomputation a few lines after.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7c97206561b74fce245cb22449c6082d2e066844.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:55 -06:00
Pavel Begunkov
4e9fda29d6 io_uring/kbuf: drop extra vars in io_register_pbuf_ring
bl and free_bl variables in io_register_pbuf_ring() always point to the
same list since we started to reallocate the pre-existent list. Drop
free_bl.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d45c3342d74c9030f99376c777a4b3d59089074d.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:55 -06:00
Pavel Begunkov
1724849072 io_uring/kbuf: use mem_is_zero()
Make use of mem_is_zero() for reserved fields checking.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/11fe27b7a831329bcdb4ea087317ef123ba7c171.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:55 -06:00
Pavel Begunkov
475a8d3037 io_uring/kbuf: account ring io_buffer_list memory
Follow the non-ringed pbuf struct io_buffer_list allocations and account
it against the memcg. There is low chance of that being an actual
problem as ring provided buffer should either pin user memory or
allocate it, which is already accounted.

Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3985218b50d341273cafff7234e1a7e6d0db9808.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13 14:45:47 -06:00
Mina Almasry
bd61848900 net: devmem: Implement TX path
Augment dmabuf binding to be able to handle TX. Additional to all the RX
binding, we also create tx_vec needed for the TX path.

Provide API for sendmsg to be able to send dmabufs bound to this device:

- Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from.
- MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf.

Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY
implementation, while disabling instances where MSG_ZEROCOPY falls back
to copying.

We additionally pipe the binding down to the new
zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems
instead of the traditional page netmems.

We also special case skb_frag_dma_map to return the dma-address of these
dmabuf net_iovs instead of attempting to map pages.

The TX path may release the dmabuf in a context where we cannot wait.
This happens when the user unbinds a TX dmabuf while there are still
references to its netmems in the TX path. In that case, the netmems will
be put_netmem'd from a context where we can't unmap the dmabuf, Resolve
this by making __net_devmem_dmabuf_binding_free schedule_work'd.

Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat
of the implementation came from devmem TCP RFC v1[1], which included the
TX path, but Stan did all the rebasing on top of netmem/net_iov.

Cc: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:48 +02:00
Mina Almasry
03e96b8c11 netmem: add niov->type attribute to distinguish different net_iov types
Later patches in the series adds TX net_iovs where there is no pp
associated, so we can't rely on niov->pp->mp_ops to tell what is the
type of the net_iov.

Add a type enum to the net_iov which tells us the net_iov type.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250508004830.4100853-2-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13 11:12:48 +02:00
Jens Axboe
f446c6311e io_uring/memmap: don't use page_address() on a highmem page
For older/32-bit systems with highmem, don't assume that the pages in
a mapped region are always going to be mapped. If io_region_init_ptr()
finds that the pages are coalescable, also check if the first page is
a HighMem page or not. If it is, fall through to the usual vmap()
mapping rather than attempt to get the unmapped page address.

Cc: stable@vger.kernel.org
Fixes: c4d0ac1c15 ("io_uring/memmap: optimise single folio regions")
Link: https://lore.kernel.org/all/681fe2fb.050a0220.f2294.001a.GAE@google.com/
Reported-by: syzbot+5b8c4abafcb1d791ccfc@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/681fed0a.050a0220.f2294.001c.GAE@google.com/
Reported-by: syzbot+6456a99dfdc2e78c4feb@syzkaller.appspotmail.com
Tested-by: syzbot+6456a99dfdc2e78c4feb@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-12 09:27:41 -06:00
Pavel Begunkov
8fb7aee055 io_uring: drain based on allocates reqs
Don't rely on CQ sequence numbers for draining, as it has become messy
and needs cq_extra adjustments. Instead, base it on the number of
allocated requests and only allow flushing when all requests are in the
drain list.

As a result, cq_extra is gone, no overhead for its accounting in aux cqe
posting, less bloating as it was inlined before, and it's in general
simpler than trying to track where we should bump it and where it should
be put back like in cases of overflow. Also, it'll likely help with
cleaning and unifying some of the CQ posting helpers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/46ece1e34320b046c06fee2498d6b4cd12a700f2.1746788718.git.asml.silence@gmail.com
Link: https://lore.kernel.org/r/24497b04b004bceada496033d3c9d09ff8e81ae9.1746944903.git.asml.silence@gmail.com
[axboe: fold in fix from link2]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-12 07:52:52 -06:00
hexue
63166b815d io_uring/uring_cmd: fix hybrid polling initialization issue
Modify the check for whether the timer is initialized during IO transfer
when passthrough is used with hybrid polling, to ensure that it's always
setup correctly.

Cc: stable@vger.kernel.org
Fixes: 01ee194d1a ("io_uring: add support for hybrid IOPOLL")
Signed-off-by: hexue <xue01.he@samsung.com>
Link: https://lore.kernel.org/r/20250512052025.293031-1-xue01.he@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-12 07:17:02 -06:00
Pavel Begunkov
63de899cb6 io_uring: count allocated requests
Keep track of the number requests a ring currently has allocated (and
not freed), it'll be needed in the next patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c8f8308294dc2a1cb8925d984d937d4fc14ab5d4.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:02 -06:00
Pavel Begunkov
b0c8a6401f io_uring: open code io_account_cq_overflow()
io_account_cq_overflow() doesn't help explaining what's going on in
there, and it'll become even smaller with following patches, so open
code it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e4333fa0d371f519e52a71148ebdffed4b8d3aa9.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:02 -06:00
Pavel Begunkov
19a94da447 io_uring: consolidate drain seq checking
We check sequences when queuing drained requests as well when flushing
them. Instead, always queue and immediately try to flush, so that all
seq handling can be kept contained in the flushing code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d4651f742e671af5b3216581e539ea5d31bc7125.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:02 -06:00
Pavel Begunkov
e91e4f692f io_uring: remove drain prealloc checks
Currently io_drain_req() has two steps. The first is fast path checking
sequence numbers. The second is allocations, rechecking and actual
queuing. Further simplify it by removing the first step.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d06e89ed07611993d7bf89182de2300858379bd.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov
05b334110f io_uring: simplify drain ret passing
"ret" in io_drain_req() is only used in one place, remove it and pass
-ENOMEM directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ece724b77e66e6caabcc215e0032ee7ff140f289.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov
fde04c7e27 io_uring: fix spurious drain flushing
io_queue_deferred() is not tolerant to spurious calls not completing
some requests. You can have an inflight drain-marked request and another
request that came after and got queued into the drain list. Now, if
io_queue_deferred() is called before the first request completes, it'll
check the 2nd req with req_need_defer(), find that there is no drain
flag set, and queue it for execution.

To make io_queue_deferred() work, it should at least check sequences for
the first request, and then we need also need to check if there is
another drain request creating another bubble.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/972bde11b7d4ef25b3f5e3fd34f80e4d2aa345b8.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov
f979c20547 io_uring: account drain memory to cgroup
Account drain allocations against memcg. It's not a big problem as each
such allocation is paired with a request, which is accounted, but it's
nicer to follow the limits more closely.

Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f8dfdbd755c41fd9c75d12b858af07dfba5bbb68.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 08:01:01 -06:00
Pavel Begunkov
81a22c86ec io_uring: add lockdep asserts to io_add_aux_cqe
io_add_aux_cqe() can only be called for rings with uring_lock protected
completion queues, add a couple of assertions in regards to that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c010eab7b94a187c00a9d46d8b67bf7fcad18af4.1746788592.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 07:58:55 -06:00
Pavel Begunkov
28b8cd864d io_uring/net: move CONFIG_NET guards to Makefile
Instruct Makefile to never try to compile net.c without CONFIG_NET and
kill ifdefs in the file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f466400e20c3f536191bfd559b1f3cd2a2ab5a1e.1746788579.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 07:58:47 -06:00
Long Li
6ae4308116 io_uring: update parameter name in io_pin_pages function declaration
Rename first parameter in io_pin_pages from ubuf to uaddr for consistency
between declaration and implementation.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Link: https://lore.kernel.org/r/20250509063015.3799255-1-leo.lilong@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 07:58:22 -06:00
Gabriel Krisman Bertazi
92835cebab io_uring/sqpoll: Increase task_work submission batch size
Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reduction of the device submission queue depth utilization. It turns out
that, after commit af5d68f889 ("io_uring/sqpoll: manage task_work
privately"), we capped the number of task_work entries that can be
completed from a single spin of sqpoll to only 8 entries, before the
sqpoll goes around to (potentially) sleep.  While this cap doesn't drive
the submission side directly, it impacts the completion behavior, which
affects the number of IO queued by fio per sqpoll cycle on the
submission side, and io_uring ends up seeing less ios per sqpoll cycle.
As a result, block layer plugging is less effective, and we see more
time spent inside the block layer in profilings charts, and increased
submission latency measured by fio.

There are other places that have increased overhead once sqpoll sleeps
more often, such as the sqpoll utilization calculation.  But, in this
microbenchmark, those were not representative enough in perf charts, and
their removal didn't yield measurable changes in throughput.  The major
overhead comes from the fact we plug less, and less often, when submitting
to the block layer.

My benchmark is:

fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
    --invalidate=1 --time_based  --ramp_time=10 --group_reporting=1 \
    --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
    --rw=randread --numjobs=1 --sqthread_poll

In one machine, tested on top of Linux 6.15-rc1, we have the following
baseline:
  READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec

With this patch:
  READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec

which is a 15% improvement in measured bandwidth.  The average
submission latency is noticeably lowered too.  As measured by
fio:

Baseline:
   lat (usec): min=20, max=241, avg=99.81, stdev=3.38
Patched:
   lat (usec): min=26, max=226, avg=86.48, stdev=4.82

If we look at blktrace, we can also see the plugging behavior is
improved. In the baseline, we end up limited to plugging 8 requests in
the block layer regardless of the device queue depth size, while after
patching we can drive more io, and we manage to utilize the full device
queue.

In the baseline, after a stabilization phase, an ordinary submission
looks like:
  254,0    1    49942     0.016028795  5977  U   N [iou-sqp-5976] 7

After patching, I see consistently more requests per unplug.
  254,0    1     4996     0.001432872  3145  U   N [iou-sqp-3144] 32

Ideally, the cap size would at least be the deep enough to fill the
device queue, but we can't predict that behavior, or assume all IO goes
to a single device, and thus can't guess the ideal batch size.  We also
don't want to let the tw run unbounded, though I'm not sure it would
really be a problem.  Instead, let's just give it a more sensible value
that will allow for more efficient batching.  I've tested with different
cap values, and initially proposed to increase the cap to 1024.  Jens
argued it is too big of a bump and I observed that, with 32, I'm no
longer able to observe this bottleneck in any of my machines.

Fixes: af5d68f889 ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20250508181203.3785544-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-09 07:56:53 -06:00
Jens Axboe
687b2bae0e io_uring: ensure deferred completions are flushed for multishot
Multishot normally uses io_req_post_cqe() to post completions, but when
stopping it, it may finish up with a deferred completion. This is fine,
except if another multishot event triggers before the deferred completions
get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
as new multishot completions get posted before the deferred ones are
flushed. This can cause confusion on the application side, if strict
ordering is required for the use case.

When multishot posting via io_req_post_cqe(), flush any pending deferred
completions first, if any.

Cc: stable@vger.kernel.org # 6.1+
Reported-by: Norman Maurer <norman_maurer@apple.com>
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:55:15 -06:00
Pavel Begunkov
35adea1d01 io_uring: move io_req_put_rsrc_nodes()
It'd be nice to hide details of how rsrc nodes are used by a request
from rsrc.c, specifically which request fields store them, and what bits
are signifying if there is a node in a request. It rather belong to
generic request handling, so move the helper to io_uring.c. While doing
so remove clearing of ->buf_node as it's controlled by REQ_F_BUF_NODE
and doesn't require zeroing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bb73fb42baf825edb39344365aff48cdfdd4c692.1746533789.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 10:11:23 -06:00
Pavel Begunkov
9c2ff3f9b5 io_uring: remove io_preinit_req()
Apart from setting ->ctx, io_preinit_req() zeroes a bunch of fields of a
request, from which only ->file_node is mandatory. Remove the function
and zero the entire request on first allocation. With that, we also need
to initialise ->ctx every time, which might be a good thing for
performance as now we're likely overwriting the entire cache line, and
so it can write combined and avoid RMW.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ba5485dc913f1e275862ce88f5169d4ac4a33836.1746533807.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 10:11:23 -06:00
Pavel Begunkov
78967aabf6 io_uring/timeout: don't export link t-out disarm helper
[__]io_disarm_linked_timeout() are only used inside timeout.c. so
confine them inside the file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1eb200911255e643bf252a8e65fb2c787340cf18.1746533800.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 10:11:23 -06:00
Pavel Begunkov
a5c98e9424 io_uring/zcrx: dmabuf backed zerocopy receive
Add support for dmabuf backed zcrx areas. To use it, the user should
pass IORING_ZCRX_AREA_DMABUF in the struct io_uring_zcrx_area_reg flags
field and pass a dmabuf fd in the dmabuf_fd field.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20bb1890e60a82ec945ab36370d1fd54be414ab6.1746097431.git.asml.silence@gmail.com
Link: https://lore.kernel.org/io-uring/6e37db97303212bbd8955f9501cf99b579f8aece.1746547722.git.asml.silence@gmail.com
[axboe: fold in fixup]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 10:11:00 -06:00
Keith Busch
02040353f4 io_uring: enable per-io write streams
Allow userspace to pass a per-I/O write stream in the SQE:

      __u8 write_stream;

The __u8 type matches the size the filesystems and block layer support.

Application can query the supported values from the block devices
max_write_streams sysfs attribute. Unsupported values are ignored by
file operations that do not support write streams or rejected with an
error by those that support them.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-7-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 07:46:43 -06:00
Jens Axboe
b53e523261 io_uring: always arm linked timeouts prior to issue
There are a few spots where linked timeouts are armed, and not all of
them adhere to the pre-arm, attempt issue, post-arm pattern. This can
be problematic if the linked request returns that it will trigger a
callback later, and does so before the linked timeout is fully armed.

Consolidate all the linked timeout handling into __io_issue_sqe(),
rather than have it spread throughout the various issue entry points.

Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/1390
Reported-by: Chase Hiltz <chase@path.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-04 09:15:58 -06:00
Peter Zijlstra
93f1b6d79a futex: Move futex_queue() into futex_wait_setup()
futex_wait_setup() has a weird calling convention in order to return
hb to use as an argument to futex_queue().

Mostly such that requeue can have an extra test in between.

Reorder code a little to get rid of this and keep the hb usage inside
futex_wait_setup().

[bigeasy: fixes]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-4-bigeasy@linutronix.de
2025-05-03 12:02:05 +02:00
Pavel Begunkov
8a62804248 io_uring/zcrx: split common area map/unmap parts
Extract area type depedent parts of io_zcrx_[un]map_area from the
generic path. It'll be helpful once there are more area memory types
and not only user pages.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/50f6e893e2d20f937e628196cbf528d15f81c289.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02 09:24:42 -06:00
Pavel Begunkov
782dfa329a io_uring/zcrx: split out memory holders from area
In the data path users of struct io_zcrx_area don't need to know what
kind of memory it's backed by. Only keep there generic bits in there and
and split out memory type dependent fields into a new structure. It also
logically separates the step that actually imports the memory, e.g.
pinning user pages, from the generic area initialisation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b60fc09c76921bf69e77eb17e07eb4decedb3bf4.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02 09:24:42 -06:00
Pavel Begunkov
6c9589aa08 io_uring/zcrx: resolve netdev before area creation
Some area types will require a valid struct device to be created, so
resolve netdev and struct device before creating an area.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ac8c1482be22acfe9ca788d2c3ce31b7451ce488.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02 09:24:42 -06:00
Pavel Begunkov
d760d3f59f io_uring/zcrx: improve area validation
dmabuf backed area will be taking an offset instead of addresses, and
io_buffer_validate() is not flexible enough to facilitate it. It also
takes an iovec, which may truncate the u64 length zcrx takes. Add a new
helper function for validation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0b3b735391a0a8f8971bf0121c19765131fddd3b.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02 09:24:42 -06:00
Jens Axboe
f024d3a8de io_uring/fdinfo: annotate racy sq/cq head/tail reads
syzbot complains about the cached sq head read, and it's totally right.
But we don't need to care, it's just reading fdinfo, and reading the
CQ or SQ tail/head entries are known racy in that they are just a view
into that very instant and may of course be outdated by the time they
are reported.

Annotate both the SQ head and CQ tail read with data_race() to avoid
this syzbot complaint.

Link: https://lore.kernel.org/io-uring/6811f6dc.050a0220.39e3a1.0d0e.GAE@google.com/
Reported-by: syzbot+3e77fd302e99f5af9394@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-30 07:17:17 -06:00
Pavel Begunkov
91db6edc57 io_uring/cmd: move net cmd into a separate file
We keep socket io_uring command implementation in io_uring/uring_cmd.c.
Separate it from generic command code into a separate file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/747d0519a2255bd055ae76b691d38d2b4c311001.1745843119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28 11:51:31 -06:00
Pavel Begunkov
27d2fed790 io_uring: delete misleading comment in io_fill_cqe_aux()
io_fill_cqe_aux() doesn't overflow completions, however it might fail
them and lets the caller handle it. Remove the comment, which doesn't
make any sense.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/021aa8c1d8f20ef2b66da6aeabb6b511938fd2c5.1745843119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-28 11:51:31 -06:00
Jens Axboe
edd43f4d6f io_uring: fix 'sync' handling of io_fallback_tw()
A previous commit added a 'sync' parameter to io_fallback_tw(), which if
true, means the caller wants to wait on the fallback thread handling it.
But the logic is somewhat messed up, ensure that ctxs are swapped and
flushed appropriately.

Cc: stable@vger.kernel.org
Fixes: dfbe5561ae ("io_uring: flush offloaded and delayed task_work on exit")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 10:32:43 -06:00
Pavel Begunkov
f6da4fee69 io_uring/eventfd: open code io_eventfd_grab()
io_eventfd_grab() doesn't help wit understanding the path, it'll be
simpler to keep the helper open coded.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5cb53ce3876c2819db9e8055cf41dca4398521db.1745493845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 08:33:54 -06:00
Pavel Begunkov
da01f60f8a io_uring/eventfd: clean up rcu locking
Conditional locking is never welcome if there are better options. Move
rcu locking into io_eventfd_signal(), make it unconditional and use
guards. It also helps with sparse warnings.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/91a925e708ca8a5aa7fee61f96d29b24ea9adeaf.1745493845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-24 08:33:54 -06:00