mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-09-18 22:14:16 +00:00
block-6.16-20250606
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhC7/UQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgps+6D/9BOhkMyMkUF9LAev4PBNE+x3aftjl7Y1AY EHv2vozb4nDwXIaalG4qGUhprz2+z+hqxYjmnlOAsqbixhcSzKK5z9rjxyDka776 x03vfvKXaXZUG7XN7ENY8sJnLx4QJ0nh4+0gzT9yDyq2vKvPFLEKweNOxKDKCSbE 31vGoLFwjltp74hX+Qrnj1KMaTLgvAaV0eXKWlbX7Iiw6GFVm200zb27gth6U8bV WQAmjSkFQ0daHtAWmXIVy7hrXiCqe8D6YPKvBXnQ4cfKVbgG0HHDuTmQLpKGzfMi rr24MU5vZjt6OsYalceiTtifSUcf/I2+iFV7HswOk9kpOY5A2ylsWawRP2mm4PDI nJE3LaSTRpEvs5kzPJ2kr8Zp4/uvF6ehSq8Y9w52JekmOzxusLcRcswezaO00EI0 32uuK+P505EGTcCBTrEdtaI6k7zzQEeVoIpxqvMhNRG/s5vzvIV3eVrALu2HSDma P3paEdx7PwJla3ndmdChfh1vUR3TW3gWoZvoNCVmJzNCnLEAScTS2NsiQeEjy8zs 20IGsrRgIqt9KR8GZ2zj1ZOM47Cg0dIU3pbbA2Ja71wx4TYXJCSFFRK7mzDtXYlY BWOix/Dks8tk118cwuxnT+IiwmWDMbDZKnygh+4tiSyrs0IszeekRADLUu03C0Ve Dhpljqf3zA== =gs32 -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linux Pull more block updates from Jens Axboe: - NVMe pull request via Christoph: - TCP error handling fix (Shin'ichiro Kawasaki) - TCP I/O stall handling fixes (Hannes Reinecke) - fix command limits status code (Keith Busch) - support vectored buffers also for passthrough (Pavel Begunkov) - spelling fixes (Yi Zhang) - MD pull request via Yu: - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10 - fix max_write_behind setting for dm-raid - some minor cleanups - Integrity data direction fix and cleanup - bcache NULL pointer fix - Fix for loop missing write start/end handling - Decouple hardware queues and IO threads in ublk - Slew of ublk selftests additions and updates * tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits) nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user() block: drop direction param from bio_integrity_copy_user() selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon ...
This commit is contained in:
commit
6d8854216e
48 changed files with 708 additions and 393 deletions
|
@ -115,15 +115,15 @@ managing and controlling ublk devices with help of several control commands:
|
|||
|
||||
- ``UBLK_CMD_START_DEV``
|
||||
|
||||
After the server prepares userspace resources (such as creating per-queue
|
||||
pthread & io_uring for handling ublk IO), this command is sent to the
|
||||
After the server prepares userspace resources (such as creating I/O handler
|
||||
threads & io_uring for handling ublk IO), this command is sent to the
|
||||
driver for allocating & exposing ``/dev/ublkb*``. Parameters set via
|
||||
``UBLK_CMD_SET_PARAMS`` are applied for creating the device.
|
||||
|
||||
- ``UBLK_CMD_STOP_DEV``
|
||||
|
||||
Halt IO on ``/dev/ublkb*`` and remove the device. When this command returns,
|
||||
ublk server will release resources (such as destroying per-queue pthread &
|
||||
ublk server will release resources (such as destroying I/O handler threads &
|
||||
io_uring).
|
||||
|
||||
- ``UBLK_CMD_DEL_DEV``
|
||||
|
@ -208,15 +208,15 @@ managing and controlling ublk devices with help of several control commands:
|
|||
modify how I/O is handled while the ublk server is dying/dead (this is called
|
||||
the ``nosrv`` case in the driver code).
|
||||
|
||||
With just ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublk server's io
|
||||
handler) is dying, ublk does not delete ``/dev/ublkb*`` during the whole
|
||||
With just ``UBLK_F_USER_RECOVERY`` set, after the ublk server exits,
|
||||
ublk does not delete ``/dev/ublkb*`` during the whole
|
||||
recovery stage and ublk device ID is kept. It is ublk server's
|
||||
responsibility to recover the device context by its own knowledge.
|
||||
Requests which have not been issued to userspace are requeued. Requests
|
||||
which have been issued to userspace are aborted.
|
||||
|
||||
With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after one ubq_daemon
|
||||
(ublk server's io handler) is dying, contrary to ``UBLK_F_USER_RECOVERY``,
|
||||
With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after the ublk server
|
||||
exits, contrary to ``UBLK_F_USER_RECOVERY``,
|
||||
requests which have been issued to userspace are requeued and will be
|
||||
re-issued to the new process after handling ``UBLK_CMD_END_USER_RECOVERY``.
|
||||
``UBLK_F_USER_RECOVERY_REISSUE`` is designed for backends who tolerate
|
||||
|
@ -241,10 +241,11 @@ can be controlled/accessed just inside this container.
|
|||
Data plane
|
||||
----------
|
||||
|
||||
ublk server needs to create per-queue IO pthread & io_uring for handling IO
|
||||
commands via io_uring passthrough. The per-queue IO pthread
|
||||
focuses on IO handling and shouldn't handle any control & management
|
||||
tasks.
|
||||
The ublk server should create dedicated threads for handling I/O. Each
|
||||
thread should have its own io_uring through which it is notified of new
|
||||
I/O, and through which it can complete I/O. These dedicated threads
|
||||
should focus on IO handling and shouldn't handle any control &
|
||||
management tasks.
|
||||
|
||||
The's IO is assigned by a unique tag, which is 1:1 mapping with IO
|
||||
request of ``/dev/ublkb*``.
|
||||
|
@ -265,6 +266,18 @@ with specified IO tag in the command data:
|
|||
destined to ``/dev/ublkb*``. This command is sent only once from the server
|
||||
IO pthread for ublk driver to setup IO forward environment.
|
||||
|
||||
Once a thread issues this command against a given (qid,tag) pair, the thread
|
||||
registers itself as that I/O's daemon. In the future, only that I/O's daemon
|
||||
is allowed to issue commands against the I/O. If any other thread attempts
|
||||
to issue a command against a (qid,tag) pair for which the thread is not the
|
||||
daemon, the command will fail. Daemons can be reset only be going through
|
||||
recovery.
|
||||
|
||||
The ability for every (qid,tag) pair to have its own independent daemon task
|
||||
is indicated by the ``UBLK_F_PER_IO_DAEMON`` feature. If this feature is not
|
||||
supported by the driver, daemons must be per-queue instead - i.e. all I/Os
|
||||
associated to a single qid must be handled by the same task.
|
||||
|
||||
- ``UBLK_IO_COMMIT_AND_FETCH_REQ``
|
||||
|
||||
When an IO request is destined to ``/dev/ublkb*``, the driver stores
|
||||
|
|
|
@ -154,10 +154,9 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
|
|||
EXPORT_SYMBOL(bio_integrity_add_page);
|
||||
|
||||
static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec,
|
||||
int nr_vecs, unsigned int len,
|
||||
unsigned int direction)
|
||||
int nr_vecs, unsigned int len)
|
||||
{
|
||||
bool write = direction == ITER_SOURCE;
|
||||
bool write = op_is_write(bio_op(bio));
|
||||
struct bio_integrity_payload *bip;
|
||||
struct iov_iter iter;
|
||||
void *buf;
|
||||
|
@ -168,7 +167,7 @@ static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec,
|
|||
return -ENOMEM;
|
||||
|
||||
if (write) {
|
||||
iov_iter_bvec(&iter, direction, bvec, nr_vecs, len);
|
||||
iov_iter_bvec(&iter, ITER_SOURCE, bvec, nr_vecs, len);
|
||||
if (!copy_from_iter_full(buf, len, &iter)) {
|
||||
ret = -EFAULT;
|
||||
goto free_buf;
|
||||
|
@ -264,7 +263,7 @@ int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
|
|||
struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages;
|
||||
struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec;
|
||||
size_t offset, bytes = iter->count;
|
||||
unsigned int direction, nr_bvecs;
|
||||
unsigned int nr_bvecs;
|
||||
int ret, nr_vecs;
|
||||
bool copy;
|
||||
|
||||
|
@ -273,11 +272,6 @@ int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
|
|||
if (bytes >> SECTOR_SHIFT > queue_max_hw_sectors(q))
|
||||
return -E2BIG;
|
||||
|
||||
if (bio_data_dir(bio) == READ)
|
||||
direction = ITER_DEST;
|
||||
else
|
||||
direction = ITER_SOURCE;
|
||||
|
||||
nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS + 1);
|
||||
if (nr_vecs > BIO_MAX_VECS)
|
||||
return -E2BIG;
|
||||
|
@ -300,8 +294,7 @@ int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter)
|
|||
copy = true;
|
||||
|
||||
if (copy)
|
||||
ret = bio_integrity_copy_user(bio, bvec, nr_bvecs, bytes,
|
||||
direction);
|
||||
ret = bio_integrity_copy_user(bio, bvec, nr_bvecs, bytes);
|
||||
else
|
||||
ret = bio_integrity_init_user(bio, bvec, nr_bvecs, bytes);
|
||||
if (ret)
|
||||
|
|
|
@ -117,13 +117,8 @@ int blk_rq_integrity_map_user(struct request *rq, void __user *ubuf,
|
|||
{
|
||||
int ret;
|
||||
struct iov_iter iter;
|
||||
unsigned int direction;
|
||||
|
||||
if (op_is_write(req_op(rq)))
|
||||
direction = ITER_DEST;
|
||||
else
|
||||
direction = ITER_SOURCE;
|
||||
iov_iter_ubuf(&iter, direction, ubuf, bytes);
|
||||
iov_iter_ubuf(&iter, rq_data_dir(rq), ubuf, bytes);
|
||||
ret = bio_integrity_map_user(rq->bio, &iter);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
|
|
@ -308,11 +308,14 @@ end_io:
|
|||
static void lo_rw_aio_do_completion(struct loop_cmd *cmd)
|
||||
{
|
||||
struct request *rq = blk_mq_rq_from_pdu(cmd);
|
||||
struct loop_device *lo = rq->q->queuedata;
|
||||
|
||||
if (!atomic_dec_and_test(&cmd->ref))
|
||||
return;
|
||||
kfree(cmd->bvec);
|
||||
cmd->bvec = NULL;
|
||||
if (req_op(rq) == REQ_OP_WRITE)
|
||||
file_end_write(lo->lo_backing_file);
|
||||
if (likely(!blk_should_fake_timeout(rq->q)))
|
||||
blk_mq_complete_request(rq);
|
||||
}
|
||||
|
@ -387,9 +390,10 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
|
|||
cmd->iocb.ki_flags = 0;
|
||||
}
|
||||
|
||||
if (rw == ITER_SOURCE)
|
||||
if (rw == ITER_SOURCE) {
|
||||
file_start_write(lo->lo_backing_file);
|
||||
ret = file->f_op->write_iter(&cmd->iocb, &iter);
|
||||
else
|
||||
} else
|
||||
ret = file->f_op->read_iter(&cmd->iocb, &iter);
|
||||
|
||||
lo_rw_aio_do_completion(cmd);
|
||||
|
|
|
@ -69,7 +69,8 @@
|
|||
| UBLK_F_USER_RECOVERY_FAIL_IO \
|
||||
| UBLK_F_UPDATE_SIZE \
|
||||
| UBLK_F_AUTO_BUF_REG \
|
||||
| UBLK_F_QUIESCE)
|
||||
| UBLK_F_QUIESCE \
|
||||
| UBLK_F_PER_IO_DAEMON)
|
||||
|
||||
#define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
|
||||
| UBLK_F_USER_RECOVERY_REISSUE \
|
||||
|
@ -166,6 +167,8 @@ struct ublk_io {
|
|||
/* valid if UBLK_IO_FLAG_OWNED_BY_SRV is set */
|
||||
struct request *req;
|
||||
};
|
||||
|
||||
struct task_struct *task;
|
||||
};
|
||||
|
||||
struct ublk_queue {
|
||||
|
@ -173,11 +176,9 @@ struct ublk_queue {
|
|||
int q_depth;
|
||||
|
||||
unsigned long flags;
|
||||
struct task_struct *ubq_daemon;
|
||||
struct ublksrv_io_desc *io_cmd_buf;
|
||||
|
||||
bool force_abort;
|
||||
bool timeout;
|
||||
bool canceling;
|
||||
bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */
|
||||
unsigned short nr_io_ready; /* how many ios setup */
|
||||
|
@ -1099,11 +1100,6 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
|
|||
return io_uring_cmd_to_pdu(ioucmd, struct ublk_uring_cmd_pdu);
|
||||
}
|
||||
|
||||
static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
|
||||
{
|
||||
return !ubq->ubq_daemon || ubq->ubq_daemon->flags & PF_EXITING;
|
||||
}
|
||||
|
||||
/* todo: handle partial completion */
|
||||
static inline void __ublk_complete_rq(struct request *req)
|
||||
{
|
||||
|
@ -1275,13 +1271,13 @@ static void ublk_dispatch_req(struct ublk_queue *ubq,
|
|||
/*
|
||||
* Task is exiting if either:
|
||||
*
|
||||
* (1) current != ubq_daemon.
|
||||
* (1) current != io->task.
|
||||
* io_uring_cmd_complete_in_task() tries to run task_work
|
||||
* in a workqueue if ubq_daemon(cmd's task) is PF_EXITING.
|
||||
* in a workqueue if cmd's task is PF_EXITING.
|
||||
*
|
||||
* (2) current->flags & PF_EXITING.
|
||||
*/
|
||||
if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {
|
||||
if (unlikely(current != io->task || current->flags & PF_EXITING)) {
|
||||
__ublk_abort_rq(ubq, req);
|
||||
return;
|
||||
}
|
||||
|
@ -1330,24 +1326,22 @@ static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd,
|
|||
{
|
||||
struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
|
||||
struct request *rq = pdu->req_list;
|
||||
struct ublk_queue *ubq = pdu->ubq;
|
||||
struct request *next;
|
||||
|
||||
do {
|
||||
next = rq->rq_next;
|
||||
rq->rq_next = NULL;
|
||||
ublk_dispatch_req(ubq, rq, issue_flags);
|
||||
ublk_dispatch_req(rq->mq_hctx->driver_data, rq, issue_flags);
|
||||
rq = next;
|
||||
} while (rq);
|
||||
}
|
||||
|
||||
static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
|
||||
static void ublk_queue_cmd_list(struct ublk_io *io, struct rq_list *l)
|
||||
{
|
||||
struct request *rq = rq_list_peek(l);
|
||||
struct io_uring_cmd *cmd = ubq->ios[rq->tag].cmd;
|
||||
struct io_uring_cmd *cmd = io->cmd;
|
||||
struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
|
||||
|
||||
pdu->req_list = rq;
|
||||
pdu->req_list = rq_list_peek(l);
|
||||
rq_list_init(l);
|
||||
io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb);
|
||||
}
|
||||
|
@ -1355,13 +1349,10 @@ static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
|
|||
static enum blk_eh_timer_return ublk_timeout(struct request *rq)
|
||||
{
|
||||
struct ublk_queue *ubq = rq->mq_hctx->driver_data;
|
||||
struct ublk_io *io = &ubq->ios[rq->tag];
|
||||
|
||||
if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
|
||||
if (!ubq->timeout) {
|
||||
send_sig(SIGKILL, ubq->ubq_daemon, 0);
|
||||
ubq->timeout = true;
|
||||
}
|
||||
|
||||
send_sig(SIGKILL, io->task, 0);
|
||||
return BLK_EH_DONE;
|
||||
}
|
||||
|
||||
|
@ -1429,24 +1420,25 @@ static void ublk_queue_rqs(struct rq_list *rqlist)
|
|||
{
|
||||
struct rq_list requeue_list = { };
|
||||
struct rq_list submit_list = { };
|
||||
struct ublk_queue *ubq = NULL;
|
||||
struct ublk_io *io = NULL;
|
||||
struct request *req;
|
||||
|
||||
while ((req = rq_list_pop(rqlist))) {
|
||||
struct ublk_queue *this_q = req->mq_hctx->driver_data;
|
||||
struct ublk_io *this_io = &this_q->ios[req->tag];
|
||||
|
||||
if (ubq && ubq != this_q && !rq_list_empty(&submit_list))
|
||||
ublk_queue_cmd_list(ubq, &submit_list);
|
||||
ubq = this_q;
|
||||
if (io && io->task != this_io->task && !rq_list_empty(&submit_list))
|
||||
ublk_queue_cmd_list(io, &submit_list);
|
||||
io = this_io;
|
||||
|
||||
if (ublk_prep_req(ubq, req, true) == BLK_STS_OK)
|
||||
if (ublk_prep_req(this_q, req, true) == BLK_STS_OK)
|
||||
rq_list_add_tail(&submit_list, req);
|
||||
else
|
||||
rq_list_add_tail(&requeue_list, req);
|
||||
}
|
||||
|
||||
if (ubq && !rq_list_empty(&submit_list))
|
||||
ublk_queue_cmd_list(ubq, &submit_list);
|
||||
if (!rq_list_empty(&submit_list))
|
||||
ublk_queue_cmd_list(io, &submit_list);
|
||||
*rqlist = requeue_list;
|
||||
}
|
||||
|
||||
|
@ -1474,17 +1466,6 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
|
|||
/* All old ioucmds have to be completed */
|
||||
ubq->nr_io_ready = 0;
|
||||
|
||||
/*
|
||||
* old daemon is PF_EXITING, put it now
|
||||
*
|
||||
* It could be NULL in case of closing one quisced device.
|
||||
*/
|
||||
if (ubq->ubq_daemon)
|
||||
put_task_struct(ubq->ubq_daemon);
|
||||
/* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
|
||||
ubq->ubq_daemon = NULL;
|
||||
ubq->timeout = false;
|
||||
|
||||
for (i = 0; i < ubq->q_depth; i++) {
|
||||
struct ublk_io *io = &ubq->ios[i];
|
||||
|
||||
|
@ -1495,6 +1476,17 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
|
|||
io->flags &= UBLK_IO_FLAG_CANCELED;
|
||||
io->cmd = NULL;
|
||||
io->addr = 0;
|
||||
|
||||
/*
|
||||
* old task is PF_EXITING, put it now
|
||||
*
|
||||
* It could be NULL in case of closing one quiesced
|
||||
* device.
|
||||
*/
|
||||
if (io->task) {
|
||||
put_task_struct(io->task);
|
||||
io->task = NULL;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -1516,7 +1508,7 @@ static void ublk_reset_ch_dev(struct ublk_device *ub)
|
|||
for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
|
||||
ublk_queue_reinit(ub, ublk_get_queue(ub, i));
|
||||
|
||||
/* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */
|
||||
/* set to NULL, otherwise new tasks cannot mmap io_cmd_buf */
|
||||
ub->mm = NULL;
|
||||
ub->nr_queues_ready = 0;
|
||||
ub->nr_privileged_daemon = 0;
|
||||
|
@ -1783,6 +1775,7 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
|
|||
struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
|
||||
struct ublk_queue *ubq = pdu->ubq;
|
||||
struct task_struct *task;
|
||||
struct ublk_io *io;
|
||||
|
||||
if (WARN_ON_ONCE(!ubq))
|
||||
return;
|
||||
|
@ -1791,13 +1784,14 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
|
|||
return;
|
||||
|
||||
task = io_uring_cmd_get_task(cmd);
|
||||
if (WARN_ON_ONCE(task && task != ubq->ubq_daemon))
|
||||
io = &ubq->ios[pdu->tag];
|
||||
if (WARN_ON_ONCE(task && task != io->task))
|
||||
return;
|
||||
|
||||
if (!ubq->canceling)
|
||||
ublk_start_cancel(ubq);
|
||||
|
||||
WARN_ON_ONCE(ubq->ios[pdu->tag].cmd != cmd);
|
||||
WARN_ON_ONCE(io->cmd != cmd);
|
||||
ublk_cancel_cmd(ubq, pdu->tag, issue_flags);
|
||||
}
|
||||
|
||||
|
@ -1930,8 +1924,6 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
|
|||
{
|
||||
ubq->nr_io_ready++;
|
||||
if (ublk_queue_ready(ubq)) {
|
||||
ubq->ubq_daemon = current;
|
||||
get_task_struct(ubq->ubq_daemon);
|
||||
ub->nr_queues_ready++;
|
||||
|
||||
if (capable(CAP_SYS_ADMIN))
|
||||
|
@ -2084,6 +2076,7 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
|
|||
}
|
||||
|
||||
ublk_fill_io_cmd(io, cmd, buf_addr);
|
||||
WRITE_ONCE(io->task, get_task_struct(current));
|
||||
ublk_mark_io_ready(ub, ubq);
|
||||
out:
|
||||
mutex_unlock(&ub->mutex);
|
||||
|
@ -2179,6 +2172,7 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
|
|||
const struct ublksrv_io_cmd *ub_cmd)
|
||||
{
|
||||
struct ublk_device *ub = cmd->file->private_data;
|
||||
struct task_struct *task;
|
||||
struct ublk_queue *ubq;
|
||||
struct ublk_io *io;
|
||||
u32 cmd_op = cmd->cmd_op;
|
||||
|
@ -2193,13 +2187,14 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
|
|||
goto out;
|
||||
|
||||
ubq = ublk_get_queue(ub, ub_cmd->q_id);
|
||||
if (ubq->ubq_daemon && ubq->ubq_daemon != current)
|
||||
goto out;
|
||||
|
||||
if (tag >= ubq->q_depth)
|
||||
goto out;
|
||||
|
||||
io = &ubq->ios[tag];
|
||||
task = READ_ONCE(io->task);
|
||||
if (task && task != current)
|
||||
goto out;
|
||||
|
||||
/* there is pending io cmd, something must be wrong */
|
||||
if (io->flags & UBLK_IO_FLAG_ACTIVE) {
|
||||
|
@ -2449,9 +2444,14 @@ static void ublk_deinit_queue(struct ublk_device *ub, int q_id)
|
|||
{
|
||||
int size = ublk_queue_cmd_buf_size(ub, q_id);
|
||||
struct ublk_queue *ubq = ublk_get_queue(ub, q_id);
|
||||
int i;
|
||||
|
||||
for (i = 0; i < ubq->q_depth; i++) {
|
||||
struct ublk_io *io = &ubq->ios[i];
|
||||
if (io->task)
|
||||
put_task_struct(io->task);
|
||||
}
|
||||
|
||||
if (ubq->ubq_daemon)
|
||||
put_task_struct(ubq->ubq_daemon);
|
||||
if (ubq->io_cmd_buf)
|
||||
free_pages((unsigned long)ubq->io_cmd_buf, get_order(size));
|
||||
}
|
||||
|
@ -2923,7 +2923,8 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
|
|||
ub->dev_info.flags &= UBLK_F_ALL;
|
||||
|
||||
ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE |
|
||||
UBLK_F_URING_CMD_COMP_IN_TASK;
|
||||
UBLK_F_URING_CMD_COMP_IN_TASK |
|
||||
UBLK_F_PER_IO_DAEMON;
|
||||
|
||||
/* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */
|
||||
if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
|
||||
|
@ -3188,14 +3189,14 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
|
|||
int ublksrv_pid = (int)header->data[0];
|
||||
int ret = -EINVAL;
|
||||
|
||||
pr_devel("%s: Waiting for new ubq_daemons(nr: %d) are ready, dev id %d...\n",
|
||||
__func__, ub->dev_info.nr_hw_queues, header->dev_id);
|
||||
/* wait until new ubq_daemon sending all FETCH_REQ */
|
||||
pr_devel("%s: Waiting for all FETCH_REQs, dev id %d...\n", __func__,
|
||||
header->dev_id);
|
||||
|
||||
if (wait_for_completion_interruptible(&ub->completion))
|
||||
return -EINTR;
|
||||
|
||||
pr_devel("%s: All new ubq_daemons(nr: %d) are ready, dev id %d\n",
|
||||
__func__, ub->dev_info.nr_hw_queues, header->dev_id);
|
||||
pr_devel("%s: All FETCH_REQs received, dev id %d\n", __func__,
|
||||
header->dev_id);
|
||||
|
||||
mutex_lock(&ub->mutex);
|
||||
if (ublk_nosrv_should_stop_dev(ub))
|
||||
|
|
|
@ -89,8 +89,6 @@
|
|||
* Test module load/unload
|
||||
*/
|
||||
|
||||
#define MAX_NEED_GC 64
|
||||
#define MAX_SAVE_PRIO 72
|
||||
#define MAX_GC_TIMES 100
|
||||
#define MIN_GC_NODES 100
|
||||
#define GC_SLEEP_MS 100
|
||||
|
|
|
@ -1733,7 +1733,12 @@ static CLOSURE_CALLBACK(cache_set_flush)
|
|||
mutex_unlock(&b->write_lock);
|
||||
}
|
||||
|
||||
if (ca->alloc_thread)
|
||||
/*
|
||||
* If the register_cache_set() call to bch_cache_set_alloc() failed,
|
||||
* ca has not been assigned a value and return error.
|
||||
* So we need check ca is not NULL during bch_cache_set_unregister().
|
||||
*/
|
||||
if (ca && ca->alloc_thread)
|
||||
kthread_stop(ca->alloc_thread);
|
||||
|
||||
if (c->journal.cur) {
|
||||
|
@ -2233,15 +2238,47 @@ static int cache_alloc(struct cache *ca)
|
|||
bio_init(&ca->journal.bio, NULL, ca->journal.bio.bi_inline_vecs, 8, 0);
|
||||
|
||||
/*
|
||||
* when ca->sb.njournal_buckets is not zero, journal exists,
|
||||
* and in bch_journal_replay(), tree node may split,
|
||||
* so bucket of RESERVE_BTREE type is needed,
|
||||
* the worst situation is all journal buckets are valid journal,
|
||||
* and all the keys need to replay,
|
||||
* so the number of RESERVE_BTREE type buckets should be as much
|
||||
* as journal buckets
|
||||
* When the cache disk is first registered, ca->sb.njournal_buckets
|
||||
* is zero, and it is assigned in run_cache_set().
|
||||
*
|
||||
* When ca->sb.njournal_buckets is not zero, journal exists,
|
||||
* and in bch_journal_replay(), tree node may split.
|
||||
* The worst situation is all journal buckets are valid journal,
|
||||
* and all the keys need to replay, so the number of RESERVE_BTREE
|
||||
* type buckets should be as much as journal buckets.
|
||||
*
|
||||
* If the number of RESERVE_BTREE type buckets is too few, the
|
||||
* bch_allocator_thread() may hang up and unable to allocate
|
||||
* bucket. The situation is roughly as follows:
|
||||
*
|
||||
* 1. In bch_data_insert_keys(), if the operation is not op->replace,
|
||||
* it will call the bch_journal(), which increments the journal_ref
|
||||
* counter. This counter is only decremented after bch_btree_insert
|
||||
* completes.
|
||||
*
|
||||
* 2. When calling bch_btree_insert, if the btree needs to split,
|
||||
* it will call btree_split() and btree_check_reserve() to check
|
||||
* whether there are enough reserved buckets in the RESERVE_BTREE
|
||||
* slot. If not enough, bcache_btree_root() will repeatedly retry.
|
||||
*
|
||||
* 3. Normally, the bch_allocator_thread is responsible for filling
|
||||
* the reservation slots from the free_inc bucket list. When the
|
||||
* free_inc bucket list is exhausted, the bch_allocator_thread
|
||||
* will call invalidate_buckets() until free_inc is refilled.
|
||||
* Then bch_allocator_thread calls bch_prio_write() once. and
|
||||
* bch_prio_write() will call bch_journal_meta() and waits for
|
||||
* the journal write to complete.
|
||||
*
|
||||
* 4. During journal_write, journal_write_unlocked() is be called.
|
||||
* If journal full occurs, journal_reclaim() and btree_flush_write()
|
||||
* will be called sequentially, then retry journal_write.
|
||||
*
|
||||
* 5. When 2 and 4 occur together, IO will hung up and cannot recover.
|
||||
*
|
||||
* Therefore, reserve more RESERVE_BTREE type buckets.
|
||||
*/
|
||||
btree_buckets = ca->sb.njournal_buckets ?: 8;
|
||||
btree_buckets = clamp_t(size_t, ca->sb.nbuckets >> 7,
|
||||
32, SB_JOURNAL_BUCKETS);
|
||||
free = roundup_pow_of_two(ca->sb.nbuckets) >> 10;
|
||||
if (!free) {
|
||||
ret = -EPERM;
|
||||
|
|
|
@ -1356,11 +1356,7 @@ static int parse_raid_params(struct raid_set *rs, struct dm_arg_set *as,
|
|||
return -EINVAL;
|
||||
}
|
||||
|
||||
/*
|
||||
* In device-mapper, we specify things in sectors, but
|
||||
* MD records this value in kB
|
||||
*/
|
||||
if (value < 0 || value / 2 > COUNTER_MAX) {
|
||||
if (value < 0) {
|
||||
rs->ti->error = "Max write-behind limit out of range";
|
||||
return -EINVAL;
|
||||
}
|
||||
|
|
|
@ -105,9 +105,19 @@
|
|||
*
|
||||
*/
|
||||
|
||||
typedef __u16 bitmap_counter_t;
|
||||
|
||||
#define PAGE_BITS (PAGE_SIZE << 3)
|
||||
#define PAGE_BIT_SHIFT (PAGE_SHIFT + 3)
|
||||
|
||||
#define COUNTER_BITS 16
|
||||
#define COUNTER_BIT_SHIFT 4
|
||||
#define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3)
|
||||
|
||||
#define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1)))
|
||||
#define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2)))
|
||||
#define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1)
|
||||
|
||||
#define NEEDED(x) (((bitmap_counter_t) x) & NEEDED_MASK)
|
||||
#define RESYNC(x) (((bitmap_counter_t) x) & RESYNC_MASK)
|
||||
#define COUNTER(x) (((bitmap_counter_t) x) & COUNTER_MAX)
|
||||
|
@ -789,7 +799,7 @@ static int md_bitmap_new_disk_sb(struct bitmap *bitmap)
|
|||
* is a good choice? We choose COUNTER_MAX / 2 arbitrarily.
|
||||
*/
|
||||
write_behind = bitmap->mddev->bitmap_info.max_write_behind;
|
||||
if (write_behind > COUNTER_MAX)
|
||||
if (write_behind > COUNTER_MAX / 2)
|
||||
write_behind = COUNTER_MAX / 2;
|
||||
sb->write_behind = cpu_to_le32(write_behind);
|
||||
bitmap->mddev->bitmap_info.max_write_behind = write_behind;
|
||||
|
@ -1672,13 +1682,13 @@ __acquires(bitmap->lock)
|
|||
&(bitmap->bp[page].map[pageoff]);
|
||||
}
|
||||
|
||||
static int bitmap_startwrite(struct mddev *mddev, sector_t offset,
|
||||
unsigned long sectors)
|
||||
static void bitmap_start_write(struct mddev *mddev, sector_t offset,
|
||||
unsigned long sectors)
|
||||
{
|
||||
struct bitmap *bitmap = mddev->bitmap;
|
||||
|
||||
if (!bitmap)
|
||||
return 0;
|
||||
return;
|
||||
|
||||
while (sectors) {
|
||||
sector_t blocks;
|
||||
|
@ -1688,7 +1698,7 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset,
|
|||
bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, 1);
|
||||
if (!bmc) {
|
||||
spin_unlock_irq(&bitmap->counts.lock);
|
||||
return 0;
|
||||
return;
|
||||
}
|
||||
|
||||
if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {
|
||||
|
@ -1724,11 +1734,10 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset,
|
|||
else
|
||||
sectors = 0;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void bitmap_endwrite(struct mddev *mddev, sector_t offset,
|
||||
unsigned long sectors)
|
||||
static void bitmap_end_write(struct mddev *mddev, sector_t offset,
|
||||
unsigned long sectors)
|
||||
{
|
||||
struct bitmap *bitmap = mddev->bitmap;
|
||||
|
||||
|
@ -2205,9 +2214,9 @@ static struct bitmap *__bitmap_create(struct mddev *mddev, int slot)
|
|||
return ERR_PTR(err);
|
||||
}
|
||||
|
||||
static int bitmap_create(struct mddev *mddev, int slot)
|
||||
static int bitmap_create(struct mddev *mddev)
|
||||
{
|
||||
struct bitmap *bitmap = __bitmap_create(mddev, slot);
|
||||
struct bitmap *bitmap = __bitmap_create(mddev, -1);
|
||||
|
||||
if (IS_ERR(bitmap))
|
||||
return PTR_ERR(bitmap);
|
||||
|
@ -2670,7 +2679,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len)
|
|||
}
|
||||
|
||||
mddev->bitmap_info.offset = offset;
|
||||
rv = bitmap_create(mddev, -1);
|
||||
rv = bitmap_create(mddev);
|
||||
if (rv)
|
||||
goto out;
|
||||
|
||||
|
@ -3003,8 +3012,8 @@ static struct bitmap_operations bitmap_ops = {
|
|||
.end_behind_write = bitmap_end_behind_write,
|
||||
.wait_behind_writes = bitmap_wait_behind_writes,
|
||||
|
||||
.startwrite = bitmap_startwrite,
|
||||
.endwrite = bitmap_endwrite,
|
||||
.start_write = bitmap_start_write,
|
||||
.end_write = bitmap_end_write,
|
||||
.start_sync = bitmap_start_sync,
|
||||
.end_sync = bitmap_end_sync,
|
||||
.cond_end_sync = bitmap_cond_end_sync,
|
||||
|
|
|
@ -9,15 +9,6 @@
|
|||
|
||||
#define BITMAP_MAGIC 0x6d746962
|
||||
|
||||
typedef __u16 bitmap_counter_t;
|
||||
#define COUNTER_BITS 16
|
||||
#define COUNTER_BIT_SHIFT 4
|
||||
#define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3)
|
||||
|
||||
#define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1)))
|
||||
#define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2)))
|
||||
#define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1)
|
||||
|
||||
/* use these for bitmap->flags and bitmap->sb->state bit-fields */
|
||||
enum bitmap_state {
|
||||
BITMAP_STALE = 1, /* the bitmap file is out of date or had -EIO */
|
||||
|
@ -72,7 +63,7 @@ struct md_bitmap_stats {
|
|||
|
||||
struct bitmap_operations {
|
||||
bool (*enabled)(struct mddev *mddev);
|
||||
int (*create)(struct mddev *mddev, int slot);
|
||||
int (*create)(struct mddev *mddev);
|
||||
int (*resize)(struct mddev *mddev, sector_t blocks, int chunksize,
|
||||
bool init);
|
||||
|
||||
|
@ -89,10 +80,10 @@ struct bitmap_operations {
|
|||
void (*end_behind_write)(struct mddev *mddev);
|
||||
void (*wait_behind_writes)(struct mddev *mddev);
|
||||
|
||||
int (*startwrite)(struct mddev *mddev, sector_t offset,
|
||||
void (*start_write)(struct mddev *mddev, sector_t offset,
|
||||
unsigned long sectors);
|
||||
void (*end_write)(struct mddev *mddev, sector_t offset,
|
||||
unsigned long sectors);
|
||||
void (*endwrite)(struct mddev *mddev, sector_t offset,
|
||||
unsigned long sectors);
|
||||
bool (*start_sync)(struct mddev *mddev, sector_t offset,
|
||||
sector_t *blocks, bool degraded);
|
||||
void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks);
|
||||
|
|
|
@ -6225,7 +6225,7 @@ int md_run(struct mddev *mddev)
|
|||
}
|
||||
if (err == 0 && pers->sync_request &&
|
||||
(mddev->bitmap_info.file || mddev->bitmap_info.offset)) {
|
||||
err = mddev->bitmap_ops->create(mddev, -1);
|
||||
err = mddev->bitmap_ops->create(mddev);
|
||||
if (err)
|
||||
pr_warn("%s: failed to create bitmap (%d)\n",
|
||||
mdname(mddev), err);
|
||||
|
@ -7285,7 +7285,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
|
|||
err = 0;
|
||||
if (mddev->pers) {
|
||||
if (fd >= 0) {
|
||||
err = mddev->bitmap_ops->create(mddev, -1);
|
||||
err = mddev->bitmap_ops->create(mddev);
|
||||
if (!err)
|
||||
err = mddev->bitmap_ops->load(mddev);
|
||||
|
||||
|
@ -7601,7 +7601,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info)
|
|||
mddev->bitmap_info.default_offset;
|
||||
mddev->bitmap_info.space =
|
||||
mddev->bitmap_info.default_space;
|
||||
rv = mddev->bitmap_ops->create(mddev, -1);
|
||||
rv = mddev->bitmap_ops->create(mddev);
|
||||
if (!rv)
|
||||
rv = mddev->bitmap_ops->load(mddev);
|
||||
|
||||
|
@ -8799,14 +8799,14 @@ static void md_bitmap_start(struct mddev *mddev,
|
|||
mddev->pers->bitmap_sector(mddev, &md_io_clone->offset,
|
||||
&md_io_clone->sectors);
|
||||
|
||||
mddev->bitmap_ops->startwrite(mddev, md_io_clone->offset,
|
||||
md_io_clone->sectors);
|
||||
mddev->bitmap_ops->start_write(mddev, md_io_clone->offset,
|
||||
md_io_clone->sectors);
|
||||
}
|
||||
|
||||
static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone)
|
||||
{
|
||||
mddev->bitmap_ops->endwrite(mddev, md_io_clone->offset,
|
||||
md_io_clone->sectors);
|
||||
mddev->bitmap_ops->end_write(mddev, md_io_clone->offset,
|
||||
md_io_clone->sectors);
|
||||
}
|
||||
|
||||
static void md_end_clone_io(struct bio *bio)
|
||||
|
|
|
@ -293,3 +293,13 @@ static inline bool raid1_should_read_first(struct mddev *mddev,
|
|||
|
||||
return false;
|
||||
}
|
||||
|
||||
/*
|
||||
* bio with REQ_RAHEAD or REQ_NOWAIT can fail at anytime, before such IO is
|
||||
* submitted to the underlying disks, hence don't record badblocks or retry
|
||||
* in this case.
|
||||
*/
|
||||
static inline bool raid1_should_handle_error(struct bio *bio)
|
||||
{
|
||||
return !(bio->bi_opf & (REQ_RAHEAD | REQ_NOWAIT));
|
||||
}
|
||||
|
|
|
@ -373,14 +373,16 @@ static void raid1_end_read_request(struct bio *bio)
|
|||
*/
|
||||
update_head_pos(r1_bio->read_disk, r1_bio);
|
||||
|
||||
if (uptodate)
|
||||
if (uptodate) {
|
||||
set_bit(R1BIO_Uptodate, &r1_bio->state);
|
||||
else if (test_bit(FailFast, &rdev->flags) &&
|
||||
test_bit(R1BIO_FailFast, &r1_bio->state))
|
||||
} else if (test_bit(FailFast, &rdev->flags) &&
|
||||
test_bit(R1BIO_FailFast, &r1_bio->state)) {
|
||||
/* This was a fail-fast read so we definitely
|
||||
* want to retry */
|
||||
;
|
||||
else {
|
||||
} else if (!raid1_should_handle_error(bio)) {
|
||||
uptodate = 1;
|
||||
} else {
|
||||
/* If all other devices have failed, we want to return
|
||||
* the error upwards rather than fail the last device.
|
||||
* Here we redefine "uptodate" to mean "Don't want to retry"
|
||||
|
@ -451,16 +453,15 @@ static void raid1_end_write_request(struct bio *bio)
|
|||
struct bio *to_put = NULL;
|
||||
int mirror = find_bio_disk(r1_bio, bio);
|
||||
struct md_rdev *rdev = conf->mirrors[mirror].rdev;
|
||||
bool discard_error;
|
||||
sector_t lo = r1_bio->sector;
|
||||
sector_t hi = r1_bio->sector + r1_bio->sectors;
|
||||
|
||||
discard_error = bio->bi_status && bio_op(bio) == REQ_OP_DISCARD;
|
||||
bool ignore_error = !raid1_should_handle_error(bio) ||
|
||||
(bio->bi_status && bio_op(bio) == REQ_OP_DISCARD);
|
||||
|
||||
/*
|
||||
* 'one mirror IO has finished' event handler:
|
||||
*/
|
||||
if (bio->bi_status && !discard_error) {
|
||||
if (bio->bi_status && !ignore_error) {
|
||||
set_bit(WriteErrorSeen, &rdev->flags);
|
||||
if (!test_and_set_bit(WantReplacement, &rdev->flags))
|
||||
set_bit(MD_RECOVERY_NEEDED, &
|
||||
|
@ -511,7 +512,7 @@ static void raid1_end_write_request(struct bio *bio)
|
|||
|
||||
/* Maybe we can clear some bad blocks. */
|
||||
if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors) &&
|
||||
!discard_error) {
|
||||
!ignore_error) {
|
||||
r1_bio->bios[mirror] = IO_MADE_GOOD;
|
||||
set_bit(R1BIO_MadeGood, &r1_bio->state);
|
||||
}
|
||||
|
|
|
@ -399,6 +399,8 @@ static void raid10_end_read_request(struct bio *bio)
|
|||
* wait for the 'master' bio.
|
||||
*/
|
||||
set_bit(R10BIO_Uptodate, &r10_bio->state);
|
||||
} else if (!raid1_should_handle_error(bio)) {
|
||||
uptodate = 1;
|
||||
} else {
|
||||
/* If all other devices that store this block have
|
||||
* failed, we want to return the error upwards rather
|
||||
|
@ -456,9 +458,8 @@ static void raid10_end_write_request(struct bio *bio)
|
|||
int slot, repl;
|
||||
struct md_rdev *rdev = NULL;
|
||||
struct bio *to_put = NULL;
|
||||
bool discard_error;
|
||||
|
||||
discard_error = bio->bi_status && bio_op(bio) == REQ_OP_DISCARD;
|
||||
bool ignore_error = !raid1_should_handle_error(bio) ||
|
||||
(bio->bi_status && bio_op(bio) == REQ_OP_DISCARD);
|
||||
|
||||
dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl);
|
||||
|
||||
|
@ -472,7 +473,7 @@ static void raid10_end_write_request(struct bio *bio)
|
|||
/*
|
||||
* this branch is our 'one mirror IO has finished' event handler:
|
||||
*/
|
||||
if (bio->bi_status && !discard_error) {
|
||||
if (bio->bi_status && !ignore_error) {
|
||||
if (repl)
|
||||
/* Never record new bad blocks to replacement,
|
||||
* just fail it.
|
||||
|
@ -527,7 +528,7 @@ static void raid10_end_write_request(struct bio *bio)
|
|||
/* Maybe we can clear some bad blocks. */
|
||||
if (rdev_has_badblock(rdev, r10_bio->devs[slot].addr,
|
||||
r10_bio->sectors) &&
|
||||
!discard_error) {
|
||||
!ignore_error) {
|
||||
bio_put(bio);
|
||||
if (repl)
|
||||
r10_bio->devs[slot].repl_bio = IO_MADE_GOOD;
|
||||
|
|
|
@ -471,7 +471,7 @@ EXPORT_SYMBOL_GPL(nvme_auth_generate_key);
|
|||
* @c1: Value of challenge C1
|
||||
* @c2: Value of challenge C2
|
||||
* @hash_len: Hash length of the hash algorithm
|
||||
* @ret_psk: Pointer too the resulting generated PSK
|
||||
* @ret_psk: Pointer to the resulting generated PSK
|
||||
* @ret_len: length of @ret_psk
|
||||
*
|
||||
* Generate a PSK for TLS as specified in NVMe base specification, section
|
||||
|
@ -759,8 +759,8 @@ int nvme_auth_derive_tls_psk(int hmac_id, u8 *psk, size_t psk_len,
|
|||
goto out_free_prk;
|
||||
|
||||
/*
|
||||
* 2 addtional bytes for the length field from HDKF-Expand-Label,
|
||||
* 2 addtional bytes for the HMAC ID, and one byte for the space
|
||||
* 2 additional bytes for the length field from HDKF-Expand-Label,
|
||||
* 2 additional bytes for the HMAC ID, and one byte for the space
|
||||
* separator.
|
||||
*/
|
||||
info_len = strlen(psk_digest) + strlen(psk_prefix) + 5;
|
||||
|
|
|
@ -106,7 +106,7 @@ config NVME_TCP_TLS
|
|||
help
|
||||
Enables TLS encryption for NVMe TCP using the netlink handshake API.
|
||||
|
||||
The TLS handshake daemon is availble at
|
||||
The TLS handshake daemon is available at
|
||||
https://github.com/oracle/ktls-utils.
|
||||
|
||||
If unsure, say N.
|
||||
|
|
|
@ -145,7 +145,7 @@ static const char * const nvme_statuses[] = {
|
|||
[NVME_SC_BAD_ATTRIBUTES] = "Conflicting Attributes",
|
||||
[NVME_SC_INVALID_PI] = "Invalid Protection Information",
|
||||
[NVME_SC_READ_ONLY] = "Attempted Write to Read Only Range",
|
||||
[NVME_SC_ONCS_NOT_SUPPORTED] = "ONCS Not Supported",
|
||||
[NVME_SC_CMD_SIZE_LIM_EXCEEDED ] = "Command Size Limits Exceeded",
|
||||
[NVME_SC_ZONE_BOUNDARY_ERROR] = "Zoned Boundary Error",
|
||||
[NVME_SC_ZONE_FULL] = "Zone Is Full",
|
||||
[NVME_SC_ZONE_READ_ONLY] = "Zone Is Read Only",
|
||||
|
|
|
@ -290,7 +290,6 @@ static blk_status_t nvme_error_status(u16 status)
|
|||
case NVME_SC_NS_NOT_READY:
|
||||
return BLK_STS_TARGET;
|
||||
case NVME_SC_BAD_ATTRIBUTES:
|
||||
case NVME_SC_ONCS_NOT_SUPPORTED:
|
||||
case NVME_SC_INVALID_OPCODE:
|
||||
case NVME_SC_INVALID_FIELD:
|
||||
case NVME_SC_INVALID_NS:
|
||||
|
@ -1027,7 +1026,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
|
|||
|
||||
if (ns->head->ms) {
|
||||
/*
|
||||
* If formated with metadata, the block layer always provides a
|
||||
* If formatted with metadata, the block layer always provides a
|
||||
* metadata buffer if CONFIG_BLK_DEV_INTEGRITY is enabled. Else
|
||||
* we enable the PRACT bit for protection information or set the
|
||||
* namespace capacity to zero to prevent any I/O.
|
||||
|
|
|
@ -582,7 +582,7 @@ EXPORT_SYMBOL_GPL(nvmf_connect_io_queue);
|
|||
* Do not retry when:
|
||||
*
|
||||
* - the DNR bit is set and the specification states no further connect
|
||||
* attempts with the same set of paramenters should be attempted.
|
||||
* attempts with the same set of parameters should be attempted.
|
||||
*
|
||||
* - when the authentication attempt fails, because the key was invalid.
|
||||
* This error code is set on the host side.
|
||||
|
|
|
@ -80,7 +80,7 @@ enum {
|
|||
* @transport: Holds the fabric transport "technology name" (for a lack of
|
||||
* better description) that will be used by an NVMe controller
|
||||
* being added.
|
||||
* @subsysnqn: Hold the fully qualified NQN subystem name (format defined
|
||||
* @subsysnqn: Hold the fully qualified NQN subsystem name (format defined
|
||||
* in the NVMe specification, "NVMe Qualified Names").
|
||||
* @traddr: The transport-specific TRADDR field for a port on the
|
||||
* subsystem which is adding a controller.
|
||||
|
@ -156,7 +156,7 @@ struct nvmf_ctrl_options {
|
|||
* @create_ctrl(): function pointer that points to a non-NVMe
|
||||
* implementation-specific fabric technology
|
||||
* that would go into starting up that fabric
|
||||
* for the purpose of conneciton to an NVMe controller
|
||||
* for the purpose of connection to an NVMe controller
|
||||
* using that fabric technology.
|
||||
*
|
||||
* Notes:
|
||||
|
@ -165,7 +165,7 @@ struct nvmf_ctrl_options {
|
|||
* 2. create_ctrl() must be defined (even if it does nothing)
|
||||
* 3. struct nvmf_transport_ops must be statically allocated in the
|
||||
* modules .bss section so that a pure module_get on @module
|
||||
* prevents the memory from beeing freed.
|
||||
* prevents the memory from being freed.
|
||||
*/
|
||||
struct nvmf_transport_ops {
|
||||
struct list_head entry;
|
||||
|
|
|
@ -1955,7 +1955,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
|
|||
}
|
||||
|
||||
/*
|
||||
* For the linux implementation, if we have an unsuccesful
|
||||
* For the linux implementation, if we have an unsucceesful
|
||||
* status, they blk-mq layer can typically be called with the
|
||||
* non-zero status and the content of the cqe isn't important.
|
||||
*/
|
||||
|
@ -2479,7 +2479,7 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
|
|||
* writing the registers for shutdown and polling (call
|
||||
* nvme_disable_ctrl()). Given a bunch of i/o was potentially
|
||||
* just aborted and we will wait on those contexts, and given
|
||||
* there was no indication of how live the controlelr is on the
|
||||
* there was no indication of how live the controller is on the
|
||||
* link, don't send more io to create more contexts for the
|
||||
* shutdown. Let the controller fail via keepalive failure if
|
||||
* its still present.
|
||||
|
|
|
@ -493,13 +493,15 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
|
|||
d.timeout_ms = READ_ONCE(cmd->timeout_ms);
|
||||
|
||||
if (d.data_len && (ioucmd->flags & IORING_URING_CMD_FIXED)) {
|
||||
/* fixedbufs is only for non-vectored io */
|
||||
if (vec)
|
||||
return -EINVAL;
|
||||
int ddir = nvme_is_write(&c) ? WRITE : READ;
|
||||
|
||||
ret = io_uring_cmd_import_fixed(d.addr, d.data_len,
|
||||
nvme_is_write(&c) ? WRITE : READ, &iter, ioucmd,
|
||||
issue_flags);
|
||||
if (vec)
|
||||
ret = io_uring_cmd_import_fixed_vec(ioucmd,
|
||||
u64_to_user_ptr(d.addr), d.data_len,
|
||||
ddir, &iter, issue_flags);
|
||||
else
|
||||
ret = io_uring_cmd_import_fixed(d.addr, d.data_len,
|
||||
ddir, &iter, ioucmd, issue_flags);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
|
||||
|
@ -521,7 +523,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
|
|||
if (d.data_len) {
|
||||
ret = nvme_map_user_request(req, d.addr, d.data_len,
|
||||
nvme_to_user_ptr(d.metadata), d.metadata_len,
|
||||
map_iter, vec);
|
||||
map_iter, vec ? NVME_IOCTL_VEC : 0);
|
||||
if (ret)
|
||||
goto out_free_req;
|
||||
}
|
||||
|
@ -727,7 +729,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
|
|||
|
||||
/*
|
||||
* Handle ioctls that apply to the controller instead of the namespace
|
||||
* seperately and drop the ns SRCU reference early. This avoids a
|
||||
* separately and drop the ns SRCU reference early. This avoids a
|
||||
* deadlock when deleting namespaces using the passthrough interface.
|
||||
*/
|
||||
if (is_ctrl_ioctl(cmd))
|
||||
|
|
|
@ -760,7 +760,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
|
|||
* controller's scan_work context. If a path error occurs here, the IO
|
||||
* will wait until a path becomes available or all paths are torn down,
|
||||
* but that action also occurs within scan_work, so it would deadlock.
|
||||
* Defer the partion scan to a different context that does not block
|
||||
* Defer the partition scan to a different context that does not block
|
||||
* scan_work.
|
||||
*/
|
||||
set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state);
|
||||
|
|
|
@ -523,7 +523,7 @@ static inline bool nvme_ns_head_multipath(struct nvme_ns_head *head)
|
|||
enum nvme_ns_features {
|
||||
NVME_NS_EXT_LBAS = 1 << 0, /* support extended LBA format */
|
||||
NVME_NS_METADATA_SUPPORTED = 1 << 1, /* support getting generated md */
|
||||
NVME_NS_DEAC = 1 << 2, /* DEAC bit in Write Zeores supported */
|
||||
NVME_NS_DEAC = 1 << 2, /* DEAC bit in Write Zeroes supported */
|
||||
};
|
||||
|
||||
struct nvme_ns {
|
||||
|
|
|
@ -3015,7 +3015,7 @@ static void nvme_reset_work(struct work_struct *work)
|
|||
goto out;
|
||||
|
||||
/*
|
||||
* Freeze and update the number of I/O queues as thos might have
|
||||
* Freeze and update the number of I/O queues as those might have
|
||||
* changed. If there are no I/O queues left after this reset, keep the
|
||||
* controller around but remove all namespaces.
|
||||
*/
|
||||
|
@ -3186,7 +3186,7 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev)
|
|||
/*
|
||||
* Exclude some Kingston NV1 and A2000 devices from
|
||||
* NVME_QUIRK_SIMPLE_SUSPEND. Do a full suspend to save a
|
||||
* lot fo energy with s2idle sleep on some TUXEDO platforms.
|
||||
* lot of energy with s2idle sleep on some TUXEDO platforms.
|
||||
*/
|
||||
if (dmi_match(DMI_BOARD_NAME, "NS5X_NS7XAU") ||
|
||||
dmi_match(DMI_BOARD_NAME, "NS5x_7xAU") ||
|
||||
|
|
|
@ -82,8 +82,6 @@ static int nvme_status_to_pr_err(int status)
|
|||
return PR_STS_SUCCESS;
|
||||
case NVME_SC_RESERVATION_CONFLICT:
|
||||
return PR_STS_RESERVATION_CONFLICT;
|
||||
case NVME_SC_ONCS_NOT_SUPPORTED:
|
||||
return -EOPNOTSUPP;
|
||||
case NVME_SC_BAD_ATTRIBUTES:
|
||||
case NVME_SC_INVALID_OPCODE:
|
||||
case NVME_SC_INVALID_FIELD:
|
||||
|
|
|
@ -221,7 +221,7 @@ static struct nvme_rdma_qe *nvme_rdma_alloc_ring(struct ib_device *ibdev,
|
|||
|
||||
/*
|
||||
* Bind the CQEs (post recv buffers) DMA mapping to the RDMA queue
|
||||
* lifetime. It's safe, since any chage in the underlying RDMA device
|
||||
* lifetime. It's safe, since any change in the underlying RDMA device
|
||||
* will issue error recovery and queue re-creation.
|
||||
*/
|
||||
for (i = 0; i < ib_queue_size; i++) {
|
||||
|
@ -800,7 +800,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
|
|||
|
||||
/*
|
||||
* Bind the async event SQE DMA mapping to the admin queue lifetime.
|
||||
* It's safe, since any chage in the underlying RDMA device will issue
|
||||
* It's safe, since any change in the underlying RDMA device will issue
|
||||
* error recovery and queue re-creation.
|
||||
*/
|
||||
error = nvme_rdma_alloc_qe(ctrl->device->dev, &ctrl->async_event_sqe,
|
||||
|
|
|
@ -452,7 +452,8 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
|
|||
return NULL;
|
||||
}
|
||||
|
||||
list_del(&req->entry);
|
||||
list_del_init(&req->entry);
|
||||
init_llist_node(&req->lentry);
|
||||
return req;
|
||||
}
|
||||
|
||||
|
@ -565,6 +566,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
|
|||
req->queue = queue;
|
||||
nvme_req(rq)->ctrl = &ctrl->ctrl;
|
||||
nvme_req(rq)->cmd = &pdu->cmd;
|
||||
init_llist_node(&req->lentry);
|
||||
INIT_LIST_HEAD(&req->entry);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
@ -769,6 +772,14 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
|
|||
return -EPROTO;
|
||||
}
|
||||
|
||||
if (llist_on_list(&req->lentry) ||
|
||||
!list_empty(&req->entry)) {
|
||||
dev_err(queue->ctrl->ctrl.device,
|
||||
"req %d unexpected r2t while processing request\n",
|
||||
rq->tag);
|
||||
return -EPROTO;
|
||||
}
|
||||
|
||||
req->pdu_len = 0;
|
||||
req->h2cdata_left = r2t_length;
|
||||
req->h2cdata_offset = r2t_offset;
|
||||
|
@ -1355,7 +1366,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
|
|||
queue->nr_cqe = 0;
|
||||
consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
|
||||
release_sock(sk);
|
||||
return consumed;
|
||||
return consumed == -EAGAIN ? 0 : consumed;
|
||||
}
|
||||
|
||||
static void nvme_tcp_io_work(struct work_struct *w)
|
||||
|
@ -1383,6 +1394,11 @@ static void nvme_tcp_io_work(struct work_struct *w)
|
|||
else if (unlikely(result < 0))
|
||||
return;
|
||||
|
||||
/* did we get some space after spending time in recv? */
|
||||
if (nvme_tcp_queue_has_pending(queue) &&
|
||||
sk_stream_is_writeable(queue->sock->sk))
|
||||
pending = true;
|
||||
|
||||
if (!pending || !queue->rd_enabled)
|
||||
return;
|
||||
|
||||
|
@ -2350,7 +2366,7 @@ static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new)
|
|||
nvme_tcp_teardown_admin_queue(ctrl, false);
|
||||
ret = nvme_tcp_configure_admin_queue(ctrl, false);
|
||||
if (ret)
|
||||
return ret;
|
||||
goto destroy_admin;
|
||||
}
|
||||
|
||||
if (ctrl->icdoff) {
|
||||
|
@ -2594,6 +2610,8 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
|
|||
ctrl->async_req.offset = 0;
|
||||
ctrl->async_req.curr_bio = NULL;
|
||||
ctrl->async_req.data_len = 0;
|
||||
init_llist_node(&ctrl->async_req.lentry);
|
||||
INIT_LIST_HEAD(&ctrl->async_req.entry);
|
||||
|
||||
nvme_tcp_queue_request(&ctrl->async_req, true);
|
||||
}
|
||||
|
|
|
@ -1165,7 +1165,7 @@ static void nvmet_execute_identify(struct nvmet_req *req)
|
|||
* A "minimum viable" abort implementation: the command is mandatory in the
|
||||
* spec, but we are not required to do any useful work. We couldn't really
|
||||
* do a useful abort, so don't bother even with waiting for the command
|
||||
* to be exectuted and return immediately telling the command to abort
|
||||
* to be executed and return immediately telling the command to abort
|
||||
* wasn't found.
|
||||
*/
|
||||
static void nvmet_execute_abort(struct nvmet_req *req)
|
||||
|
|
|
@ -62,14 +62,7 @@ inline u16 errno_to_nvme_status(struct nvmet_req *req, int errno)
|
|||
return NVME_SC_LBA_RANGE | NVME_STATUS_DNR;
|
||||
case -EOPNOTSUPP:
|
||||
req->error_loc = offsetof(struct nvme_common_command, opcode);
|
||||
switch (req->cmd->common.opcode) {
|
||||
case nvme_cmd_dsm:
|
||||
case nvme_cmd_write_zeroes:
|
||||
return NVME_SC_ONCS_NOT_SUPPORTED | NVME_STATUS_DNR;
|
||||
default:
|
||||
return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR;
|
||||
}
|
||||
break;
|
||||
return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR;
|
||||
case -ENODATA:
|
||||
req->error_loc = offsetof(struct nvme_rw_command, nsid);
|
||||
return NVME_SC_ACCESS_DENIED;
|
||||
|
@ -651,7 +644,7 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
|
|||
* Now that we removed the namespaces from the lookup list, we
|
||||
* can kill the per_cpu ref and wait for any remaining references
|
||||
* to be dropped, as well as a RCU grace period for anyone only
|
||||
* using the namepace under rcu_read_lock(). Note that we can't
|
||||
* using the namespace under rcu_read_lock(). Note that we can't
|
||||
* use call_rcu here as we need to ensure the namespaces have
|
||||
* been fully destroyed before unloading the module.
|
||||
*/
|
||||
|
|
|
@ -1339,7 +1339,7 @@ nvmet_fc_portentry_rebind_tgt(struct nvmet_fc_tgtport *tgtport)
|
|||
/**
|
||||
* nvmet_fc_register_targetport - transport entry point called by an
|
||||
* LLDD to register the existence of a local
|
||||
* NVME subystem FC port.
|
||||
* NVME subsystem FC port.
|
||||
* @pinfo: pointer to information about the port to be registered
|
||||
* @template: LLDD entrypoints and operational parameters for the port
|
||||
* @dev: physical hardware device node port corresponds to. Will be
|
||||
|
|
|
@ -133,7 +133,7 @@ u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
|
|||
* Right now there exists M : 1 mapping between block layer error
|
||||
* to the NVMe status code (see nvme_error_status()). For consistency,
|
||||
* when we reverse map we use most appropriate NVMe Status code from
|
||||
* the group of the NVMe staus codes used in the nvme_error_status().
|
||||
* the group of the NVMe status codes used in the nvme_error_status().
|
||||
*/
|
||||
switch (blk_sts) {
|
||||
case BLK_STS_NOSPC:
|
||||
|
@ -145,15 +145,8 @@ u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts)
|
|||
req->error_loc = offsetof(struct nvme_rw_command, slba);
|
||||
break;
|
||||
case BLK_STS_NOTSUPP:
|
||||
status = NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR;
|
||||
req->error_loc = offsetof(struct nvme_common_command, opcode);
|
||||
switch (req->cmd->common.opcode) {
|
||||
case nvme_cmd_dsm:
|
||||
case nvme_cmd_write_zeroes:
|
||||
status = NVME_SC_ONCS_NOT_SUPPORTED | NVME_STATUS_DNR;
|
||||
break;
|
||||
default:
|
||||
status = NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR;
|
||||
}
|
||||
break;
|
||||
case BLK_STS_MEDIUM:
|
||||
status = NVME_SC_ACCESS_DENIED;
|
||||
|
|
|
@ -99,7 +99,7 @@ static u16 nvmet_passthru_override_id_ctrl(struct nvmet_req *req)
|
|||
|
||||
/*
|
||||
* The passthru NVMe driver may have a limit on the number of segments
|
||||
* which depends on the host's memory fragementation. To solve this,
|
||||
* which depends on the host's memory fragmentation. To solve this,
|
||||
* ensure mdts is limited to the pages equal to the number of segments.
|
||||
*/
|
||||
max_hw_sectors = min_not_zero(pctrl->max_segments << PAGE_SECTORS_SHIFT,
|
||||
|
|
|
@ -2171,7 +2171,7 @@ enum {
|
|||
NVME_SC_BAD_ATTRIBUTES = 0x180,
|
||||
NVME_SC_INVALID_PI = 0x181,
|
||||
NVME_SC_READ_ONLY = 0x182,
|
||||
NVME_SC_ONCS_NOT_SUPPORTED = 0x183,
|
||||
NVME_SC_CMD_SIZE_LIM_EXCEEDED = 0x183,
|
||||
|
||||
/*
|
||||
* I/O Command Set Specific - Fabrics commands:
|
||||
|
|
|
@ -272,6 +272,15 @@
|
|||
*/
|
||||
#define UBLK_F_QUIESCE (1ULL << 12)
|
||||
|
||||
/*
|
||||
* If this feature is set, ublk_drv supports each (qid,tag) pair having
|
||||
* its own independent daemon task that is responsible for handling it.
|
||||
* If it is not set, daemons are per-queue instead, so for two pairs
|
||||
* (qid1,tag1) and (qid2,tag2), if qid1 == qid2, then the same task must
|
||||
* be responsible for handling (qid1,tag1) and (qid2,tag2).
|
||||
*/
|
||||
#define UBLK_F_PER_IO_DAEMON (1ULL << 13)
|
||||
|
||||
/* device state */
|
||||
#define UBLK_S_DEV_DEAD 0
|
||||
#define UBLK_S_DEV_LIVE 1
|
||||
|
|
|
@ -19,6 +19,7 @@ TEST_PROGS += test_generic_08.sh
|
|||
TEST_PROGS += test_generic_09.sh
|
||||
TEST_PROGS += test_generic_10.sh
|
||||
TEST_PROGS += test_generic_11.sh
|
||||
TEST_PROGS += test_generic_12.sh
|
||||
|
||||
TEST_PROGS += test_null_01.sh
|
||||
TEST_PROGS += test_null_02.sh
|
||||
|
|
|
@ -46,9 +46,9 @@ static int ublk_fault_inject_queue_io(struct ublk_queue *q, int tag)
|
|||
.tv_nsec = (long long)q->dev->private_data,
|
||||
};
|
||||
|
||||
ublk_queue_alloc_sqes(q, &sqe, 1);
|
||||
ublk_io_alloc_sqes(ublk_get_io(q, tag), &sqe, 1);
|
||||
io_uring_prep_timeout(sqe, &ts, 1, 0);
|
||||
sqe->user_data = build_user_data(tag, ublksrv_get_op(iod), 0, 1);
|
||||
sqe->user_data = build_user_data(tag, ublksrv_get_op(iod), 0, q->q_id, 1);
|
||||
|
||||
ublk_queued_tgt_io(q, tag, 1);
|
||||
|
||||
|
|
|
@ -18,11 +18,11 @@ static int loop_queue_flush_io(struct ublk_queue *q, const struct ublksrv_io_des
|
|||
unsigned ublk_op = ublksrv_get_op(iod);
|
||||
struct io_uring_sqe *sqe[1];
|
||||
|
||||
ublk_queue_alloc_sqes(q, sqe, 1);
|
||||
ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 1);
|
||||
io_uring_prep_fsync(sqe[0], 1 /*fds[1]*/, IORING_FSYNC_DATASYNC);
|
||||
io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE);
|
||||
/* bit63 marks us as tgt io */
|
||||
sqe[0]->user_data = build_user_data(tag, ublk_op, 0, 1);
|
||||
sqe[0]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1);
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
@ -36,7 +36,7 @@ static int loop_queue_tgt_rw_io(struct ublk_queue *q, const struct ublksrv_io_de
|
|||
void *addr = (zc | auto_zc) ? NULL : (void *)iod->addr;
|
||||
|
||||
if (!zc || auto_zc) {
|
||||
ublk_queue_alloc_sqes(q, sqe, 1);
|
||||
ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 1);
|
||||
if (!sqe[0])
|
||||
return -ENOMEM;
|
||||
|
||||
|
@ -48,26 +48,26 @@ static int loop_queue_tgt_rw_io(struct ublk_queue *q, const struct ublksrv_io_de
|
|||
sqe[0]->buf_index = tag;
|
||||
io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE);
|
||||
/* bit63 marks us as tgt io */
|
||||
sqe[0]->user_data = build_user_data(tag, ublk_op, 0, 1);
|
||||
sqe[0]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1);
|
||||
return 1;
|
||||
}
|
||||
|
||||
ublk_queue_alloc_sqes(q, sqe, 3);
|
||||
ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 3);
|
||||
|
||||
io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, tag);
|
||||
io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
|
||||
sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
|
||||
sqe[0]->user_data = build_user_data(tag,
|
||||
ublk_cmd_op_nr(sqe[0]->cmd_op), 0, 1);
|
||||
ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
|
||||
|
||||
io_uring_prep_rw(op, sqe[1], 1 /*fds[1]*/, 0,
|
||||
iod->nr_sectors << 9,
|
||||
iod->start_sector << 9);
|
||||
sqe[1]->buf_index = tag;
|
||||
sqe[1]->flags |= IOSQE_FIXED_FILE | IOSQE_IO_HARDLINK;
|
||||
sqe[1]->user_data = build_user_data(tag, ublk_op, 0, 1);
|
||||
sqe[1]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1);
|
||||
|
||||
io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, tag);
|
||||
sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, 1);
|
||||
io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
|
||||
sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1);
|
||||
|
||||
return 2;
|
||||
}
|
||||
|
|
|
@ -348,8 +348,8 @@ static void ublk_ctrl_dump(struct ublk_dev *dev)
|
|||
|
||||
for (i = 0; i < info->nr_hw_queues; i++) {
|
||||
ublk_print_cpu_set(&affinity[i], buf, sizeof(buf));
|
||||
printf("\tqueue %u: tid %d affinity(%s)\n",
|
||||
i, dev->q[i].tid, buf);
|
||||
printf("\tqueue %u: affinity(%s)\n",
|
||||
i, buf);
|
||||
}
|
||||
free(affinity);
|
||||
}
|
||||
|
@ -412,16 +412,6 @@ static void ublk_queue_deinit(struct ublk_queue *q)
|
|||
int i;
|
||||
int nr_ios = q->q_depth;
|
||||
|
||||
io_uring_unregister_buffers(&q->ring);
|
||||
|
||||
io_uring_unregister_ring_fd(&q->ring);
|
||||
|
||||
if (q->ring.ring_fd > 0) {
|
||||
io_uring_unregister_files(&q->ring);
|
||||
close(q->ring.ring_fd);
|
||||
q->ring.ring_fd = -1;
|
||||
}
|
||||
|
||||
if (q->io_cmd_buf)
|
||||
munmap(q->io_cmd_buf, ublk_queue_cmd_buf_sz(q));
|
||||
|
||||
|
@ -429,20 +419,30 @@ static void ublk_queue_deinit(struct ublk_queue *q)
|
|||
free(q->ios[i].buf_addr);
|
||||
}
|
||||
|
||||
static void ublk_thread_deinit(struct ublk_thread *t)
|
||||
{
|
||||
io_uring_unregister_buffers(&t->ring);
|
||||
|
||||
io_uring_unregister_ring_fd(&t->ring);
|
||||
|
||||
if (t->ring.ring_fd > 0) {
|
||||
io_uring_unregister_files(&t->ring);
|
||||
close(t->ring.ring_fd);
|
||||
t->ring.ring_fd = -1;
|
||||
}
|
||||
}
|
||||
|
||||
static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags)
|
||||
{
|
||||
struct ublk_dev *dev = q->dev;
|
||||
int depth = dev->dev_info.queue_depth;
|
||||
int i, ret = -1;
|
||||
int i;
|
||||
int cmd_buf_size, io_buf_size;
|
||||
unsigned long off;
|
||||
int ring_depth = dev->tgt.sq_depth, cq_depth = dev->tgt.cq_depth;
|
||||
|
||||
q->tgt_ops = dev->tgt.ops;
|
||||
q->state = 0;
|
||||
q->q_depth = depth;
|
||||
q->cmd_inflight = 0;
|
||||
q->tid = gettid();
|
||||
|
||||
if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_AUTO_BUF_REG)) {
|
||||
q->state |= UBLKSRV_NO_BUF;
|
||||
|
@ -467,6 +467,7 @@ static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags)
|
|||
for (i = 0; i < q->q_depth; i++) {
|
||||
q->ios[i].buf_addr = NULL;
|
||||
q->ios[i].flags = UBLKSRV_NEED_FETCH_RQ | UBLKSRV_IO_FREE;
|
||||
q->ios[i].tag = i;
|
||||
|
||||
if (q->state & UBLKSRV_NO_BUF)
|
||||
continue;
|
||||
|
@ -479,34 +480,6 @@ static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags)
|
|||
}
|
||||
}
|
||||
|
||||
ret = ublk_setup_ring(&q->ring, ring_depth, cq_depth,
|
||||
IORING_SETUP_COOP_TASKRUN |
|
||||
IORING_SETUP_SINGLE_ISSUER |
|
||||
IORING_SETUP_DEFER_TASKRUN);
|
||||
if (ret < 0) {
|
||||
ublk_err("ublk dev %d queue %d setup io_uring failed %d\n",
|
||||
q->dev->dev_info.dev_id, q->q_id, ret);
|
||||
goto fail;
|
||||
}
|
||||
|
||||
if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_AUTO_BUF_REG)) {
|
||||
ret = io_uring_register_buffers_sparse(&q->ring, q->q_depth);
|
||||
if (ret) {
|
||||
ublk_err("ublk dev %d queue %d register spare buffers failed %d",
|
||||
dev->dev_info.dev_id, q->q_id, ret);
|
||||
goto fail;
|
||||
}
|
||||
}
|
||||
|
||||
io_uring_register_ring_fd(&q->ring);
|
||||
|
||||
ret = io_uring_register_files(&q->ring, dev->fds, dev->nr_fds);
|
||||
if (ret) {
|
||||
ublk_err("ublk dev %d queue %d register files failed %d\n",
|
||||
q->dev->dev_info.dev_id, q->q_id, ret);
|
||||
goto fail;
|
||||
}
|
||||
|
||||
return 0;
|
||||
fail:
|
||||
ublk_queue_deinit(q);
|
||||
|
@ -515,6 +488,52 @@ static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags)
|
|||
return -ENOMEM;
|
||||
}
|
||||
|
||||
static int ublk_thread_init(struct ublk_thread *t)
|
||||
{
|
||||
struct ublk_dev *dev = t->dev;
|
||||
int ring_depth = dev->tgt.sq_depth, cq_depth = dev->tgt.cq_depth;
|
||||
int ret;
|
||||
|
||||
ret = ublk_setup_ring(&t->ring, ring_depth, cq_depth,
|
||||
IORING_SETUP_COOP_TASKRUN |
|
||||
IORING_SETUP_SINGLE_ISSUER |
|
||||
IORING_SETUP_DEFER_TASKRUN);
|
||||
if (ret < 0) {
|
||||
ublk_err("ublk dev %d thread %d setup io_uring failed %d\n",
|
||||
dev->dev_info.dev_id, t->idx, ret);
|
||||
goto fail;
|
||||
}
|
||||
|
||||
if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_AUTO_BUF_REG)) {
|
||||
unsigned nr_ios = dev->dev_info.queue_depth * dev->dev_info.nr_hw_queues;
|
||||
unsigned max_nr_ios_per_thread = nr_ios / dev->nthreads;
|
||||
max_nr_ios_per_thread += !!(nr_ios % dev->nthreads);
|
||||
ret = io_uring_register_buffers_sparse(
|
||||
&t->ring, max_nr_ios_per_thread);
|
||||
if (ret) {
|
||||
ublk_err("ublk dev %d thread %d register spare buffers failed %d",
|
||||
dev->dev_info.dev_id, t->idx, ret);
|
||||
goto fail;
|
||||
}
|
||||
}
|
||||
|
||||
io_uring_register_ring_fd(&t->ring);
|
||||
|
||||
ret = io_uring_register_files(&t->ring, dev->fds, dev->nr_fds);
|
||||
if (ret) {
|
||||
ublk_err("ublk dev %d thread %d register files failed %d\n",
|
||||
t->dev->dev_info.dev_id, t->idx, ret);
|
||||
goto fail;
|
||||
}
|
||||
|
||||
return 0;
|
||||
fail:
|
||||
ublk_thread_deinit(t);
|
||||
ublk_err("ublk dev %d thread %d init failed\n",
|
||||
dev->dev_info.dev_id, t->idx);
|
||||
return -ENOMEM;
|
||||
}
|
||||
|
||||
#define WAIT_USEC 100000
|
||||
#define MAX_WAIT_USEC (3 * 1000000)
|
||||
static int ublk_dev_prep(const struct dev_ctx *ctx, struct ublk_dev *dev)
|
||||
|
@ -562,7 +581,7 @@ static void ublk_set_auto_buf_reg(const struct ublk_queue *q,
|
|||
if (q->tgt_ops->buf_index)
|
||||
buf.index = q->tgt_ops->buf_index(q, tag);
|
||||
else
|
||||
buf.index = tag;
|
||||
buf.index = q->ios[tag].buf_index;
|
||||
|
||||
if (q->state & UBLKSRV_AUTO_BUF_REG_FALLBACK)
|
||||
buf.flags = UBLK_AUTO_BUF_REG_FALLBACK;
|
||||
|
@ -570,8 +589,10 @@ static void ublk_set_auto_buf_reg(const struct ublk_queue *q,
|
|||
sqe->addr = ublk_auto_buf_reg_to_sqe_addr(&buf);
|
||||
}
|
||||
|
||||
int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag)
|
||||
int ublk_queue_io_cmd(struct ublk_io *io)
|
||||
{
|
||||
struct ublk_thread *t = io->t;
|
||||
struct ublk_queue *q = ublk_io_to_queue(io);
|
||||
struct ublksrv_io_cmd *cmd;
|
||||
struct io_uring_sqe *sqe[1];
|
||||
unsigned int cmd_op = 0;
|
||||
|
@ -596,13 +617,13 @@ int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag)
|
|||
else if (io->flags & UBLKSRV_NEED_FETCH_RQ)
|
||||
cmd_op = UBLK_U_IO_FETCH_REQ;
|
||||
|
||||
if (io_uring_sq_space_left(&q->ring) < 1)
|
||||
io_uring_submit(&q->ring);
|
||||
if (io_uring_sq_space_left(&t->ring) < 1)
|
||||
io_uring_submit(&t->ring);
|
||||
|
||||
ublk_queue_alloc_sqes(q, sqe, 1);
|
||||
ublk_io_alloc_sqes(io, sqe, 1);
|
||||
if (!sqe[0]) {
|
||||
ublk_err("%s: run out of sqe %d, tag %d\n",
|
||||
__func__, q->q_id, tag);
|
||||
ublk_err("%s: run out of sqe. thread %u, tag %d\n",
|
||||
__func__, t->idx, io->tag);
|
||||
return -1;
|
||||
}
|
||||
|
||||
|
@ -617,7 +638,7 @@ int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag)
|
|||
sqe[0]->opcode = IORING_OP_URING_CMD;
|
||||
sqe[0]->flags = IOSQE_FIXED_FILE;
|
||||
sqe[0]->rw_flags = 0;
|
||||
cmd->tag = tag;
|
||||
cmd->tag = io->tag;
|
||||
cmd->q_id = q->q_id;
|
||||
if (!(q->state & UBLKSRV_NO_BUF))
|
||||
cmd->addr = (__u64) (uintptr_t) io->buf_addr;
|
||||
|
@ -625,37 +646,72 @@ int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag)
|
|||
cmd->addr = 0;
|
||||
|
||||
if (q->state & UBLKSRV_AUTO_BUF_REG)
|
||||
ublk_set_auto_buf_reg(q, sqe[0], tag);
|
||||
ublk_set_auto_buf_reg(q, sqe[0], io->tag);
|
||||
|
||||
user_data = build_user_data(tag, _IOC_NR(cmd_op), 0, 0);
|
||||
user_data = build_user_data(io->tag, _IOC_NR(cmd_op), 0, q->q_id, 0);
|
||||
io_uring_sqe_set_data64(sqe[0], user_data);
|
||||
|
||||
io->flags = 0;
|
||||
|
||||
q->cmd_inflight += 1;
|
||||
t->cmd_inflight += 1;
|
||||
|
||||
ublk_dbg(UBLK_DBG_IO_CMD, "%s: (qid %d tag %u cmd_op %u) iof %x stopping %d\n",
|
||||
__func__, q->q_id, tag, cmd_op,
|
||||
io->flags, !!(q->state & UBLKSRV_QUEUE_STOPPING));
|
||||
ublk_dbg(UBLK_DBG_IO_CMD, "%s: (thread %u qid %d tag %u cmd_op %u) iof %x stopping %d\n",
|
||||
__func__, t->idx, q->q_id, io->tag, cmd_op,
|
||||
io->flags, !!(t->state & UBLKSRV_THREAD_STOPPING));
|
||||
return 1;
|
||||
}
|
||||
|
||||
static void ublk_submit_fetch_commands(struct ublk_queue *q)
|
||||
static void ublk_submit_fetch_commands(struct ublk_thread *t)
|
||||
{
|
||||
int i = 0;
|
||||
struct ublk_queue *q;
|
||||
struct ublk_io *io;
|
||||
int i = 0, j = 0;
|
||||
|
||||
for (i = 0; i < q->q_depth; i++)
|
||||
ublk_queue_io_cmd(q, &q->ios[i], i);
|
||||
if (t->dev->per_io_tasks) {
|
||||
/*
|
||||
* Lexicographically order all the (qid,tag) pairs, with
|
||||
* qid taking priority (so (1,0) > (0,1)). Then make
|
||||
* this thread the daemon for every Nth entry in this
|
||||
* list (N is the number of threads), starting at this
|
||||
* thread's index. This ensures that each queue is
|
||||
* handled by as many ublk server threads as possible,
|
||||
* so that load that is concentrated on one or a few
|
||||
* queues can make use of all ublk server threads.
|
||||
*/
|
||||
const struct ublksrv_ctrl_dev_info *dinfo = &t->dev->dev_info;
|
||||
int nr_ios = dinfo->nr_hw_queues * dinfo->queue_depth;
|
||||
for (i = t->idx; i < nr_ios; i += t->dev->nthreads) {
|
||||
int q_id = i / dinfo->queue_depth;
|
||||
int tag = i % dinfo->queue_depth;
|
||||
q = &t->dev->q[q_id];
|
||||
io = &q->ios[tag];
|
||||
io->t = t;
|
||||
io->buf_index = j++;
|
||||
ublk_queue_io_cmd(io);
|
||||
}
|
||||
} else {
|
||||
/*
|
||||
* Service exclusively the queue whose q_id matches our
|
||||
* thread index.
|
||||
*/
|
||||
struct ublk_queue *q = &t->dev->q[t->idx];
|
||||
for (i = 0; i < q->q_depth; i++) {
|
||||
io = &q->ios[i];
|
||||
io->t = t;
|
||||
io->buf_index = i;
|
||||
ublk_queue_io_cmd(io);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
static int ublk_queue_is_idle(struct ublk_queue *q)
|
||||
static int ublk_thread_is_idle(struct ublk_thread *t)
|
||||
{
|
||||
return !io_uring_sq_ready(&q->ring) && !q->io_inflight;
|
||||
return !io_uring_sq_ready(&t->ring) && !t->io_inflight;
|
||||
}
|
||||
|
||||
static int ublk_queue_is_done(struct ublk_queue *q)
|
||||
static int ublk_thread_is_done(struct ublk_thread *t)
|
||||
{
|
||||
return (q->state & UBLKSRV_QUEUE_STOPPING) && ublk_queue_is_idle(q);
|
||||
return (t->state & UBLKSRV_THREAD_STOPPING) && ublk_thread_is_idle(t);
|
||||
}
|
||||
|
||||
static inline void ublksrv_handle_tgt_cqe(struct ublk_queue *q,
|
||||
|
@ -673,14 +729,16 @@ static inline void ublksrv_handle_tgt_cqe(struct ublk_queue *q,
|
|||
q->tgt_ops->tgt_io_done(q, tag, cqe);
|
||||
}
|
||||
|
||||
static void ublk_handle_cqe(struct io_uring *r,
|
||||
static void ublk_handle_cqe(struct ublk_thread *t,
|
||||
struct io_uring_cqe *cqe, void *data)
|
||||
{
|
||||
struct ublk_queue *q = container_of(r, struct ublk_queue, ring);
|
||||
struct ublk_dev *dev = t->dev;
|
||||
unsigned q_id = user_data_to_q_id(cqe->user_data);
|
||||
struct ublk_queue *q = &dev->q[q_id];
|
||||
unsigned tag = user_data_to_tag(cqe->user_data);
|
||||
unsigned cmd_op = user_data_to_op(cqe->user_data);
|
||||
int fetch = (cqe->res != UBLK_IO_RES_ABORT) &&
|
||||
!(q->state & UBLKSRV_QUEUE_STOPPING);
|
||||
!(t->state & UBLKSRV_THREAD_STOPPING);
|
||||
struct ublk_io *io;
|
||||
|
||||
if (cqe->res < 0 && cqe->res != -ENODEV)
|
||||
|
@ -691,7 +749,7 @@ static void ublk_handle_cqe(struct io_uring *r,
|
|||
__func__, cqe->res, q->q_id, tag, cmd_op,
|
||||
is_target_io(cqe->user_data),
|
||||
user_data_to_tgt_data(cqe->user_data),
|
||||
(q->state & UBLKSRV_QUEUE_STOPPING));
|
||||
(t->state & UBLKSRV_THREAD_STOPPING));
|
||||
|
||||
/* Don't retrieve io in case of target io */
|
||||
if (is_target_io(cqe->user_data)) {
|
||||
|
@ -700,10 +758,10 @@ static void ublk_handle_cqe(struct io_uring *r,
|
|||
}
|
||||
|
||||
io = &q->ios[tag];
|
||||
q->cmd_inflight--;
|
||||
t->cmd_inflight--;
|
||||
|
||||
if (!fetch) {
|
||||
q->state |= UBLKSRV_QUEUE_STOPPING;
|
||||
t->state |= UBLKSRV_THREAD_STOPPING;
|
||||
io->flags &= ~UBLKSRV_NEED_FETCH_RQ;
|
||||
}
|
||||
|
||||
|
@ -713,7 +771,7 @@ static void ublk_handle_cqe(struct io_uring *r,
|
|||
q->tgt_ops->queue_io(q, tag);
|
||||
} else if (cqe->res == UBLK_IO_RES_NEED_GET_DATA) {
|
||||
io->flags |= UBLKSRV_NEED_GET_DATA | UBLKSRV_IO_FREE;
|
||||
ublk_queue_io_cmd(q, io, tag);
|
||||
ublk_queue_io_cmd(io);
|
||||
} else {
|
||||
/*
|
||||
* COMMIT_REQ will be completed immediately since no fetching
|
||||
|
@ -727,92 +785,93 @@ static void ublk_handle_cqe(struct io_uring *r,
|
|||
}
|
||||
}
|
||||
|
||||
static int ublk_reap_events_uring(struct io_uring *r)
|
||||
static int ublk_reap_events_uring(struct ublk_thread *t)
|
||||
{
|
||||
struct io_uring_cqe *cqe;
|
||||
unsigned head;
|
||||
int count = 0;
|
||||
|
||||
io_uring_for_each_cqe(r, head, cqe) {
|
||||
ublk_handle_cqe(r, cqe, NULL);
|
||||
io_uring_for_each_cqe(&t->ring, head, cqe) {
|
||||
ublk_handle_cqe(t, cqe, NULL);
|
||||
count += 1;
|
||||
}
|
||||
io_uring_cq_advance(r, count);
|
||||
io_uring_cq_advance(&t->ring, count);
|
||||
|
||||
return count;
|
||||
}
|
||||
|
||||
static int ublk_process_io(struct ublk_queue *q)
|
||||
static int ublk_process_io(struct ublk_thread *t)
|
||||
{
|
||||
int ret, reapped;
|
||||
|
||||
ublk_dbg(UBLK_DBG_QUEUE, "dev%d-q%d: to_submit %d inflight cmd %u stopping %d\n",
|
||||
q->dev->dev_info.dev_id,
|
||||
q->q_id, io_uring_sq_ready(&q->ring),
|
||||
q->cmd_inflight,
|
||||
(q->state & UBLKSRV_QUEUE_STOPPING));
|
||||
ublk_dbg(UBLK_DBG_THREAD, "dev%d-t%u: to_submit %d inflight cmd %u stopping %d\n",
|
||||
t->dev->dev_info.dev_id,
|
||||
t->idx, io_uring_sq_ready(&t->ring),
|
||||
t->cmd_inflight,
|
||||
(t->state & UBLKSRV_THREAD_STOPPING));
|
||||
|
||||
if (ublk_queue_is_done(q))
|
||||
if (ublk_thread_is_done(t))
|
||||
return -ENODEV;
|
||||
|
||||
ret = io_uring_submit_and_wait(&q->ring, 1);
|
||||
reapped = ublk_reap_events_uring(&q->ring);
|
||||
ret = io_uring_submit_and_wait(&t->ring, 1);
|
||||
reapped = ublk_reap_events_uring(t);
|
||||
|
||||
ublk_dbg(UBLK_DBG_QUEUE, "submit result %d, reapped %d stop %d idle %d\n",
|
||||
ret, reapped, (q->state & UBLKSRV_QUEUE_STOPPING),
|
||||
(q->state & UBLKSRV_QUEUE_IDLE));
|
||||
ublk_dbg(UBLK_DBG_THREAD, "submit result %d, reapped %d stop %d idle %d\n",
|
||||
ret, reapped, (t->state & UBLKSRV_THREAD_STOPPING),
|
||||
(t->state & UBLKSRV_THREAD_IDLE));
|
||||
|
||||
return reapped;
|
||||
}
|
||||
|
||||
static void ublk_queue_set_sched_affinity(const struct ublk_queue *q,
|
||||
static void ublk_thread_set_sched_affinity(const struct ublk_thread *t,
|
||||
cpu_set_t *cpuset)
|
||||
{
|
||||
if (sched_setaffinity(0, sizeof(*cpuset), cpuset) < 0)
|
||||
ublk_err("ublk dev %u queue %u set affinity failed",
|
||||
q->dev->dev_info.dev_id, q->q_id);
|
||||
ublk_err("ublk dev %u thread %u set affinity failed",
|
||||
t->dev->dev_info.dev_id, t->idx);
|
||||
}
|
||||
|
||||
struct ublk_queue_info {
|
||||
struct ublk_queue *q;
|
||||
sem_t *queue_sem;
|
||||
struct ublk_thread_info {
|
||||
struct ublk_dev *dev;
|
||||
unsigned idx;
|
||||
sem_t *ready;
|
||||
cpu_set_t *affinity;
|
||||
unsigned char auto_zc_fallback;
|
||||
};
|
||||
|
||||
static void *ublk_io_handler_fn(void *data)
|
||||
{
|
||||
struct ublk_queue_info *info = data;
|
||||
struct ublk_queue *q = info->q;
|
||||
int dev_id = q->dev->dev_info.dev_id;
|
||||
unsigned extra_flags = 0;
|
||||
struct ublk_thread_info *info = data;
|
||||
struct ublk_thread *t = &info->dev->threads[info->idx];
|
||||
int dev_id = info->dev->dev_info.dev_id;
|
||||
int ret;
|
||||
|
||||
if (info->auto_zc_fallback)
|
||||
extra_flags = UBLKSRV_AUTO_BUF_REG_FALLBACK;
|
||||
t->dev = info->dev;
|
||||
t->idx = info->idx;
|
||||
|
||||
ret = ublk_queue_init(q, extra_flags);
|
||||
ret = ublk_thread_init(t);
|
||||
if (ret) {
|
||||
ublk_err("ublk dev %d queue %d init queue failed\n",
|
||||
dev_id, q->q_id);
|
||||
ublk_err("ublk dev %d thread %u init failed\n",
|
||||
dev_id, t->idx);
|
||||
return NULL;
|
||||
}
|
||||
/* IO perf is sensitive with queue pthread affinity on NUMA machine*/
|
||||
ublk_queue_set_sched_affinity(q, info->affinity);
|
||||
sem_post(info->queue_sem);
|
||||
if (info->affinity)
|
||||
ublk_thread_set_sched_affinity(t, info->affinity);
|
||||
sem_post(info->ready);
|
||||
|
||||
ublk_dbg(UBLK_DBG_QUEUE, "tid %d: ublk dev %d queue %d started\n",
|
||||
q->tid, dev_id, q->q_id);
|
||||
ublk_dbg(UBLK_DBG_THREAD, "tid %d: ublk dev %d thread %u started\n",
|
||||
gettid(), dev_id, t->idx);
|
||||
|
||||
/* submit all io commands to ublk driver */
|
||||
ublk_submit_fetch_commands(q);
|
||||
ublk_submit_fetch_commands(t);
|
||||
do {
|
||||
if (ublk_process_io(q) < 0)
|
||||
if (ublk_process_io(t) < 0)
|
||||
break;
|
||||
} while (1);
|
||||
|
||||
ublk_dbg(UBLK_DBG_QUEUE, "ublk dev %d queue %d exited\n", dev_id, q->q_id);
|
||||
ublk_queue_deinit(q);
|
||||
ublk_dbg(UBLK_DBG_THREAD, "tid %d: ublk dev %d thread %d exiting\n",
|
||||
gettid(), dev_id, t->idx);
|
||||
ublk_thread_deinit(t);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
|
@ -855,20 +914,20 @@ static int ublk_send_dev_event(const struct dev_ctx *ctx, struct ublk_dev *dev,
|
|||
static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
|
||||
{
|
||||
const struct ublksrv_ctrl_dev_info *dinfo = &dev->dev_info;
|
||||
struct ublk_queue_info *qinfo;
|
||||
struct ublk_thread_info *tinfo;
|
||||
unsigned extra_flags = 0;
|
||||
cpu_set_t *affinity_buf;
|
||||
void *thread_ret;
|
||||
sem_t queue_sem;
|
||||
sem_t ready;
|
||||
int ret, i;
|
||||
|
||||
ublk_dbg(UBLK_DBG_DEV, "%s enter\n", __func__);
|
||||
|
||||
qinfo = (struct ublk_queue_info *)calloc(sizeof(struct ublk_queue_info),
|
||||
dinfo->nr_hw_queues);
|
||||
if (!qinfo)
|
||||
tinfo = calloc(sizeof(struct ublk_thread_info), dev->nthreads);
|
||||
if (!tinfo)
|
||||
return -ENOMEM;
|
||||
|
||||
sem_init(&queue_sem, 0, 0);
|
||||
sem_init(&ready, 0, 0);
|
||||
ret = ublk_dev_prep(ctx, dev);
|
||||
if (ret)
|
||||
return ret;
|
||||
|
@ -877,22 +936,44 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
|
|||
if (ret)
|
||||
return ret;
|
||||
|
||||
if (ctx->auto_zc_fallback)
|
||||
extra_flags = UBLKSRV_AUTO_BUF_REG_FALLBACK;
|
||||
|
||||
for (i = 0; i < dinfo->nr_hw_queues; i++) {
|
||||
dev->q[i].dev = dev;
|
||||
dev->q[i].q_id = i;
|
||||
|
||||
qinfo[i].q = &dev->q[i];
|
||||
qinfo[i].queue_sem = &queue_sem;
|
||||
qinfo[i].affinity = &affinity_buf[i];
|
||||
qinfo[i].auto_zc_fallback = ctx->auto_zc_fallback;
|
||||
pthread_create(&dev->q[i].thread, NULL,
|
||||
ublk_io_handler_fn,
|
||||
&qinfo[i]);
|
||||
ret = ublk_queue_init(&dev->q[i], extra_flags);
|
||||
if (ret) {
|
||||
ublk_err("ublk dev %d queue %d init queue failed\n",
|
||||
dinfo->dev_id, i);
|
||||
goto fail;
|
||||
}
|
||||
}
|
||||
|
||||
for (i = 0; i < dinfo->nr_hw_queues; i++)
|
||||
sem_wait(&queue_sem);
|
||||
free(qinfo);
|
||||
for (i = 0; i < dev->nthreads; i++) {
|
||||
tinfo[i].dev = dev;
|
||||
tinfo[i].idx = i;
|
||||
tinfo[i].ready = &ready;
|
||||
|
||||
/*
|
||||
* If threads are not tied 1:1 to queues, setting thread
|
||||
* affinity based on queue affinity makes little sense.
|
||||
* However, thread CPU affinity has significant impact
|
||||
* on performance, so to compare fairly, we'll still set
|
||||
* thread CPU affinity based on queue affinity where
|
||||
* possible.
|
||||
*/
|
||||
if (dev->nthreads == dinfo->nr_hw_queues)
|
||||
tinfo[i].affinity = &affinity_buf[i];
|
||||
pthread_create(&dev->threads[i].thread, NULL,
|
||||
ublk_io_handler_fn,
|
||||
&tinfo[i]);
|
||||
}
|
||||
|
||||
for (i = 0; i < dev->nthreads; i++)
|
||||
sem_wait(&ready);
|
||||
free(tinfo);
|
||||
free(affinity_buf);
|
||||
|
||||
/* everything is fine now, start us */
|
||||
|
@ -914,9 +995,11 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
|
|||
ublk_send_dev_event(ctx, dev, dev->dev_info.dev_id);
|
||||
|
||||
/* wait until we are terminated */
|
||||
for (i = 0; i < dinfo->nr_hw_queues; i++)
|
||||
pthread_join(dev->q[i].thread, &thread_ret);
|
||||
for (i = 0; i < dev->nthreads; i++)
|
||||
pthread_join(dev->threads[i].thread, &thread_ret);
|
||||
fail:
|
||||
for (i = 0; i < dinfo->nr_hw_queues; i++)
|
||||
ublk_queue_deinit(&dev->q[i]);
|
||||
ublk_dev_unprep(dev);
|
||||
ublk_dbg(UBLK_DBG_DEV, "%s exit\n", __func__);
|
||||
|
||||
|
@ -1022,13 +1105,14 @@ wait:
|
|||
|
||||
static int __cmd_dev_add(const struct dev_ctx *ctx)
|
||||
{
|
||||
unsigned nthreads = ctx->nthreads;
|
||||
unsigned nr_queues = ctx->nr_hw_queues;
|
||||
const char *tgt_type = ctx->tgt_type;
|
||||
unsigned depth = ctx->queue_depth;
|
||||
__u64 features;
|
||||
const struct ublk_tgt_ops *ops;
|
||||
struct ublksrv_ctrl_dev_info *info;
|
||||
struct ublk_dev *dev;
|
||||
struct ublk_dev *dev = NULL;
|
||||
int dev_id = ctx->dev_id;
|
||||
int ret, i;
|
||||
|
||||
|
@ -1036,29 +1120,55 @@ static int __cmd_dev_add(const struct dev_ctx *ctx)
|
|||
if (!ops) {
|
||||
ublk_err("%s: no such tgt type, type %s\n",
|
||||
__func__, tgt_type);
|
||||
return -ENODEV;
|
||||
ret = -ENODEV;
|
||||
goto fail;
|
||||
}
|
||||
|
||||
if (nr_queues > UBLK_MAX_QUEUES || depth > UBLK_QUEUE_DEPTH) {
|
||||
ublk_err("%s: invalid nr_queues or depth queues %u depth %u\n",
|
||||
__func__, nr_queues, depth);
|
||||
return -EINVAL;
|
||||
ret = -EINVAL;
|
||||
goto fail;
|
||||
}
|
||||
|
||||
/* default to 1:1 threads:queues if nthreads is unspecified */
|
||||
if (!nthreads)
|
||||
nthreads = nr_queues;
|
||||
|
||||
if (nthreads > UBLK_MAX_THREADS) {
|
||||
ublk_err("%s: %u is too many threads (max %u)\n",
|
||||
__func__, nthreads, UBLK_MAX_THREADS);
|
||||
ret = -EINVAL;
|
||||
goto fail;
|
||||
}
|
||||
|
||||
if (nthreads != nr_queues && !ctx->per_io_tasks) {
|
||||
ublk_err("%s: threads %u must be same as queues %u if "
|
||||
"not using per_io_tasks\n",
|
||||
__func__, nthreads, nr_queues);
|
||||
ret = -EINVAL;
|
||||
goto fail;
|
||||
}
|
||||
|
||||
dev = ublk_ctrl_init();
|
||||
if (!dev) {
|
||||
ublk_err("%s: can't alloc dev id %d, type %s\n",
|
||||
__func__, dev_id, tgt_type);
|
||||
return -ENOMEM;
|
||||
ret = -ENOMEM;
|
||||
goto fail;
|
||||
}
|
||||
|
||||
/* kernel doesn't support get_features */
|
||||
ret = ublk_ctrl_get_features(dev, &features);
|
||||
if (ret < 0)
|
||||
return -EINVAL;
|
||||
if (ret < 0) {
|
||||
ret = -EINVAL;
|
||||
goto fail;
|
||||
}
|
||||
|
||||
if (!(features & UBLK_F_CMD_IOCTL_ENCODE))
|
||||
return -ENOTSUP;
|
||||
if (!(features & UBLK_F_CMD_IOCTL_ENCODE)) {
|
||||
ret = -ENOTSUP;
|
||||
goto fail;
|
||||
}
|
||||
|
||||
info = &dev->dev_info;
|
||||
info->dev_id = ctx->dev_id;
|
||||
|
@ -1068,6 +1178,8 @@ static int __cmd_dev_add(const struct dev_ctx *ctx)
|
|||
if ((features & UBLK_F_QUIESCE) &&
|
||||
(info->flags & UBLK_F_USER_RECOVERY))
|
||||
info->flags |= UBLK_F_QUIESCE;
|
||||
dev->nthreads = nthreads;
|
||||
dev->per_io_tasks = ctx->per_io_tasks;
|
||||
dev->tgt.ops = ops;
|
||||
dev->tgt.sq_depth = depth;
|
||||
dev->tgt.cq_depth = depth;
|
||||
|
@ -1097,7 +1209,8 @@ static int __cmd_dev_add(const struct dev_ctx *ctx)
|
|||
fail:
|
||||
if (ret < 0)
|
||||
ublk_send_dev_event(ctx, dev, -1);
|
||||
ublk_ctrl_deinit(dev);
|
||||
if (dev)
|
||||
ublk_ctrl_deinit(dev);
|
||||
return ret;
|
||||
}
|
||||
|
||||
|
@ -1159,6 +1272,8 @@ run:
|
|||
shmctl(ctx->_shmid, IPC_RMID, NULL);
|
||||
/* wait for child and detach from it */
|
||||
wait(NULL);
|
||||
if (exit_code == EXIT_FAILURE)
|
||||
ublk_err("%s: command failed\n", __func__);
|
||||
exit(exit_code);
|
||||
} else {
|
||||
exit(EXIT_FAILURE);
|
||||
|
@ -1266,6 +1381,7 @@ static int cmd_dev_get_features(void)
|
|||
[const_ilog2(UBLK_F_UPDATE_SIZE)] = "UPDATE_SIZE",
|
||||
[const_ilog2(UBLK_F_AUTO_BUF_REG)] = "AUTO_BUF_REG",
|
||||
[const_ilog2(UBLK_F_QUIESCE)] = "QUIESCE",
|
||||
[const_ilog2(UBLK_F_PER_IO_DAEMON)] = "PER_IO_DAEMON",
|
||||
};
|
||||
struct ublk_dev *dev;
|
||||
__u64 features = 0;
|
||||
|
@ -1360,8 +1476,10 @@ static void __cmd_create_help(char *exe, bool recovery)
|
|||
exe, recovery ? "recover" : "add");
|
||||
printf("\t[--foreground] [--quiet] [-z] [--auto_zc] [--auto_zc_fallback] [--debug_mask mask] [-r 0|1 ] [-g]\n");
|
||||
printf("\t[-e 0|1 ] [-i 0|1]\n");
|
||||
printf("\t[--nthreads threads] [--per_io_tasks]\n");
|
||||
printf("\t[target options] [backfile1] [backfile2] ...\n");
|
||||
printf("\tdefault: nr_queues=2(max 32), depth=128(max 1024), dev_id=-1(auto allocation)\n");
|
||||
printf("\tdefault: nthreads=nr_queues");
|
||||
|
||||
for (i = 0; i < sizeof(tgt_ops_list) / sizeof(tgt_ops_list[0]); i++) {
|
||||
const struct ublk_tgt_ops *ops = tgt_ops_list[i];
|
||||
|
@ -1418,6 +1536,8 @@ int main(int argc, char *argv[])
|
|||
{ "auto_zc", 0, NULL, 0 },
|
||||
{ "auto_zc_fallback", 0, NULL, 0 },
|
||||
{ "size", 1, NULL, 's'},
|
||||
{ "nthreads", 1, NULL, 0 },
|
||||
{ "per_io_tasks", 0, NULL, 0 },
|
||||
{ 0, 0, 0, 0 }
|
||||
};
|
||||
const struct ublk_tgt_ops *ops = NULL;
|
||||
|
@ -1493,6 +1613,10 @@ int main(int argc, char *argv[])
|
|||
ctx.flags |= UBLK_F_AUTO_BUF_REG;
|
||||
if (!strcmp(longopts[option_idx].name, "auto_zc_fallback"))
|
||||
ctx.auto_zc_fallback = 1;
|
||||
if (!strcmp(longopts[option_idx].name, "nthreads"))
|
||||
ctx.nthreads = strtol(optarg, NULL, 10);
|
||||
if (!strcmp(longopts[option_idx].name, "per_io_tasks"))
|
||||
ctx.per_io_tasks = 1;
|
||||
break;
|
||||
case '?':
|
||||
/*
|
||||
|
|
|
@ -49,11 +49,14 @@
|
|||
#define UBLKSRV_IO_IDLE_SECS 20
|
||||
|
||||
#define UBLK_IO_MAX_BYTES (1 << 20)
|
||||
#define UBLK_MAX_QUEUES 32
|
||||
#define UBLK_MAX_QUEUES_SHIFT 5
|
||||
#define UBLK_MAX_QUEUES (1 << UBLK_MAX_QUEUES_SHIFT)
|
||||
#define UBLK_MAX_THREADS_SHIFT 5
|
||||
#define UBLK_MAX_THREADS (1 << UBLK_MAX_THREADS_SHIFT)
|
||||
#define UBLK_QUEUE_DEPTH 1024
|
||||
|
||||
#define UBLK_DBG_DEV (1U << 0)
|
||||
#define UBLK_DBG_QUEUE (1U << 1)
|
||||
#define UBLK_DBG_THREAD (1U << 1)
|
||||
#define UBLK_DBG_IO_CMD (1U << 2)
|
||||
#define UBLK_DBG_IO (1U << 3)
|
||||
#define UBLK_DBG_CTRL_CMD (1U << 4)
|
||||
|
@ -61,6 +64,7 @@
|
|||
|
||||
struct ublk_dev;
|
||||
struct ublk_queue;
|
||||
struct ublk_thread;
|
||||
|
||||
struct stripe_ctx {
|
||||
/* stripe */
|
||||
|
@ -76,6 +80,7 @@ struct dev_ctx {
|
|||
char tgt_type[16];
|
||||
unsigned long flags;
|
||||
unsigned nr_hw_queues;
|
||||
unsigned short nthreads;
|
||||
unsigned queue_depth;
|
||||
int dev_id;
|
||||
int nr_files;
|
||||
|
@ -85,6 +90,7 @@ struct dev_ctx {
|
|||
unsigned int fg:1;
|
||||
unsigned int recovery:1;
|
||||
unsigned int auto_zc_fallback:1;
|
||||
unsigned int per_io_tasks:1;
|
||||
|
||||
int _evtfd;
|
||||
int _shmid;
|
||||
|
@ -123,10 +129,14 @@ struct ublk_io {
|
|||
unsigned short flags;
|
||||
unsigned short refs; /* used by target code only */
|
||||
|
||||
int tag;
|
||||
|
||||
int result;
|
||||
|
||||
unsigned short buf_index;
|
||||
unsigned short tgt_ios;
|
||||
void *private_data;
|
||||
struct ublk_thread *t;
|
||||
};
|
||||
|
||||
struct ublk_tgt_ops {
|
||||
|
@ -165,28 +175,39 @@ struct ublk_tgt {
|
|||
struct ublk_queue {
|
||||
int q_id;
|
||||
int q_depth;
|
||||
unsigned int cmd_inflight;
|
||||
unsigned int io_inflight;
|
||||
struct ublk_dev *dev;
|
||||
const struct ublk_tgt_ops *tgt_ops;
|
||||
struct ublksrv_io_desc *io_cmd_buf;
|
||||
struct io_uring ring;
|
||||
|
||||
struct ublk_io ios[UBLK_QUEUE_DEPTH];
|
||||
#define UBLKSRV_QUEUE_STOPPING (1U << 0)
|
||||
#define UBLKSRV_QUEUE_IDLE (1U << 1)
|
||||
#define UBLKSRV_NO_BUF (1U << 2)
|
||||
#define UBLKSRV_ZC (1U << 3)
|
||||
#define UBLKSRV_AUTO_BUF_REG (1U << 4)
|
||||
#define UBLKSRV_AUTO_BUF_REG_FALLBACK (1U << 5)
|
||||
unsigned state;
|
||||
pid_t tid;
|
||||
};
|
||||
|
||||
struct ublk_thread {
|
||||
struct ublk_dev *dev;
|
||||
struct io_uring ring;
|
||||
unsigned int cmd_inflight;
|
||||
unsigned int io_inflight;
|
||||
|
||||
pthread_t thread;
|
||||
unsigned idx;
|
||||
|
||||
#define UBLKSRV_THREAD_STOPPING (1U << 0)
|
||||
#define UBLKSRV_THREAD_IDLE (1U << 1)
|
||||
unsigned state;
|
||||
};
|
||||
|
||||
struct ublk_dev {
|
||||
struct ublk_tgt tgt;
|
||||
struct ublksrv_ctrl_dev_info dev_info;
|
||||
struct ublk_queue q[UBLK_MAX_QUEUES];
|
||||
struct ublk_thread threads[UBLK_MAX_THREADS];
|
||||
unsigned nthreads;
|
||||
unsigned per_io_tasks;
|
||||
|
||||
int fds[MAX_BACK_FILES + 1]; /* fds[0] points to /dev/ublkcN */
|
||||
int nr_fds;
|
||||
|
@ -211,7 +232,7 @@ struct ublk_dev {
|
|||
|
||||
|
||||
extern unsigned int ublk_dbg_mask;
|
||||
extern int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag);
|
||||
extern int ublk_queue_io_cmd(struct ublk_io *io);
|
||||
|
||||
|
||||
static inline int ublk_io_auto_zc_fallback(const struct ublksrv_io_desc *iod)
|
||||
|
@ -225,11 +246,14 @@ static inline int is_target_io(__u64 user_data)
|
|||
}
|
||||
|
||||
static inline __u64 build_user_data(unsigned tag, unsigned op,
|
||||
unsigned tgt_data, unsigned is_target_io)
|
||||
unsigned tgt_data, unsigned q_id, unsigned is_target_io)
|
||||
{
|
||||
assert(!(tag >> 16) && !(op >> 8) && !(tgt_data >> 16));
|
||||
/* we only have 7 bits to encode q_id */
|
||||
_Static_assert(UBLK_MAX_QUEUES_SHIFT <= 7);
|
||||
assert(!(tag >> 16) && !(op >> 8) && !(tgt_data >> 16) && !(q_id >> 7));
|
||||
|
||||
return tag | (op << 16) | (tgt_data << 24) | (__u64)is_target_io << 63;
|
||||
return tag | (op << 16) | (tgt_data << 24) |
|
||||
(__u64)q_id << 56 | (__u64)is_target_io << 63;
|
||||
}
|
||||
|
||||
static inline unsigned int user_data_to_tag(__u64 user_data)
|
||||
|
@ -247,6 +271,11 @@ static inline unsigned int user_data_to_tgt_data(__u64 user_data)
|
|||
return (user_data >> 24) & 0xffff;
|
||||
}
|
||||
|
||||
static inline unsigned int user_data_to_q_id(__u64 user_data)
|
||||
{
|
||||
return (user_data >> 56) & 0x7f;
|
||||
}
|
||||
|
||||
static inline unsigned short ublk_cmd_op_nr(unsigned int op)
|
||||
{
|
||||
return _IOC_NR(op);
|
||||
|
@ -280,17 +309,23 @@ static inline void ublk_dbg(int level, const char *fmt, ...)
|
|||
}
|
||||
}
|
||||
|
||||
static inline int ublk_queue_alloc_sqes(struct ublk_queue *q,
|
||||
static inline struct ublk_queue *ublk_io_to_queue(const struct ublk_io *io)
|
||||
{
|
||||
return container_of(io, struct ublk_queue, ios[io->tag]);
|
||||
}
|
||||
|
||||
static inline int ublk_io_alloc_sqes(struct ublk_io *io,
|
||||
struct io_uring_sqe *sqes[], int nr_sqes)
|
||||
{
|
||||
unsigned left = io_uring_sq_space_left(&q->ring);
|
||||
struct io_uring *ring = &io->t->ring;
|
||||
unsigned left = io_uring_sq_space_left(ring);
|
||||
int i;
|
||||
|
||||
if (left < nr_sqes)
|
||||
io_uring_submit(&q->ring);
|
||||
io_uring_submit(ring);
|
||||
|
||||
for (i = 0; i < nr_sqes; i++) {
|
||||
sqes[i] = io_uring_get_sqe(&q->ring);
|
||||
sqes[i] = io_uring_get_sqe(ring);
|
||||
if (!sqes[i])
|
||||
return i;
|
||||
}
|
||||
|
@ -373,7 +408,7 @@ static inline int ublk_complete_io(struct ublk_queue *q, unsigned tag, int res)
|
|||
|
||||
ublk_mark_io_done(io, res);
|
||||
|
||||
return ublk_queue_io_cmd(q, io, tag);
|
||||
return ublk_queue_io_cmd(io);
|
||||
}
|
||||
|
||||
static inline void ublk_queued_tgt_io(struct ublk_queue *q, unsigned tag, int queued)
|
||||
|
@ -383,7 +418,7 @@ static inline void ublk_queued_tgt_io(struct ublk_queue *q, unsigned tag, int qu
|
|||
else {
|
||||
struct ublk_io *io = ublk_get_io(q, tag);
|
||||
|
||||
q->io_inflight += queued;
|
||||
io->t->io_inflight += queued;
|
||||
io->tgt_ios = queued;
|
||||
io->result = 0;
|
||||
}
|
||||
|
@ -393,7 +428,7 @@ static inline int ublk_completed_tgt_io(struct ublk_queue *q, unsigned tag)
|
|||
{
|
||||
struct ublk_io *io = ublk_get_io(q, tag);
|
||||
|
||||
q->io_inflight--;
|
||||
io->t->io_inflight--;
|
||||
|
||||
return --io->tgt_ios == 0;
|
||||
}
|
||||
|
|
|
@ -43,7 +43,7 @@ static int ublk_null_tgt_init(const struct dev_ctx *ctx, struct ublk_dev *dev)
|
|||
}
|
||||
|
||||
static void __setup_nop_io(int tag, const struct ublksrv_io_desc *iod,
|
||||
struct io_uring_sqe *sqe)
|
||||
struct io_uring_sqe *sqe, int q_id)
|
||||
{
|
||||
unsigned ublk_op = ublksrv_get_op(iod);
|
||||
|
||||
|
@ -52,7 +52,7 @@ static void __setup_nop_io(int tag, const struct ublksrv_io_desc *iod,
|
|||
sqe->flags |= IOSQE_FIXED_FILE;
|
||||
sqe->rw_flags = IORING_NOP_FIXED_BUFFER | IORING_NOP_INJECT_RESULT;
|
||||
sqe->len = iod->nr_sectors << 9; /* injected result */
|
||||
sqe->user_data = build_user_data(tag, ublk_op, 0, 1);
|
||||
sqe->user_data = build_user_data(tag, ublk_op, 0, q_id, 1);
|
||||
}
|
||||
|
||||
static int null_queue_zc_io(struct ublk_queue *q, int tag)
|
||||
|
@ -60,18 +60,18 @@ static int null_queue_zc_io(struct ublk_queue *q, int tag)
|
|||
const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
|
||||
struct io_uring_sqe *sqe[3];
|
||||
|
||||
ublk_queue_alloc_sqes(q, sqe, 3);
|
||||
ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 3);
|
||||
|
||||
io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, tag);
|
||||
io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
|
||||
sqe[0]->user_data = build_user_data(tag,
|
||||
ublk_cmd_op_nr(sqe[0]->cmd_op), 0, 1);
|
||||
ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
|
||||
sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
|
||||
|
||||
__setup_nop_io(tag, iod, sqe[1]);
|
||||
__setup_nop_io(tag, iod, sqe[1], q->q_id);
|
||||
sqe[1]->flags |= IOSQE_IO_HARDLINK;
|
||||
|
||||
io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, tag);
|
||||
sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, 1);
|
||||
io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
|
||||
sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1);
|
||||
|
||||
// buf register is marked as IOSQE_CQE_SKIP_SUCCESS
|
||||
return 2;
|
||||
|
@ -82,8 +82,8 @@ static int null_queue_auto_zc_io(struct ublk_queue *q, int tag)
|
|||
const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
|
||||
struct io_uring_sqe *sqe[1];
|
||||
|
||||
ublk_queue_alloc_sqes(q, sqe, 1);
|
||||
__setup_nop_io(tag, iod, sqe[0]);
|
||||
ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 1);
|
||||
__setup_nop_io(tag, iod, sqe[0], q->q_id);
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
@ -136,7 +136,7 @@ static unsigned short ublk_null_buf_index(const struct ublk_queue *q, int tag)
|
|||
{
|
||||
if (q->state & UBLKSRV_AUTO_BUF_REG_FALLBACK)
|
||||
return (unsigned short)-1;
|
||||
return tag;
|
||||
return q->ios[tag].buf_index;
|
||||
}
|
||||
|
||||
const struct ublk_tgt_ops null_tgt_ops = {
|
||||
|
|
|
@ -138,13 +138,13 @@ static int stripe_queue_tgt_rw_io(struct ublk_queue *q, const struct ublksrv_io_
|
|||
io->private_data = s;
|
||||
calculate_stripe_array(conf, iod, s, base);
|
||||
|
||||
ublk_queue_alloc_sqes(q, sqe, s->nr + extra);
|
||||
ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, s->nr + extra);
|
||||
|
||||
if (zc) {
|
||||
io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, tag);
|
||||
io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, io->buf_index);
|
||||
sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
|
||||
sqe[0]->user_data = build_user_data(tag,
|
||||
ublk_cmd_op_nr(sqe[0]->cmd_op), 0, 1);
|
||||
ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
|
||||
}
|
||||
|
||||
for (i = zc; i < s->nr + extra - zc; i++) {
|
||||
|
@ -162,13 +162,14 @@ static int stripe_queue_tgt_rw_io(struct ublk_queue *q, const struct ublksrv_io_
|
|||
sqe[i]->flags |= IOSQE_IO_HARDLINK;
|
||||
}
|
||||
/* bit63 marks us as tgt io */
|
||||
sqe[i]->user_data = build_user_data(tag, ublksrv_get_op(iod), i - zc, 1);
|
||||
sqe[i]->user_data = build_user_data(tag, ublksrv_get_op(iod), i - zc, q->q_id, 1);
|
||||
}
|
||||
if (zc) {
|
||||
struct io_uring_sqe *unreg = sqe[s->nr + 1];
|
||||
|
||||
io_uring_prep_buf_unregister(unreg, 0, tag, q->q_id, tag);
|
||||
unreg->user_data = build_user_data(tag, ublk_cmd_op_nr(unreg->cmd_op), 0, 1);
|
||||
io_uring_prep_buf_unregister(unreg, 0, tag, q->q_id, io->buf_index);
|
||||
unreg->user_data = build_user_data(
|
||||
tag, ublk_cmd_op_nr(unreg->cmd_op), 0, q->q_id, 1);
|
||||
}
|
||||
|
||||
/* register buffer is skip_success */
|
||||
|
@ -181,11 +182,11 @@ static int handle_flush(struct ublk_queue *q, const struct ublksrv_io_desc *iod,
|
|||
struct io_uring_sqe *sqe[NR_STRIPE];
|
||||
int i;
|
||||
|
||||
ublk_queue_alloc_sqes(q, sqe, conf->nr_files);
|
||||
ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, conf->nr_files);
|
||||
for (i = 0; i < conf->nr_files; i++) {
|
||||
io_uring_prep_fsync(sqe[i], i + 1, IORING_FSYNC_DATASYNC);
|
||||
io_uring_sqe_set_flags(sqe[i], IOSQE_FIXED_FILE);
|
||||
sqe[i]->user_data = build_user_data(tag, UBLK_IO_OP_FLUSH, 0, 1);
|
||||
sqe[i]->user_data = build_user_data(tag, UBLK_IO_OP_FLUSH, 0, q->q_id, 1);
|
||||
}
|
||||
return conf->nr_files;
|
||||
}
|
||||
|
|
|
@ -278,6 +278,11 @@ __run_io_and_remove()
|
|||
fio --name=job1 --filename=/dev/ublkb"${dev_id}" --ioengine=libaio \
|
||||
--rw=randrw --norandommap --iodepth=256 --size="${size}" --numjobs="$(nproc)" \
|
||||
--runtime=20 --time_based > /dev/null 2>&1 &
|
||||
fio --name=batchjob --filename=/dev/ublkb"${dev_id}" --ioengine=io_uring \
|
||||
--rw=randrw --norandommap --iodepth=256 --size="${size}" \
|
||||
--numjobs="$(nproc)" --runtime=20 --time_based \
|
||||
--iodepth_batch_submit=32 --iodepth_batch_complete_min=32 \
|
||||
--force_async=7 > /dev/null 2>&1 &
|
||||
sleep 2
|
||||
if [ "${kill_server}" = "yes" ]; then
|
||||
local state
|
||||
|
|
55
tools/testing/selftests/ublk/test_generic_12.sh
Executable file
55
tools/testing/selftests/ublk/test_generic_12.sh
Executable file
|
@ -0,0 +1,55 @@
|
|||
#!/bin/bash
|
||||
# SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
|
||||
|
||||
TID="generic_12"
|
||||
ERR_CODE=0
|
||||
|
||||
if ! _have_program bpftrace; then
|
||||
exit "$UBLK_SKIP_CODE"
|
||||
fi
|
||||
|
||||
_prep_test "null" "do imbalanced load, it should be balanced over I/O threads"
|
||||
|
||||
NTHREADS=6
|
||||
dev_id=$(_add_ublk_dev -t null -q 4 -d 16 --nthreads $NTHREADS --per_io_tasks)
|
||||
_check_add_dev $TID $?
|
||||
|
||||
dev_t=$(_get_disk_dev_t "$dev_id")
|
||||
bpftrace trace/count_ios_per_tid.bt "$dev_t" > "$UBLK_TMP" 2>&1 &
|
||||
btrace_pid=$!
|
||||
sleep 2
|
||||
|
||||
if ! kill -0 "$btrace_pid" > /dev/null 2>&1; then
|
||||
_cleanup_test "null"
|
||||
exit "$UBLK_SKIP_CODE"
|
||||
fi
|
||||
|
||||
# do imbalanced I/O on the ublk device
|
||||
# pin to cpu 0 to prevent migration/only target one queue
|
||||
fio --name=write_seq \
|
||||
--filename=/dev/ublkb"${dev_id}" \
|
||||
--ioengine=libaio --iodepth=16 \
|
||||
--rw=write \
|
||||
--size=512M \
|
||||
--direct=1 \
|
||||
--bs=4k \
|
||||
--cpus_allowed=0 > /dev/null 2>&1
|
||||
ERR_CODE=$?
|
||||
kill "$btrace_pid"
|
||||
wait
|
||||
|
||||
# check that every task handles some I/O, even though all I/O was issued
|
||||
# from a single CPU. when ublk gets support for round-robin tag
|
||||
# allocation, this check can be strengthened to assert that every thread
|
||||
# handles the same number of I/Os
|
||||
NR_THREADS_THAT_HANDLED_IO=$(grep -c '@' ${UBLK_TMP})
|
||||
if [[ $NR_THREADS_THAT_HANDLED_IO -ne $NTHREADS ]]; then
|
||||
echo "only $NR_THREADS_THAT_HANDLED_IO handled I/O! expected $NTHREADS"
|
||||
cat "$UBLK_TMP"
|
||||
ERR_CODE=255
|
||||
fi
|
||||
|
||||
_cleanup_test "null"
|
||||
_show_result $TID $ERR_CODE
|
|
@ -41,5 +41,13 @@ if _have_feature "AUTO_BUF_REG"; then
|
|||
fi
|
||||
wait
|
||||
|
||||
if _have_feature "PER_IO_DAEMON"; then
|
||||
ublk_io_and_remove 8G -t null -q 4 --auto_zc --nthreads 8 --per_io_tasks &
|
||||
ublk_io_and_remove 256M -t loop -q 4 --auto_zc --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[0]}" &
|
||||
ublk_io_and_remove 256M -t stripe -q 4 --auto_zc --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
|
||||
ublk_io_and_remove 8G -t null -q 4 -z --auto_zc --auto_zc_fallback --nthreads 8 --per_io_tasks &
|
||||
fi
|
||||
wait
|
||||
|
||||
_cleanup_test "stress"
|
||||
_show_result $TID $ERR_CODE
|
||||
|
|
|
@ -38,6 +38,13 @@ if _have_feature "AUTO_BUF_REG"; then
|
|||
ublk_io_and_kill_daemon 256M -t stripe -q 4 --auto_zc "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
|
||||
ublk_io_and_kill_daemon 8G -t null -q 4 -z --auto_zc --auto_zc_fallback &
|
||||
fi
|
||||
|
||||
if _have_feature "PER_IO_DAEMON"; then
|
||||
ublk_io_and_kill_daemon 8G -t null -q 4 --nthreads 8 --per_io_tasks &
|
||||
ublk_io_and_kill_daemon 256M -t loop -q 4 --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[0]}" &
|
||||
ublk_io_and_kill_daemon 256M -t stripe -q 4 --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
|
||||
ublk_io_and_kill_daemon 8G -t null -q 4 --nthreads 8 --per_io_tasks &
|
||||
fi
|
||||
wait
|
||||
|
||||
_cleanup_test "stress"
|
||||
|
|
|
@ -69,5 +69,12 @@ if _have_feature "AUTO_BUF_REG"; then
|
|||
done
|
||||
fi
|
||||
|
||||
if _have_feature "PER_IO_DAEMON"; then
|
||||
ublk_io_and_remove 8G -t null -q 4 --nthreads 8 --per_io_tasks -r 1 -i "$reissue" &
|
||||
ublk_io_and_remove 256M -t loop -q 4 --nthreads 8 --per_io_tasks -r 1 -i "$reissue" "${UBLK_BACKFILES[0]}" &
|
||||
ublk_io_and_remove 8G -t null -q 4 --nthreads 8 --per_io_tasks -r 1 -i "$reissue" &
|
||||
fi
|
||||
wait
|
||||
|
||||
_cleanup_test "stress"
|
||||
_show_result $TID $ERR_CODE
|
||||
|
|
11
tools/testing/selftests/ublk/trace/count_ios_per_tid.bt
Normal file
11
tools/testing/selftests/ublk/trace/count_ios_per_tid.bt
Normal file
|
@ -0,0 +1,11 @@
|
|||
/*
|
||||
* Tabulates and prints I/O completions per thread for the given device
|
||||
*
|
||||
* $1: dev_t
|
||||
*/
|
||||
tracepoint:block:block_rq_complete
|
||||
{
|
||||
if (args.dev == $1) {
|
||||
@[tid] = count();
|
||||
}
|
||||
}
|
Loading…
Add table
Reference in a new issue