linux/drivers
Damien Le Moal 0faa0fe6f9 nvmet: New NVMe PCI endpoint function target driver
Implement a PCI target driver using the PCI endpoint framework. This
requires hardware with a PCI controller capable of executing in endpoint
mode.

The PCI endpoint framework is used to set up a PCI endpoint function
and its BAR compatible with a NVMe PCI controller. The framework is also
used to map local memory to the PCI address space to execute MMIO
accesses for retrieving NVMe commands from submission queues and posting
completion entries to completion queues. If supported, DMA is used for
command retreival and command data transfers, based on the PCI address
segments indicated by the command using either PRPs or SGLs.

The NVMe target driver relies on the NVMe target core code to execute
all commands isssued by the host. The PCI target driver is mainly
responsible for the following:
 - Initialization and teardown of the endpoint device and its backend
   PCI target controller. The PCI target controller is created using a
   subsystem and a port defined through configfs. The port used must be
   initialized with the "pci" transport type. The target controller is
   allocated and initialized when the PCI endpoint is started by binding
   it to the endpoint PCI device (nvmet_pci_epf_epc_init() function).

 - Manage the endpoint controller state according to the PCI link state
   and the actions of the host (e.g. checking the CC.EN register) and
   propagate these actions to the PCI target controller. Polling of the
   controller enable/disable is done using a delayed work scheduled
   every 5ms (nvmet_pci_epf_poll_cc() function). This work is started
   whenever the PCI link comes up (nvmet_pci_epf_link_up() notifier
   function) and stopped when the PCI link comes down
   (nvmet_pci_epf_link_down() notifier function).
   nvmet_pci_epf_poll_cc() enables and disables the PCI controller using
   the functions nvmet_pci_epf_enable_ctrl() and
   nvmet_pci_epf_disable_ctrl(). The controller admin queue is created
   using nvmet_pci_epf_create_cq(), which calls nvmet_cq_create(), and
   nvmet_pci_epf_create_sq() which uses nvmet_sq_create().
   nvmet_pci_epf_disable_ctrl() always resets the PCI controller to its
   initial state so that nvmet_pci_epf_enable_ctrl() can be called
   again. This ensures correct operation if, for instance, the host
   reboots causing the PCI link to be temporarily down.

 - Manage the controller admin and I/O submission queues using local
   memory. Commands are obtained from submission queues using a work
   item that constantly polls the doorbells of all submissions queues
   (nvmet_pci_epf_poll_sqs() function). This work is started whenever
   the controller is enabled (nvmet_pci_epf_enable_ctrl() function) and
   stopped when the controller is disabled (nvmet_pci_epf_disable_ctrl()
   function). When new commands are submitted by the host, DMA transfers
   are used to retrieve the commands.

 - Initiate the execution of all admin and I/O commands using the target
   core code, by calling a requests execute() function. All commands are
   individually handled using a per-command work item
   (nvmet_pci_epf_iod_work() function). A command overall execution
   includes: initializing a struct nvmet_req request for the command,
   using nvmet_req_transfer_len() to get a command data transfer length,
   parse the command PRPs or SGLs to get the PCI address segments of
   the command data buffer, retrieve data from the host (if the command
   is a write command), call req->execute() to execute the command and
   transfer data to the host (for read commands).

 - Handle the completions of commands as notified by the
   ->queue_response() operation of the PCI target controller
   (nvmet_pci_epf_queue_response() function). Completed commands are
   added to a list of completed command for their CQ. Each CQ list of
   completed command is processed using a work item
   (nvmet_pci_epf_cq_work() function) which posts entries for the
   completed commands in the CQ memory and raise an IRQ to the host to
   signal the completion. IRQ coalescing is supported as mandated by the
   NVMe base specification for PCI controllers. Of note is that
   completion entries are transmitted to the host using MMIO, after
   mapping the completion queue memory to the host PCI address space.
   Unlike for retrieving commands from SQs, DMA is not used as it
   degrades performance due to the transfer serialization needed (which
   delays completion entries transmission).

The configuration of a NVMe PCI endpoint controller is done using
configfs. First the NVMe PCI target controller configuration must be
done to set up a subsystem and a port with the "pci" addr_trtype
attribute. The subsystem can be setup using a file or block device
backed namespace or using a passthrough NVMe device. After this, the
PCI endpoint can be configured and bound to the PCI endpoint controller
to start the NVMe endpoint controller.

In order to not overcomplicate this initial implementation of an
endpoint PCI target controller driver, protection information is not
for now supported. If the PCI controller port and namespace are
configured with protection information support, an error will be
returned when the controller is created and initialized when the
endpoint function is started. Protection information support will be
added in a follow-up patch series.

Using a Rock5B board (Rockchip RK3588 SoC, PCI Gen3x4 endpoint
controller) with a target PCI controller setup with 4 I/O queues and a
null_blk block device as a namespace, the maximum performance using fio
was measured at 131 KIOPS for random 4K reads and up to 2.8 GB/S
throughput. Some data points are:

Rnd read,   4KB,  QD=1, 1 job : IOPS=16.9k, BW=66.2MiB/s (69.4MB/s)
Rnd read,   4KB, QD=32, 1 job : IOPS=78.5k, BW=307MiB/s (322MB/s)
Rnd read,   4KB, QD=32, 4 jobs: IOPS=131k, BW=511MiB/s (536MB/s)
Seq read, 512KB, QD=32, 1 job : IOPS=5381, BW=2691MiB/s (2821MB/s)

The NVMe PCI endpoint target driver is not intended for production use.
It is a tool for learning NVMe, exploring existing features and testing
implementations of new NVMe features.

Co-developed-by: Rick Wertenbroek <rick.wertenbroek@gmail.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10 19:30:49 -08:00
..
accel
accessibility
acpi
amba
android
ata block: simplify tag allocation policy selection 2025-01-06 07:37:41 -07:00
atm
auxdisplay
base
bcma
block nbd: don't allow reconnect after disconnect 2025-01-06 07:38:20 -07:00
bluetooth
bus
cache
cdrom block: remove BLK_MQ_F_SHOULD_MERGE 2024-12-23 08:17:23 -07:00
cdx
char
clk
clocksource
comedi
connector
counter
cpufreq
cpuidle
crypto
cxl
dax
dca
devfreq
dio
dma module: Convert default symbol namespace to string literal 2024-12-03 08:22:25 -08:00
dma-buf
dpll
edac
eisa
extcon
firewire
firmware
fpga
fsi
gnss
gpio
gpu
greybus
hid hid-for-linus-2024120501 2024-12-05 10:06:47 -08:00
hsi
hte
hv
hwmon
hwspinlock
hwtracing
i2c
i3c
idle
iio
infiniband
input
interconnect
iommu
ipack
irqchip
isdn
leds
macintosh
mailbox
mcb
md block: remove BLK_MQ_F_SHOULD_MERGE 2024-12-23 08:17:23 -07:00
media
memory
memstick block: remove BLK_MQ_F_SHOULD_MERGE 2024-12-23 08:17:23 -07:00
message
mfd
misc
mmc block: remove BLK_MQ_F_SHOULD_MERGE 2024-12-23 08:17:23 -07:00
most
mtd block: remove BLK_MQ_F_SHOULD_MERGE 2024-12-23 08:17:23 -07:00
mux
net
nfc
ntb
nubus
nvdimm
nvme nvmet: New NVMe PCI endpoint function target driver 2025-01-10 19:30:49 -08:00
nvmem
of
opp
parisc
parport
pci PCI: hookup irq_get_affinity callback 2024-12-23 08:17:23 -07:00
pcmcia
peci
perf
phy
pinctrl
platform
pmdomain
pnp
power
powercap
pps
ps3
ptp
pwm
rapidio
ras
regulator
remoteproc
reset
rpmsg
rtc
s390 block: remove BLK_MQ_F_SHOULD_MERGE 2024-12-23 08:17:23 -07:00
sbus
scsi block: simplify tag allocation policy selection 2025-01-06 07:37:41 -07:00
sh
siox
slimbus
soc
soundwire
spi
spmi
ssb
staging
target block: remove bio_add_pc_page 2025-01-04 15:27:35 -07:00
tc
tee
thermal
thunderbolt
tty
ufs block: remove BLK_MQ_F_NO_SCHED 2025-01-06 07:37:41 -07:00
uio
usb
vdpa
vfio
vhost
video
virt
virtio virtio: hookup irq_get_affinity callback 2024-12-23 08:17:23 -07:00
w1
watchdog
xen
zorro
Kconfig
Makefile