- IOMMU HW now has features to directly assign HW command queues to a
guest VM. In this mode the command queue operates on a limited set of
invalidation commands that are suitable for improving guest invalidation
performance and easy for the HW to virtualize.
This PR brings the generic infrastructure to allow IOMMU drivers to
expose such command queues through the iommufd uAPI, mmap the doorbell
pages, and get the guest physical range for the command queue ring
itself.
- An implementation for the NVIDIA SMMUv3 extension "cmdqv" is built on
the new iommufd command queue features. It works with the existing SMMU
driver support for cmdqv in guest VMs.
- Many precursor cleanups and improvements to support the above cleanly,
changes to the general ioctl and object helpers, driver support for
VDEVICE, and mmap pgoff cookie infrastructure.
- Sequence VDEVICE destruction to always happen before VFIO device
destruction. When using the above type features, and also in future
confidential compute, the internal virtual device representation becomes
linked to HW or CC TSM configuration and objects. If a VFIO device is
removed from iommufd those HW objects should also be cleaned up to
prevent a sort of UAF. This became important now that we have HW backing
the VDEVICE.
- Fix one syzkaller found error related to math overflows during iova
allocation
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaIpl9AAKCRCFwuHvBreF
YS5tAP9MDIRML5a/2IOhzcsc4LiDkWTMKm2m1wcRYd+iU2aFVQEAjdghINLHrUlx
HVuIDvNvWIUED/oTAp5kCxQ7PBFN4gU=
=NmCO
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"This broadly brings the assigned HW command queue support to iommufd.
This feature is used to improve SVA performance in VMs by avoiding
paravirtualization traps during SVA invalidations.
Along the way I think some of the core logic is in a much better state
to support future driver backed features.
Summary:
- IOMMU HW now has features to directly assign HW command queues to a
guest VM. In this mode the command queue operates on a limited set
of invalidation commands that are suitable for improving guest
invalidation performance and easy for the HW to virtualize.
This brings the generic infrastructure to allow IOMMU drivers to
expose such command queues through the iommufd uAPI, mmap the
doorbell pages, and get the guest physical range for the command
queue ring itself.
- An implementation for the NVIDIA SMMUv3 extension "cmdqv" is built
on the new iommufd command queue features. It works with the
existing SMMU driver support for cmdqv in guest VMs.
- Many precursor cleanups and improvements to support the above
cleanly, changes to the general ioctl and object helpers, driver
support for VDEVICE, and mmap pgoff cookie infrastructure.
- Sequence VDEVICE destruction to always happen before VFIO device
destruction. When using the above type features, and also in future
confidential compute, the internal virtual device representation
becomes linked to HW or CC TSM configuration and objects. If a VFIO
device is removed from iommufd those HW objects should also be
cleaned up to prevent a sort of UAF. This became important now that
we have HW backing the VDEVICE.
- Fix one syzkaller found error related to math overflows during iova
allocation"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (57 commits)
iommu/arm-smmu-v3: Replace vsmmu_size/type with get_viommu_size
iommu/arm-smmu-v3: Do not bother impl_ops if IOMMU_VIOMMU_TYPE_ARM_SMMUV3
iommufd: Rename some shortterm-related identifiers
iommufd/selftest: Add coverage for vdevice tombstone
iommufd/selftest: Explicitly skip tests for inapplicable variant
iommufd/vdevice: Remove struct device reference from struct vdevice
iommufd: Destroy vdevice on idevice destroy
iommufd: Add a pre_destroy() op for objects
iommufd: Add iommufd_object_tombstone_user() helper
iommufd/viommu: Roll back to use iommufd_object_alloc() for vdevice
iommufd/selftest: Test reserved regions near ULONG_MAX
iommufd: Prevent ALIGN() overflow
iommu/tegra241-cmdqv: import IOMMUFD module namespace
iommufd: Do not allow _iommufd_object_alloc_ucmd if abort op is set
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
...
The hw_info uAPI will support a bidirectional data_type field that can be
used as an input field for user space to request for a specific info data.
To prepare for the uAPI update, change the iommu layer first:
- Add a new IOMMU_HW_INFO_TYPE_DEFAULT as an input, for which driver can
output its only (or firstly) supported type
- Update the kdoc accordingly
- Roll out the type validation in the existing drivers
Link: https://patch.msgid.link/r/00f4a2d3d930721f61367014717b3ba2d1e82a81.1752126748.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The new type of vIOMMU for tegra241-cmdqv allows user space VM to use one
of its virtual command queue HW resources exclusively. This requires user
space to mmap the corresponding MMIO page from kernel space for direct HW
control.
To forward the mmap info (offset and length), iommufd should add a driver
specific data structure to the IOMMUFD_CMD_VIOMMU_ALLOC ioctl, for driver
to output the info during the vIOMMU initialization back to user space.
Similar to the existing ioctls and their IOMMU handlers, add a user_data
to viommu_init op to bridge between iommufd and drivers.
Link: https://patch.msgid.link/r/90bd5637dab7f5507c7a64d2c4826e70431e45a4.1752126748.git.nicolinc@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Replace u32 to make it clear. No functional changes.
Also simplify the kdoc since the type itself is clear enough.
Link: https://patch.msgid.link/r/651c50dee8ab900f691202ef0204cd5a43fdd6a2.1752126748.git.nicolinc@nvidia.com
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
These drivers all set the domain->pgsize_bitmap in their
domain_alloc_paging() functions, so the ops value is never used. Delete
it.
Reviewed-by: Sven Peter <sven@svenpeter.dev> # for Apple DART
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com> # for RISC-V
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/3-v2-68a2e1ba507c+1fb-iommu_rm_ops_pgsize_jgg@nvidia.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
To ease the for-driver iommufd APIs, get_viommu_size and viommu_init ops
are introduced.
Sanitize the inputs and report the size of struct mock_viommu on success,
in mock_get_viommu_size().
The core will ensure the viommu_type is set to the core vIOMMU object, so
simply init the driver part in mock_viommu_init().
Remove the mock_viommu_alloc, completing the replacement.
Link: https://patch.msgid.link/r/993beabbb0bc9705d979a92801ea5ed5996a34eb.1749882255.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is no use of this parent domain. Delete the dead code.
Note that the s2_parent in struct mock_viommu will be a deadcode too. Yet,
keep it because it will be soon used by HW queue objects, i.e. no point in
adding it back and forth in such a short window. Besides, keeping it could
cover the majority of vIOMMU use cases where a driver-level structure will
be larger in size than the core structure.
Link: https://patch.msgid.link/r/0f155a7cd71034a498448fe4828fb4aaacdabf95.1749882255.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Run clang-format but exclude those not so obvious ones, which leaves us:
- Align indentations
- Add missing spaces
- Remove unnecessary spaces
- Remove unnecessary line wrappings
Link: https://patch.msgid.link/r/9132e1ab45690ab1959c66bbb51ac5536a635388.1749882255.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Update iopf enablement in the iommufd mock device driver to use the new
method, similar to the arm-smmu-v3 driver. Enable iopf support when any
domain with an iopf_handler is attached, and disable it when the domain
is removed.
Add a refcount in the mock device state structure to keep track of the
number of domains set to the device and PASIDs that require iopf.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/r/20250418080130.1844424-5-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This adds 4 test ops for pasid attach/replace/detach testing. There are
ops to attach/detach pasid, and also op to check the attached hwpt of a
pasid.
Link: https://patch.msgid.link/r/20250321171940.7213-18-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There is need to get the selftest device (sobj->type == TYPE_IDEV) in
multiple places, so have a helper to for it.
Link: https://patch.msgid.link/r/20250321171940.7213-17-yi.l.liu@intel.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The callback is needed to make pasid_attach/detach path complete for mock
device. A nop is enough for set_dev_pasid.
A MOCK_FLAGS_DEVICE_PASID is added to indicate a pasid-capable mock device
for the pasid test cases. Other test cases will still create a non-pasid
mock device. While the mock iommu always pretends to be pasid-capable.
Link: https://patch.msgid.link/r/20250321171940.7213-16-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This extends the below APIs to support PASID. Device drivers to manage pasid
attach/replace/detach.
int iommufd_device_attach(struct iommufd_device *idev,
ioasid_t pasid, u32 *pt_id);
int iommufd_device_replace(struct iommufd_device *idev,
ioasid_t pasid, u32 *pt_id);
void iommufd_device_detach(struct iommufd_device *idev,
ioasid_t pasid);
The pasid operations share underlying attach/replace/detach infrastructure
with the device operations, but still have some different implications:
- no reserved region per pasid otherwise SVA architecture is already
broken (CPU address space doesn't count device reserved regions);
- accordingly no sw_msi trick;
Cache coherency enforcement is still applied to pasid operations since
it is about memory accesses post page table walking (no matter the walk
is per RID or per PASID).
Link: https://patch.msgid.link/r/20250321171940.7213-12-yi.l.liu@intel.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When attaching a device to a vIOMMU-based nested domain, vdev_id must be
present. Add a piece of code hard-requesting it, preparing for a vEVENTQ
support in the following patch. Then, update the TEST_F.
A HWPT-based nested domain will return a NULL new_viommu, thus no such a
vDEVICE requirement.
Link: https://patch.msgid.link/r/4051ca8a819e51cb30de6b4fe9e4d94d956afe3d.1741719725.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
IOMMU_HWPT_FAULT_ID_VALID is used to mark if the fault_id field of
iommu_hwp_alloc is valid or not. As the fault_id field is handled in
the iommufd core, so it makes sense to sanitize the
IOMMU_HWPT_FAULT_ID_VALID flag in the iommufd core, and mask it out
before passing the user flags to the iommu drivers.
Link: https://patch.msgid.link/r/20241207120108.5640-1-yi.l.liu@intel.com
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Since this implements domain_alloc_paging_flags() it only needs one
op. Fold mock_domain_alloc_paging() into mock_domain_alloc_paging_flags().
Link: https://patch.msgid.link/r/0-v1-8a3e7e21ff6a+1745d-iommufd_paging_flags_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now that the main domain allocating path is calling this function it
doesn't make sense to leave it named _user. Change the name to
alloc_paging_flags() to mirror the new iommu_paging_domain_alloc_flags()
function.
A driver should implement only one of ops->domain_alloc_paging() or
ops->domain_alloc_paging_flags(). The former is a simpler interface with
less boiler plate that the majority of drivers use. The latter is for
drivers with a greater feature set (PASID, multiple page table support,
advanced iommufd support, nesting, etc). Additional patches will be needed
to achieve this.
Link: https://patch.msgid.link/r/2-v1-c252ebdeb57b+329-iommu_paging_flags_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
It turns out all the drivers that are using this immediately call into
another function, so just make that function directly into the op. This
makes paging=NULL for domain_alloc_user and we can remove the argument in
the next patch.
The function mirrors the similar op in the viommu that allocates a nested
domain on top of the viommu's nesting parent. This version supports cases
where a viommu is not being used.
Link: https://patch.msgid.link/r/1-v1-c252ebdeb57b+329-iommu_paging_flags_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Similar to IOMMU_TEST_OP_MD_CHECK_IOTLB verifying a mock_domain's iotlb,
IOMMU_TEST_OP_DEV_CHECK_CACHE will be used to verify a mock_dev's cache.
Link: https://patch.msgid.link/r/cd4082079d75427bd67ed90c3c825e15b5720a5f.1730836308.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Implement the viommu alloc/free functions to increase/reduce refcount of
its dependent mock iommu device. User space can verify this loop via the
IOMMU_VIOMMU_TYPE_SELFTEST.
Link: https://patch.msgid.link/r/9d755a215a3007d4d8d1c2513846830332db62aa.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
For an iommu_dev that can unplug (so far only this selftest does so), the
viommu->iommu_dev pointer has no guarantee of its life cycle after it is
copied from the idev->dev->iommu->iommu_dev.
Track the user count of the iommu_dev. Postpone the exit routine using a
completion, if refcount is unbalanced. The refcount inc/dec will be added
in the following patch.
Link: https://patch.msgid.link/r/33f28d64841b497eebef11b49a571e03103c5d24.1730836219.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A nested domain now can be allocated for a parent domain or for a vIOMMU
object. Rework the existing allocators to prepare for the latter case.
Link: https://patch.msgid.link/r/f62894ad8ccae28a8a616845947fe4b76135d79b.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Use these inline helpers to shorten those container_of lines.
Note that one of them goes back and forth between iommu_domain and
mock_iommu_domain, which isn't necessary. So drop its container_of.
Link: https://patch.msgid.link/r/518ec64dae2e814eb29fd9f170f58a3aad56c81c.1730836219.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Nicolin Chen says:
=========
IOMMU_RESV_SW_MSI is a unique region defined by an IOMMU driver. Though it
is eventually used by a device for address translation to an MSI location
(including nested cases), practically it is a universal region across all
domains allocated for the IOMMU that defines it.
Currently IOMMUFD core fetches and reserves the region during an attach to
an hwpt_paging. It works with a hwpt_paging-only case, but might not work
with a nested case where a device could directly attach to a hwpt_nested,
bypassing the hwpt_paging attachment.
Move the enforcement forward, to the hwpt_paging allocation function. Then
clean up all the SW_MSI related things in the attach/replace routine.
=========
Based on v6.11-rc5 for dependencies.
* nesting_reserved_regions: (562 commits)
iommufd/device: Enforce reserved IOVA also when attached to hwpt_nested
Linux 6.11-rc5
...
test_bit() is used to read the memory storing the bitmap, however
test_bit() always uses a unsigned long 8 byte access.
If the bitmap is not an aligned size of 64 bits this will now trigger a
KASAN warning reading past the end of the buffer.
Properly round the buffer allocation to an unsigned long size. Continue to
copy_from_user() using a byte granularity.
Fixes: 9560393b83 ("iommufd/selftest: Fix iommufd_test_dirty() to handle <u8 bitmaps")
Link: https://patch.msgid.link/r/0-v1-113e8d9e7861+5ae-iommufd_kasan_jgg@nvidia.com
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The sparse tool complains as follows:
drivers/iommu/iommufd/selftest.c:277:30: warning:
symbol 'dirty_ops' was not declared. Should it be static?
This symbol is not used outside of selftest.c, so marks it static.
Fixes: 266ce58989 ("iommufd/selftest: Test IOMMU_HWPT_ALLOC_DIRTY_TRACKING")
Link: https://patch.msgid.link/r/20240819120007.3884868-1-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Lu Baolu says:
====================
This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework. One feasible use case is the
nested translation. Nested translation is a hardware feature that supports
two-stage translation tables for IOMMU. The second-stage translation table
is managed by the host VMM, while the first-stage translation table is
owned by user space. This allows user space to control the IOMMU mappings
for its devices.
When an IO page fault occurs on the first-stage translation table, the
IOMMU hardware can deliver the page fault to user space through the
IOMMUFD framework. User space can then handle the page fault and respond
to the device top-down through the IOMMUFD. This allows user space to
implement its own IO page fault handling policies.
User space application that is capable of handling IO page faults should
allocate a fault object, and bind the fault object to any domain that it
is willing to handle the fault generatd for them. On a successful return
of fault object allocation, the user can retrieve and respond to page
faults by reading or writing to the file descriptor (FD) returned.
The iommu selftest framework has been updated to test the IO page fault
delivery and response functionality.
====================
* iommufd_pri:
iommufd/selftest: Add coverage for IOPF test
iommufd/selftest: Add IOPF support for mock device
iommufd: Associate fault object with iommufd_hw_pgtable
iommufd: Fault-capable hwpt attach/detach/replace
iommufd: Add iommufd fault object
iommufd: Add fault and response message definitions
iommu: Extend domain attach group with handle support
iommu: Add attach handle to struct iopf_group
iommu: Remove sva handle list
iommu: Introduce domain attachment handle
Link: https://lore.kernel.org/all/20240702063444.105814-1-baolu.lu@linux.intel.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Extend the selftest mock device to support generating and responding to
an IOPF. Also add an ioctl interface to userspace applications to trigger
the IOPF on the mock device. This would allow userspace applications to
test the IOMMUFD's handling of IOPFs without having to rely on any real
hardware.
Link: https://lore.kernel.org/r/20240702063444.105814-10-baolu.lu@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Do not set a hugepage-aligned IOVA for incrementing an IOVA, to better
match current IOMMU driver implementations. Keep the logic of clearing all
IOPTE dirty bits for a whole hugepage, even if the range being dirtied
starts from part of the hugepage. This is also similar to AMD driver (iommu
v1 format) where IOMMU uses various subpage PTE data for dirty tracking
(for non-standard page sizes).
Link: https://lore.kernel.org/r/20240627110105.62325-6-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Matt Ochs <mochs@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The calculation returns 0 if it sets less than the number of bits per
byte. For calculating memory allocation from bits, lets round it up to
one byte.
Link: https://lore.kernel.org/r/20240627110105.62325-3-joao.m.martins@oracle.com
Reported-by: Matt Ochs <mochs@nvidia.com>
Fixes: a9af47e382 ("iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP")
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Matt Ochs <mochs@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Since MOCK_HUGE_PAGE_SIZE was introduced it allows the core code to invoke
mock with large page sizes. This confuses the validation logic that checks
that map/unmap are paired.
This is because the page size computed for map is based on the physical
address and in many cases will always be the base page size, however the
entire range generated by iommufd will be passed to map.
Randomly iommufd can see small groups of physically contiguous pages,
(say 8k unaligned and grouped together), but that group crosses a huge
page boundary. The map side will observe this as a contiguous run and mark
it accordingly, but there is a chance the unmap side will end up
terminating interior huge pages in the middle of that group and trigger a
validation failure. Meaning the validation only works if the core code
passes the iova/length directly from iommufd to mock.
syzkaller randomly hits this with failures like:
WARNING: CPU: 0 PID: 11568 at drivers/iommu/iommufd/selftest.c:461 mock_domain_unmap_pages+0x1c0/0x250
Modules linked in:
CPU: 0 PID: 11568 Comm: syz-executor.0 Not tainted 6.8.0-rc3+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:mock_domain_unmap_pages+0x1c0/0x250
Code: 2b e8 94 37 0f ff 48 d1 eb 31 ff 48 b8 00 00 00 00 00 00 20 00 48 21 c3 48 89 de e8 aa 32 0f ff 48 85 db 75 07 e8 70 37 0f ff <0f> 0b e8 69 37 0f ff 31 f6 31 ff e8 90 32 0f ff e8 5b 37 0f ff 4c
RSP: 0018:ffff88800e707490 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff822dfae6
RDX: ffff88800cf86400 RSI: ffffffff822dfaf0 RDI: 0000000000000007
RBP: ffff88800e7074d8 R08: 0000000000000000 R09: ffffed1001167c90
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000001500000
R13: 0000000000083000 R14: 0000000000000001 R15: 0000000000000800
FS: 0000555556048480(0000) GS:ffff88806d400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2dc23000 CR3: 0000000008cbb000 CR4: 0000000000350eb0
Call Trace:
<TASK>
__iommu_unmap+0x281/0x520
iommu_unmap+0xc9/0x180
iopt_area_unmap_domain_range+0x1b1/0x290
iopt_area_unpin_domain+0x590/0x800
__iopt_area_unfill_domain+0x22e/0x650
iopt_area_unfill_domain+0x47/0x60
iopt_unfill_domain+0x187/0x590
iopt_table_remove_domain+0x267/0x2d0
iommufd_hwpt_paging_destroy+0x1f1/0x370
iommufd_object_remove+0x2a3/0x490
iommufd_device_detach+0x23a/0x2c0
iommufd_selftest_destroy+0x7a/0xf0
iommufd_fops_release+0x1d3/0x340
__fput+0x272/0xb50
__fput_sync+0x4b/0x60
__x64_sys_close+0x8b/0x110
do_syscall_64+0x71/0x140
entry_SYSCALL_64_after_hwframe+0x46/0x4e
Do the simple thing and just disable the validation when the huge page
tests are being run.
Fixes: 7db521e23f ("iommufd/selftest: Hugepage mock domain support")
Link: https://lore.kernel.org/r/0-v1-1e17e60a5c8a+103fb-iommufd_mock_hugepg_jgg@nvidia.com
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Syzkaller reported the following bug:
general protection fault, probably for non-canonical address 0xdffffc0000000038: 0000 [#1] SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000001c0-0x00000000000001c7]
Call Trace:
lock_acquire
lock_acquire+0x1ce/0x4f0
down_read+0x93/0x4a0
iommufd_test_syz_conv_iova+0x56/0x1f0
iommufd_test_access_rw.isra.0+0x2ec/0x390
iommufd_test+0x1058/0x1e30
iommufd_fops_ioctl+0x381/0x510
vfs_ioctl
__do_sys_ioctl
__se_sys_ioctl
__x64_sys_ioctl+0x170/0x1e0
do_syscall_x64
do_syscall_64+0x71/0x140
This is because the new iommufd_access_change_ioas() sets access->ioas to
NULL during its process, so the lock might be gone in a concurrent racing
context.
Fix this by doing the same access->ioas sanity as iommufd_access_rw() and
iommufd_access_pin_pages() functions do.
Cc: stable@vger.kernel.org
Fixes: 9227da7816 ("iommufd: Add iommufd_access_change_ioas(_id) helpers")
Link: https://lore.kernel.org/r/3f1932acaf1dd494d404c04364d73ce8f57f3e5e.1708636627.git.nicolinc@nvidia.com
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Syzkaller reported the following bug:
sysfs: cannot create duplicate filename '/devices/iommufd_mock4'
Call Trace:
sysfs_warn_dup+0x71/0x90
sysfs_create_dir_ns+0x1ee/0x260
? sysfs_create_mount_point+0x80/0x80
? spin_bug+0x1d0/0x1d0
? do_raw_spin_unlock+0x54/0x220
kobject_add_internal+0x221/0x970
kobject_add+0x11c/0x1e0
? lockdep_hardirqs_on_prepare+0x273/0x3e0
? kset_create_and_add+0x160/0x160
? kobject_put+0x5d/0x390
? bus_get_dev_root+0x4a/0x60
? kobject_put+0x5d/0x390
device_add+0x1d5/0x1550
? __fw_devlink_link_to_consumers.isra.0+0x1f0/0x1f0
? __init_waitqueue_head+0xcb/0x150
iommufd_test+0x462/0x3b60
? lock_release+0x1fe/0x640
? __might_fault+0x117/0x170
? reacquire_held_locks+0x4b0/0x4b0
? iommufd_selftest_destroy+0xd0/0xd0
? __might_fault+0xbe/0x170
iommufd_fops_ioctl+0x256/0x350
? iommufd_option+0x180/0x180
? __lock_acquire+0x1755/0x45f0
__x64_sys_ioctl+0xa13/0x1640
The bug is triggered when Syzkaller created multiple mock devices but
didn't destroy them in the same sequence, messing up the mock_dev_num
counter. Replace the atomic with an mock_dev_ida.
Cc: stable@vger.kernel.org
Fixes: 23a1b46f15 ("iommufd/selftest: Make the mock iommu driver into a real driver")
Link: https://lore.kernel.org/r/5af41d5af6d5c013cc51de01427abb8141b3587e.1708636627.git.nicolinc@nvidia.com
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add support to mock iommu hugepages of 1M (for a 2K mock io page size). To
avoid breaking test suite defaults, the way this is done is by explicitly
creating a iommu mock device which has hugepage support (i.e. through
MOCK_FLAGS_DEVICE_HUGE_IOVA).
The same scheme is maintained of mock base page index tracking in the
XArray, except that an extra bit is added to mark it as a hugepage. One
subpage containing the dirty bit, means that the whole hugepage is dirty
(similar to AMD IOMMU non-standard page sizes). For clearing, same thing
applies, and it must clear all dirty subpages.
This is in preparation for dirty tracking to mark mock hugepages as
dirty to exercise all the iova-bitmap fixes.
Link: https://lore.kernel.org/r/20240202133415.23819-8-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move the clearing of the dirty bit of the mock domain into
mock_domain_test_and_clear_dirty() helper, simplifying the caller
function.
Additionally, rework the mock_domain_read_and_clear_dirty() loop to
iterate over a potentially variable IO page size. No functional change
intended with the loop refactor.
This is in preparation for dirty tracking support for IOMMU hugepage mock
domains.
Link: https://lore.kernel.org/r/20240202133415.23819-7-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This relied on the probe function only being invoked by the bus type mock
was registered on. The removal of the bus ops broke this assumption and
the probe could be called on non-mock bus types like PCI.
Check the bus type directly in probe.
Fixes: 17de3f5fdd ("iommu: Retire bus ops")
Link: https://lore.kernel.org/r/0-v1-82d59f7eab8c+40c-iommufd_mock_bus_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Allow to test whether IOTLB has been invalidated or not.
Link: https://lore.kernel.org/r/20240111041015.47920-6-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add mock_domain_cache_invalidate_user() data structure to support user
space selftest program to cover user cache invalidation pathway.
Link: https://lore.kernel.org/r/20240111041015.47920-5-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Including:
- Core changes:
- Make default-domains mandatory for all IOMMU drivers
- Remove group refcounting
- Add generic_single_device_group() helper and consolidate
drivers
- Cleanup map/unmap ops
- Scaling improvements for the IOVA rcache depot
- Convert dart & iommufd to the new domain_alloc_paging()
- ARM-SMMU:
- Device-tree binding update:
- Add qcom,sm7150-smmu-v2 for Adreno on SM7150 SoC
- SMMUv2:
- Support for Qualcomm SDM670 (MDSS) and SM7150 SoCs
- SMMUv3:
- Large refactoring of the context descriptor code to
move the CD table into the master, paving the way
for '->set_dev_pasid()' support on non-SVA domains
- Minor cleanups to the SVA code
- Intel VT-d:
- Enable debugfs to dump domain attached to a pasid
- Remove an unnecessary inline function.
- AMD IOMMU:
- Initial patches for SVA support (not complete yet)
- S390 IOMMU:
- DMA-API conversion and optimized IOTLB flushing
- Some smaller fixes and improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmVJFcEACgkQK/BELZcB
GuMgDxAAsnYVQjQ7wRkwR0rHARuEaJ+Lz2vkLNH+uYXjBzhFe2bT+ykMcZysAkdK
A5PMLOFT5Etf+PAqOM0CoIGQFOefAId6uGl7S61Fp9ZWDKhMrOBFWhxGOaufA1Du
tNvt3i66hwPSDZa82kY3wRCluYtj0aBBzmM6ZTwBwFZdQ7LABMtE8OxisqncVvq0
H6vhV213fqvhCFSQJ6PnTAEiv70WvWBWygA+Z/gwYf9hypZQae91PNXdK9313a9z
OvCzGBkL/R5/3KkJd88UhFwyYzyNGxq/DmH1etawYR5gYZ8UT/Z/sYpcx9hlO7qr
eENPqeQc+YHZXpKqkaq66HBA1FSnXUqRZLl4cVaZahRRMe/yArsBM6R0W1AfkMAR
rZxwHKoHUWeuHQLMVvmSDNL57h/GJJpTXjRc8HMxLZkVp+ScvnT5XCYHWWzRdCdx
TcC/pJ1tet0FQ8rw09ovlwpGVA6eojWvcpVbLVLfGN8ZWViSVfvNFoPNb7HsGK6M
iRi+L41Y7s63cyogC/Gsae2RAvYv29ZpvE91lmon2u+VBlTpMdOFX9EhWS6RqOBF
cV30bhsw0dyCB7v5jDPtABYEOaR6l1mPLhn1gX3u0Ue/tmPhLX69k4bVWBY6wP3p
gmmJD9ub8FuPQtFCGPE7/8ZINjGGrfiKO24DNI2Ty3XEeq21hU4=
=UyWC
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core changes:
- Make default-domains mandatory for all IOMMU drivers
- Remove group refcounting
- Add generic_single_device_group() helper and consolidate drivers
- Cleanup map/unmap ops
- Scaling improvements for the IOVA rcache depot
- Convert dart & iommufd to the new domain_alloc_paging()
ARM-SMMU:
- Device-tree binding update:
- Add qcom,sm7150-smmu-v2 for Adreno on SM7150 SoC
- SMMUv2:
- Support for Qualcomm SDM670 (MDSS) and SM7150 SoCs
- SMMUv3:
- Large refactoring of the context descriptor code to move the CD
table into the master, paving the way for '->set_dev_pasid()'
support on non-SVA domains
- Minor cleanups to the SVA code
Intel VT-d:
- Enable debugfs to dump domain attached to a pasid
- Remove an unnecessary inline function
AMD IOMMU:
- Initial patches for SVA support (not complete yet)
S390 IOMMU:
- DMA-API conversion and optimized IOTLB flushing
And some smaller fixes and improvements"
* tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (102 commits)
iommu/dart: Remove the force_bypass variable
iommu/dart: Call apple_dart_finalize_domain() as part of alloc_paging()
iommu/dart: Convert to domain_alloc_paging()
iommu/dart: Move the blocked domain support to a global static
iommu/dart: Use static global identity domains
iommufd: Convert to alloc_domain_paging()
iommu/vt-d: Use ops->blocked_domain
iommu/vt-d: Update the definition of the blocking domain
iommu: Move IOMMU_DOMAIN_BLOCKED global statics to ops->blocked_domain
Revert "iommu/vt-d: Remove unused function"
iommu/amd: Remove DMA_FQ type from domain allocation path
iommu: change iommu_map_sgtable to return signed values
iommu/virtio: Add __counted_by for struct viommu_request and use struct_size()
iommu/vt-d: debugfs: Support dumping a specified page table
iommu/vt-d: debugfs: Create/remove debugfs file per {device, pasid}
iommu/vt-d: debugfs: Dump entry pointing to huge page
iommu/vt-d: Remove unused function
iommu/arm-smmu-v3-sva: Remove bond refcount
iommu/arm-smmu-v3-sva: Remove unused iommu_sva handle
iommu/arm-smmu-v3: Rename cdcfg to cd_table
...
Patches in Joerg's iommu tree to convert the mock driver to use
domain_alloc_paging() that clash badly with the way the selftest changes
for nesting were structured.
Massage the selftest so that it looks closer the code after the
domain_alloc_paging() conversion to ease the merge. Change
__mock_domain_alloc_paging() into mock_domain_alloc_paging() in the same
way as the iommu tree. The merge resolution then trivially takes both and
deletes mock_domain_alloc().
Link: https://lore.kernel.org/r/0-v1-90a855762c96+19de-mock_merge_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
iommufd_test_dirty()/IOMMU_TEST_OP_DIRTY sets the dirty bits in the mock
domain implementation that the userspace side validates against what it
obtains via the UAPI.
However in introducing iommufd_test_dirty() it forgot to validate page_size
being 0 leading to two possible divide-by-zero problems: one at the
beginning when calculating @max and while calculating the IOVA in the
XArray PFN tracking list.
While at it, validate the length to require non-zero value as well, as we
can't be allocating a 0-sized bitmap.
Link: https://lore.kernel.org/r/20231030113446.7056-1-joao.m.martins@oracle.com
Reported-by: syzbot+25dc7383c30ecdc83c38@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-iommu/00000000000005f6aa0608b9220f@google.com/
Fixes: a9af47e382 ("iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP")
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move the global static blocked domain to the ops and convert the unmanaged
domain to domain_alloc_paging.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/4-v2-bff223cf6409+282-dart_paging_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add nested domain support in the ->domain_alloc_user op with some proper
sanity checks. Then, add a domain_nested_ops for all nested domains and
split the get_md_pagetable helper into paging and nested helpers.
Also, add an iotlb as a testing property of a nested domain.
Link: https://lore.kernel.org/r/20231026043938.63898-10-yi.l.liu@intel.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>