linux/arch/arm64/include/asm/kvm_pgtable.h

847 lines
29 KiB
C
Raw Permalink Normal View History

// SPDX-License-Identifier: GPL-2.0-only
/*
* Copyright (C) 2020 Google LLC
* Author: Will Deacon <will@kernel.org>
*/
#ifndef __ARM64_KVM_PGTABLE_H__
#define __ARM64_KVM_PGTABLE_H__
#include <linux/bits.h>
#include <linux/kvm_host.h>
#include <linux/types.h>
KVM: arm64: Support up to 5 levels of translation in kvm_pgtable FEAT_LPA2 increases the maximum levels of translation from 4 to 5 for the 4KB page case, when IA is >48 bits. While we can still use 4 levels for stage2 translation in this case (due to stage2 allowing concatenated page tables for first level lookup), the same kvm_pgtable library is used for the hyp stage1 page tables and stage1 does not support concatenation. Therefore, modify the library to support up to 5 levels. Previous patches already laid the groundwork for this by refactoring code to work in terms of KVM_PGTABLE_FIRST_LEVEL and KVM_PGTABLE_LAST_LEVEL. So we just need to change these macros. The hardware sometimes encodes the new level differently from the others: One such place is when reading the level from the FSC field in the ESR_EL2 register. We never expect to see the lowest level (-1) here since the stage 2 page tables always use concatenated tables for first level lookup and therefore only use 4 levels of lookup. So we get away with just adding a comment to explain why we are not being careful about decoding level -1. For stage2 VTCR_EL2.SL2 is introduced to encode the new start level. However, since we always use concatenated page tables for first level look up at stage2 (and therefore we will never need the new extra level) we never touch this new field. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231127111737.1897081-10-ryan.roberts@arm.com
2023-11-27 11:17:34 +00:00
#define KVM_PGTABLE_FIRST_LEVEL -1
#define KVM_PGTABLE_LAST_LEVEL 3
KVM: arm64: Prepare the creation of s1 mappings at EL2 When memory protection is enabled, the EL2 code needs the ability to create and manage its own page-table. To do so, introduce a new set of hypercalls to bootstrap a memory management system at EL2. This leads to the following boot flow in nVHE Protected mode: 1. the host allocates memory for the hypervisor very early on, using the memblock API; 2. the host creates a set of stage 1 page-table for EL2, installs the EL2 vectors, and issues the __pkvm_init hypercall; 3. during __pkvm_init, the hypervisor re-creates its stage 1 page-table and stores it in the memory pool provided by the host; 4. the hypervisor then extends its stage 1 mappings to include a vmemmap in the EL2 VA space, hence allowing to use the buddy allocator introduced in a previous patch; 5. the hypervisor jumps back in the idmap page, switches from the host-provided page-table to the new one, and wraps up its initialization by enabling the new allocator, before returning to the host. 6. the host can free the now unused page-table created for EL2, and will now need to issue hypercalls to make changes to the EL2 stage 1 mappings instead of modifying them directly. Note that for the sake of simplifying the review, this patch focuses on the hypervisor side of things. In other words, this only implements the new hypercalls, but does not make use of them from the host yet. The host-side changes will follow in a subsequent patch. Credits to Will for __pkvm_init_switch_pgd. Acked-by: Will Deacon <will@kernel.org> Co-authored-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210319100146.1149909-18-qperret@google.com
2021-03-19 10:01:25 +00:00
/*
* The largest supported block sizes for KVM (no 52-bit PA support):
* - 4K (level 1): 1GB
* - 16K (level 2): 32MB
* - 64K (level 2): 512MB
*/
#ifdef CONFIG_ARM64_4K_PAGES
#define KVM_PGTABLE_MIN_BLOCK_LEVEL 1
#else
#define KVM_PGTABLE_MIN_BLOCK_LEVEL 2
#endif
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1 Implement a simple policy whereby if the HW supports FEAT_LPA2 for the page size we are using, always use LPA2-style page-tables for stage 2 and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested IPA size or HW-implemented PA size. When in use we can now support up to 52-bit IPA and PA sizes. We use the previously created cpu feature to track whether LPA2 is supported for deciding whether to use the LPA2 or classic pte format. Note that FEAT_LPA2 brings support for bigger block mappings (512GB with 4KB, 64GB with 16KB). We explicitly don't enable these in the library because stage2_apply_range() works on batch sizes of the largest used block mapping, and increasing the size of the batch would lead to soft lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit stage2_apply_range() batch size to largest block"). With the addition of LPA2 support in the hypervisor, the PA size supported by the HW must be capped with a runtime decision, rather than simply using a compile-time decision based on PA_BITS. For example, on a system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB or 16KB kernel compiled with LPA2 support must still limit the PA size to 48 bits. Therefore, move the insertion of the PS field into TCR_EL2 out of __kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode() where the rest of TCR_EL2 is prepared. This allows us to figure out PS with kvm_get_parange(), which has the appropriate logic to ensure the above requirement. (and the PS field of VTCR_EL2 is already populated this way). Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
#define kvm_lpa2_is_enabled() system_supports_lpa2()
static inline u64 kvm_get_parange_max(void)
{
if (kvm_lpa2_is_enabled() ||
(IS_ENABLED(CONFIG_ARM64_PA_BITS_52) && PAGE_SHIFT == 16))
return ID_AA64MMFR0_EL1_PARANGE_52;
else
return ID_AA64MMFR0_EL1_PARANGE_48;
}
static inline u64 kvm_get_parange(u64 mmfr0)
{
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1 Implement a simple policy whereby if the HW supports FEAT_LPA2 for the page size we are using, always use LPA2-style page-tables for stage 2 and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested IPA size or HW-implemented PA size. When in use we can now support up to 52-bit IPA and PA sizes. We use the previously created cpu feature to track whether LPA2 is supported for deciding whether to use the LPA2 or classic pte format. Note that FEAT_LPA2 brings support for bigger block mappings (512GB with 4KB, 64GB with 16KB). We explicitly don't enable these in the library because stage2_apply_range() works on batch sizes of the largest used block mapping, and increasing the size of the batch would lead to soft lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit stage2_apply_range() batch size to largest block"). With the addition of LPA2 support in the hypervisor, the PA size supported by the HW must be capped with a runtime decision, rather than simply using a compile-time decision based on PA_BITS. For example, on a system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB or 16KB kernel compiled with LPA2 support must still limit the PA size to 48 bits. Therefore, move the insertion of the PS field into TCR_EL2 out of __kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode() where the rest of TCR_EL2 is prepared. This allows us to figure out PS with kvm_get_parange(), which has the appropriate logic to ensure the above requirement. (and the PS field of VTCR_EL2 is already populated this way). Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
u64 parange_max = kvm_get_parange_max();
u64 parange = cpuid_feature_extract_unsigned_field(mmfr0,
ID_AA64MMFR0_EL1_PARANGE_SHIFT);
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1 Implement a simple policy whereby if the HW supports FEAT_LPA2 for the page size we are using, always use LPA2-style page-tables for stage 2 and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested IPA size or HW-implemented PA size. When in use we can now support up to 52-bit IPA and PA sizes. We use the previously created cpu feature to track whether LPA2 is supported for deciding whether to use the LPA2 or classic pte format. Note that FEAT_LPA2 brings support for bigger block mappings (512GB with 4KB, 64GB with 16KB). We explicitly don't enable these in the library because stage2_apply_range() works on batch sizes of the largest used block mapping, and increasing the size of the batch would lead to soft lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit stage2_apply_range() batch size to largest block"). With the addition of LPA2 support in the hypervisor, the PA size supported by the HW must be capped with a runtime decision, rather than simply using a compile-time decision based on PA_BITS. For example, on a system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB or 16KB kernel compiled with LPA2 support must still limit the PA size to 48 bits. Therefore, move the insertion of the PS field into TCR_EL2 out of __kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode() where the rest of TCR_EL2 is prepared. This allows us to figure out PS with kvm_get_parange(), which has the appropriate logic to ensure the above requirement. (and the PS field of VTCR_EL2 is already populated this way). Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
if (parange > parange_max)
parange = parange_max;
return parange;
}
typedef u64 kvm_pte_t;
#define KVM_PTE_VALID BIT(0)
#define KVM_PTE_ADDR_MASK GENMASK(47, PAGE_SHIFT)
#define KVM_PTE_ADDR_51_48 GENMASK(15, 12)
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1 Implement a simple policy whereby if the HW supports FEAT_LPA2 for the page size we are using, always use LPA2-style page-tables for stage 2 and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested IPA size or HW-implemented PA size. When in use we can now support up to 52-bit IPA and PA sizes. We use the previously created cpu feature to track whether LPA2 is supported for deciding whether to use the LPA2 or classic pte format. Note that FEAT_LPA2 brings support for bigger block mappings (512GB with 4KB, 64GB with 16KB). We explicitly don't enable these in the library because stage2_apply_range() works on batch sizes of the largest used block mapping, and increasing the size of the batch would lead to soft lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit stage2_apply_range() batch size to largest block"). With the addition of LPA2 support in the hypervisor, the PA size supported by the HW must be capped with a runtime decision, rather than simply using a compile-time decision based on PA_BITS. For example, on a system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB or 16KB kernel compiled with LPA2 support must still limit the PA size to 48 bits. Therefore, move the insertion of the PS field into TCR_EL2 out of __kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode() where the rest of TCR_EL2 is prepared. This allows us to figure out PS with kvm_get_parange(), which has the appropriate logic to ensure the above requirement. (and the PS field of VTCR_EL2 is already populated this way). Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
#define KVM_PTE_ADDR_MASK_LPA2 GENMASK(49, PAGE_SHIFT)
#define KVM_PTE_ADDR_51_50_LPA2 GENMASK(9, 8)
#define KVM_PHYS_INVALID (-1ULL)
#define KVM_PTE_TYPE BIT(1)
#define KVM_PTE_TYPE_BLOCK 0
#define KVM_PTE_TYPE_PAGE 1
#define KVM_PTE_TYPE_TABLE 1
#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2)
#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2)
#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO \
({ cpus_have_final_cap(ARM64_KVM_HVHE) ? 2 : 3; })
#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW \
({ cpus_have_final_cap(ARM64_KVM_HVHE) ? 0 : 1; })
#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3
#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2)
#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3
#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 50)
#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55)
#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
#define KVM_PTE_LEAF_ATTR_HI_S1_GP BIT(50)
#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
KVM_PTE_LEAF_ATTR_HI_S2_XN)
#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
#define KVM_MAX_OWNER_ID 1
/*
* Used to indicate a pte for which a 'break-before-make' sequence is in
* progress.
*/
#define KVM_INVALID_PTE_LOCKED BIT(10)
static inline bool kvm_pte_valid(kvm_pte_t pte)
{
return pte & KVM_PTE_VALID;
}
static inline u64 kvm_pte_to_phys(kvm_pte_t pte)
{
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1 Implement a simple policy whereby if the HW supports FEAT_LPA2 for the page size we are using, always use LPA2-style page-tables for stage 2 and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested IPA size or HW-implemented PA size. When in use we can now support up to 52-bit IPA and PA sizes. We use the previously created cpu feature to track whether LPA2 is supported for deciding whether to use the LPA2 or classic pte format. Note that FEAT_LPA2 brings support for bigger block mappings (512GB with 4KB, 64GB with 16KB). We explicitly don't enable these in the library because stage2_apply_range() works on batch sizes of the largest used block mapping, and increasing the size of the batch would lead to soft lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit stage2_apply_range() batch size to largest block"). With the addition of LPA2 support in the hypervisor, the PA size supported by the HW must be capped with a runtime decision, rather than simply using a compile-time decision based on PA_BITS. For example, on a system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB or 16KB kernel compiled with LPA2 support must still limit the PA size to 48 bits. Therefore, move the insertion of the PS field into TCR_EL2 out of __kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode() where the rest of TCR_EL2 is prepared. This allows us to figure out PS with kvm_get_parange(), which has the appropriate logic to ensure the above requirement. (and the PS field of VTCR_EL2 is already populated this way). Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
u64 pa;
if (kvm_lpa2_is_enabled()) {
pa = pte & KVM_PTE_ADDR_MASK_LPA2;
pa |= FIELD_GET(KVM_PTE_ADDR_51_50_LPA2, pte) << 50;
} else {
pa = pte & KVM_PTE_ADDR_MASK;
if (PAGE_SHIFT == 16)
pa |= FIELD_GET(KVM_PTE_ADDR_51_48, pte) << 48;
}
return pa;
}
static inline kvm_pte_t kvm_phys_to_pte(u64 pa)
{
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1 Implement a simple policy whereby if the HW supports FEAT_LPA2 for the page size we are using, always use LPA2-style page-tables for stage 2 and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested IPA size or HW-implemented PA size. When in use we can now support up to 52-bit IPA and PA sizes. We use the previously created cpu feature to track whether LPA2 is supported for deciding whether to use the LPA2 or classic pte format. Note that FEAT_LPA2 brings support for bigger block mappings (512GB with 4KB, 64GB with 16KB). We explicitly don't enable these in the library because stage2_apply_range() works on batch sizes of the largest used block mapping, and increasing the size of the batch would lead to soft lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit stage2_apply_range() batch size to largest block"). With the addition of LPA2 support in the hypervisor, the PA size supported by the HW must be capped with a runtime decision, rather than simply using a compile-time decision based on PA_BITS. For example, on a system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB or 16KB kernel compiled with LPA2 support must still limit the PA size to 48 bits. Therefore, move the insertion of the PS field into TCR_EL2 out of __kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode() where the rest of TCR_EL2 is prepared. This allows us to figure out PS with kvm_get_parange(), which has the appropriate logic to ensure the above requirement. (and the PS field of VTCR_EL2 is already populated this way). Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
kvm_pte_t pte;
if (kvm_lpa2_is_enabled()) {
pte = pa & KVM_PTE_ADDR_MASK_LPA2;
pa &= GENMASK(51, 50);
pte |= FIELD_PREP(KVM_PTE_ADDR_51_50_LPA2, pa >> 50);
} else {
pte = pa & KVM_PTE_ADDR_MASK;
if (PAGE_SHIFT == 16) {
pa &= GENMASK(51, 48);
pte |= FIELD_PREP(KVM_PTE_ADDR_51_48, pa >> 48);
}
}
return pte;
}
static inline kvm_pfn_t kvm_pte_to_pfn(kvm_pte_t pte)
{
return __phys_to_pfn(kvm_pte_to_phys(pte));
}
static inline u64 kvm_granule_shift(s8 level)
{
/* Assumes KVM_PGTABLE_LAST_LEVEL is 3 */
return ARM64_HW_PGTABLE_LEVEL_SHIFT(level);
}
static inline u64 kvm_granule_size(s8 level)
{
return BIT(kvm_granule_shift(level));
}
static inline bool kvm_level_supports_block_mapping(s8 level)
{
return level >= KVM_PGTABLE_MIN_BLOCK_LEVEL;
}
static inline u32 kvm_supported_block_sizes(void)
{
s8 level = KVM_PGTABLE_MIN_BLOCK_LEVEL;
u32 r = 0;
for (; level <= KVM_PGTABLE_LAST_LEVEL; level++)
r |= BIT(kvm_granule_shift(level));
return r;
}
static inline bool kvm_is_block_size_supported(u64 size)
{
bool is_power_of_two = IS_ALIGNED(size, size);
return is_power_of_two && (size & kvm_supported_block_sizes());
}
/**
* struct kvm_pgtable_mm_ops - Memory management callbacks.
* @zalloc_page: Allocate a single zeroed memory page.
* The @arg parameter can be used by the walker
* to pass a memcache. The initial refcount of
* the page is 1.
* @zalloc_pages_exact: Allocate an exact number of zeroed memory pages.
* The @size parameter is in bytes, and is rounded
* up to the next page boundary. The resulting
* allocation is physically contiguous.
* @free_pages_exact: Free an exact number of memory pages previously
* allocated by zalloc_pages_exact.
* @free_unlinked_table: Free an unlinked paging structure by unlinking and
* dropping references.
* @get_page: Increment the refcount on a page.
* @put_page: Decrement the refcount on a page. When the
* refcount reaches 0 the page is automatically
* freed.
* @page_count: Return the refcount of a page.
* @phys_to_virt: Convert a physical address into a virtual
* address mapped in the current context.
* @virt_to_phys: Convert a virtual address mapped in the current
* context into a physical address.
* @dcache_clean_inval_poc: Clean and invalidate the data cache to the PoC
* for the specified memory address range.
* @icache_inval_pou: Invalidate the instruction cache to the PoU
* for the specified memory address range.
*/
struct kvm_pgtable_mm_ops {
void* (*zalloc_page)(void *arg);
void* (*zalloc_pages_exact)(size_t size);
void (*free_pages_exact)(void *addr, size_t size);
void (*free_unlinked_table)(void *addr, s8 level);
void (*get_page)(void *addr);
void (*put_page)(void *addr);
int (*page_count)(void *addr);
void* (*phys_to_virt)(phys_addr_t phys);
phys_addr_t (*virt_to_phys)(void *addr);
void (*dcache_clean_inval_poc)(void *addr, size_t size);
void (*icache_inval_pou)(void *addr, size_t size);
};
/**
* enum kvm_pgtable_stage2_flags - Stage-2 page-table flags.
* @KVM_PGTABLE_S2_NOFWB: Don't enforce Normal-WB even if the CPUs have
* ARM64_HAS_STAGE2_FWB.
* @KVM_PGTABLE_S2_IDMAP: Only use identity mappings.
*/
enum kvm_pgtable_stage2_flags {
KVM_PGTABLE_S2_NOFWB = BIT(0),
KVM_PGTABLE_S2_IDMAP = BIT(1),
};
/**
* enum kvm_pgtable_prot - Page-table permissions and attributes.
* @KVM_PGTABLE_PROT_X: Execute permission.
* @KVM_PGTABLE_PROT_W: Write permission.
* @KVM_PGTABLE_PROT_R: Read permission.
* @KVM_PGTABLE_PROT_DEVICE: Device attributes.
KVM: arm64: Introduce new flag for non-cacheable IO memory Currently, KVM for ARM64 maps at stage 2 memory that is considered device (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting overrides (as per the ARM architecture [1]) any device MMIO mapping present at stage 1, resulting in a set-up whereby a guest operating system cannot determine device MMIO mapping memory attributes on its own but it is always overridden by the KVM stage 2 default. This set-up does not allow guest operating systems to select device memory attributes independently from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage 2 memory type attributes"), which turns out to be an issue in that guest operating systems (e.g. Linux) may request to map devices MMIO regions with memory attributes that guarantee better performance (e.g. gathering attribute - that for some devices can generate larger PCIe memory writes TLPs) and specific operations (e.g. unaligned transactions) such as the NormalNC memory type. The default device stage 2 mapping was chosen in KVM for ARM64 since it was considered safer (i.e. it would not allow guests to trigger uncontained failures ultimately crashing the machine) but this turned out to be asynchronous (SError) defeating the purpose. Failures containability is a property of the platform and is independent from the memory type used for MMIO device memory mappings. Actually, DEVICE_nGnRE memory type is even more problematic than Normal-NC memory type in terms of faults containability in that e.g. aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally, synchronous (i.e. that would imply that the processor should issue at most 1 load transaction at a time - it cannot pipeline them - otherwise the synchronous abort semantics would break the no-speculation attribute attached to DEVICE_XXX memory). This means that regardless of the combined stage1+stage2 mappings a platform is safe if and only if device transactions cannot trigger uncontained failures and that in turn relies on platform capabilities and the device type being assigned (i.e. PCIe AER/DPC error containment and RAS architecture[3]); therefore the default KVM device stage 2 memory attributes play no role in making device assignment safer for a given platform (if the platform design adheres to design guidelines outlined in [3]) and therefore can be relaxed. For all these reasons, relax the KVM stage 2 device memory attributes from DEVICE_nGnRE to Normal-NC. The NormalNC was chosen over a different Normal memory type default at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping. Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to trigger any issue on guest device reclaim use cases either (i.e. device MMIO unmap followed by a device reset) at least for PCIe devices, in that in PCIe a device reset is architected and carried out through PCI config space transactions that are naturally ordered with respect to MMIO transactions according to the PCI ordering rules. Having Normal-NC S2 default puts guests in control (thanks to stage1+stage2 combined memory attributes rules [1]) of device MMIO regions memory mappings, according to the rules described in [1] and summarized here ([(S1) - stage1], [(S2) - stage 2]): S1 | S2 | Result NORMAL-WB | NORMAL-NC | NORMAL-NC NORMAL-WT | NORMAL-NC | NORMAL-NC NORMAL-NC | NORMAL-NC | NORMAL-NC DEVICE<attr> | NORMAL-NC | DEVICE<attr> It is worth noting that currently, to map devices MMIO space to user space in a device pass-through use case the VFIO framework applies memory attributes derived from pgprot_noncached() settings applied to VMAs, which result in device-nGnRnE memory attributes for the stage-1 VMM mappings. This means that a userspace mapping for device MMIO space carried out with the current VFIO framework and a guest OS mapping for the same MMIO space may result in a mismatched alias as described in [2]. Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change anything in this respect, in that the mismatched aliases would only affect (refer to [2] for a detailed explanation) ordering between the userspace and GuestOS mappings resulting stream of transactions (i.e. it does not cause loss of property for either stream of transactions on its own), which is harmless given that the userspace and GuestOS access to the device is carried out through independent transactions streams. A Normal-NC flag is not present today. So add a new kvm_pgtable_prot (KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its corresponding PTE value 0x5 (0b101) determined from [1]. Lastly, adapt the stage2 PTE property setter function (stage2_set_prot_attr) to handle the NormalNC attribute. The entire discussion leading to this patch series may be followed through the following links. Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com [1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf [2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-24 20:35:43 +05:30
* @KVM_PGTABLE_PROT_NORMAL_NC: Normal noncacheable attributes.
* @KVM_PGTABLE_PROT_SW0: Software bit 0.
* @KVM_PGTABLE_PROT_SW1: Software bit 1.
* @KVM_PGTABLE_PROT_SW2: Software bit 2.
* @KVM_PGTABLE_PROT_SW3: Software bit 3.
*/
enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_X = BIT(0),
KVM_PGTABLE_PROT_W = BIT(1),
KVM_PGTABLE_PROT_R = BIT(2),
KVM_PGTABLE_PROT_DEVICE = BIT(3),
KVM: arm64: Introduce new flag for non-cacheable IO memory Currently, KVM for ARM64 maps at stage 2 memory that is considered device (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting overrides (as per the ARM architecture [1]) any device MMIO mapping present at stage 1, resulting in a set-up whereby a guest operating system cannot determine device MMIO mapping memory attributes on its own but it is always overridden by the KVM stage 2 default. This set-up does not allow guest operating systems to select device memory attributes independently from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage 2 memory type attributes"), which turns out to be an issue in that guest operating systems (e.g. Linux) may request to map devices MMIO regions with memory attributes that guarantee better performance (e.g. gathering attribute - that for some devices can generate larger PCIe memory writes TLPs) and specific operations (e.g. unaligned transactions) such as the NormalNC memory type. The default device stage 2 mapping was chosen in KVM for ARM64 since it was considered safer (i.e. it would not allow guests to trigger uncontained failures ultimately crashing the machine) but this turned out to be asynchronous (SError) defeating the purpose. Failures containability is a property of the platform and is independent from the memory type used for MMIO device memory mappings. Actually, DEVICE_nGnRE memory type is even more problematic than Normal-NC memory type in terms of faults containability in that e.g. aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally, synchronous (i.e. that would imply that the processor should issue at most 1 load transaction at a time - it cannot pipeline them - otherwise the synchronous abort semantics would break the no-speculation attribute attached to DEVICE_XXX memory). This means that regardless of the combined stage1+stage2 mappings a platform is safe if and only if device transactions cannot trigger uncontained failures and that in turn relies on platform capabilities and the device type being assigned (i.e. PCIe AER/DPC error containment and RAS architecture[3]); therefore the default KVM device stage 2 memory attributes play no role in making device assignment safer for a given platform (if the platform design adheres to design guidelines outlined in [3]) and therefore can be relaxed. For all these reasons, relax the KVM stage 2 device memory attributes from DEVICE_nGnRE to Normal-NC. The NormalNC was chosen over a different Normal memory type default at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping. Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to trigger any issue on guest device reclaim use cases either (i.e. device MMIO unmap followed by a device reset) at least for PCIe devices, in that in PCIe a device reset is architected and carried out through PCI config space transactions that are naturally ordered with respect to MMIO transactions according to the PCI ordering rules. Having Normal-NC S2 default puts guests in control (thanks to stage1+stage2 combined memory attributes rules [1]) of device MMIO regions memory mappings, according to the rules described in [1] and summarized here ([(S1) - stage1], [(S2) - stage 2]): S1 | S2 | Result NORMAL-WB | NORMAL-NC | NORMAL-NC NORMAL-WT | NORMAL-NC | NORMAL-NC NORMAL-NC | NORMAL-NC | NORMAL-NC DEVICE<attr> | NORMAL-NC | DEVICE<attr> It is worth noting that currently, to map devices MMIO space to user space in a device pass-through use case the VFIO framework applies memory attributes derived from pgprot_noncached() settings applied to VMAs, which result in device-nGnRnE memory attributes for the stage-1 VMM mappings. This means that a userspace mapping for device MMIO space carried out with the current VFIO framework and a guest OS mapping for the same MMIO space may result in a mismatched alias as described in [2]. Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change anything in this respect, in that the mismatched aliases would only affect (refer to [2] for a detailed explanation) ordering between the userspace and GuestOS mappings resulting stream of transactions (i.e. it does not cause loss of property for either stream of transactions on its own), which is harmless given that the userspace and GuestOS access to the device is carried out through independent transactions streams. A Normal-NC flag is not present today. So add a new kvm_pgtable_prot (KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its corresponding PTE value 0x5 (0b101) determined from [1]. Lastly, adapt the stage2 PTE property setter function (stage2_set_prot_attr) to handle the NormalNC attribute. The entire discussion leading to this patch series may be followed through the following links. Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com [1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf [2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-24 20:35:43 +05:30
KVM_PGTABLE_PROT_NORMAL_NC = BIT(4),
KVM_PGTABLE_PROT_SW0 = BIT(55),
KVM_PGTABLE_PROT_SW1 = BIT(56),
KVM_PGTABLE_PROT_SW2 = BIT(57),
KVM_PGTABLE_PROT_SW3 = BIT(58),
};
#define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
#define KVM_PGTABLE_PROT_RWX (KVM_PGTABLE_PROT_RW | KVM_PGTABLE_PROT_X)
#define PKVM_HOST_MEM_PROT KVM_PGTABLE_PROT_RWX
#define PKVM_HOST_MMIO_PROT KVM_PGTABLE_PROT_RW
#define PAGE_HYP KVM_PGTABLE_PROT_RW
#define PAGE_HYP_EXEC (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_X)
#define PAGE_HYP_RO (KVM_PGTABLE_PROT_R)
#define PAGE_HYP_DEVICE (PAGE_HYP | KVM_PGTABLE_PROT_DEVICE)
typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
enum kvm_pgtable_prot prot);
/**
* enum kvm_pgtable_walk_flags - Flags to control a depth-first page-table walk.
* @KVM_PGTABLE_WALK_LEAF: Visit leaf entries, including invalid
* entries.
* @KVM_PGTABLE_WALK_TABLE_PRE: Visit table entries before their
* children.
* @KVM_PGTABLE_WALK_TABLE_POST: Visit table entries after their
* children.
* @KVM_PGTABLE_WALK_SHARED: Indicates the page-tables may be shared
* with other software walkers.
* @KVM_PGTABLE_WALK_HANDLE_FAULT: Indicates the page-table walk was
* invoked from a fault handler.
* @KVM_PGTABLE_WALK_SKIP_BBM_TLBI: Visit and update table entries
* without Break-before-make's
* TLB invalidation.
* @KVM_PGTABLE_WALK_SKIP_CMO: Visit and update table entries
* without Cache maintenance
* operations required.
*/
enum kvm_pgtable_walk_flags {
KVM_PGTABLE_WALK_LEAF = BIT(0),
KVM_PGTABLE_WALK_TABLE_PRE = BIT(1),
KVM_PGTABLE_WALK_TABLE_POST = BIT(2),
KVM_PGTABLE_WALK_SHARED = BIT(3),
KVM_PGTABLE_WALK_HANDLE_FAULT = BIT(4),
KVM_PGTABLE_WALK_SKIP_BBM_TLBI = BIT(5),
KVM_PGTABLE_WALK_SKIP_CMO = BIT(6),
};
struct kvm_pgtable_visit_ctx {
kvm_pte_t *ptep;
kvm_pte_t old;
void *arg;
struct kvm_pgtable_mm_ops *mm_ops;
KVM: arm64: Infer the PA offset from IPA in stage-2 map walker Until now, the page table walker counted increments to the PA and IPA of a walk in two separate places. While the PA is incremented as soon as a leaf PTE is installed in stage2_map_walker_try_leaf(), the IPA is actually bumped in the generic table walker context. Critically, __kvm_pgtable_visit() rereads the PTE after the LEAF callback returns to work out if a table or leaf was installed, and only bumps the IPA for a leaf PTE. This arrangement worked fine when we handled faults behind the write lock, as the walker had exclusive access to the stage-2 page tables. However, commit 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel") started handling all stage-2 faults behind the read lock, opening up a race where a walker could increment the PA but not the IPA of a walk. Nothing good ensues, as the walker starts mapping with the incorrect IPA -> PA relationship. For example, assume that two vCPUs took a data abort on the same IPA. One observes that dirty logging is disabled, and the other observed that it is enabled: vCPU attempting PMD mapping vCPU attempting PTE mapping ====================================== ===================================== /* install PMD */ stage2_make_pte(ctx, leaf); data->phys += granule; /* replace PMD with a table */ stage2_try_break_pte(ctx, data->mmu); stage2_make_pte(ctx, table); /* table is observed */ ctx.old = READ_ONCE(*ptep); table = kvm_pte_table(ctx.old, level); /* * map walk continues w/o incrementing * IPA. */ __kvm_pgtable_walk(..., level + 1); Bring an end to the whole mess by using the IPA as the single source of truth for how far along a walk has gotten. Work out the correct PA to map by calculating the IPA offset from the beginning of the walk and add that to the starting physical address. Cc: stable@vger.kernel.org Fixes: 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel") Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230421071606.1603916-2-oliver.upton@linux.dev
2023-04-21 07:16:05 +00:00
u64 start;
u64 addr;
u64 end;
s8 level;
enum kvm_pgtable_walk_flags flags;
};
typedef int (*kvm_pgtable_visitor_fn_t)(const struct kvm_pgtable_visit_ctx *ctx,
enum kvm_pgtable_walk_flags visit);
static inline bool kvm_pgtable_walk_shared(const struct kvm_pgtable_visit_ctx *ctx)
{
return ctx->flags & KVM_PGTABLE_WALK_SHARED;
}
/**
* struct kvm_pgtable_walker - Hook into a page-table walk.
* @cb: Callback function to invoke during the walk.
* @arg: Argument passed to the callback function.
* @flags: Bitwise-OR of flags to identify the entry types on which to
* invoke the callback function.
*/
struct kvm_pgtable_walker {
const kvm_pgtable_visitor_fn_t cb;
void * const arg;
const enum kvm_pgtable_walk_flags flags;
};
/*
* RCU cannot be used in a non-kernel context such as the hyp. As such, page
* table walkers used in hyp do not call into RCU and instead use other
* synchronization mechanisms (such as a spinlock).
*/
#if defined(__KVM_NVHE_HYPERVISOR__) || defined(__KVM_VHE_HYPERVISOR__)
typedef kvm_pte_t *kvm_pteref_t;
static inline kvm_pte_t *kvm_dereference_pteref(struct kvm_pgtable_walker *walker,
kvm_pteref_t pteref)
{
return pteref;
}
static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker)
{
/*
* Due to the lack of RCU (or a similar protection scheme), only
* non-shared table walkers are allowed in the hypervisor.
*/
if (walker->flags & KVM_PGTABLE_WALK_SHARED)
return -EPERM;
return 0;
}
KVM: arm64: Don't acquire RCU read lock for exclusive table walks Marek reported a BUG resulting from the recent parallel faults changes, as the hyp stage-1 map walker attempted to allocate table memory while holding the RCU read lock: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 2 locks held by swapper/0/1: #0: ffff80000a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at: __create_hyp_mappings+0x80/0xc4 #1: ffff80000a927720 (rcu_read_lock){....}-{1:2}, at: kvm_pgtable_walk+0x0/0x1f4 CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918 Hardware name: Raspberry Pi 3 Model B (DT) Call trace: dump_backtrace.part.0+0xe4/0xf0 show_stack+0x18/0x40 dump_stack_lvl+0x8c/0xb8 dump_stack+0x18/0x34 __might_resched+0x178/0x220 __might_sleep+0x48/0xa0 prepare_alloc_pages+0x178/0x1a0 __alloc_pages+0x9c/0x109c alloc_page_interleave+0x1c/0xc4 alloc_pages+0xec/0x160 get_zeroed_page+0x1c/0x44 kvm_hyp_zalloc_page+0x14/0x20 hyp_map_walker+0xd4/0x134 kvm_pgtable_visitor_cb.isra.0+0x38/0x5c __kvm_pgtable_walk+0x1a4/0x220 kvm_pgtable_walk+0x104/0x1f4 kvm_pgtable_hyp_map+0x80/0xc4 __create_hyp_mappings+0x9c/0xc4 kvm_mmu_init+0x144/0x1cc kvm_arch_init+0xe4/0xef4 kvm_init+0x3c/0x3d0 arm_init+0x20/0x30 do_one_initcall+0x74/0x400 kernel_init_freeable+0x2e0/0x350 kernel_init+0x24/0x130 ret_from_fork+0x10/0x20 Since the hyp stage-1 table walkers are serialized by kvm_hyp_pgd_mutex, RCU protection really doesn't add anything. Don't acquire the RCU read lock for an exclusive walk. Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221118182222.3932898-3-oliver.upton@linux.dev
2022-11-18 18:22:21 +00:00
static inline void kvm_pgtable_walk_end(struct kvm_pgtable_walker *walker) {}
static inline bool kvm_pgtable_walk_lock_held(void)
{
return true;
}
#else
typedef kvm_pte_t __rcu *kvm_pteref_t;
static inline kvm_pte_t *kvm_dereference_pteref(struct kvm_pgtable_walker *walker,
kvm_pteref_t pteref)
{
return rcu_dereference_check(pteref, !(walker->flags & KVM_PGTABLE_WALK_SHARED));
}
static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker)
{
KVM: arm64: Don't acquire RCU read lock for exclusive table walks Marek reported a BUG resulting from the recent parallel faults changes, as the hyp stage-1 map walker attempted to allocate table memory while holding the RCU read lock: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 2 locks held by swapper/0/1: #0: ffff80000a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at: __create_hyp_mappings+0x80/0xc4 #1: ffff80000a927720 (rcu_read_lock){....}-{1:2}, at: kvm_pgtable_walk+0x0/0x1f4 CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918 Hardware name: Raspberry Pi 3 Model B (DT) Call trace: dump_backtrace.part.0+0xe4/0xf0 show_stack+0x18/0x40 dump_stack_lvl+0x8c/0xb8 dump_stack+0x18/0x34 __might_resched+0x178/0x220 __might_sleep+0x48/0xa0 prepare_alloc_pages+0x178/0x1a0 __alloc_pages+0x9c/0x109c alloc_page_interleave+0x1c/0xc4 alloc_pages+0xec/0x160 get_zeroed_page+0x1c/0x44 kvm_hyp_zalloc_page+0x14/0x20 hyp_map_walker+0xd4/0x134 kvm_pgtable_visitor_cb.isra.0+0x38/0x5c __kvm_pgtable_walk+0x1a4/0x220 kvm_pgtable_walk+0x104/0x1f4 kvm_pgtable_hyp_map+0x80/0xc4 __create_hyp_mappings+0x9c/0xc4 kvm_mmu_init+0x144/0x1cc kvm_arch_init+0xe4/0xef4 kvm_init+0x3c/0x3d0 arm_init+0x20/0x30 do_one_initcall+0x74/0x400 kernel_init_freeable+0x2e0/0x350 kernel_init+0x24/0x130 ret_from_fork+0x10/0x20 Since the hyp stage-1 table walkers are serialized by kvm_hyp_pgd_mutex, RCU protection really doesn't add anything. Don't acquire the RCU read lock for an exclusive walk. Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221118182222.3932898-3-oliver.upton@linux.dev
2022-11-18 18:22:21 +00:00
if (walker->flags & KVM_PGTABLE_WALK_SHARED)
rcu_read_lock();
return 0;
}
KVM: arm64: Don't acquire RCU read lock for exclusive table walks Marek reported a BUG resulting from the recent parallel faults changes, as the hyp stage-1 map walker attempted to allocate table memory while holding the RCU read lock: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 2 locks held by swapper/0/1: #0: ffff80000a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at: __create_hyp_mappings+0x80/0xc4 #1: ffff80000a927720 (rcu_read_lock){....}-{1:2}, at: kvm_pgtable_walk+0x0/0x1f4 CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918 Hardware name: Raspberry Pi 3 Model B (DT) Call trace: dump_backtrace.part.0+0xe4/0xf0 show_stack+0x18/0x40 dump_stack_lvl+0x8c/0xb8 dump_stack+0x18/0x34 __might_resched+0x178/0x220 __might_sleep+0x48/0xa0 prepare_alloc_pages+0x178/0x1a0 __alloc_pages+0x9c/0x109c alloc_page_interleave+0x1c/0xc4 alloc_pages+0xec/0x160 get_zeroed_page+0x1c/0x44 kvm_hyp_zalloc_page+0x14/0x20 hyp_map_walker+0xd4/0x134 kvm_pgtable_visitor_cb.isra.0+0x38/0x5c __kvm_pgtable_walk+0x1a4/0x220 kvm_pgtable_walk+0x104/0x1f4 kvm_pgtable_hyp_map+0x80/0xc4 __create_hyp_mappings+0x9c/0xc4 kvm_mmu_init+0x144/0x1cc kvm_arch_init+0xe4/0xef4 kvm_init+0x3c/0x3d0 arm_init+0x20/0x30 do_one_initcall+0x74/0x400 kernel_init_freeable+0x2e0/0x350 kernel_init+0x24/0x130 ret_from_fork+0x10/0x20 Since the hyp stage-1 table walkers are serialized by kvm_hyp_pgd_mutex, RCU protection really doesn't add anything. Don't acquire the RCU read lock for an exclusive walk. Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221118182222.3932898-3-oliver.upton@linux.dev
2022-11-18 18:22:21 +00:00
static inline void kvm_pgtable_walk_end(struct kvm_pgtable_walker *walker)
{
KVM: arm64: Don't acquire RCU read lock for exclusive table walks Marek reported a BUG resulting from the recent parallel faults changes, as the hyp stage-1 map walker attempted to allocate table memory while holding the RCU read lock: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 2 locks held by swapper/0/1: #0: ffff80000a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at: __create_hyp_mappings+0x80/0xc4 #1: ffff80000a927720 (rcu_read_lock){....}-{1:2}, at: kvm_pgtable_walk+0x0/0x1f4 CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918 Hardware name: Raspberry Pi 3 Model B (DT) Call trace: dump_backtrace.part.0+0xe4/0xf0 show_stack+0x18/0x40 dump_stack_lvl+0x8c/0xb8 dump_stack+0x18/0x34 __might_resched+0x178/0x220 __might_sleep+0x48/0xa0 prepare_alloc_pages+0x178/0x1a0 __alloc_pages+0x9c/0x109c alloc_page_interleave+0x1c/0xc4 alloc_pages+0xec/0x160 get_zeroed_page+0x1c/0x44 kvm_hyp_zalloc_page+0x14/0x20 hyp_map_walker+0xd4/0x134 kvm_pgtable_visitor_cb.isra.0+0x38/0x5c __kvm_pgtable_walk+0x1a4/0x220 kvm_pgtable_walk+0x104/0x1f4 kvm_pgtable_hyp_map+0x80/0xc4 __create_hyp_mappings+0x9c/0xc4 kvm_mmu_init+0x144/0x1cc kvm_arch_init+0xe4/0xef4 kvm_init+0x3c/0x3d0 arm_init+0x20/0x30 do_one_initcall+0x74/0x400 kernel_init_freeable+0x2e0/0x350 kernel_init+0x24/0x130 ret_from_fork+0x10/0x20 Since the hyp stage-1 table walkers are serialized by kvm_hyp_pgd_mutex, RCU protection really doesn't add anything. Don't acquire the RCU read lock for an exclusive walk. Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221118182222.3932898-3-oliver.upton@linux.dev
2022-11-18 18:22:21 +00:00
if (walker->flags & KVM_PGTABLE_WALK_SHARED)
rcu_read_unlock();
}
static inline bool kvm_pgtable_walk_lock_held(void)
{
return rcu_read_lock_held();
}
#endif
/**
* struct kvm_pgtable - KVM page-table.
* @ia_bits: Maximum input address size, in bits.
* @start_level: Level at which the page-table walk starts.
* @pgd: Pointer to the first top-level entry of the page-table.
* @mm_ops: Memory management callbacks.
* @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
* @flags: Stage-2 page-table flags.
* @force_pte_cb: Function that returns true if page level mappings must
* be used instead of block mappings.
*/
struct kvm_pgtable {
union {
struct rb_root_cached pkvm_mappings;
struct {
u32 ia_bits;
s8 start_level;
kvm_pteref_t pgd;
struct kvm_pgtable_mm_ops *mm_ops;
/* Stage-2 only */
enum kvm_pgtable_stage2_flags flags;
kvm_pgtable_force_pte_cb_t force_pte_cb;
};
};
struct kvm_s2_mmu *mmu;
};
/**
* kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
* @pgt: Uninitialised page-table structure to initialise.
* @va_bits: Maximum virtual address bits.
* @mm_ops: Memory management callbacks.
*
* Return: 0 on success, negative error code on failure.
*/
int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
struct kvm_pgtable_mm_ops *mm_ops);
/**
* kvm_pgtable_hyp_destroy() - Destroy an unused hypervisor stage-1 page-table.
* @pgt: Page-table structure initialised by kvm_pgtable_hyp_init().
*
* The page-table is assumed to be unreachable by any hardware walkers prior
* to freeing and therefore no TLB invalidation is performed.
*/
void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt);
/**
* kvm_pgtable_hyp_map() - Install a mapping in a hypervisor stage-1 page-table.
* @pgt: Page-table structure initialised by kvm_pgtable_hyp_init().
* @addr: Virtual address at which to place the mapping.
* @size: Size of the mapping.
* @phys: Physical address of the memory to map.
* @prot: Permissions and attributes for the mapping.
*
* The offset of @addr within a page is ignored, @size is rounded-up to
* the next page boundary and @phys is rounded-down to the previous page
* boundary.
*
* If device attributes are not explicitly requested in @prot, then the
* mapping will be normal, cacheable. Attempts to install a new mapping
* for a virtual address that is already mapped will be rejected with an
* error and a WARN().
*
* Return: 0 on success, negative error code on failure.
*/
int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
enum kvm_pgtable_prot prot);
/**
* kvm_pgtable_hyp_unmap() - Remove a mapping from a hypervisor stage-1 page-table.
* @pgt: Page-table structure initialised by kvm_pgtable_hyp_init().
* @addr: Virtual address from which to remove the mapping.
* @size: Size of the mapping.
*
* The offset of @addr within a page is ignored, @size is rounded-up to
* the next page boundary and @phys is rounded-down to the previous page
* boundary.
*
* TLB invalidation is performed for each page-table entry cleared during the
* unmapping operation and the reference count for the page-table page
* containing the cleared entry is decremented, with unreferenced pages being
* freed. The unmapping operation will stop early if it encounters either an
* invalid page-table entry or a valid block mapping which maps beyond the range
* being unmapped.
*
* Return: Number of bytes unmapped, which may be 0.
*/
u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
/**
* kvm_get_vtcr() - Helper to construct VTCR_EL2
* @mmfr0: Sanitized value of SYS_ID_AA64MMFR0_EL1 register.
* @mmfr1: Sanitized value of SYS_ID_AA64MMFR1_EL1 register.
* @phys_shfit: Value to set in VTCR_EL2.T0SZ.
*
* The VTCR value is common across all the physical CPUs on the system.
* We use system wide sanitised values to fill in different fields,
* except for Hardware Management of Access Flags. HA Flag is set
* unconditionally on all CPUs, as it is safe to run with or without
* the feature and the bit is RES0 on CPUs that don't support it.
*
* Return: VTCR_EL2 value
*/
u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
/**
* kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD
* @vtcr: Content of the VTCR register.
*
* Return: the size (in bytes) of the stage-2 PGD
*/
size_t kvm_pgtable_stage2_pgd_size(u64 vtcr);
/**
* __kvm_pgtable_stage2_init() - Initialise a guest stage-2 page-table.
* @pgt: Uninitialised page-table structure to initialise.
* @mmu: S2 MMU context for this S2 translation
* @mm_ops: Memory management callbacks.
* @flags: Stage-2 configuration flags.
* @force_pte_cb: Function that returns true if page level mappings must
* be used instead of block mappings.
*
* Return: 0 on success, negative error code on failure.
*/
int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
struct kvm_pgtable_mm_ops *mm_ops,
enum kvm_pgtable_stage2_flags flags,
kvm_pgtable_force_pte_cb_t force_pte_cb);
static inline int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
struct kvm_pgtable_mm_ops *mm_ops)
{
return __kvm_pgtable_stage2_init(pgt, mmu, mm_ops, 0, NULL);
}
/**
* kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
*
* The page-table is assumed to be unreachable by any hardware walkers prior
* to freeing and therefore no TLB invalidation is performed.
*/
void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
/**
* kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure.
* @mm_ops: Memory management callbacks.
* @pgtable: Unlinked stage-2 paging structure to be freed.
* @level: Level of the stage-2 paging structure to be freed.
*
* The page-table is assumed to be unreachable by any hardware walkers prior to
* freeing and therefore no TLB invalidation is performed.
*/
void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, s8 level);
/**
* kvm_pgtable_stage2_create_unlinked() - Create an unlinked stage-2 paging structure.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @phys: Physical address of the memory to map.
* @level: Starting level of the stage-2 paging structure to be created.
* @prot: Permissions and attributes for the mapping.
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
* page-table pages.
* @force_pte: Force mappings to PAGE_SIZE granularity.
*
* Returns an unlinked page-table tree. This new page-table tree is
* not reachable (i.e., it is unlinked) from the root pgd and it's
* therefore unreachableby the hardware page-table walker. No TLB
* invalidation or CMOs are performed.
*
* If device attributes are not explicitly requested in @prot, then the
* mapping will be normal, cacheable.
*
* Return: The fully populated (unlinked) stage-2 paging structure, or
* an ERR_PTR(error) on failure.
*/
kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
u64 phys, s8 level,
enum kvm_pgtable_prot prot,
void *mc, bool force_pte);
/**
* kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Intermediate physical address at which to place the mapping.
* @size: Size of the mapping.
* @phys: Physical address of the memory to map.
* @prot: Permissions and attributes for the mapping.
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
* page-table pages.
* @flags: Flags to control the page-table walk (ex. a shared walk)
*
* The offset of @addr within a page is ignored, @size is rounded-up to
* the next page boundary and @phys is rounded-down to the previous page
* boundary.
*
* If device attributes are not explicitly requested in @prot, then the
* mapping will be normal, cacheable.
*
KVM: arm64: Filter out the case of only changing permissions from stage-2 map path (1) During running time of a a VM with numbers of vCPUs, if some vCPUs access the same GPA almost at the same time and the stage-2 mapping of the GPA has not been built yet, as a result they will all cause translation faults. The first vCPU builds the mapping, and the followed ones end up updating the valid leaf PTE. Note that these vCPUs might want different access permissions (RO, RW, RX, RWX, etc.). (2) It's inevitable that we sometimes will update an existing valid leaf PTE in the map path, and we perform break-before-make in this case. Then more unnecessary translation faults could be caused if the *break stage* of BBM is just catched by other vCPUS. With (1) and (2), something unsatisfactory could happen: vCPU A causes a translation fault and builds the mapping with RW permissions, vCPU B then update the valid leaf PTE with break-before-make and permissions are updated back to RO. Besides, *break stage* of BBM may trigger more translation faults. Finally, some useless small loops could occur. We can make some optimization to solve above problems: When we need to update a valid leaf PTE in the map path, let's filter out the case where this update only change access permissions, and don't update the valid leaf PTE here in this case. Instead, let the vCPU enter back the guest and it will exit next time to go through the relax_perms path without break-before-make if it still wants more permissions. Signed-off-by: Yanan Wang <wangyanan55@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210114121350.123684-3-wangyanan55@huawei.com
2021-01-14 20:13:49 +08:00
* Note that the update of a valid leaf PTE in this function will be aborted,
* if it's trying to recreate the exact same mapping or only change the access
* permissions. Instead, the vCPU will exit one more time from guest if still
* needed and then go through the path of relaxing permissions.
*
* Note that this function will both coalesce existing table entries and split
* existing block mappings, relying on page-faults to fault back areas outside
* of the new mapping lazily.
*
* Return: 0 on success, negative error code on failure.
*/
int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
u64 phys, enum kvm_pgtable_prot prot,
void *mc, enum kvm_pgtable_walk_flags flags);
/**
* kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space to
* track ownership.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Base intermediate physical address to annotate.
* @size: Size of the annotated range.
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
* page-table pages.
* @owner_id: Unique identifier for the owner of the page.
*
* By default, all page-tables are owned by identifier 0. This function can be
* used to mark portions of the IPA space as owned by other entities. When a
* stage 2 is used with identity-mappings, these annotations allow to use the
* page-table data structure as a simple rmap.
*
* Return: 0 on success, negative error code on failure.
*/
int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
void *mc, u8 owner_id);
/**
* kvm_pgtable_stage2_unmap() - Remove a mapping from a guest stage-2 page-table.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Intermediate physical address from which to remove the mapping.
* @size: Size of the mapping.
*
* The offset of @addr within a page is ignored and @size is rounded-up to
* the next page boundary.
*
* TLB invalidation is performed for each page-table entry cleared during the
* unmapping operation and the reference count for the page-table page
* containing the cleared entry is decremented, with unreferenced pages being
* freed. Unmapping a cacheable page will ensure that it is clean to the PoC if
* FWB is not supported by the CPU.
*
* Return: 0 on success, negative error code on failure.
*/
int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
/**
* kvm_pgtable_stage2_wrprotect() - Write-protect guest stage-2 address range
* without TLB invalidation.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Intermediate physical address from which to write-protect,
* @size: Size of the range.
*
* The offset of @addr within a page is ignored and @size is rounded-up to
* the next page boundary.
*
* Note that it is the caller's responsibility to invalidate the TLB after
* calling this function to ensure that the updated permissions are visible
* to the CPUs.
*
* Return: 0 on success, negative error code on failure.
*/
int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size);
/**
* kvm_pgtable_stage2_mkyoung() - Set the access flag in a page-table entry.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Intermediate physical address to identify the page-table entry.
* @flags: Flags to control the page-table walk (ex. a shared walk)
*
* The offset of @addr within a page is ignored.
*
* If there is a valid, leaf page-table entry used to translate @addr, then
* set the access flag in that entry.
*/
void kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr,
enum kvm_pgtable_walk_flags flags);
/**
KVM: arm64: Correctly handle page aging notifiers for unaligned memslot Userspace is allowed to select any PAGE_SIZE aligned hva to back guest memory. This is even the case with hugepages, although it is a rather suboptimal configuration as PTE level mappings are used at stage-2. The arm64 page aging handlers have an assumption that the specified range is exactly one page/block of memory, which in the aforementioned case is not necessarily true. All together this leads to the WARN() in kvm_age_gfn() firing. However, the WARN is only part of the issue as the table walkers visit at most a single leaf PTE. For hugepage-backed memory in a memslot that isn't hugepage-aligned, page aging entirely misses accesses to the hugepage beyond the first page in the memslot. Add a new walker dedicated to handling page aging MMU notifiers capable of walking a range of PTEs. Convert kvm(_test)_age_gfn() over to the new walker and drop the WARN that caught the issue in the first place. The implementation of this walker was inspired by the test_clear_young() implementation by Yu Zhao [*], but repurposed to address a bug in the existing aging implementation. Cc: stable@vger.kernel.org # v5.15 Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers") Link: https://lore.kernel.org/kvmarm/20230526234435.662652-6-yuzhao@google.com/ Co-developed-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20230627235405.4069823-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-27 23:54:05 +00:00
* kvm_pgtable_stage2_test_clear_young() - Test and optionally clear the access
* flag in a page-table entry.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Intermediate physical address to identify the page-table entry.
KVM: arm64: Correctly handle page aging notifiers for unaligned memslot Userspace is allowed to select any PAGE_SIZE aligned hva to back guest memory. This is even the case with hugepages, although it is a rather suboptimal configuration as PTE level mappings are used at stage-2. The arm64 page aging handlers have an assumption that the specified range is exactly one page/block of memory, which in the aforementioned case is not necessarily true. All together this leads to the WARN() in kvm_age_gfn() firing. However, the WARN is only part of the issue as the table walkers visit at most a single leaf PTE. For hugepage-backed memory in a memslot that isn't hugepage-aligned, page aging entirely misses accesses to the hugepage beyond the first page in the memslot. Add a new walker dedicated to handling page aging MMU notifiers capable of walking a range of PTEs. Convert kvm(_test)_age_gfn() over to the new walker and drop the WARN that caught the issue in the first place. The implementation of this walker was inspired by the test_clear_young() implementation by Yu Zhao [*], but repurposed to address a bug in the existing aging implementation. Cc: stable@vger.kernel.org # v5.15 Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers") Link: https://lore.kernel.org/kvmarm/20230526234435.662652-6-yuzhao@google.com/ Co-developed-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20230627235405.4069823-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-27 23:54:05 +00:00
* @size: Size of the address range to visit.
* @mkold: True if the access flag should be cleared.
*
* The offset of @addr within a page is ignored.
*
KVM: arm64: Correctly handle page aging notifiers for unaligned memslot Userspace is allowed to select any PAGE_SIZE aligned hva to back guest memory. This is even the case with hugepages, although it is a rather suboptimal configuration as PTE level mappings are used at stage-2. The arm64 page aging handlers have an assumption that the specified range is exactly one page/block of memory, which in the aforementioned case is not necessarily true. All together this leads to the WARN() in kvm_age_gfn() firing. However, the WARN is only part of the issue as the table walkers visit at most a single leaf PTE. For hugepage-backed memory in a memslot that isn't hugepage-aligned, page aging entirely misses accesses to the hugepage beyond the first page in the memslot. Add a new walker dedicated to handling page aging MMU notifiers capable of walking a range of PTEs. Convert kvm(_test)_age_gfn() over to the new walker and drop the WARN that caught the issue in the first place. The implementation of this walker was inspired by the test_clear_young() implementation by Yu Zhao [*], but repurposed to address a bug in the existing aging implementation. Cc: stable@vger.kernel.org # v5.15 Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers") Link: https://lore.kernel.org/kvmarm/20230526234435.662652-6-yuzhao@google.com/ Co-developed-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20230627235405.4069823-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-27 23:54:05 +00:00
* Tests and conditionally clears the access flag for every valid, leaf
* page-table entry used to translate the range [@addr, @addr + @size).
*
* Note that it is the caller's responsibility to invalidate the TLB after
* calling this function to ensure that the updated permissions are visible
* to the CPUs.
*
KVM: arm64: Correctly handle page aging notifiers for unaligned memslot Userspace is allowed to select any PAGE_SIZE aligned hva to back guest memory. This is even the case with hugepages, although it is a rather suboptimal configuration as PTE level mappings are used at stage-2. The arm64 page aging handlers have an assumption that the specified range is exactly one page/block of memory, which in the aforementioned case is not necessarily true. All together this leads to the WARN() in kvm_age_gfn() firing. However, the WARN is only part of the issue as the table walkers visit at most a single leaf PTE. For hugepage-backed memory in a memslot that isn't hugepage-aligned, page aging entirely misses accesses to the hugepage beyond the first page in the memslot. Add a new walker dedicated to handling page aging MMU notifiers capable of walking a range of PTEs. Convert kvm(_test)_age_gfn() over to the new walker and drop the WARN that caught the issue in the first place. The implementation of this walker was inspired by the test_clear_young() implementation by Yu Zhao [*], but repurposed to address a bug in the existing aging implementation. Cc: stable@vger.kernel.org # v5.15 Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers") Link: https://lore.kernel.org/kvmarm/20230526234435.662652-6-yuzhao@google.com/ Co-developed-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20230627235405.4069823-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-27 23:54:05 +00:00
* Return: True if any of the visited PTEs had the access flag set.
*/
KVM: arm64: Correctly handle page aging notifiers for unaligned memslot Userspace is allowed to select any PAGE_SIZE aligned hva to back guest memory. This is even the case with hugepages, although it is a rather suboptimal configuration as PTE level mappings are used at stage-2. The arm64 page aging handlers have an assumption that the specified range is exactly one page/block of memory, which in the aforementioned case is not necessarily true. All together this leads to the WARN() in kvm_age_gfn() firing. However, the WARN is only part of the issue as the table walkers visit at most a single leaf PTE. For hugepage-backed memory in a memslot that isn't hugepage-aligned, page aging entirely misses accesses to the hugepage beyond the first page in the memslot. Add a new walker dedicated to handling page aging MMU notifiers capable of walking a range of PTEs. Convert kvm(_test)_age_gfn() over to the new walker and drop the WARN that caught the issue in the first place. The implementation of this walker was inspired by the test_clear_young() implementation by Yu Zhao [*], but repurposed to address a bug in the existing aging implementation. Cc: stable@vger.kernel.org # v5.15 Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers") Link: https://lore.kernel.org/kvmarm/20230526234435.662652-6-yuzhao@google.com/ Co-developed-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20230627235405.4069823-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-27 23:54:05 +00:00
bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
u64 size, bool mkold);
/**
* kvm_pgtable_stage2_relax_perms() - Relax the permissions enforced by a
* page-table entry.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Intermediate physical address to identify the page-table entry.
* @prot: Additional permissions to grant for the mapping.
* @flags: Flags to control the page-table walk (ex. a shared walk)
*
* The offset of @addr within a page is ignored.
*
* If there is a valid, leaf page-table entry used to translate @addr, then
* relax the permissions in that entry according to the read, write and
* execute permissions specified by @prot. No permissions are removed, and
* TLB invalidation is performed after updating the entry. Software bits cannot
* be set or cleared using kvm_pgtable_stage2_relax_perms().
*
* Return: 0 on success, negative error code on failure.
*/
int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
enum kvm_pgtable_prot prot,
enum kvm_pgtable_walk_flags flags);
/**
* kvm_pgtable_stage2_flush_range() - Clean and invalidate data cache to Point
* of Coherency for guest stage-2 address
* range.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Intermediate physical address from which to flush.
* @size: Size of the range.
*
* The offset of @addr within a page is ignored and @size is rounded-up to
* the next page boundary.
*
* Return: 0 on success, negative error code on failure.
*/
int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
/**
* kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
* to PAGE_SIZE guest pages.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init().
* @addr: Intermediate physical address from which to split.
* @size: Size of the range.
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
* page-table pages.
*
* The function tries to split any level 1 or 2 entry that overlaps
* with the input range (given by @addr and @size).
*
* Return: 0 on success, negative error code on failure. Note that
* kvm_pgtable_stage2_split() is best effort: it tries to break as many
* blocks in the input range as allowed by @mc_capacity.
*/
int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
struct kvm_mmu_memory_cache *mc);
/**
* kvm_pgtable_walk() - Walk a page-table.
* @pgt: Page-table structure initialised by kvm_pgtable_*_init().
* @addr: Input address for the start of the walk.
* @size: Size of the range to walk.
* @walker: Walker callback description.
*
* The offset of @addr within a page is ignored and @size is rounded-up to
* the next page boundary.
*
* The walker will walk the page-table entries corresponding to the input
* address range specified, visiting entries according to the walker flags.
* Invalid entries are treated as leaf entries. The visited page table entry is
* reloaded after invoking the walker callback, allowing the walker to descend
* into a newly installed table.
*
* Returning a negative error code from the walker callback function will
* terminate the walk immediately with the same error code.
*
* Return: 0 on success, negative error code on failure.
*/
int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
struct kvm_pgtable_walker *walker);
/**
* kvm_pgtable_get_leaf() - Walk a page-table and retrieve the leaf entry
* with its level.
* @pgt: Page-table structure initialised by kvm_pgtable_*_init()
* or a similar initialiser.
* @addr: Input address for the start of the walk.
* @ptep: Pointer to storage for the retrieved PTE.
* @level: Pointer to storage for the level of the retrieved PTE.
*
* The offset of @addr within a page is ignored.
*
* The walker will walk the page-table entries corresponding to the input
* address specified, retrieving the leaf corresponding to this address.
* Invalid entries are treated as leaf entries.
*
* Return: 0 on success, negative error code on failure.
*/
int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
kvm_pte_t *ptep, s8 *level);
/**
* kvm_pgtable_stage2_pte_prot() - Retrieve the protection attributes of a
* stage-2 Page-Table Entry.
* @pte: Page-table entry
*
* Return: protection attributes of the page-table entry in the enum
* kvm_pgtable_prot format.
*/
enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte);
/**
* kvm_pgtable_hyp_pte_prot() - Retrieve the protection attributes of a stage-1
* Page-Table Entry.
* @pte: Page-table entry
*
* Return: protection attributes of the page-table entry in the enum
* kvm_pgtable_prot format.
*/
enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte);
/**
* kvm_tlb_flush_vmid_range() - Invalidate/flush a range of TLB entries
*
* @mmu: Stage-2 KVM MMU struct
* @addr: The base Intermediate physical address from which to invalidate
* @size: Size of the range from the base to invalidate
*/
void kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
phys_addr_t addr, size_t size);
#endif /* __ARM64_KVM_PGTABLE_H__ */