2020-09-11 14:25:10 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
|
|
|
/*
|
|
|
|
* Copyright (C) 2020 Google LLC
|
|
|
|
* Author: Will Deacon <will@kernel.org>
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef __ARM64_KVM_PGTABLE_H__
|
|
|
|
#define __ARM64_KVM_PGTABLE_H__
|
|
|
|
|
|
|
|
#include <linux/bits.h>
|
|
|
|
#include <linux/kvm_host.h>
|
|
|
|
#include <linux/types.h>
|
|
|
|
|
2023-11-27 11:17:34 +00:00
|
|
|
#define KVM_PGTABLE_FIRST_LEVEL -1
|
2023-11-27 11:17:33 +00:00
|
|
|
#define KVM_PGTABLE_LAST_LEVEL 3
|
KVM: arm64: Prepare the creation of s1 mappings at EL2
When memory protection is enabled, the EL2 code needs the ability to
create and manage its own page-table. To do so, introduce a new set of
hypercalls to bootstrap a memory management system at EL2.
This leads to the following boot flow in nVHE Protected mode:
1. the host allocates memory for the hypervisor very early on, using
the memblock API;
2. the host creates a set of stage 1 page-table for EL2, installs the
EL2 vectors, and issues the __pkvm_init hypercall;
3. during __pkvm_init, the hypervisor re-creates its stage 1 page-table
and stores it in the memory pool provided by the host;
4. the hypervisor then extends its stage 1 mappings to include a
vmemmap in the EL2 VA space, hence allowing to use the buddy
allocator introduced in a previous patch;
5. the hypervisor jumps back in the idmap page, switches from the
host-provided page-table to the new one, and wraps up its
initialization by enabling the new allocator, before returning to
the host.
6. the host can free the now unused page-table created for EL2, and
will now need to issue hypercalls to make changes to the EL2 stage 1
mappings instead of modifying them directly.
Note that for the sake of simplifying the review, this patch focuses on
the hypervisor side of things. In other words, this only implements the
new hypercalls, but does not make use of them from the host yet. The
host-side changes will follow in a subsequent patch.
Credits to Will for __pkvm_init_switch_pgd.
Acked-by: Will Deacon <will@kernel.org>
Co-authored-by: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210319100146.1149909-18-qperret@google.com
2021-03-19 10:01:25 +00:00
|
|
|
|
2022-10-07 23:41:50 +00:00
|
|
|
/*
|
|
|
|
* The largest supported block sizes for KVM (no 52-bit PA support):
|
|
|
|
* - 4K (level 1): 1GB
|
|
|
|
* - 16K (level 2): 32MB
|
|
|
|
* - 64K (level 2): 512MB
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_ARM64_4K_PAGES
|
2023-11-27 11:17:33 +00:00
|
|
|
#define KVM_PGTABLE_MIN_BLOCK_LEVEL 1
|
2022-10-07 23:41:50 +00:00
|
|
|
#else
|
2023-11-27 11:17:33 +00:00
|
|
|
#define KVM_PGTABLE_MIN_BLOCK_LEVEL 2
|
2022-10-07 23:41:50 +00:00
|
|
|
#endif
|
|
|
|
|
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1
Implement a simple policy whereby if the HW supports FEAT_LPA2 for the
page size we are using, always use LPA2-style page-tables for stage 2
and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested
IPA size or HW-implemented PA size. When in use we can now support up to
52-bit IPA and PA sizes.
We use the previously created cpu feature to track whether LPA2 is
supported for deciding whether to use the LPA2 or classic pte format.
Note that FEAT_LPA2 brings support for bigger block mappings (512GB with
4KB, 64GB with 16KB). We explicitly don't enable these in the library
because stage2_apply_range() works on batch sizes of the largest used
block mapping, and increasing the size of the batch would lead to soft
lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit
stage2_apply_range() batch size to largest block").
With the addition of LPA2 support in the hypervisor, the PA size
supported by the HW must be capped with a runtime decision, rather than
simply using a compile-time decision based on PA_BITS. For example, on a
system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB
or 16KB kernel compiled with LPA2 support must still limit the PA size
to 48 bits.
Therefore, move the insertion of the PS field into TCR_EL2 out of
__kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode()
where the rest of TCR_EL2 is prepared. This allows us to figure out PS
with kvm_get_parange(), which has the appropriate logic to ensure the
above requirement. (and the PS field of VTCR_EL2 is already populated
this way).
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
|
|
|
#define kvm_lpa2_is_enabled() system_supports_lpa2()
|
|
|
|
|
|
|
|
static inline u64 kvm_get_parange_max(void)
|
|
|
|
{
|
|
|
|
if (kvm_lpa2_is_enabled() ||
|
|
|
|
(IS_ENABLED(CONFIG_ARM64_PA_BITS_52) && PAGE_SHIFT == 16))
|
|
|
|
return ID_AA64MMFR0_EL1_PARANGE_52;
|
|
|
|
else
|
|
|
|
return ID_AA64MMFR0_EL1_PARANGE_48;
|
|
|
|
}
|
2023-11-27 11:17:27 +00:00
|
|
|
|
2021-03-19 10:01:30 +00:00
|
|
|
static inline u64 kvm_get_parange(u64 mmfr0)
|
|
|
|
{
|
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1
Implement a simple policy whereby if the HW supports FEAT_LPA2 for the
page size we are using, always use LPA2-style page-tables for stage 2
and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested
IPA size or HW-implemented PA size. When in use we can now support up to
52-bit IPA and PA sizes.
We use the previously created cpu feature to track whether LPA2 is
supported for deciding whether to use the LPA2 or classic pte format.
Note that FEAT_LPA2 brings support for bigger block mappings (512GB with
4KB, 64GB with 16KB). We explicitly don't enable these in the library
because stage2_apply_range() works on batch sizes of the largest used
block mapping, and increasing the size of the batch would lead to soft
lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit
stage2_apply_range() batch size to largest block").
With the addition of LPA2 support in the hypervisor, the PA size
supported by the HW must be capped with a runtime decision, rather than
simply using a compile-time decision based on PA_BITS. For example, on a
system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB
or 16KB kernel compiled with LPA2 support must still limit the PA size
to 48 bits.
Therefore, move the insertion of the PS field into TCR_EL2 out of
__kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode()
where the rest of TCR_EL2 is prepared. This allows us to figure out PS
with kvm_get_parange(), which has the appropriate logic to ensure the
above requirement. (and the PS field of VTCR_EL2 is already populated
this way).
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
|
|
|
u64 parange_max = kvm_get_parange_max();
|
2021-03-19 10:01:30 +00:00
|
|
|
u64 parange = cpuid_feature_extract_unsigned_field(mmfr0,
|
2022-09-05 23:54:01 +01:00
|
|
|
ID_AA64MMFR0_EL1_PARANGE_SHIFT);
|
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1
Implement a simple policy whereby if the HW supports FEAT_LPA2 for the
page size we are using, always use LPA2-style page-tables for stage 2
and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested
IPA size or HW-implemented PA size. When in use we can now support up to
52-bit IPA and PA sizes.
We use the previously created cpu feature to track whether LPA2 is
supported for deciding whether to use the LPA2 or classic pte format.
Note that FEAT_LPA2 brings support for bigger block mappings (512GB with
4KB, 64GB with 16KB). We explicitly don't enable these in the library
because stage2_apply_range() works on batch sizes of the largest used
block mapping, and increasing the size of the batch would lead to soft
lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit
stage2_apply_range() batch size to largest block").
With the addition of LPA2 support in the hypervisor, the PA size
supported by the HW must be capped with a runtime decision, rather than
simply using a compile-time decision based on PA_BITS. For example, on a
system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB
or 16KB kernel compiled with LPA2 support must still limit the PA size
to 48 bits.
Therefore, move the insertion of the PS field into TCR_EL2 out of
__kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode()
where the rest of TCR_EL2 is prepared. This allows us to figure out PS
with kvm_get_parange(), which has the appropriate logic to ensure the
above requirement. (and the PS field of VTCR_EL2 is already populated
this way).
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
|
|
|
if (parange > parange_max)
|
|
|
|
parange = parange_max;
|
2021-03-19 10:01:30 +00:00
|
|
|
|
|
|
|
return parange;
|
|
|
|
}
|
|
|
|
|
2020-09-11 14:25:10 +01:00
|
|
|
typedef u64 kvm_pte_t;
|
|
|
|
|
2021-08-09 16:24:32 +01:00
|
|
|
#define KVM_PTE_VALID BIT(0)
|
|
|
|
|
|
|
|
#define KVM_PTE_ADDR_MASK GENMASK(47, PAGE_SHIFT)
|
|
|
|
#define KVM_PTE_ADDR_51_48 GENMASK(15, 12)
|
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1
Implement a simple policy whereby if the HW supports FEAT_LPA2 for the
page size we are using, always use LPA2-style page-tables for stage 2
and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested
IPA size or HW-implemented PA size. When in use we can now support up to
52-bit IPA and PA sizes.
We use the previously created cpu feature to track whether LPA2 is
supported for deciding whether to use the LPA2 or classic pte format.
Note that FEAT_LPA2 brings support for bigger block mappings (512GB with
4KB, 64GB with 16KB). We explicitly don't enable these in the library
because stage2_apply_range() works on batch sizes of the largest used
block mapping, and increasing the size of the batch would lead to soft
lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit
stage2_apply_range() batch size to largest block").
With the addition of LPA2 support in the hypervisor, the PA size
supported by the HW must be capped with a runtime decision, rather than
simply using a compile-time decision based on PA_BITS. For example, on a
system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB
or 16KB kernel compiled with LPA2 support must still limit the PA size
to 48 bits.
Therefore, move the insertion of the PS field into TCR_EL2 out of
__kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode()
where the rest of TCR_EL2 is prepared. This allows us to figure out PS
with kvm_get_parange(), which has the appropriate logic to ensure the
above requirement. (and the PS field of VTCR_EL2 is already populated
this way).
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
|
|
|
#define KVM_PTE_ADDR_MASK_LPA2 GENMASK(49, PAGE_SHIFT)
|
|
|
|
#define KVM_PTE_ADDR_51_50_LPA2 GENMASK(9, 8)
|
2021-08-09 16:24:32 +01:00
|
|
|
|
2022-11-10 19:02:47 +00:00
|
|
|
#define KVM_PHYS_INVALID (-1ULL)
|
|
|
|
|
2025-05-21 13:48:34 +01:00
|
|
|
#define KVM_PTE_TYPE BIT(1)
|
|
|
|
#define KVM_PTE_TYPE_BLOCK 0
|
|
|
|
#define KVM_PTE_TYPE_PAGE 1
|
|
|
|
#define KVM_PTE_TYPE_TABLE 1
|
|
|
|
|
2024-09-09 12:47:17 +00:00
|
|
|
#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2)
|
|
|
|
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2)
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO \
|
|
|
|
({ cpus_have_final_cap(ARM64_KVM_HVHE) ? 2 : 3; })
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW \
|
|
|
|
({ cpus_have_final_cap(ARM64_KVM_HVHE) ? 0 : 1; })
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
|
|
|
|
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2)
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3
|
|
|
|
#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
|
|
|
|
|
|
|
|
#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 50)
|
|
|
|
|
|
|
|
#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55)
|
|
|
|
|
|
|
|
#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
|
|
|
|
|
|
|
|
#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
|
|
|
|
|
|
|
|
#define KVM_PTE_LEAF_ATTR_HI_S1_GP BIT(50)
|
|
|
|
|
|
|
|
#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
|
|
|
|
KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
|
|
|
|
KVM_PTE_LEAF_ATTR_HI_S2_XN)
|
|
|
|
|
|
|
|
#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
|
|
|
|
#define KVM_MAX_OWNER_ID 1
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Used to indicate a pte for which a 'break-before-make' sequence is in
|
|
|
|
* progress.
|
|
|
|
*/
|
|
|
|
#define KVM_INVALID_PTE_LOCKED BIT(10)
|
|
|
|
|
2021-08-09 16:24:32 +01:00
|
|
|
static inline bool kvm_pte_valid(kvm_pte_t pte)
|
|
|
|
{
|
|
|
|
return pte & KVM_PTE_VALID;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u64 kvm_pte_to_phys(kvm_pte_t pte)
|
|
|
|
{
|
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1
Implement a simple policy whereby if the HW supports FEAT_LPA2 for the
page size we are using, always use LPA2-style page-tables for stage 2
and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested
IPA size or HW-implemented PA size. When in use we can now support up to
52-bit IPA and PA sizes.
We use the previously created cpu feature to track whether LPA2 is
supported for deciding whether to use the LPA2 or classic pte format.
Note that FEAT_LPA2 brings support for bigger block mappings (512GB with
4KB, 64GB with 16KB). We explicitly don't enable these in the library
because stage2_apply_range() works on batch sizes of the largest used
block mapping, and increasing the size of the batch would lead to soft
lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit
stage2_apply_range() batch size to largest block").
With the addition of LPA2 support in the hypervisor, the PA size
supported by the HW must be capped with a runtime decision, rather than
simply using a compile-time decision based on PA_BITS. For example, on a
system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB
or 16KB kernel compiled with LPA2 support must still limit the PA size
to 48 bits.
Therefore, move the insertion of the PS field into TCR_EL2 out of
__kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode()
where the rest of TCR_EL2 is prepared. This allows us to figure out PS
with kvm_get_parange(), which has the appropriate logic to ensure the
above requirement. (and the PS field of VTCR_EL2 is already populated
this way).
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
|
|
|
u64 pa;
|
|
|
|
|
|
|
|
if (kvm_lpa2_is_enabled()) {
|
|
|
|
pa = pte & KVM_PTE_ADDR_MASK_LPA2;
|
|
|
|
pa |= FIELD_GET(KVM_PTE_ADDR_51_50_LPA2, pte) << 50;
|
|
|
|
} else {
|
|
|
|
pa = pte & KVM_PTE_ADDR_MASK;
|
|
|
|
if (PAGE_SHIFT == 16)
|
|
|
|
pa |= FIELD_GET(KVM_PTE_ADDR_51_48, pte) << 48;
|
|
|
|
}
|
2021-08-09 16:24:32 +01:00
|
|
|
|
|
|
|
return pa;
|
|
|
|
}
|
|
|
|
|
2022-11-10 19:02:47 +00:00
|
|
|
static inline kvm_pte_t kvm_phys_to_pte(u64 pa)
|
|
|
|
{
|
KVM: arm64: Use LPA2 page-tables for stage2 and hyp stage1
Implement a simple policy whereby if the HW supports FEAT_LPA2 for the
page size we are using, always use LPA2-style page-tables for stage 2
and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested
IPA size or HW-implemented PA size. When in use we can now support up to
52-bit IPA and PA sizes.
We use the previously created cpu feature to track whether LPA2 is
supported for deciding whether to use the LPA2 or classic pte format.
Note that FEAT_LPA2 brings support for bigger block mappings (512GB with
4KB, 64GB with 16KB). We explicitly don't enable these in the library
because stage2_apply_range() works on batch sizes of the largest used
block mapping, and increasing the size of the batch would lead to soft
lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit
stage2_apply_range() batch size to largest block").
With the addition of LPA2 support in the hypervisor, the PA size
supported by the HW must be capped with a runtime decision, rather than
simply using a compile-time decision based on PA_BITS. For example, on a
system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB
or 16KB kernel compiled with LPA2 support must still limit the PA size
to 48 bits.
Therefore, move the insertion of the PS field into TCR_EL2 out of
__kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode()
where the rest of TCR_EL2 is prepared. This allows us to figure out PS
with kvm_get_parange(), which has the appropriate logic to ensure the
above requirement. (and the PS field of VTCR_EL2 is already populated
this way).
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
2023-11-27 11:17:32 +00:00
|
|
|
kvm_pte_t pte;
|
|
|
|
|
|
|
|
if (kvm_lpa2_is_enabled()) {
|
|
|
|
pte = pa & KVM_PTE_ADDR_MASK_LPA2;
|
|
|
|
pa &= GENMASK(51, 50);
|
|
|
|
pte |= FIELD_PREP(KVM_PTE_ADDR_51_50_LPA2, pa >> 50);
|
|
|
|
} else {
|
|
|
|
pte = pa & KVM_PTE_ADDR_MASK;
|
|
|
|
if (PAGE_SHIFT == 16) {
|
|
|
|
pa &= GENMASK(51, 48);
|
|
|
|
pte |= FIELD_PREP(KVM_PTE_ADDR_51_48, pa >> 48);
|
|
|
|
}
|
2022-11-10 19:02:47 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
|
2022-12-02 18:51:51 +00:00
|
|
|
static inline kvm_pfn_t kvm_pte_to_pfn(kvm_pte_t pte)
|
|
|
|
{
|
|
|
|
return __phys_to_pfn(kvm_pte_to_phys(pte));
|
|
|
|
}
|
|
|
|
|
2023-11-27 11:17:33 +00:00
|
|
|
static inline u64 kvm_granule_shift(s8 level)
|
2021-08-09 16:24:32 +01:00
|
|
|
{
|
2023-11-27 11:17:33 +00:00
|
|
|
/* Assumes KVM_PGTABLE_LAST_LEVEL is 3 */
|
2021-08-09 16:24:32 +01:00
|
|
|
return ARM64_HW_PGTABLE_LEVEL_SHIFT(level);
|
|
|
|
}
|
|
|
|
|
2023-11-27 11:17:33 +00:00
|
|
|
static inline u64 kvm_granule_size(s8 level)
|
2021-08-09 16:24:32 +01:00
|
|
|
{
|
|
|
|
return BIT(kvm_granule_shift(level));
|
|
|
|
}
|
|
|
|
|
2023-11-27 11:17:33 +00:00
|
|
|
static inline bool kvm_level_supports_block_mapping(s8 level)
|
2021-08-09 16:24:32 +01:00
|
|
|
{
|
2022-10-07 23:41:50 +00:00
|
|
|
return level >= KVM_PGTABLE_MIN_BLOCK_LEVEL;
|
2021-08-09 16:24:32 +01:00
|
|
|
}
|
|
|
|
|
2023-04-26 17:23:23 +00:00
|
|
|
static inline u32 kvm_supported_block_sizes(void)
|
|
|
|
{
|
2023-11-27 11:17:33 +00:00
|
|
|
s8 level = KVM_PGTABLE_MIN_BLOCK_LEVEL;
|
2023-04-26 17:23:23 +00:00
|
|
|
u32 r = 0;
|
|
|
|
|
2023-11-27 11:17:33 +00:00
|
|
|
for (; level <= KVM_PGTABLE_LAST_LEVEL; level++)
|
2023-04-26 17:23:23 +00:00
|
|
|
r |= BIT(kvm_granule_shift(level));
|
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool kvm_is_block_size_supported(u64 size)
|
|
|
|
{
|
|
|
|
bool is_power_of_two = IS_ALIGNED(size, size);
|
|
|
|
|
|
|
|
return is_power_of_two && (size & kvm_supported_block_sizes());
|
|
|
|
}
|
|
|
|
|
2021-03-19 10:01:14 +00:00
|
|
|
/**
|
|
|
|
* struct kvm_pgtable_mm_ops - Memory management callbacks.
|
2021-06-17 18:58:21 +08:00
|
|
|
* @zalloc_page: Allocate a single zeroed memory page.
|
|
|
|
* The @arg parameter can be used by the walker
|
|
|
|
* to pass a memcache. The initial refcount of
|
|
|
|
* the page is 1.
|
|
|
|
* @zalloc_pages_exact: Allocate an exact number of zeroed memory pages.
|
|
|
|
* The @size parameter is in bytes, and is rounded
|
|
|
|
* up to the next page boundary. The resulting
|
|
|
|
* allocation is physically contiguous.
|
|
|
|
* @free_pages_exact: Free an exact number of memory pages previously
|
|
|
|
* allocated by zalloc_pages_exact.
|
2023-04-26 17:23:19 +00:00
|
|
|
* @free_unlinked_table: Free an unlinked paging structure by unlinking and
|
2022-11-07 21:56:37 +00:00
|
|
|
* dropping references.
|
2021-06-17 18:58:21 +08:00
|
|
|
* @get_page: Increment the refcount on a page.
|
|
|
|
* @put_page: Decrement the refcount on a page. When the
|
|
|
|
* refcount reaches 0 the page is automatically
|
|
|
|
* freed.
|
|
|
|
* @page_count: Return the refcount of a page.
|
|
|
|
* @phys_to_virt: Convert a physical address into a virtual
|
|
|
|
* address mapped in the current context.
|
|
|
|
* @virt_to_phys: Convert a virtual address mapped in the current
|
|
|
|
* context into a physical address.
|
|
|
|
* @dcache_clean_inval_poc: Clean and invalidate the data cache to the PoC
|
|
|
|
* for the specified memory address range.
|
|
|
|
* @icache_inval_pou: Invalidate the instruction cache to the PoU
|
|
|
|
* for the specified memory address range.
|
2021-03-19 10:01:14 +00:00
|
|
|
*/
|
|
|
|
struct kvm_pgtable_mm_ops {
|
|
|
|
void* (*zalloc_page)(void *arg);
|
|
|
|
void* (*zalloc_pages_exact)(size_t size);
|
|
|
|
void (*free_pages_exact)(void *addr, size_t size);
|
2023-11-27 11:17:33 +00:00
|
|
|
void (*free_unlinked_table)(void *addr, s8 level);
|
2021-03-19 10:01:14 +00:00
|
|
|
void (*get_page)(void *addr);
|
|
|
|
void (*put_page)(void *addr);
|
|
|
|
int (*page_count)(void *addr);
|
|
|
|
void* (*phys_to_virt)(phys_addr_t phys);
|
|
|
|
phys_addr_t (*virt_to_phys)(void *addr);
|
2021-06-17 18:58:21 +08:00
|
|
|
void (*dcache_clean_inval_poc)(void *addr, size_t size);
|
|
|
|
void (*icache_inval_pou)(void *addr, size_t size);
|
2021-03-19 10:01:14 +00:00
|
|
|
};
|
|
|
|
|
2021-03-19 10:01:40 +00:00
|
|
|
/**
|
|
|
|
* enum kvm_pgtable_stage2_flags - Stage-2 page-table flags.
|
|
|
|
* @KVM_PGTABLE_S2_NOFWB: Don't enforce Normal-WB even if the CPUs have
|
|
|
|
* ARM64_HAS_STAGE2_FWB.
|
2021-03-19 10:01:41 +00:00
|
|
|
* @KVM_PGTABLE_S2_IDMAP: Only use identity mappings.
|
2021-03-19 10:01:40 +00:00
|
|
|
*/
|
|
|
|
enum kvm_pgtable_stage2_flags {
|
|
|
|
KVM_PGTABLE_S2_NOFWB = BIT(0),
|
2021-03-19 10:01:41 +00:00
|
|
|
KVM_PGTABLE_S2_IDMAP = BIT(1),
|
2021-03-19 10:01:40 +00:00
|
|
|
};
|
|
|
|
|
2020-09-11 14:25:10 +01:00
|
|
|
/**
|
|
|
|
* enum kvm_pgtable_prot - Page-table permissions and attributes.
|
|
|
|
* @KVM_PGTABLE_PROT_X: Execute permission.
|
|
|
|
* @KVM_PGTABLE_PROT_W: Write permission.
|
|
|
|
* @KVM_PGTABLE_PROT_R: Read permission.
|
|
|
|
* @KVM_PGTABLE_PROT_DEVICE: Device attributes.
|
KVM: arm64: Introduce new flag for non-cacheable IO memory
Currently, KVM for ARM64 maps at stage 2 memory that is considered device
(i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
overrides (as per the ARM architecture [1]) any device MMIO mapping
present at stage 1, resulting in a set-up whereby a guest operating
system cannot determine device MMIO mapping memory attributes on its
own but it is always overridden by the KVM stage 2 default.
This set-up does not allow guest operating systems to select device
memory attributes independently from KVM stage-2 mappings
(refer to [1], "Combining stage 1 and stage 2 memory type attributes"),
which turns out to be an issue in that guest operating systems
(e.g. Linux) may request to map devices MMIO regions with memory
attributes that guarantee better performance (e.g. gathering
attribute - that for some devices can generate larger PCIe memory
writes TLPs) and specific operations (e.g. unaligned transactions)
such as the NormalNC memory type.
The default device stage 2 mapping was chosen in KVM for ARM64 since
it was considered safer (i.e. it would not allow guests to trigger
uncontained failures ultimately crashing the machine) but this
turned out to be asynchronous (SError) defeating the purpose.
Failures containability is a property of the platform and is independent
from the memory type used for MMIO device memory mappings.
Actually, DEVICE_nGnRE memory type is even more problematic than
Normal-NC memory type in terms of faults containability in that e.g.
aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally,
synchronous (i.e. that would imply that the processor should issue at
most 1 load transaction at a time - it cannot pipeline them - otherwise
the synchronous abort semantics would break the no-speculation attribute
attached to DEVICE_XXX memory).
This means that regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures and that in turn relies on platform capabilities
and the device type being assigned (i.e. PCIe AER/DPC error containment
and RAS architecture[3]); therefore the default KVM device stage 2
memory attributes play no role in making device assignment safer
for a given platform (if the platform design adheres to design
guidelines outlined in [3]) and therefore can be relaxed.
For all these reasons, relax the KVM stage 2 device memory attributes
from DEVICE_nGnRE to Normal-NC.
The NormalNC was chosen over a different Normal memory type default
at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.
Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (i.e. device
MMIO unmap followed by a device reset) at least for PCIe devices, in that
in PCIe a device reset is architected and carried out through PCI config
space transactions that are naturally ordered with respect to MMIO
transactions according to the PCI ordering rules.
Having Normal-NC S2 default puts guests in control (thanks to
stage1+stage2 combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules described in [1]
and summarized here ([(S1) - stage1], [(S2) - stage 2]):
S1 | S2 | Result
NORMAL-WB | NORMAL-NC | NORMAL-NC
NORMAL-WT | NORMAL-NC | NORMAL-NC
NORMAL-NC | NORMAL-NC | NORMAL-NC
DEVICE<attr> | NORMAL-NC | DEVICE<attr>
It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].
Defaulting KVM device stage-2 mappings to Normal-NC attributes does not
change anything in this respect, in that the mismatched aliases would
only affect (refer to [2] for a detailed explanation) ordering between
the userspace and GuestOS mappings resulting stream of transactions
(i.e. it does not cause loss of property for either stream of
transactions on its own), which is harmless given that the userspace
and GuestOS access to the device is carried out through independent
transactions streams.
A Normal-NC flag is not present today. So add a new kvm_pgtable_prot
(KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its
corresponding PTE value 0x5 (0b101) determined from [1].
Lastly, adapt the stage2 PTE property setter function
(stage2_set_prot_attr) to handle the NormalNC attribute.
The entire discussion leading to this patch series may be followed through
the following links.
Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com
Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com
[1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-24 20:35:43 +05:30
|
|
|
* @KVM_PGTABLE_PROT_NORMAL_NC: Normal noncacheable attributes.
|
2021-08-09 16:24:38 +01:00
|
|
|
* @KVM_PGTABLE_PROT_SW0: Software bit 0.
|
|
|
|
* @KVM_PGTABLE_PROT_SW1: Software bit 1.
|
|
|
|
* @KVM_PGTABLE_PROT_SW2: Software bit 2.
|
|
|
|
* @KVM_PGTABLE_PROT_SW3: Software bit 3.
|
2020-09-11 14:25:10 +01:00
|
|
|
*/
|
|
|
|
enum kvm_pgtable_prot {
|
|
|
|
KVM_PGTABLE_PROT_X = BIT(0),
|
|
|
|
KVM_PGTABLE_PROT_W = BIT(1),
|
|
|
|
KVM_PGTABLE_PROT_R = BIT(2),
|
|
|
|
|
|
|
|
KVM_PGTABLE_PROT_DEVICE = BIT(3),
|
KVM: arm64: Introduce new flag for non-cacheable IO memory
Currently, KVM for ARM64 maps at stage 2 memory that is considered device
(i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
overrides (as per the ARM architecture [1]) any device MMIO mapping
present at stage 1, resulting in a set-up whereby a guest operating
system cannot determine device MMIO mapping memory attributes on its
own but it is always overridden by the KVM stage 2 default.
This set-up does not allow guest operating systems to select device
memory attributes independently from KVM stage-2 mappings
(refer to [1], "Combining stage 1 and stage 2 memory type attributes"),
which turns out to be an issue in that guest operating systems
(e.g. Linux) may request to map devices MMIO regions with memory
attributes that guarantee better performance (e.g. gathering
attribute - that for some devices can generate larger PCIe memory
writes TLPs) and specific operations (e.g. unaligned transactions)
such as the NormalNC memory type.
The default device stage 2 mapping was chosen in KVM for ARM64 since
it was considered safer (i.e. it would not allow guests to trigger
uncontained failures ultimately crashing the machine) but this
turned out to be asynchronous (SError) defeating the purpose.
Failures containability is a property of the platform and is independent
from the memory type used for MMIO device memory mappings.
Actually, DEVICE_nGnRE memory type is even more problematic than
Normal-NC memory type in terms of faults containability in that e.g.
aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally,
synchronous (i.e. that would imply that the processor should issue at
most 1 load transaction at a time - it cannot pipeline them - otherwise
the synchronous abort semantics would break the no-speculation attribute
attached to DEVICE_XXX memory).
This means that regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures and that in turn relies on platform capabilities
and the device type being assigned (i.e. PCIe AER/DPC error containment
and RAS architecture[3]); therefore the default KVM device stage 2
memory attributes play no role in making device assignment safer
for a given platform (if the platform design adheres to design
guidelines outlined in [3]) and therefore can be relaxed.
For all these reasons, relax the KVM stage 2 device memory attributes
from DEVICE_nGnRE to Normal-NC.
The NormalNC was chosen over a different Normal memory type default
at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.
Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (i.e. device
MMIO unmap followed by a device reset) at least for PCIe devices, in that
in PCIe a device reset is architected and carried out through PCI config
space transactions that are naturally ordered with respect to MMIO
transactions according to the PCI ordering rules.
Having Normal-NC S2 default puts guests in control (thanks to
stage1+stage2 combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules described in [1]
and summarized here ([(S1) - stage1], [(S2) - stage 2]):
S1 | S2 | Result
NORMAL-WB | NORMAL-NC | NORMAL-NC
NORMAL-WT | NORMAL-NC | NORMAL-NC
NORMAL-NC | NORMAL-NC | NORMAL-NC
DEVICE<attr> | NORMAL-NC | DEVICE<attr>
It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].
Defaulting KVM device stage-2 mappings to Normal-NC attributes does not
change anything in this respect, in that the mismatched aliases would
only affect (refer to [2] for a detailed explanation) ordering between
the userspace and GuestOS mappings resulting stream of transactions
(i.e. it does not cause loss of property for either stream of
transactions on its own), which is harmless given that the userspace
and GuestOS access to the device is carried out through independent
transactions streams.
A Normal-NC flag is not present today. So add a new kvm_pgtable_prot
(KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its
corresponding PTE value 0x5 (0b101) determined from [1].
Lastly, adapt the stage2 PTE property setter function
(stage2_set_prot_attr) to handle the NormalNC attribute.
The entire discussion leading to this patch series may be followed through
the following links.
Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com
Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com
[1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-24 20:35:43 +05:30
|
|
|
KVM_PGTABLE_PROT_NORMAL_NC = BIT(4),
|
2021-08-09 16:24:38 +01:00
|
|
|
|
|
|
|
KVM_PGTABLE_PROT_SW0 = BIT(55),
|
|
|
|
KVM_PGTABLE_PROT_SW1 = BIT(56),
|
|
|
|
KVM_PGTABLE_PROT_SW2 = BIT(57),
|
|
|
|
KVM_PGTABLE_PROT_SW3 = BIT(58),
|
2020-09-11 14:25:10 +01:00
|
|
|
};
|
|
|
|
|
2021-08-09 16:24:37 +01:00
|
|
|
#define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
|
|
|
|
#define KVM_PGTABLE_PROT_RWX (KVM_PGTABLE_PROT_RW | KVM_PGTABLE_PROT_X)
|
|
|
|
|
|
|
|
#define PKVM_HOST_MEM_PROT KVM_PGTABLE_PROT_RWX
|
|
|
|
#define PKVM_HOST_MMIO_PROT KVM_PGTABLE_PROT_RW
|
|
|
|
|
|
|
|
#define PAGE_HYP KVM_PGTABLE_PROT_RW
|
2020-09-11 14:25:12 +01:00
|
|
|
#define PAGE_HYP_EXEC (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_X)
|
|
|
|
#define PAGE_HYP_RO (KVM_PGTABLE_PROT_R)
|
|
|
|
#define PAGE_HYP_DEVICE (PAGE_HYP | KVM_PGTABLE_PROT_DEVICE)
|
|
|
|
|
2021-08-09 16:24:37 +01:00
|
|
|
typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
|
|
|
|
enum kvm_pgtable_prot prot);
|
|
|
|
|
2020-09-11 14:25:10 +01:00
|
|
|
/**
|
|
|
|
* enum kvm_pgtable_walk_flags - Flags to control a depth-first page-table walk.
|
|
|
|
* @KVM_PGTABLE_WALK_LEAF: Visit leaf entries, including invalid
|
|
|
|
* entries.
|
|
|
|
* @KVM_PGTABLE_WALK_TABLE_PRE: Visit table entries before their
|
|
|
|
* children.
|
|
|
|
* @KVM_PGTABLE_WALK_TABLE_POST: Visit table entries after their
|
|
|
|
* children.
|
2022-11-07 21:56:38 +00:00
|
|
|
* @KVM_PGTABLE_WALK_SHARED: Indicates the page-tables may be shared
|
|
|
|
* with other software walkers.
|
2022-12-02 18:51:52 +00:00
|
|
|
* @KVM_PGTABLE_WALK_HANDLE_FAULT: Indicates the page-table walk was
|
|
|
|
* invoked from a fault handler.
|
2023-04-26 17:23:20 +00:00
|
|
|
* @KVM_PGTABLE_WALK_SKIP_BBM_TLBI: Visit and update table entries
|
|
|
|
* without Break-before-make's
|
|
|
|
* TLB invalidation.
|
|
|
|
* @KVM_PGTABLE_WALK_SKIP_CMO: Visit and update table entries
|
|
|
|
* without Cache maintenance
|
|
|
|
* operations required.
|
2020-09-11 14:25:10 +01:00
|
|
|
*/
|
|
|
|
enum kvm_pgtable_walk_flags {
|
|
|
|
KVM_PGTABLE_WALK_LEAF = BIT(0),
|
|
|
|
KVM_PGTABLE_WALK_TABLE_PRE = BIT(1),
|
|
|
|
KVM_PGTABLE_WALK_TABLE_POST = BIT(2),
|
2022-11-07 21:56:38 +00:00
|
|
|
KVM_PGTABLE_WALK_SHARED = BIT(3),
|
2022-12-02 18:51:52 +00:00
|
|
|
KVM_PGTABLE_WALK_HANDLE_FAULT = BIT(4),
|
2023-04-26 17:23:20 +00:00
|
|
|
KVM_PGTABLE_WALK_SKIP_BBM_TLBI = BIT(5),
|
|
|
|
KVM_PGTABLE_WALK_SKIP_CMO = BIT(6),
|
2020-09-11 14:25:10 +01:00
|
|
|
};
|
|
|
|
|
2022-11-07 21:56:31 +00:00
|
|
|
struct kvm_pgtable_visit_ctx {
|
|
|
|
kvm_pte_t *ptep;
|
2022-11-07 21:56:32 +00:00
|
|
|
kvm_pte_t old;
|
2022-11-07 21:56:31 +00:00
|
|
|
void *arg;
|
2022-11-07 21:56:33 +00:00
|
|
|
struct kvm_pgtable_mm_ops *mm_ops;
|
KVM: arm64: Infer the PA offset from IPA in stage-2 map walker
Until now, the page table walker counted increments to the PA and IPA
of a walk in two separate places. While the PA is incremented as soon as
a leaf PTE is installed in stage2_map_walker_try_leaf(), the IPA is
actually bumped in the generic table walker context. Critically,
__kvm_pgtable_visit() rereads the PTE after the LEAF callback returns
to work out if a table or leaf was installed, and only bumps the IPA for
a leaf PTE.
This arrangement worked fine when we handled faults behind the write lock,
as the walker had exclusive access to the stage-2 page tables. However,
commit 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel")
started handling all stage-2 faults behind the read lock, opening up a
race where a walker could increment the PA but not the IPA of a walk.
Nothing good ensues, as the walker starts mapping with the incorrect
IPA -> PA relationship.
For example, assume that two vCPUs took a data abort on the same IPA.
One observes that dirty logging is disabled, and the other observed that
it is enabled:
vCPU attempting PMD mapping vCPU attempting PTE mapping
====================================== =====================================
/* install PMD */
stage2_make_pte(ctx, leaf);
data->phys += granule;
/* replace PMD with a table */
stage2_try_break_pte(ctx, data->mmu);
stage2_make_pte(ctx, table);
/* table is observed */
ctx.old = READ_ONCE(*ptep);
table = kvm_pte_table(ctx.old, level);
/*
* map walk continues w/o incrementing
* IPA.
*/
__kvm_pgtable_walk(..., level + 1);
Bring an end to the whole mess by using the IPA as the single source of
truth for how far along a walk has gotten. Work out the correct PA to
map by calculating the IPA offset from the beginning of the walk and add
that to the starting physical address.
Cc: stable@vger.kernel.org
Fixes: 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230421071606.1603916-2-oliver.upton@linux.dev
2023-04-21 07:16:05 +00:00
|
|
|
u64 start;
|
2022-11-07 21:56:31 +00:00
|
|
|
u64 addr;
|
|
|
|
u64 end;
|
2023-11-27 11:17:33 +00:00
|
|
|
s8 level;
|
2022-11-07 21:56:31 +00:00
|
|
|
enum kvm_pgtable_walk_flags flags;
|
|
|
|
};
|
|
|
|
|
|
|
|
typedef int (*kvm_pgtable_visitor_fn_t)(const struct kvm_pgtable_visit_ctx *ctx,
|
|
|
|
enum kvm_pgtable_walk_flags visit);
|
2020-09-11 14:25:10 +01:00
|
|
|
|
2022-11-07 21:56:38 +00:00
|
|
|
static inline bool kvm_pgtable_walk_shared(const struct kvm_pgtable_visit_ctx *ctx)
|
|
|
|
{
|
|
|
|
return ctx->flags & KVM_PGTABLE_WALK_SHARED;
|
|
|
|
}
|
|
|
|
|
2020-09-11 14:25:10 +01:00
|
|
|
/**
|
|
|
|
* struct kvm_pgtable_walker - Hook into a page-table walk.
|
|
|
|
* @cb: Callback function to invoke during the walk.
|
|
|
|
* @arg: Argument passed to the callback function.
|
|
|
|
* @flags: Bitwise-OR of flags to identify the entry types on which to
|
|
|
|
* invoke the callback function.
|
|
|
|
*/
|
|
|
|
struct kvm_pgtable_walker {
|
|
|
|
const kvm_pgtable_visitor_fn_t cb;
|
|
|
|
void * const arg;
|
|
|
|
const enum kvm_pgtable_walk_flags flags;
|
|
|
|
};
|
|
|
|
|
2022-11-18 18:22:20 +00:00
|
|
|
/*
|
|
|
|
* RCU cannot be used in a non-kernel context such as the hyp. As such, page
|
|
|
|
* table walkers used in hyp do not call into RCU and instead use other
|
|
|
|
* synchronization mechanisms (such as a spinlock).
|
|
|
|
*/
|
|
|
|
#if defined(__KVM_NVHE_HYPERVISOR__) || defined(__KVM_VHE_HYPERVISOR__)
|
|
|
|
|
|
|
|
typedef kvm_pte_t *kvm_pteref_t;
|
|
|
|
|
|
|
|
static inline kvm_pte_t *kvm_dereference_pteref(struct kvm_pgtable_walker *walker,
|
|
|
|
kvm_pteref_t pteref)
|
|
|
|
{
|
|
|
|
return pteref;
|
|
|
|
}
|
|
|
|
|
2022-11-18 18:22:22 +00:00
|
|
|
static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Due to the lack of RCU (or a similar protection scheme), only
|
|
|
|
* non-shared table walkers are allowed in the hypervisor.
|
|
|
|
*/
|
|
|
|
if (walker->flags & KVM_PGTABLE_WALK_SHARED)
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
KVM: arm64: Don't acquire RCU read lock for exclusive table walks
Marek reported a BUG resulting from the recent parallel faults changes,
as the hyp stage-1 map walker attempted to allocate table memory while
holding the RCU read lock:
BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:274
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by swapper/0/1:
#0: ffff80000a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at:
__create_hyp_mappings+0x80/0xc4
#1: ffff80000a927720 (rcu_read_lock){....}-{1:2}, at:
kvm_pgtable_walk+0x0/0x1f4
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918
Hardware name: Raspberry Pi 3 Model B (DT)
Call trace:
dump_backtrace.part.0+0xe4/0xf0
show_stack+0x18/0x40
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
__might_resched+0x178/0x220
__might_sleep+0x48/0xa0
prepare_alloc_pages+0x178/0x1a0
__alloc_pages+0x9c/0x109c
alloc_page_interleave+0x1c/0xc4
alloc_pages+0xec/0x160
get_zeroed_page+0x1c/0x44
kvm_hyp_zalloc_page+0x14/0x20
hyp_map_walker+0xd4/0x134
kvm_pgtable_visitor_cb.isra.0+0x38/0x5c
__kvm_pgtable_walk+0x1a4/0x220
kvm_pgtable_walk+0x104/0x1f4
kvm_pgtable_hyp_map+0x80/0xc4
__create_hyp_mappings+0x9c/0xc4
kvm_mmu_init+0x144/0x1cc
kvm_arch_init+0xe4/0xef4
kvm_init+0x3c/0x3d0
arm_init+0x20/0x30
do_one_initcall+0x74/0x400
kernel_init_freeable+0x2e0/0x350
kernel_init+0x24/0x130
ret_from_fork+0x10/0x20
Since the hyp stage-1 table walkers are serialized by kvm_hyp_pgd_mutex,
RCU protection really doesn't add anything. Don't acquire the RCU read
lock for an exclusive walk.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221118182222.3932898-3-oliver.upton@linux.dev
2022-11-18 18:22:21 +00:00
|
|
|
static inline void kvm_pgtable_walk_end(struct kvm_pgtable_walker *walker) {}
|
2022-11-18 18:22:20 +00:00
|
|
|
|
|
|
|
static inline bool kvm_pgtable_walk_lock_held(void)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
typedef kvm_pte_t __rcu *kvm_pteref_t;
|
|
|
|
|
|
|
|
static inline kvm_pte_t *kvm_dereference_pteref(struct kvm_pgtable_walker *walker,
|
|
|
|
kvm_pteref_t pteref)
|
|
|
|
{
|
|
|
|
return rcu_dereference_check(pteref, !(walker->flags & KVM_PGTABLE_WALK_SHARED));
|
|
|
|
}
|
|
|
|
|
2022-11-18 18:22:22 +00:00
|
|
|
static inline int kvm_pgtable_walk_begin(struct kvm_pgtable_walker *walker)
|
2022-11-18 18:22:20 +00:00
|
|
|
{
|
KVM: arm64: Don't acquire RCU read lock for exclusive table walks
Marek reported a BUG resulting from the recent parallel faults changes,
as the hyp stage-1 map walker attempted to allocate table memory while
holding the RCU read lock:
BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:274
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by swapper/0/1:
#0: ffff80000a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at:
__create_hyp_mappings+0x80/0xc4
#1: ffff80000a927720 (rcu_read_lock){....}-{1:2}, at:
kvm_pgtable_walk+0x0/0x1f4
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918
Hardware name: Raspberry Pi 3 Model B (DT)
Call trace:
dump_backtrace.part.0+0xe4/0xf0
show_stack+0x18/0x40
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
__might_resched+0x178/0x220
__might_sleep+0x48/0xa0
prepare_alloc_pages+0x178/0x1a0
__alloc_pages+0x9c/0x109c
alloc_page_interleave+0x1c/0xc4
alloc_pages+0xec/0x160
get_zeroed_page+0x1c/0x44
kvm_hyp_zalloc_page+0x14/0x20
hyp_map_walker+0xd4/0x134
kvm_pgtable_visitor_cb.isra.0+0x38/0x5c
__kvm_pgtable_walk+0x1a4/0x220
kvm_pgtable_walk+0x104/0x1f4
kvm_pgtable_hyp_map+0x80/0xc4
__create_hyp_mappings+0x9c/0xc4
kvm_mmu_init+0x144/0x1cc
kvm_arch_init+0xe4/0xef4
kvm_init+0x3c/0x3d0
arm_init+0x20/0x30
do_one_initcall+0x74/0x400
kernel_init_freeable+0x2e0/0x350
kernel_init+0x24/0x130
ret_from_fork+0x10/0x20
Since the hyp stage-1 table walkers are serialized by kvm_hyp_pgd_mutex,
RCU protection really doesn't add anything. Don't acquire the RCU read
lock for an exclusive walk.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221118182222.3932898-3-oliver.upton@linux.dev
2022-11-18 18:22:21 +00:00
|
|
|
if (walker->flags & KVM_PGTABLE_WALK_SHARED)
|
|
|
|
rcu_read_lock();
|
2022-11-18 18:22:22 +00:00
|
|
|
|
|
|
|
return 0;
|
2022-11-18 18:22:20 +00:00
|
|
|
}
|
|
|
|
|
KVM: arm64: Don't acquire RCU read lock for exclusive table walks
Marek reported a BUG resulting from the recent parallel faults changes,
as the hyp stage-1 map walker attempted to allocate table memory while
holding the RCU read lock:
BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:274
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by swapper/0/1:
#0: ffff80000a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at:
__create_hyp_mappings+0x80/0xc4
#1: ffff80000a927720 (rcu_read_lock){....}-{1:2}, at:
kvm_pgtable_walk+0x0/0x1f4
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918
Hardware name: Raspberry Pi 3 Model B (DT)
Call trace:
dump_backtrace.part.0+0xe4/0xf0
show_stack+0x18/0x40
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
__might_resched+0x178/0x220
__might_sleep+0x48/0xa0
prepare_alloc_pages+0x178/0x1a0
__alloc_pages+0x9c/0x109c
alloc_page_interleave+0x1c/0xc4
alloc_pages+0xec/0x160
get_zeroed_page+0x1c/0x44
kvm_hyp_zalloc_page+0x14/0x20
hyp_map_walker+0xd4/0x134
kvm_pgtable_visitor_cb.isra.0+0x38/0x5c
__kvm_pgtable_walk+0x1a4/0x220
kvm_pgtable_walk+0x104/0x1f4
kvm_pgtable_hyp_map+0x80/0xc4
__create_hyp_mappings+0x9c/0xc4
kvm_mmu_init+0x144/0x1cc
kvm_arch_init+0xe4/0xef4
kvm_init+0x3c/0x3d0
arm_init+0x20/0x30
do_one_initcall+0x74/0x400
kernel_init_freeable+0x2e0/0x350
kernel_init+0x24/0x130
ret_from_fork+0x10/0x20
Since the hyp stage-1 table walkers are serialized by kvm_hyp_pgd_mutex,
RCU protection really doesn't add anything. Don't acquire the RCU read
lock for an exclusive walk.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221118182222.3932898-3-oliver.upton@linux.dev
2022-11-18 18:22:21 +00:00
|
|
|
static inline void kvm_pgtable_walk_end(struct kvm_pgtable_walker *walker)
|
2022-11-18 18:22:20 +00:00
|
|
|
{
|
KVM: arm64: Don't acquire RCU read lock for exclusive table walks
Marek reported a BUG resulting from the recent parallel faults changes,
as the hyp stage-1 map walker attempted to allocate table memory while
holding the RCU read lock:
BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:274
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by swapper/0/1:
#0: ffff80000a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at:
__create_hyp_mappings+0x80/0xc4
#1: ffff80000a927720 (rcu_read_lock){....}-{1:2}, at:
kvm_pgtable_walk+0x0/0x1f4
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918
Hardware name: Raspberry Pi 3 Model B (DT)
Call trace:
dump_backtrace.part.0+0xe4/0xf0
show_stack+0x18/0x40
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
__might_resched+0x178/0x220
__might_sleep+0x48/0xa0
prepare_alloc_pages+0x178/0x1a0
__alloc_pages+0x9c/0x109c
alloc_page_interleave+0x1c/0xc4
alloc_pages+0xec/0x160
get_zeroed_page+0x1c/0x44
kvm_hyp_zalloc_page+0x14/0x20
hyp_map_walker+0xd4/0x134
kvm_pgtable_visitor_cb.isra.0+0x38/0x5c
__kvm_pgtable_walk+0x1a4/0x220
kvm_pgtable_walk+0x104/0x1f4
kvm_pgtable_hyp_map+0x80/0xc4
__create_hyp_mappings+0x9c/0xc4
kvm_mmu_init+0x144/0x1cc
kvm_arch_init+0xe4/0xef4
kvm_init+0x3c/0x3d0
arm_init+0x20/0x30
do_one_initcall+0x74/0x400
kernel_init_freeable+0x2e0/0x350
kernel_init+0x24/0x130
ret_from_fork+0x10/0x20
Since the hyp stage-1 table walkers are serialized by kvm_hyp_pgd_mutex,
RCU protection really doesn't add anything. Don't acquire the RCU read
lock for an exclusive walk.
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221118182222.3932898-3-oliver.upton@linux.dev
2022-11-18 18:22:21 +00:00
|
|
|
if (walker->flags & KVM_PGTABLE_WALK_SHARED)
|
|
|
|
rcu_read_unlock();
|
2022-11-18 18:22:20 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool kvm_pgtable_walk_lock_held(void)
|
|
|
|
{
|
|
|
|
return rcu_read_lock_held();
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/**
|
|
|
|
* struct kvm_pgtable - KVM page-table.
|
|
|
|
* @ia_bits: Maximum input address size, in bits.
|
|
|
|
* @start_level: Level at which the page-table walk starts.
|
|
|
|
* @pgd: Pointer to the first top-level entry of the page-table.
|
|
|
|
* @mm_ops: Memory management callbacks.
|
|
|
|
* @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
|
|
|
|
* @flags: Stage-2 page-table flags.
|
|
|
|
* @force_pte_cb: Function that returns true if page level mappings must
|
|
|
|
* be used instead of block mappings.
|
|
|
|
*/
|
|
|
|
struct kvm_pgtable {
|
2024-12-18 19:40:58 +00:00
|
|
|
union {
|
2025-05-21 13:48:31 +01:00
|
|
|
struct rb_root_cached pkvm_mappings;
|
2024-12-18 19:40:58 +00:00
|
|
|
struct {
|
|
|
|
u32 ia_bits;
|
|
|
|
s8 start_level;
|
|
|
|
kvm_pteref_t pgd;
|
|
|
|
struct kvm_pgtable_mm_ops *mm_ops;
|
|
|
|
|
|
|
|
/* Stage-2 only */
|
|
|
|
enum kvm_pgtable_stage2_flags flags;
|
|
|
|
kvm_pgtable_force_pte_cb_t force_pte_cb;
|
|
|
|
};
|
|
|
|
};
|
|
|
|
struct kvm_s2_mmu *mmu;
|
2022-11-18 18:22:20 +00:00
|
|
|
};
|
|
|
|
|
2020-09-11 14:25:11 +01:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
|
|
|
|
* @pgt: Uninitialised page-table structure to initialise.
|
|
|
|
* @va_bits: Maximum virtual address bits.
|
2021-03-19 10:01:14 +00:00
|
|
|
* @mm_ops: Memory management callbacks.
|
2020-09-11 14:25:11 +01:00
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure.
|
|
|
|
*/
|
2021-03-19 10:01:14 +00:00
|
|
|
int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
|
|
|
|
struct kvm_pgtable_mm_ops *mm_ops);
|
2020-09-11 14:25:11 +01:00
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_pgtable_hyp_destroy() - Destroy an unused hypervisor stage-1 page-table.
|
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_hyp_init().
|
|
|
|
*
|
|
|
|
* The page-table is assumed to be unreachable by any hardware walkers prior
|
|
|
|
* to freeing and therefore no TLB invalidation is performed.
|
|
|
|
*/
|
|
|
|
void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_pgtable_hyp_map() - Install a mapping in a hypervisor stage-1 page-table.
|
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_hyp_init().
|
|
|
|
* @addr: Virtual address at which to place the mapping.
|
|
|
|
* @size: Size of the mapping.
|
|
|
|
* @phys: Physical address of the memory to map.
|
|
|
|
* @prot: Permissions and attributes for the mapping.
|
|
|
|
*
|
|
|
|
* The offset of @addr within a page is ignored, @size is rounded-up to
|
|
|
|
* the next page boundary and @phys is rounded-down to the previous page
|
|
|
|
* boundary.
|
|
|
|
*
|
|
|
|
* If device attributes are not explicitly requested in @prot, then the
|
|
|
|
* mapping will be normal, cacheable. Attempts to install a new mapping
|
|
|
|
* for a virtual address that is already mapped will be rejected with an
|
|
|
|
* error and a WARN().
|
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure.
|
|
|
|
*/
|
|
|
|
int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
|
|
|
|
enum kvm_pgtable_prot prot);
|
|
|
|
|
2021-12-15 16:12:22 +00:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_hyp_unmap() - Remove a mapping from a hypervisor stage-1 page-table.
|
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_hyp_init().
|
|
|
|
* @addr: Virtual address from which to remove the mapping.
|
|
|
|
* @size: Size of the mapping.
|
|
|
|
*
|
|
|
|
* The offset of @addr within a page is ignored, @size is rounded-up to
|
|
|
|
* the next page boundary and @phys is rounded-down to the previous page
|
|
|
|
* boundary.
|
|
|
|
*
|
|
|
|
* TLB invalidation is performed for each page-table entry cleared during the
|
|
|
|
* unmapping operation and the reference count for the page-table page
|
|
|
|
* containing the cleared entry is decremented, with unreferenced pages being
|
|
|
|
* freed. The unmapping operation will stop early if it encounters either an
|
|
|
|
* invalid page-table entry or a valid block mapping which maps beyond the range
|
|
|
|
* being unmapped.
|
|
|
|
*
|
|
|
|
* Return: Number of bytes unmapped, which may be 0.
|
|
|
|
*/
|
|
|
|
u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
|
|
|
|
|
2021-03-19 10:01:30 +00:00
|
|
|
/**
|
|
|
|
* kvm_get_vtcr() - Helper to construct VTCR_EL2
|
|
|
|
* @mmfr0: Sanitized value of SYS_ID_AA64MMFR0_EL1 register.
|
|
|
|
* @mmfr1: Sanitized value of SYS_ID_AA64MMFR1_EL1 register.
|
|
|
|
* @phys_shfit: Value to set in VTCR_EL2.T0SZ.
|
|
|
|
*
|
|
|
|
* The VTCR value is common across all the physical CPUs on the system.
|
|
|
|
* We use system wide sanitised values to fill in different fields,
|
|
|
|
* except for Hardware Management of Access Flags. HA Flag is set
|
|
|
|
* unconditionally on all CPUs, as it is safe to run with or without
|
|
|
|
* the feature and the bit is RES0 on CPUs that don't support it.
|
|
|
|
*
|
|
|
|
* Return: VTCR_EL2 value
|
|
|
|
*/
|
|
|
|
u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
|
|
|
|
|
2022-11-10 19:02:45 +00:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD
|
|
|
|
* @vtcr: Content of the VTCR register.
|
|
|
|
*
|
|
|
|
* Return: the size (in bytes) of the stage-2 PGD
|
|
|
|
*/
|
|
|
|
size_t kvm_pgtable_stage2_pgd_size(u64 vtcr);
|
|
|
|
|
2020-09-11 14:25:13 +01:00
|
|
|
/**
|
2021-08-09 16:24:37 +01:00
|
|
|
* __kvm_pgtable_stage2_init() - Initialise a guest stage-2 page-table.
|
2020-09-11 14:25:13 +01:00
|
|
|
* @pgt: Uninitialised page-table structure to initialise.
|
2021-11-29 20:00:45 +00:00
|
|
|
* @mmu: S2 MMU context for this S2 translation
|
2021-03-19 10:01:14 +00:00
|
|
|
* @mm_ops: Memory management callbacks.
|
2021-03-19 10:01:40 +00:00
|
|
|
* @flags: Stage-2 configuration flags.
|
2021-08-09 16:24:37 +01:00
|
|
|
* @force_pte_cb: Function that returns true if page level mappings must
|
|
|
|
* be used instead of block mappings.
|
2020-09-11 14:25:13 +01:00
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure.
|
|
|
|
*/
|
2021-11-29 20:00:45 +00:00
|
|
|
int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
|
2021-08-09 16:24:37 +01:00
|
|
|
struct kvm_pgtable_mm_ops *mm_ops,
|
|
|
|
enum kvm_pgtable_stage2_flags flags,
|
|
|
|
kvm_pgtable_force_pte_cb_t force_pte_cb);
|
2021-03-19 10:01:40 +00:00
|
|
|
|
2024-12-18 19:40:48 +00:00
|
|
|
static inline int kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_s2_mmu *mmu,
|
|
|
|
struct kvm_pgtable_mm_ops *mm_ops)
|
|
|
|
{
|
|
|
|
return __kvm_pgtable_stage2_init(pgt, mmu, mm_ops, 0, NULL);
|
|
|
|
}
|
2020-09-11 14:25:13 +01:00
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table.
|
2021-03-19 10:01:40 +00:00
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
|
2020-09-11 14:25:13 +01:00
|
|
|
*
|
|
|
|
* The page-table is assumed to be unreachable by any hardware walkers prior
|
|
|
|
* to freeing and therefore no TLB invalidation is performed.
|
|
|
|
*/
|
|
|
|
void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
|
|
|
|
|
2022-11-07 21:56:35 +00:00
|
|
|
/**
|
2023-04-26 17:23:19 +00:00
|
|
|
* kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure.
|
2022-11-07 21:56:35 +00:00
|
|
|
* @mm_ops: Memory management callbacks.
|
|
|
|
* @pgtable: Unlinked stage-2 paging structure to be freed.
|
|
|
|
* @level: Level of the stage-2 paging structure to be freed.
|
|
|
|
*
|
|
|
|
* The page-table is assumed to be unreachable by any hardware walkers prior to
|
|
|
|
* freeing and therefore no TLB invalidation is performed.
|
|
|
|
*/
|
2023-11-27 11:17:33 +00:00
|
|
|
void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *pgtable, s8 level);
|
2022-11-07 21:56:35 +00:00
|
|
|
|
2023-04-26 17:23:21 +00:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_create_unlinked() - Create an unlinked stage-2 paging structure.
|
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
|
|
|
|
* @phys: Physical address of the memory to map.
|
|
|
|
* @level: Starting level of the stage-2 paging structure to be created.
|
|
|
|
* @prot: Permissions and attributes for the mapping.
|
|
|
|
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
|
|
|
|
* page-table pages.
|
|
|
|
* @force_pte: Force mappings to PAGE_SIZE granularity.
|
|
|
|
*
|
|
|
|
* Returns an unlinked page-table tree. This new page-table tree is
|
|
|
|
* not reachable (i.e., it is unlinked) from the root pgd and it's
|
|
|
|
* therefore unreachableby the hardware page-table walker. No TLB
|
|
|
|
* invalidation or CMOs are performed.
|
|
|
|
*
|
|
|
|
* If device attributes are not explicitly requested in @prot, then the
|
|
|
|
* mapping will be normal, cacheable.
|
|
|
|
*
|
|
|
|
* Return: The fully populated (unlinked) stage-2 paging structure, or
|
|
|
|
* an ERR_PTR(error) on failure.
|
|
|
|
*/
|
|
|
|
kvm_pte_t *kvm_pgtable_stage2_create_unlinked(struct kvm_pgtable *pgt,
|
2023-11-27 11:17:33 +00:00
|
|
|
u64 phys, s8 level,
|
2023-04-26 17:23:21 +00:00
|
|
|
enum kvm_pgtable_prot prot,
|
|
|
|
void *mc, bool force_pte);
|
2022-11-07 21:56:35 +00:00
|
|
|
|
2020-09-11 14:25:14 +01:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_map() - Install a mapping in a guest stage-2 page-table.
|
2021-03-19 10:01:40 +00:00
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
|
2020-09-11 14:25:14 +01:00
|
|
|
* @addr: Intermediate physical address at which to place the mapping.
|
|
|
|
* @size: Size of the mapping.
|
|
|
|
* @phys: Physical address of the memory to map.
|
|
|
|
* @prot: Permissions and attributes for the mapping.
|
2021-03-19 10:01:33 +00:00
|
|
|
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
|
|
|
|
* page-table pages.
|
2022-11-07 22:00:33 +00:00
|
|
|
* @flags: Flags to control the page-table walk (ex. a shared walk)
|
2020-09-11 14:25:14 +01:00
|
|
|
*
|
|
|
|
* The offset of @addr within a page is ignored, @size is rounded-up to
|
|
|
|
* the next page boundary and @phys is rounded-down to the previous page
|
|
|
|
* boundary.
|
|
|
|
*
|
|
|
|
* If device attributes are not explicitly requested in @prot, then the
|
|
|
|
* mapping will be normal, cacheable.
|
|
|
|
*
|
KVM: arm64: Filter out the case of only changing permissions from stage-2 map path
(1) During running time of a a VM with numbers of vCPUs, if some vCPUs
access the same GPA almost at the same time and the stage-2 mapping of
the GPA has not been built yet, as a result they will all cause
translation faults. The first vCPU builds the mapping, and the followed
ones end up updating the valid leaf PTE. Note that these vCPUs might
want different access permissions (RO, RW, RX, RWX, etc.).
(2) It's inevitable that we sometimes will update an existing valid leaf
PTE in the map path, and we perform break-before-make in this case.
Then more unnecessary translation faults could be caused if the
*break stage* of BBM is just catched by other vCPUS.
With (1) and (2), something unsatisfactory could happen: vCPU A causes
a translation fault and builds the mapping with RW permissions, vCPU B
then update the valid leaf PTE with break-before-make and permissions
are updated back to RO. Besides, *break stage* of BBM may trigger more
translation faults. Finally, some useless small loops could occur.
We can make some optimization to solve above problems: When we need to
update a valid leaf PTE in the map path, let's filter out the case where
this update only change access permissions, and don't update the valid
leaf PTE here in this case. Instead, let the vCPU enter back the guest
and it will exit next time to go through the relax_perms path without
break-before-make if it still wants more permissions.
Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210114121350.123684-3-wangyanan55@huawei.com
2021-01-14 20:13:49 +08:00
|
|
|
* Note that the update of a valid leaf PTE in this function will be aborted,
|
|
|
|
* if it's trying to recreate the exact same mapping or only change the access
|
|
|
|
* permissions. Instead, the vCPU will exit one more time from guest if still
|
|
|
|
* needed and then go through the path of relaxing permissions.
|
|
|
|
*
|
2020-09-11 14:25:14 +01:00
|
|
|
* Note that this function will both coalesce existing table entries and split
|
|
|
|
* existing block mappings, relying on page-faults to fault back areas outside
|
|
|
|
* of the new mapping lazily.
|
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure.
|
|
|
|
*/
|
|
|
|
int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
|
|
|
|
u64 phys, enum kvm_pgtable_prot prot,
|
2022-11-07 22:00:33 +00:00
|
|
|
void *mc, enum kvm_pgtable_walk_flags flags);
|
2020-09-11 14:25:14 +01:00
|
|
|
|
2021-03-19 10:01:37 +00:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space to
|
|
|
|
* track ownership.
|
2021-03-19 10:01:40 +00:00
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
|
2021-03-19 10:01:37 +00:00
|
|
|
* @addr: Base intermediate physical address to annotate.
|
|
|
|
* @size: Size of the annotated range.
|
|
|
|
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
|
|
|
|
* page-table pages.
|
|
|
|
* @owner_id: Unique identifier for the owner of the page.
|
|
|
|
*
|
|
|
|
* By default, all page-tables are owned by identifier 0. This function can be
|
|
|
|
* used to mark portions of the IPA space as owned by other entities. When a
|
|
|
|
* stage 2 is used with identity-mappings, these annotations allow to use the
|
|
|
|
* page-table data structure as a simple rmap.
|
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure.
|
|
|
|
*/
|
|
|
|
int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
|
|
|
|
void *mc, u8 owner_id);
|
|
|
|
|
2020-09-11 14:25:14 +01:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_unmap() - Remove a mapping from a guest stage-2 page-table.
|
2021-03-19 10:01:40 +00:00
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
|
2020-09-11 14:25:14 +01:00
|
|
|
* @addr: Intermediate physical address from which to remove the mapping.
|
|
|
|
* @size: Size of the mapping.
|
|
|
|
*
|
|
|
|
* The offset of @addr within a page is ignored and @size is rounded-up to
|
|
|
|
* the next page boundary.
|
|
|
|
*
|
|
|
|
* TLB invalidation is performed for each page-table entry cleared during the
|
|
|
|
* unmapping operation and the reference count for the page-table page
|
|
|
|
* containing the cleared entry is decremented, with unreferenced pages being
|
|
|
|
* freed. Unmapping a cacheable page will ensure that it is clean to the PoC if
|
|
|
|
* FWB is not supported by the CPU.
|
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure.
|
|
|
|
*/
|
|
|
|
int kvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
|
|
|
|
|
2020-09-11 14:25:20 +01:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_wrprotect() - Write-protect guest stage-2 address range
|
|
|
|
* without TLB invalidation.
|
2021-03-19 10:01:40 +00:00
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
|
2020-09-11 14:25:20 +01:00
|
|
|
* @addr: Intermediate physical address from which to write-protect,
|
|
|
|
* @size: Size of the range.
|
|
|
|
*
|
|
|
|
* The offset of @addr within a page is ignored and @size is rounded-up to
|
|
|
|
* the next page boundary.
|
|
|
|
*
|
|
|
|
* Note that it is the caller's responsibility to invalidate the TLB after
|
|
|
|
* calling this function to ensure that the updated permissions are visible
|
|
|
|
* to the CPUs.
|
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure.
|
|
|
|
*/
|
|
|
|
int kvm_pgtable_stage2_wrprotect(struct kvm_pgtable *pgt, u64 addr, u64 size);
|
|
|
|
|
2020-09-11 14:25:18 +01:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_mkyoung() - Set the access flag in a page-table entry.
|
2021-03-19 10:01:40 +00:00
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
|
2020-09-11 14:25:18 +01:00
|
|
|
* @addr: Intermediate physical address to identify the page-table entry.
|
2024-12-18 19:40:46 +00:00
|
|
|
* @flags: Flags to control the page-table walk (ex. a shared walk)
|
2020-09-11 14:25:18 +01:00
|
|
|
*
|
|
|
|
* The offset of @addr within a page is ignored.
|
|
|
|
*
|
|
|
|
* If there is a valid, leaf page-table entry used to translate @addr, then
|
|
|
|
* set the access flag in that entry.
|
|
|
|
*/
|
2024-12-18 19:40:46 +00:00
|
|
|
void kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr,
|
|
|
|
enum kvm_pgtable_walk_flags flags);
|
2020-09-11 14:25:18 +01:00
|
|
|
|
|
|
|
/**
|
2023-06-27 23:54:05 +00:00
|
|
|
* kvm_pgtable_stage2_test_clear_young() - Test and optionally clear the access
|
|
|
|
* flag in a page-table entry.
|
2021-03-19 10:01:40 +00:00
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
|
2020-09-11 14:25:18 +01:00
|
|
|
* @addr: Intermediate physical address to identify the page-table entry.
|
2023-06-27 23:54:05 +00:00
|
|
|
* @size: Size of the address range to visit.
|
|
|
|
* @mkold: True if the access flag should be cleared.
|
2020-09-11 14:25:18 +01:00
|
|
|
*
|
|
|
|
* The offset of @addr within a page is ignored.
|
|
|
|
*
|
2023-06-27 23:54:05 +00:00
|
|
|
* Tests and conditionally clears the access flag for every valid, leaf
|
|
|
|
* page-table entry used to translate the range [@addr, @addr + @size).
|
2020-09-11 14:25:18 +01:00
|
|
|
*
|
|
|
|
* Note that it is the caller's responsibility to invalidate the TLB after
|
|
|
|
* calling this function to ensure that the updated permissions are visible
|
|
|
|
* to the CPUs.
|
|
|
|
*
|
2023-06-27 23:54:05 +00:00
|
|
|
* Return: True if any of the visited PTEs had the access flag set.
|
2020-09-11 14:25:18 +01:00
|
|
|
*/
|
2023-06-27 23:54:05 +00:00
|
|
|
bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
|
|
|
|
u64 size, bool mkold);
|
2020-09-11 14:25:18 +01:00
|
|
|
|
2020-09-11 14:25:24 +01:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_relax_perms() - Relax the permissions enforced by a
|
|
|
|
* page-table entry.
|
2021-03-19 10:01:40 +00:00
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
|
2020-09-11 14:25:24 +01:00
|
|
|
* @addr: Intermediate physical address to identify the page-table entry.
|
|
|
|
* @prot: Additional permissions to grant for the mapping.
|
2024-12-18 19:40:47 +00:00
|
|
|
* @flags: Flags to control the page-table walk (ex. a shared walk)
|
2020-09-11 14:25:24 +01:00
|
|
|
*
|
|
|
|
* The offset of @addr within a page is ignored.
|
|
|
|
*
|
|
|
|
* If there is a valid, leaf page-table entry used to translate @addr, then
|
|
|
|
* relax the permissions in that entry according to the read, write and
|
|
|
|
* execute permissions specified by @prot. No permissions are removed, and
|
2021-08-09 16:24:38 +01:00
|
|
|
* TLB invalidation is performed after updating the entry. Software bits cannot
|
|
|
|
* be set or cleared using kvm_pgtable_stage2_relax_perms().
|
2020-09-11 14:25:24 +01:00
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure.
|
|
|
|
*/
|
|
|
|
int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
|
2024-12-18 19:40:47 +00:00
|
|
|
enum kvm_pgtable_prot prot,
|
|
|
|
enum kvm_pgtable_walk_flags flags);
|
2020-09-11 14:25:24 +01:00
|
|
|
|
2020-09-11 14:25:22 +01:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_flush_range() - Clean and invalidate data cache to Point
|
|
|
|
* of Coherency for guest stage-2 address
|
|
|
|
* range.
|
2021-03-19 10:01:40 +00:00
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
|
2020-09-11 14:25:22 +01:00
|
|
|
* @addr: Intermediate physical address from which to flush.
|
|
|
|
* @size: Size of the range.
|
|
|
|
*
|
|
|
|
* The offset of @addr within a page is ignored and @size is rounded-up to
|
|
|
|
* the next page boundary.
|
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure.
|
|
|
|
*/
|
|
|
|
int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
|
|
|
|
|
2023-04-26 17:23:24 +00:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_split() - Split a range of huge pages into leaf PTEs pointing
|
|
|
|
* to PAGE_SIZE guest pages.
|
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init().
|
|
|
|
* @addr: Intermediate physical address from which to split.
|
|
|
|
* @size: Size of the range.
|
|
|
|
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
|
|
|
|
* page-table pages.
|
|
|
|
*
|
|
|
|
* The function tries to split any level 1 or 2 entry that overlaps
|
|
|
|
* with the input range (given by @addr and @size).
|
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure. Note that
|
|
|
|
* kvm_pgtable_stage2_split() is best effort: it tries to break as many
|
|
|
|
* blocks in the input range as allowed by @mc_capacity.
|
|
|
|
*/
|
|
|
|
int kvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
|
|
|
|
struct kvm_mmu_memory_cache *mc);
|
|
|
|
|
2020-09-11 14:25:10 +01:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_walk() - Walk a page-table.
|
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_*_init().
|
|
|
|
* @addr: Input address for the start of the walk.
|
|
|
|
* @size: Size of the range to walk.
|
|
|
|
* @walker: Walker callback description.
|
|
|
|
*
|
|
|
|
* The offset of @addr within a page is ignored and @size is rounded-up to
|
|
|
|
* the next page boundary.
|
|
|
|
*
|
|
|
|
* The walker will walk the page-table entries corresponding to the input
|
|
|
|
* address range specified, visiting entries according to the walker flags.
|
2023-05-22 11:32:58 +01:00
|
|
|
* Invalid entries are treated as leaf entries. The visited page table entry is
|
|
|
|
* reloaded after invoking the walker callback, allowing the walker to descend
|
|
|
|
* into a newly installed table.
|
2020-09-11 14:25:10 +01:00
|
|
|
*
|
|
|
|
* Returning a negative error code from the walker callback function will
|
|
|
|
* terminate the walk immediately with the same error code.
|
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure.
|
|
|
|
*/
|
|
|
|
int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
|
|
|
|
struct kvm_pgtable_walker *walker);
|
|
|
|
|
2021-07-26 16:35:47 +01:00
|
|
|
/**
|
|
|
|
* kvm_pgtable_get_leaf() - Walk a page-table and retrieve the leaf entry
|
|
|
|
* with its level.
|
|
|
|
* @pgt: Page-table structure initialised by kvm_pgtable_*_init()
|
|
|
|
* or a similar initialiser.
|
|
|
|
* @addr: Input address for the start of the walk.
|
|
|
|
* @ptep: Pointer to storage for the retrieved PTE.
|
|
|
|
* @level: Pointer to storage for the level of the retrieved PTE.
|
|
|
|
*
|
|
|
|
* The offset of @addr within a page is ignored.
|
|
|
|
*
|
|
|
|
* The walker will walk the page-table entries corresponding to the input
|
|
|
|
* address specified, retrieving the leaf corresponding to this address.
|
|
|
|
* Invalid entries are treated as leaf entries.
|
|
|
|
*
|
|
|
|
* Return: 0 on success, negative error code on failure.
|
|
|
|
*/
|
|
|
|
int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
|
2023-11-27 11:17:33 +00:00
|
|
|
kvm_pte_t *ptep, s8 *level);
|
2021-08-09 16:24:43 +01:00
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_pgtable_stage2_pte_prot() - Retrieve the protection attributes of a
|
|
|
|
* stage-2 Page-Table Entry.
|
|
|
|
* @pte: Page-table entry
|
|
|
|
*
|
|
|
|
* Return: protection attributes of the page-table entry in the enum
|
|
|
|
* kvm_pgtable_prot format.
|
|
|
|
*/
|
|
|
|
enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_pgtable_hyp_pte_prot() - Retrieve the protection attributes of a stage-1
|
|
|
|
* Page-Table Entry.
|
|
|
|
* @pte: Page-table entry
|
|
|
|
*
|
|
|
|
* Return: protection attributes of the page-table entry in the enum
|
|
|
|
* kvm_pgtable_prot format.
|
|
|
|
*/
|
|
|
|
enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte);
|
2023-08-11 04:51:23 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* kvm_tlb_flush_vmid_range() - Invalidate/flush a range of TLB entries
|
|
|
|
*
|
|
|
|
* @mmu: Stage-2 KVM MMU struct
|
|
|
|
* @addr: The base Intermediate physical address from which to invalidate
|
|
|
|
* @size: Size of the range from the base to invalidate
|
|
|
|
*/
|
|
|
|
void kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
|
|
|
|
phys_addr_t addr, size_t size);
|
2020-09-11 14:25:10 +01:00
|
|
|
#endif /* __ARM64_KVM_PGTABLE_H__ */
|