License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2007-12-14 09:35:10 +08:00
|
|
|
#ifndef __KVM_X86_MMU_H
|
|
|
|
#define __KVM_X86_MMU_H
|
|
|
|
|
2007-12-16 11:02:48 +02:00
|
|
|
#include <linux/kvm_host.h>
|
2009-12-07 12:16:48 +02:00
|
|
|
#include "kvm_cache_regs.h"
|
2024-09-06 18:18:21 -04:00
|
|
|
#include "x86.h"
|
2020-07-10 17:48:03 +02:00
|
|
|
#include "cpuid.h"
|
2007-12-14 09:35:10 +08:00
|
|
|
|
2022-08-03 22:49:57 +00:00
|
|
|
extern bool __read_mostly enable_mmio_caching;
|
|
|
|
|
2008-04-25 10:17:08 +08:00
|
|
|
#define PT_WRITABLE_SHIFT 1
|
2016-03-22 16:51:20 +08:00
|
|
|
#define PT_USER_SHIFT 2
|
2008-04-25 10:17:08 +08:00
|
|
|
|
|
|
|
#define PT_PRESENT_MASK (1ULL << 0)
|
|
|
|
#define PT_WRITABLE_MASK (1ULL << PT_WRITABLE_SHIFT)
|
2016-03-22 16:51:20 +08:00
|
|
|
#define PT_USER_MASK (1ULL << PT_USER_SHIFT)
|
2008-04-25 10:17:08 +08:00
|
|
|
#define PT_PWT_MASK (1ULL << 3)
|
|
|
|
#define PT_PCD_MASK (1ULL << 4)
|
2008-05-15 13:51:35 +03:00
|
|
|
#define PT_ACCESSED_SHIFT 5
|
|
|
|
#define PT_ACCESSED_MASK (1ULL << PT_ACCESSED_SHIFT)
|
2012-09-12 13:44:53 +03:00
|
|
|
#define PT_DIRTY_SHIFT 6
|
|
|
|
#define PT_DIRTY_MASK (1ULL << PT_DIRTY_SHIFT)
|
2012-09-12 20:46:56 +03:00
|
|
|
#define PT_PAGE_SIZE_SHIFT 7
|
|
|
|
#define PT_PAGE_SIZE_MASK (1ULL << PT_PAGE_SIZE_SHIFT)
|
2008-04-25 10:17:08 +08:00
|
|
|
#define PT_PAT_MASK (1ULL << 7)
|
|
|
|
#define PT_GLOBAL_MASK (1ULL << 8)
|
|
|
|
#define PT64_NX_SHIFT 63
|
|
|
|
#define PT64_NX_MASK (1ULL << PT64_NX_SHIFT)
|
|
|
|
|
|
|
|
#define PT_PAT_SHIFT 7
|
|
|
|
#define PT_DIR_PAT_SHIFT 12
|
|
|
|
#define PT_DIR_PAT_MASK (1ULL << PT_DIR_PAT_SHIFT)
|
|
|
|
|
2017-08-24 20:27:55 +08:00
|
|
|
#define PT64_ROOT_5LEVEL 5
|
2017-08-24 20:27:54 +08:00
|
|
|
#define PT64_ROOT_4LEVEL 4
|
2008-04-25 10:17:08 +08:00
|
|
|
#define PT32_ROOT_LEVEL 2
|
|
|
|
#define PT32E_ROOT_LEVEL 3
|
|
|
|
|
2021-09-19 10:42:46 +08:00
|
|
|
#define KVM_MMU_CR4_ROLE_BITS (X86_CR4_PSE | X86_CR4_PAE | X86_CR4_LA57 | \
|
|
|
|
X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE)
|
2021-06-22 10:57:02 -07:00
|
|
|
|
|
|
|
#define KVM_MMU_CR0_ROLE_BITS (X86_CR0_PG | X86_CR0_WP)
|
2022-02-09 04:56:05 -05:00
|
|
|
#define KVM_MMU_EFER_ROLE_BITS (EFER_LME | EFER_NX)
|
2021-06-22 10:57:02 -07:00
|
|
|
|
2021-01-13 12:45:15 -08:00
|
|
|
static __always_inline u64 rsvd_bits(int s, int e)
|
2014-09-01 18:44:04 +08:00
|
|
|
{
|
2021-01-13 12:45:15 -08:00
|
|
|
BUILD_BUG_ON(__builtin_constant_p(e) && __builtin_constant_p(s) && e < s);
|
|
|
|
|
|
|
|
if (__builtin_constant_p(e))
|
|
|
|
BUILD_BUG_ON(e > 63);
|
|
|
|
else
|
|
|
|
e &= 63;
|
|
|
|
|
2017-08-24 20:27:53 +08:00
|
|
|
if (e < s)
|
|
|
|
return 0;
|
|
|
|
|
2020-12-22 05:20:43 -05:00
|
|
|
return ((2ULL << (e - s)) - 1) << s;
|
2014-09-01 18:44:04 +08:00
|
|
|
}
|
|
|
|
|
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):
WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2d0 [kvm]
kvm_vm_release+0x1d/0x30 [kvm]
__fput+0x82/0x240
task_work_run+0x5b/0x90
exit_to_user_mode_prepare+0xd2/0xe0
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.
Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.
A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).
The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-28 23:34:16 +00:00
|
|
|
static inline gfn_t kvm_mmu_max_gfn(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Note that this uses the host MAXPHYADDR, not the guest's.
|
|
|
|
* EPT/NPT cannot support GPAs that would exceed host.MAXPHYADDR;
|
|
|
|
* assuming KVM is running on bare metal, guest accesses beyond
|
|
|
|
* host.MAXPHYADDR will hit a #PF(RSVD) and never cause a vmexit
|
|
|
|
* (either EPT Violation/Misconfig or #NPF), and so KVM will never
|
|
|
|
* install a SPTE for such addresses. If KVM is running as a VM
|
|
|
|
* itself, on the other hand, it might see a MAXPHYADDR that is less
|
|
|
|
* than hardware's real MAXPHYADDR. Using the host MAXPHYADDR
|
|
|
|
* disallows such SPTEs entirely and simplifies the TDP MMU.
|
|
|
|
*/
|
2024-04-23 15:15:21 -07:00
|
|
|
int max_gpa_bits = likely(tdp_enabled) ? kvm_host.maxphyaddr : 52;
|
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):
WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2d0 [kvm]
kvm_vm_release+0x1d/0x30 [kvm]
__fput+0x82/0x240
task_work_run+0x5b/0x90
exit_to_user_mode_prepare+0xd2/0xe0
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.
Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.
A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).
The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.
Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-28 23:34:16 +00:00
|
|
|
|
|
|
|
return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
|
|
|
|
}
|
|
|
|
|
2024-03-13 13:58:43 +01:00
|
|
|
u8 kvm_mmu_get_max_tdp_level(void);
|
|
|
|
|
2021-02-25 12:47:35 -08:00
|
|
|
void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
|
2024-11-12 15:37:30 +08:00
|
|
|
void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value);
|
KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits. Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits. TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either. Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.
KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory. MKTME KeyID bits should be set
to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check. So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).
Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:
- shadow_me_value: the memory encryption bit(s) that will be set to the
SPTE (the original shadow_me_mask).
- shadow_me_mask: all possible memory encryption bits (which is a super
set of shadow_me_value).
- For now, shadow_me_value is supposed to be set by SVM and VMX
respectively, and it is a constant during KVM's life time. This
perhaps doesn't fit MKTME but for now host kernel doesn't support it
(and perhaps will never do).
- Bits in shadow_me_mask are set to shadow_zero_check, except the bits
in shadow_me_value.
Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed. This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-19 23:17:03 +12:00
|
|
|
void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
|
2021-02-25 12:47:42 -08:00
|
|
|
void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
|
2013-06-07 16:51:25 +08:00
|
|
|
|
2021-06-09 16:42:33 -07:00
|
|
|
void kvm_init_mmu(struct kvm_vcpu *vcpu);
|
2021-06-22 10:56:59 -07:00
|
|
|
void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
|
|
|
|
unsigned long cr4, u64 efer, gpa_t nested_cr3);
|
2017-03-30 11:55:30 +02:00
|
|
|
void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
|
2021-11-24 20:20:49 +08:00
|
|
|
int huge_page_level, bool accessed_dirty,
|
|
|
|
gpa_t new_eptp);
|
2017-06-08 20:13:40 -07:00
|
|
|
bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
|
2017-07-13 18:30:40 -07:00
|
|
|
int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
|
2017-08-11 18:36:43 +02:00
|
|
|
u64 fault_address, char *insn, int insn_len);
|
2023-04-04 17:26:08 -07:00
|
|
|
void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_mmu *mmu);
|
2009-06-11 12:07:42 -03:00
|
|
|
|
2021-03-04 17:10:59 -08:00
|
|
|
int kvm_mmu_load(struct kvm_vcpu *vcpu);
|
|
|
|
void kvm_mmu_unload(struct kvm_vcpu *vcpu);
|
2022-02-25 18:22:45 +00:00
|
|
|
void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu);
|
2021-03-04 17:10:59 -08:00
|
|
|
void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu);
|
2021-10-19 19:01:54 +08:00
|
|
|
void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu);
|
2023-07-28 18:35:20 -07:00
|
|
|
void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
|
|
|
|
int bytes);
|
2021-03-04 17:10:59 -08:00
|
|
|
|
2007-12-14 09:35:10 +08:00
|
|
|
static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2025-03-18 09:33:33 +08:00
|
|
|
if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
|
|
|
|
kvm_mmu_free_obsolete_roots(vcpu);
|
|
|
|
|
KVM: x86/tdp_mmu: Support mirror root for TDP MMU
Add the ability for the TDP MMU to maintain a mirror of a separate
mapping.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
In order to handle both shared and private memory, KVM needs to learn to
handle faults and other operations on the correct root for the operation.
KVM could learn the concept of private roots, and operate on them by
calling out to operations that call into the TDX module. But there are two
problems with that:
1. Calls into the TDX module are relatively slow compared to the simple
accesses required to read a PTE managed directly by KVM.
2. Other Coco technologies deal with private memory completely differently
and it will make the code confusing when being read from their
perspective. Special operations added for TDX that set private or zap
private memory will have nothing to do with these other private memory
technologies. (SEV, etc).
To handle these, instead teach the TDP MMU about a new concept "mirror
roots". Such roots maintain page tables that are not actually mapped,
and are just used to traverse quickly to determine if the mid level page
tables need to be installed. When the memory be mirrored needs to actually
be changed, calls can be made to via x86_ops.
private KVM page fault |
| |
V |
private GPA | CPU protected EPTP
| | |
V | V
mirror PT root | external PT root
| | |
V | V
mirror PT --hook to propagate-->external PT
| | |
\--------------------+------\ |
| | |
| V V
| private guest page
|
|
non-encrypted memory | encrypted memory
|
Leave calling out to actually update the private page tables that are being
mirrored for later changes. Just implement the handling of MMU operations
on to mirrored roots.
In order to direct operations to correct root, add root types
KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct
roots to private/shared with conditionals. It could also be implemented by
making the kvm_tdp_mmu_root_types and kvm_gfn_range_filter enum bits line
up such that conversion could be a direct assignment with a case. Don't do
this because the mapping of private to mirrored is confusing enough. So it
is worth not hiding the logic in type casting.
Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place
in kvm_mmu_free_roots(), because the private root that is being cannot be
rebuilt like a normal root. It needs to persist for the lifetime of the VM.
The TDX module will also need to be provided with page tables to use for
the actual mapping being mirrored by the mirrored page tables. Allocate
these in the mapping path using the recently added
kvm_mmu_alloc_external_spt().
Don't support 2M page for now. This is avoided by forcing 4k pages in the
fault. Add a KVM_BUG_ON() to verify.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-13-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-18 14:12:24 -07:00
|
|
|
/*
|
|
|
|
* Checking root.hpa is sufficient even when KVM has mirror root.
|
|
|
|
* We can have either:
|
|
|
|
* (1) mirror_root_hpa = INVALID_PAGE, root.hpa = INVALID_PAGE
|
|
|
|
* (2) mirror_root_hpa = root, root.hpa = INVALID_PAGE
|
|
|
|
* (3) mirror_root_hpa = root1, root.hpa = root2
|
|
|
|
* We don't ever have:
|
|
|
|
* mirror_root_hpa = INVALID_PAGE, root.hpa = root
|
|
|
|
*/
|
2022-02-21 09:28:33 -05:00
|
|
|
if (likely(vcpu->arch.mmu->root.hpa != INVALID_PAGE))
|
2007-12-14 09:35:10 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
return kvm_mmu_load(vcpu);
|
|
|
|
}
|
|
|
|
|
2018-06-27 14:59:13 -07:00
|
|
|
static inline unsigned long kvm_get_pcid(struct kvm_vcpu *vcpu, gpa_t cr3)
|
|
|
|
{
|
|
|
|
BUILD_BUG_ON((X86_CR3_PCID_MASK & PAGE_MASK) != 0);
|
|
|
|
|
2023-03-22 12:58:21 +08:00
|
|
|
return kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE)
|
2018-06-27 14:59:13 -07:00
|
|
|
? cr3 & X86_CR3_PCID_MASK
|
|
|
|
: 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long kvm_get_active_pcid(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return kvm_get_pcid(vcpu, kvm_read_cr3(vcpu));
|
|
|
|
}
|
|
|
|
|
KVM: x86: Virtualize LAM for user pointer
Add support to allow guests to set the new CR3 control bits for Linear
Address Masking (LAM) and add implementation to get untagged address for
user pointers.
LAM modifies the canonical check for 64-bit linear addresses, allowing
software to use the masked/ignored address bits for metadata. Hardware
masks off the metadata bits before using the linear addresses to access
memory. LAM uses two new CR3 non-address bits, LAM_U48 (bit 62) and
LAM_U57 (bit 61), to configure LAM for user pointers. LAM also changes
VMENTER to allow both bits to be set in VMCS's HOST_CR3 and GUEST_CR3 for
virtualization.
When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set
any of the two LAM control bits. However, when EPT is off, the actual CR3
used by the guest is generated from the shadow MMU root which is different
from the CR3 that is *set* by the guest, and KVM needs to manually apply
any active control bits to VMCS's GUEST_CR3 based on the cached CR3 *seen*
by the guest.
KVM manually checks guest's CR3 to make sure it points to a valid guest
physical address (i.e. to support smaller MAXPHYSADDR in the guest). Extend
this check to allow the two LAM control bits to be set. After check, LAM
bits of guest CR3 will be stripped off to extract guest physical address.
In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3
and GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when
L0 enters L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2
directly. KVM also manually checks VMCS12's HOST_CR3 and GUEST_CR3 being
valid physical address. Extend such check to allow the new LAM control bits
too.
Note, LAM doesn't have a global control bit to turn on/off LAM completely,
but purely depends on hardware's CPUID to determine it can be enabled or
not. That means, when EPT is on, even when KVM doesn't expose LAM to guest,
the guest can still set LAM control bits in CR3 w/o causing problem. This
is an unfortunate virtualization hole. KVM could choose to intercept CR3 in
this case and inject fault but this would hurt performance when running a
normal VM w/o LAM support. This is undesirable. Just choose to let the
guest do such illegal thing as the worst case is guest being killed when
KVM eventually find out such illegal behaviour and that the guest is
misbehaving.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-12-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-13 20:42:22 +08:00
|
|
|
static inline unsigned long kvm_get_active_cr3_lam_bits(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2024-11-27 17:34:06 -08:00
|
|
|
if (!guest_cpu_cap_has(vcpu, X86_FEATURE_LAM))
|
KVM: x86: Virtualize LAM for user pointer
Add support to allow guests to set the new CR3 control bits for Linear
Address Masking (LAM) and add implementation to get untagged address for
user pointers.
LAM modifies the canonical check for 64-bit linear addresses, allowing
software to use the masked/ignored address bits for metadata. Hardware
masks off the metadata bits before using the linear addresses to access
memory. LAM uses two new CR3 non-address bits, LAM_U48 (bit 62) and
LAM_U57 (bit 61), to configure LAM for user pointers. LAM also changes
VMENTER to allow both bits to be set in VMCS's HOST_CR3 and GUEST_CR3 for
virtualization.
When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set
any of the two LAM control bits. However, when EPT is off, the actual CR3
used by the guest is generated from the shadow MMU root which is different
from the CR3 that is *set* by the guest, and KVM needs to manually apply
any active control bits to VMCS's GUEST_CR3 based on the cached CR3 *seen*
by the guest.
KVM manually checks guest's CR3 to make sure it points to a valid guest
physical address (i.e. to support smaller MAXPHYSADDR in the guest). Extend
this check to allow the two LAM control bits to be set. After check, LAM
bits of guest CR3 will be stripped off to extract guest physical address.
In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3
and GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when
L0 enters L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2
directly. KVM also manually checks VMCS12's HOST_CR3 and GUEST_CR3 being
valid physical address. Extend such check to allow the new LAM control bits
too.
Note, LAM doesn't have a global control bit to turn on/off LAM completely,
but purely depends on hardware's CPUID to determine it can be enabled or
not. That means, when EPT is on, even when KVM doesn't expose LAM to guest,
the guest can still set LAM control bits in CR3 w/o causing problem. This
is an unfortunate virtualization hole. KVM could choose to intercept CR3 in
this case and inject fault but this would hurt performance when running a
normal VM w/o LAM support. This is undesirable. Just choose to let the
guest do such illegal thing as the worst case is guest being killed when
KVM eventually find out such illegal behaviour and that the guest is
misbehaving.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-12-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-13 20:42:22 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
return kvm_read_cr3(vcpu) & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
|
|
|
|
}
|
|
|
|
|
2020-03-03 10:11:10 +01:00
|
|
|
static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
|
2018-06-27 14:59:08 -07:00
|
|
|
{
|
2022-02-21 09:28:33 -05:00
|
|
|
u64 root_hpa = vcpu->arch.mmu->root.hpa;
|
2020-07-15 20:41:18 -07:00
|
|
|
|
|
|
|
if (!VALID_PAGE(root_hpa))
|
|
|
|
return;
|
|
|
|
|
2024-05-07 21:31:02 +08:00
|
|
|
kvm_x86_call(load_mmu_pgd)(vcpu, root_hpa,
|
|
|
|
vcpu->arch.mmu->root_role.level);
|
2020-02-06 14:14:34 -08:00
|
|
|
}
|
|
|
|
|
2023-04-04 17:26:08 -07:00
|
|
|
static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_mmu *mmu)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* When EPT is enabled, KVM may passthrough CR0.WP to the guest, i.e.
|
|
|
|
* @mmu's snapshot of CR0.WP and thus all related paging metadata may
|
|
|
|
* be stale. Refresh CR0.WP and the metadata on-demand when checking
|
|
|
|
* for permission faults. Exempt nested MMUs, i.e. MMUs for shadowing
|
|
|
|
* nEPT and nNPT, as CR0.WP is ignored in both cases. Note, KVM does
|
|
|
|
* need to refresh nested_mmu, a.k.a. the walker used to translate L2
|
|
|
|
* GVAs to GPAs, as that "MMU" needs to honor L2's CR0.WP.
|
|
|
|
*/
|
|
|
|
if (!tdp_enabled || mmu == &vcpu->arch.guest_mmu)
|
|
|
|
return;
|
|
|
|
|
|
|
|
__kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
|
|
|
|
}
|
|
|
|
|
KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-12 14:52:00 +03:00
|
|
|
/*
|
2016-03-08 10:08:16 +01:00
|
|
|
* Check if a given access (described through the I/D, W/R and U/S bits of a
|
|
|
|
* page fault error code pfec) causes a permission fault with the given PTE
|
|
|
|
* access rights (in ACC_* format).
|
|
|
|
*
|
|
|
|
* Return zero if the access does not fault; return the page fault error code
|
|
|
|
* if the access faults.
|
KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-12 14:52:00 +03:00
|
|
|
*/
|
2016-03-08 10:08:16 +01:00
|
|
|
static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
|
2016-03-22 16:51:20 +08:00
|
|
|
unsigned pte_access, unsigned pte_pkey,
|
2022-03-11 15:03:41 +08:00
|
|
|
u64 access)
|
2011-07-12 03:23:20 +08:00
|
|
|
{
|
2022-03-11 15:03:41 +08:00
|
|
|
/* strip nested paging fault error codes */
|
|
|
|
unsigned int pfec = access;
|
2024-05-07 21:31:02 +08:00
|
|
|
unsigned long rflags = kvm_x86_call(get_rflags)(vcpu);
|
2014-04-01 17:46:34 +08:00
|
|
|
|
|
|
|
/*
|
2022-03-11 15:03:44 +08:00
|
|
|
* For explicit supervisor accesses, SMAP is disabled if EFLAGS.AC = 1.
|
|
|
|
* For implicit supervisor accesses, SMAP cannot be overridden.
|
2014-04-01 17:46:34 +08:00
|
|
|
*
|
2022-03-11 15:03:44 +08:00
|
|
|
* SMAP works on supervisor accesses only, and not_smap can
|
|
|
|
* be set or not set when user access with neither has any bearing
|
|
|
|
* on the result.
|
2014-04-01 17:46:34 +08:00
|
|
|
*
|
2022-03-11 15:03:44 +08:00
|
|
|
* We put the SMAP checking bit in place of the PFERR_RSVD_MASK bit;
|
|
|
|
* this bit will always be zero in pfec, but it will be one in index
|
|
|
|
* if SMAP checks are being disabled.
|
2014-04-01 17:46:34 +08:00
|
|
|
*/
|
2022-03-11 15:03:44 +08:00
|
|
|
u64 implicit_access = access & PFERR_IMPLICIT_ACCESS;
|
|
|
|
bool not_smap = ((rflags & X86_EFLAGS_AC) | implicit_access) == X86_EFLAGS_AC;
|
2024-02-27 18:41:33 -08:00
|
|
|
int index = (pfec | (not_smap ? PFERR_RSVD_MASK : 0)) >> 1;
|
2016-03-25 21:19:35 +08:00
|
|
|
u32 errcode = PFERR_PRESENT_MASK;
|
2023-04-04 17:26:08 -07:00
|
|
|
bool fault;
|
|
|
|
|
|
|
|
kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
|
|
|
|
|
|
|
|
fault = (mmu->permissions[index] >> pte_access) & 1;
|
2014-04-01 17:46:34 +08:00
|
|
|
|
2016-03-22 16:51:20 +08:00
|
|
|
WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
|
|
|
|
if (unlikely(mmu->pkru_mask)) {
|
|
|
|
u32 pkru_bits, offset;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* PKRU defines 32 bits, there are 16 domains and 2
|
|
|
|
* attribute bits per domain in pkru. pte_pkey is the
|
|
|
|
* index of the protection domain, so pte_pkey * 2 is
|
|
|
|
* is the index of the first bit for the domain.
|
|
|
|
*/
|
2017-08-23 23:14:38 +02:00
|
|
|
pkru_bits = (vcpu->arch.pkru >> (pte_pkey * 2)) & 3;
|
2016-03-22 16:51:20 +08:00
|
|
|
|
|
|
|
/* clear present bit, replace PFEC.RSVD with ACC_USER_MASK. */
|
2024-02-27 18:41:33 -08:00
|
|
|
offset = (pfec & ~1) | ((pte_access & PT_USER_MASK) ? PFERR_RSVD_MASK : 0);
|
2016-03-22 16:51:20 +08:00
|
|
|
|
|
|
|
pkru_bits &= mmu->pkru_mask >> offset;
|
2016-03-25 21:19:35 +08:00
|
|
|
errcode |= -pkru_bits & PFERR_PK_MASK;
|
2016-03-22 16:51:20 +08:00
|
|
|
fault |= (pkru_bits != 0);
|
|
|
|
}
|
|
|
|
|
2016-03-25 21:19:35 +08:00
|
|
|
return -(u32)fault & errcode;
|
2011-07-12 03:23:20 +08:00
|
|
|
}
|
KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-12 14:52:00 +03:00
|
|
|
|
2025-02-24 15:09:45 +08:00
|
|
|
bool kvm_mmu_may_ignore_guest_pat(struct kvm *kvm);
|
2023-07-14 14:50:06 +08:00
|
|
|
|
2019-11-04 20:26:00 +01:00
|
|
|
int kvm_mmu_post_init_vm(struct kvm *kvm);
|
|
|
|
void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
|
|
|
|
|
2021-10-15 12:30:21 -04:00
|
|
|
static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
|
2021-05-18 10:34:13 -07:00
|
|
|
{
|
2021-05-18 10:34:14 -07:00
|
|
|
/*
|
2021-10-15 12:30:21 -04:00
|
|
|
* Read shadow_root_allocated before related pointers. Hence, threads
|
|
|
|
* reading shadow_root_allocated in any lock context are guaranteed to
|
|
|
|
* see the pointers. Pairs with smp_store_release in
|
|
|
|
* mmu_first_shadow_root_alloc.
|
2021-05-18 10:34:14 -07:00
|
|
|
*/
|
2021-10-15 12:30:21 -04:00
|
|
|
return smp_load_acquire(&kvm->arch.shadow_root_allocated);
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_X86_64
|
2022-09-21 10:35:37 -07:00
|
|
|
extern bool tdp_mmu_enabled;
|
2021-10-15 12:30:21 -04:00
|
|
|
#else
|
2022-09-21 10:35:37 -07:00
|
|
|
#define tdp_mmu_enabled false
|
2021-10-15 12:30:21 -04:00
|
|
|
#endif
|
|
|
|
|
2024-11-12 15:34:57 +08:00
|
|
|
bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
|
|
|
|
int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
|
|
|
|
|
2021-10-15 12:30:21 -04:00
|
|
|
static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
|
|
|
|
{
|
2022-09-21 10:35:37 -07:00
|
|
|
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
|
2021-05-18 10:34:13 -07:00
|
|
|
}
|
|
|
|
|
2021-07-30 18:04:51 -04:00
|
|
|
static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
|
|
|
|
{
|
|
|
|
/* KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K) must be 0. */
|
|
|
|
return (gfn >> KVM_HPAGE_GFN_SHIFT(level)) -
|
|
|
|
(base_gfn >> KVM_HPAGE_GFN_SHIFT(level));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long
|
|
|
|
__kvm_mmu_slot_lpages(struct kvm_memory_slot *slot, unsigned long npages,
|
|
|
|
int level)
|
|
|
|
{
|
|
|
|
return gfn_to_index(slot->base_gfn + npages - 1,
|
|
|
|
slot->base_gfn, level) + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long
|
|
|
|
kvm_mmu_slot_lpages(struct kvm_memory_slot *slot, int level)
|
|
|
|
{
|
|
|
|
return __kvm_mmu_slot_lpages(slot, slot->npages, level);
|
|
|
|
}
|
|
|
|
|
2021-08-02 21:46:07 -07:00
|
|
|
static inline void kvm_update_page_stats(struct kvm *kvm, int level, int count)
|
|
|
|
{
|
|
|
|
atomic64_add(count, &kvm->stat.pages[level - 1]);
|
|
|
|
}
|
2021-11-24 20:20:45 +08:00
|
|
|
|
2022-03-11 15:03:41 +08:00
|
|
|
gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u64 access,
|
2021-11-24 20:20:45 +08:00
|
|
|
struct x86_exception *exception);
|
|
|
|
|
|
|
|
static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_mmu *mmu,
|
2022-03-11 15:03:41 +08:00
|
|
|
gpa_t gpa, u64 access,
|
2021-11-24 20:20:45 +08:00
|
|
|
struct x86_exception *exception)
|
|
|
|
{
|
|
|
|
if (mmu != &vcpu->arch.nested_mmu)
|
|
|
|
return gpa;
|
|
|
|
return translate_nested_gpa(vcpu, gpa, access, exception);
|
|
|
|
}
|
KVM: x86/mmu: Add an external pointer to struct kvm_mmu_page
Add an external pointer to struct kvm_mmu_page for TDX's private page table
and add helper functions to allocate/initialize/free a private page table
page. TDX will only be supported with the TDP MMU. Because KVM TDP MMU
doesn't use unsync_children and write_flooding_count, pack them to have
room for a pointer and use a union to avoid memory overhead.
For private GPA, CPU refers to a private page table whose contents are
encrypted. The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used, and their cost is expensive.
When KVM resolves the KVM page fault, it walks the page tables. To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table allocate two sets of page tables for the private
half of the GPA space.
For the page tables that KVM will walk, allocate them like normal and refer
to them as mirror page tables. Additionally allocate one more page for the
page tables the CPU will walk, and call them external page tables. Resolve
the KVM page fault with the existing code, and do additional operations
necessary for modifying the external page table in future patches.
The relationship of the types of page tables in this scheme is depicted
below:
KVM page fault |
| |
V |
-------------+---------- |
| | |
V V |
shared GPA private GPA |
| | |
V V |
shared PT root mirror PT root | private PT root
| | | |
V V | V
shared PT mirror PT --propagate--> external PT
| | | |
| \-----------------+------\ |
| | | |
V | V V
shared guest page | private guest page
|
non-encrypted memory | encrypted memory
|
PT - Page table
Shared PT - Visible to KVM, and the CPU uses it for shared mappings.
External PT - The CPU uses it, but it is invisible to KVM. TDX module
updates this table to map private guest pages.
Mirror PT - It is visible to KVM, but the CPU doesn't use it. KVM uses
it to propagate PT change to the actual private PT.
Add a helper kvm_has_mirrored_tdp() to trigger this behavior and wire it
to the TDX vm type.
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20240718211230.1492011-5-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-18 14:12:16 -07:00
|
|
|
|
|
|
|
static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return kvm->arch.vm_type == KVM_X86_TDX_VM;
|
|
|
|
}
|
KVM: x86/mmu: Support GFN direct bits
Teach the MMU to map guest GFNs at a massaged position on the TDP, to aid
in implementing TDX shared memory.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
For TDX, the shared half will be mapped in the higher alias, with a "shared
bit" set in the GPA. However, KVM will still manage it with the same
memslots as the private half. This means memslot looks ups and zapping
operations will be provided with a GFN without the shared bit set.
So KVM will either need to apply or strip the shared bit before mapping or
zapping the shared EPT. Having GFNs sometimes have the shared bit and
sometimes not would make the code confusing.
So instead arrange the code such that GFNs never have shared bit set.
Create a concept of "direct bits", that is stripped from the fault
address when setting fault->gfn, and applied within the TDP MMU iterator.
Calling code will behave as if it is operating on the PTE mapping the GFN
(without shared bits) but within the iterator, the actual mappings will be
shifted using bits specific for the root. SPs will have the GFN set
without the shared bit. In the end the TDP MMU will behave like it is
mapping things at the GFN without the shared bit but with a strange page
table format where everything is offset by the shared bit.
Since TDX only needs to shift the mapping like this for the shared bit,
which is mapped as the normal TDP root, add a "gfn_direct_bits" field to
the kvm_arch structure for each VM with a default value of 0. It will
have the bit set at the position of the GPA shared bit in GFN through TD
specific initialization code. Keep TDX specific concepts out of the MMU
code by not naming it "shared".
Ranged TLB flushes (i.e. flush_remote_tlbs_range()) target specific GFN
ranges. In convention established above, these would need to target the
shifted GFN range. It won't matter functionally, since the actual
implementation will always result in a full flush for the only planned
user (TDX). For correctness reasons, future changes can provide a TDX
x86_ops.flush_remote_tlbs_range implementation to return -EOPNOTSUPP and
force the full flush for TDs.
This leaves one problem. Some operations use a concept of max GFN (i.e.
kvm_mmu_max_gfn()), to iterate over the whole TDP range. When applying the
direct mask to the start of the range, the iterator would end up skipping
iterating over the range not covered by the direct mask bit. For safety,
make sure the __tdp_mmu_zap_root() operation iterates over the full GFN
range supported by the underlying TDP format. Add a new iterator helper,
for_each_tdp_pte_min_level_all(), that iterates the entire TDP GFN range,
regardless of root.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-9-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-18 14:12:20 -07:00
|
|
|
|
|
|
|
static inline gfn_t kvm_gfn_direct_bits(const struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return kvm->arch.gfn_direct_bits;
|
|
|
|
}
|
KVM: x86/tdp_mmu: Support mirror root for TDP MMU
Add the ability for the TDP MMU to maintain a mirror of a separate
mapping.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
In order to handle both shared and private memory, KVM needs to learn to
handle faults and other operations on the correct root for the operation.
KVM could learn the concept of private roots, and operate on them by
calling out to operations that call into the TDX module. But there are two
problems with that:
1. Calls into the TDX module are relatively slow compared to the simple
accesses required to read a PTE managed directly by KVM.
2. Other Coco technologies deal with private memory completely differently
and it will make the code confusing when being read from their
perspective. Special operations added for TDX that set private or zap
private memory will have nothing to do with these other private memory
technologies. (SEV, etc).
To handle these, instead teach the TDP MMU about a new concept "mirror
roots". Such roots maintain page tables that are not actually mapped,
and are just used to traverse quickly to determine if the mid level page
tables need to be installed. When the memory be mirrored needs to actually
be changed, calls can be made to via x86_ops.
private KVM page fault |
| |
V |
private GPA | CPU protected EPTP
| | |
V | V
mirror PT root | external PT root
| | |
V | V
mirror PT --hook to propagate-->external PT
| | |
\--------------------+------\ |
| | |
| V V
| private guest page
|
|
non-encrypted memory | encrypted memory
|
Leave calling out to actually update the private page tables that are being
mirrored for later changes. Just implement the handling of MMU operations
on to mirrored roots.
In order to direct operations to correct root, add root types
KVM_DIRECT_ROOTS and KVM_MIRROR_ROOTS. Tie the usage of mirrored/direct
roots to private/shared with conditionals. It could also be implemented by
making the kvm_tdp_mmu_root_types and kvm_gfn_range_filter enum bits line
up such that conversion could be a direct assignment with a case. Don't do
this because the mapping of private to mirrored is confusing enough. So it
is worth not hiding the logic in type casting.
Cleanup the mirror root in kvm_mmu_destroy() instead of the normal place
in kvm_mmu_free_roots(), because the private root that is being cannot be
rebuilt like a normal root. It needs to persist for the lifetime of the VM.
The TDX module will also need to be provided with page tables to use for
the actual mapping being mirrored by the mirrored page tables. Allocate
these in the mapping path using the recently added
kvm_mmu_alloc_external_spt().
Don't support 2M page for now. This is avoided by forcing 4k pages in the
fault. Add a KVM_BUG_ON() to verify.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-13-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-18 14:12:24 -07:00
|
|
|
|
|
|
|
static inline bool kvm_is_addr_direct(struct kvm *kvm, gpa_t gpa)
|
|
|
|
{
|
|
|
|
gpa_t gpa_direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(kvm));
|
|
|
|
|
|
|
|
return !gpa_direct_bits || (gpa & gpa_direct_bits);
|
|
|
|
}
|
KVM: x86/mmu: Prevent aliased memslot GFNs
Add a few sanity checks to prevent memslot GFNs from ever having alias bits
set.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
For TDX, the shared half will be mapped in the higher alias, with a "shared
bit" set in the GPA. However, KVM will still manage it with the same
memslots as the private half. This means memslot looks ups and zapping
operations will be provided with a GFN without the shared bit set.
If these memslot GFNs ever had the bit that selects between the two aliases
it could lead to unexpected behavior in the complicated code that directs
faulting or zapping operations between the roots that map the two aliases.
As a safety measure, prevent memslots from being set at a GFN range that
contains the alias bit.
Also, check in the kvm_faultin_pfn() for the fault path. This later check
does less today, as the alias bits are specifically stripped from the GFN
being checked, however future code could possibly call in to the fault
handler in a way that skips this stripping. Since kvm_faultin_pfn() now
has many references to vcpu->kvm, extract it to local variable.
Link: https://lore.kernel.org/kvm/ZpbKqG_ZhCWxl-Fc@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-19-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-18 14:12:30 -07:00
|
|
|
|
|
|
|
static inline bool kvm_is_gfn_alias(struct kvm *kvm, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return gfn & kvm_gfn_direct_bits(kvm);
|
|
|
|
}
|
2007-12-14 09:35:10 +08:00
|
|
|
#endif
|