License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2008-07-03 14:59:22 +03:00
|
|
|
#ifndef ARCH_X86_KVM_X86_H
|
|
|
|
#define ARCH_X86_KVM_X86_H
|
|
|
|
|
|
|
|
#include <linux/kvm_host.h>
|
2023-04-04 17:45:16 -07:00
|
|
|
#include <asm/fpu/xstate.h>
|
2020-10-29 14:56:00 +01:00
|
|
|
#include <asm/mce.h>
|
2016-06-20 22:28:02 -03:00
|
|
|
#include <asm/pvclock.h>
|
2010-01-21 15:31:48 +02:00
|
|
|
#include "kvm_cache_regs.h"
|
2020-02-18 15:29:49 -08:00
|
|
|
#include "kvm_emulate.h"
|
KVM: x86: model canonical checks more precisely
As a result of a recent investigation, it was determined that x86 CPUs
which support 5-level paging, don't always respect CR4.LA57 when doing
canonical checks.
In particular:
1. MSRs which contain a linear address, allow full 57-bitcanonical address
regardless of CR4.LA57 state. For example: MSR_KERNEL_GS_BASE.
2. All hidden segment bases and GDT/IDT bases also behave like MSRs.
This means that full 57-bit canonical address can be loaded to them
regardless of CR4.LA57, both using MSRS (e.g GS_BASE) and instructions
(e.g LGDT).
3. TLB invalidation instructions also allow the user to use full 57-bit
address regardless of the CR4.LA57.
Finally, it must be noted that the CPU doesn't prevent the user from
disabling 5-level paging, even when the full 57-bit canonical address is
present in one of the registers mentioned above (e.g GDT base).
In fact, this can happen without any userspace help, when the CPU enters
SMM mode - some MSRs, for example MSR_KERNEL_GS_BASE are left to contain
a non-canonical address in regard to the new mode.
Since most of the affected MSRs and all segment bases can be read and
written freely by the guest without any KVM intervention, this patch makes
the emulator closely follow hardware behavior, which means that the
emulator doesn't take in the account the guest CPUID support for 5-level
paging, and only takes in the account the host CPU support.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-4-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-06 18:18:23 -04:00
|
|
|
#include "cpuid.h"
|
2008-07-03 14:59:22 +03:00
|
|
|
|
2025-02-27 09:20:08 +08:00
|
|
|
#define KVM_MAX_MCE_BANKS 32
|
|
|
|
|
2022-05-24 21:56:23 +08:00
|
|
|
struct kvm_caps {
|
|
|
|
/* control of guest tsc rate supported? */
|
|
|
|
bool has_tsc_control;
|
|
|
|
/* maximum supported tsc_khz for guests */
|
|
|
|
u32 max_guest_tsc_khz;
|
|
|
|
/* number of bits of the fractional part of the TSC scaling ratio */
|
|
|
|
u8 tsc_scaling_ratio_frac_bits;
|
|
|
|
/* maximum allowed value of TSC scaling ratio */
|
|
|
|
u64 max_tsc_scaling_ratio;
|
|
|
|
/* 1ull << kvm_caps.tsc_scaling_ratio_frac_bits */
|
|
|
|
u64 default_tsc_scaling_ratio;
|
|
|
|
/* bus lock detection supported? */
|
|
|
|
bool has_bus_lock_exit;
|
2022-05-24 21:56:24 +08:00
|
|
|
/* notify VM exit supported? */
|
|
|
|
bool has_notify_vmexit;
|
2024-04-04 08:13:18 -04:00
|
|
|
/* bit mask of VM types */
|
|
|
|
u32 supported_vm_types;
|
2022-05-24 21:56:23 +08:00
|
|
|
|
|
|
|
u64 supported_mce_cap;
|
|
|
|
u64 supported_xcr0;
|
|
|
|
u64 supported_xss;
|
2022-10-06 00:03:11 +00:00
|
|
|
u64 supported_perf_cap;
|
2025-02-24 15:08:32 +08:00
|
|
|
|
|
|
|
u64 supported_quirks;
|
2025-03-03 11:18:38 -05:00
|
|
|
u64 inapplicable_quirks;
|
2022-05-24 21:56:23 +08:00
|
|
|
};
|
|
|
|
|
2024-04-23 15:15:18 -07:00
|
|
|
struct kvm_host_values {
|
2024-04-23 15:15:21 -07:00
|
|
|
/*
|
|
|
|
* The host's raw MAXPHYADDR, i.e. the number of non-reserved physical
|
|
|
|
* address bits irrespective of features that repurpose legal bits,
|
|
|
|
* e.g. MKTME.
|
|
|
|
*/
|
|
|
|
u8 maxphyaddr;
|
|
|
|
|
2024-04-23 15:15:18 -07:00
|
|
|
u64 efer;
|
|
|
|
u64 xcr0;
|
|
|
|
u64 xss;
|
|
|
|
u64 arch_capabilities;
|
|
|
|
};
|
|
|
|
|
2021-08-09 10:39:55 -07:00
|
|
|
void kvm_spurious_fault(void);
|
|
|
|
|
KVM: x86: Use kvzalloc() to allocate VM struct
Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical
allocation before falling back to __vmalloc(), to avoid the overhead of
establishing the virtual mappings. For non-debug builds, The SVM and VMX
(and TDX) structures are now just below 7000 bytes in the worst case
scenario (see below), i.e. are order-1 allocations, and will likely remain
that way for quite some time.
Add compile-time assertions in vendor code to ensure the size of the
structures, sans the memslot hash tables, are order-0 allocations, i.e.
are less than 4KiB. There's nothing fundamentally wrong with a larger
kvm_{svm,vmx,tdx} size, but given that the size of the structure (without
the memslots hash tables) is below 2KiB after 18+ years of existence,
more than doubling the size would be quite notable.
Add sanity checks on the memslot hash table sizes, partly to ensure they
aren't resized without accounting for the impact on VM structure size, and
partly to document that the majority of the size of VM structures comes
from the memslots.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-22 17:11:37 -07:00
|
|
|
#define SIZE_OF_MEMSLOTS_HASHTABLE \
|
|
|
|
(sizeof(((struct kvm_memslots *)0)->id_hash) * 2 * KVM_MAX_NR_ADDRESS_SPACES)
|
|
|
|
|
|
|
|
/* Sanity check the size of the memslot hash tables. */
|
|
|
|
static_assert(SIZE_OF_MEMSLOTS_HASHTABLE ==
|
|
|
|
(1024 * (1 + IS_ENABLED(CONFIG_X86_64)) * (1 + IS_ENABLED(CONFIG_KVM_SMM))));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Assert that "struct kvm_{svm,vmx,tdx}" is an order-0 or order-1 allocation.
|
|
|
|
* Spilling over to an order-2 allocation isn't fundamentally problematic, but
|
|
|
|
* isn't expected to happen in the foreseeable future (O(years)). Assert that
|
|
|
|
* the size is an order-0 allocation when ignoring the memslot hash tables, to
|
|
|
|
* help detect and debug unexpected size increases.
|
|
|
|
*/
|
|
|
|
#define KVM_SANITY_CHECK_VM_STRUCT_SIZE(x) \
|
|
|
|
do { \
|
|
|
|
BUILD_BUG_ON(get_order(sizeof(struct x) - SIZE_OF_MEMSLOTS_HASHTABLE) && \
|
|
|
|
!IS_ENABLED(CONFIG_DEBUG_KERNEL) && !IS_ENABLED(CONFIG_KASAN)); \
|
|
|
|
BUILD_BUG_ON(get_order(sizeof(struct x)) > 1 && \
|
|
|
|
!IS_ENABLED(CONFIG_DEBUG_KERNEL) && !IS_ENABLED(CONFIG_KASAN)); \
|
|
|
|
} while (0)
|
|
|
|
|
2021-02-03 16:01:16 -08:00
|
|
|
#define KVM_NESTED_VMENTER_CONSISTENCY_CHECK(consistency_check) \
|
|
|
|
({ \
|
|
|
|
bool failed = (consistency_check); \
|
|
|
|
if (failed) \
|
|
|
|
trace_kvm_nested_vmenter_failed(#consistency_check, 0); \
|
|
|
|
failed; \
|
|
|
|
})
|
|
|
|
|
2023-03-10 16:46:00 -08:00
|
|
|
/*
|
|
|
|
* The first...last VMX feature MSRs that are emulated by KVM. This may or may
|
|
|
|
* not cover all known VMX MSRs, as KVM doesn't emulate an MSR until there's an
|
|
|
|
* associated feature that KVM supports for nested virtualization.
|
|
|
|
*/
|
|
|
|
#define KVM_FIRST_EMULATED_VMX_MSR MSR_IA32_VMX_BASIC
|
|
|
|
#define KVM_LAST_EMULATED_VMX_MSR MSR_IA32_VMX_VMFUNC
|
|
|
|
|
2018-03-16 16:37:24 -04:00
|
|
|
#define KVM_DEFAULT_PLE_GAP 128
|
|
|
|
#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
|
|
|
|
#define KVM_DEFAULT_PLE_WINDOW_GROW 2
|
|
|
|
#define KVM_DEFAULT_PLE_WINDOW_SHRINK 0
|
|
|
|
#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX UINT_MAX
|
2018-03-16 16:37:26 -04:00
|
|
|
#define KVM_SVM_DEFAULT_PLE_WINDOW_MAX USHRT_MAX
|
|
|
|
#define KVM_SVM_DEFAULT_PLE_WINDOW 3000
|
2018-03-16 16:37:24 -04:00
|
|
|
|
|
|
|
static inline unsigned int __grow_ple_window(unsigned int val,
|
|
|
|
unsigned int base, unsigned int modifier, unsigned int max)
|
|
|
|
{
|
|
|
|
u64 ret = val;
|
|
|
|
|
|
|
|
if (modifier < 1)
|
|
|
|
return base;
|
|
|
|
|
|
|
|
if (modifier < base)
|
|
|
|
ret *= modifier;
|
|
|
|
else
|
|
|
|
ret += modifier;
|
|
|
|
|
|
|
|
return min(ret, (u64)max);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned int __shrink_ple_window(unsigned int val,
|
|
|
|
unsigned int base, unsigned int modifier, unsigned int min)
|
|
|
|
{
|
|
|
|
if (modifier < 1)
|
|
|
|
return base;
|
|
|
|
|
|
|
|
if (modifier < base)
|
|
|
|
val /= modifier;
|
|
|
|
else
|
|
|
|
val -= modifier;
|
|
|
|
|
|
|
|
return max(val, min);
|
|
|
|
}
|
|
|
|
|
2024-06-05 16:19:10 -07:00
|
|
|
#define MSR_IA32_CR_PAT_DEFAULT \
|
|
|
|
PAT_VALUE(WB, WT, UC_MINUS, UC, WB, WT, UC_MINUS, UC)
|
2015-04-27 15:11:25 +02:00
|
|
|
|
2021-11-25 01:49:43 +00:00
|
|
|
void kvm_service_local_tlb_flush_requests(struct kvm_vcpu *vcpu);
|
2021-03-02 09:45:14 -08:00
|
|
|
int kvm_check_nested_events(struct kvm_vcpu *vcpu);
|
|
|
|
|
KVM: x86: Forcibly leave nested if RSM to L2 hits shutdown
Leave nested mode before synthesizing shutdown (a.k.a. TRIPLE_FAULT) if
RSM fails when resuming L2 (a.k.a. guest mode). Architecturally, shutdown
on RSM occurs _before_ the transition back to guest mode on both Intel and
AMD.
On Intel, per the SDM pseudocode, SMRAM state is loaded before critical
VMX state:
restore state normally from SMRAM;
...
CR4.VMXE := value stored internally;
IF internal storage indicates that the logical processor had been in
VMX operation (root or non-root)
THEN
enter VMX operation (root or non-root);
restore VMX-critical state as defined in Section 32.14.1;
...
restore current VMCS pointer;
FI;
AMD's APM is both less clearcut and more explicit. Because AMD CPUs save
VMCB and guest state in SMRAM itself, given the lack of anything in the
APM to indicate a shutdown in guest mode is possible, a straightforward
reading of the clause on invalid state is that _what_ state is invalid is
irrelevant, i.e. all roads lead to shutdown.
An RSM causes a processor shutdown if an invalid-state condition is
found in the SMRAM state-save area.
This fixes a bug found by syzkaller where synthesizing shutdown for L2
led to a nested VM-Exit (if L1 is intercepting shutdown), which in turn
caused KVM to complain about trying to cancel a nested VM-Enter (see
commit 759cbd59674a ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM
entry which is a result of RSM").
Note, Paolo pointed out that KVM shouldn't set nested_run_pending until
after loading SMRAM state. But as above, that's only half the story, KVM
shouldn't transition to guest mode either. Unfortunately, fixing that
mess requires rewriting the nVMX and nSVM RSM flows to not piggyback
their nested VM-Enter flows, as executing the nested VM-Enter flows after
loading state from SMRAM would clobber much of said state.
For now, add a FIXME to call out that transitioning to guest mode before
loading state from SMRAM is wrong.
Link: https://lore.kernel.org/all/CABgObfYaUHXyRmsmg8UjRomnpQ0Jnaog9-L2gMjsjkqChjDYUQ@mail.gmail.com
Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com
Reported-by: Zheyu Ma <zheyuma97@gmail.com>
Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com
Analyzed-by: Michal Wilczynski <michal.wilczynski@intel.com>
Cc: Kishen Maloor <kishen.maloor@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906161337.1118412-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-06 09:13:37 -07:00
|
|
|
/* Forcibly leave the nested mode in cases like a vCPU reset */
|
|
|
|
static inline void kvm_leave_nested(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
kvm_x86_ops.nested_ops->leave_nested(vcpu);
|
|
|
|
}
|
|
|
|
|
2025-02-21 16:33:52 +00:00
|
|
|
/*
|
|
|
|
* If IBRS is advertised to the vCPU, KVM must flush the indirect branch
|
|
|
|
* predictors when transitioning from L2 to L1, as L1 expects hardware (KVM in
|
|
|
|
* this case) to provide separate predictor modes. Bare metal isolates the host
|
|
|
|
* from the guest, but doesn't isolate different guests from one another (in
|
|
|
|
* this case L1 and L2). The exception is if bare metal supports same mode IBRS,
|
|
|
|
* which offers protection within the same mode, and hence protects L1 from L2.
|
|
|
|
*/
|
|
|
|
static inline void kvm_nested_vmexit_handle_ibrs(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (cpu_feature_enabled(X86_FEATURE_AMD_IBRS_SAME_MODE))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (guest_cpu_cap_has(vcpu, X86_FEATURE_SPEC_CTRL) ||
|
|
|
|
guest_cpu_cap_has(vcpu, X86_FEATURE_AMD_IBRS))
|
|
|
|
indirect_branch_prediction_barrier();
|
|
|
|
}
|
|
|
|
|
2023-03-10 16:45:59 -08:00
|
|
|
static inline bool kvm_vcpu_has_run(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return vcpu->arch.last_vmentry_cpu != -1;
|
|
|
|
}
|
|
|
|
|
2025-01-13 12:01:43 -08:00
|
|
|
static inline void kvm_set_mp_state(struct kvm_vcpu *vcpu, int mp_state)
|
|
|
|
{
|
|
|
|
vcpu->arch.mp_state = mp_state;
|
2025-01-13 12:01:44 -08:00
|
|
|
if (mp_state == KVM_MP_STATE_RUNNABLE)
|
|
|
|
vcpu->arch.pv.pv_unhalted = false;
|
2025-01-13 12:01:43 -08:00
|
|
|
}
|
|
|
|
|
KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
Morph pending exceptions to pending VM-Exits (due to interception) when
the exception is queued instead of waiting until nested events are
checked at VM-Entry. This fixes a longstanding bug where KVM fails to
handle an exception that occurs during delivery of a previous exception,
KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow
paging), and KVM determines that the exception is in the guest's domain,
i.e. queues the new exception for L2. Deferring the interception check
causes KVM to esclate various combinations of injected+pending exceptions
to double fault (#DF) without consulting L1's interception desires, and
ends up injecting a spurious #DF into L2.
KVM has fudged around the issue for #PF by special casing emulated #PF
injection for shadow paging, but the underlying issue is not unique to
shadow paging in L0, e.g. if KVM is intercepting #PF because the guest
has a smaller maxphyaddr and L1 (but not L0) is using shadow paging.
Other exceptions are affected as well, e.g. if KVM is intercepting #GP
for one of SVM's workaround or for the VMware backdoor emulation stuff.
The other cases have gone unnoticed because the #DF is spurious if and
only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1
would have injected #DF anyways.
The hack-a-fix has also led to ugly code, e.g. bailing from the emulator
if #PF injection forced a nested VM-Exit and the emulator finds itself
back in L1. Allowing for direct-to-VM-Exit queueing also neatly solves
the async #PF in L2 mess; no need to set a magic flag and token, simply
queue a #PF nested VM-Exit.
Deal with event migration by flagging that a pending exception was queued
by userspace and check for interception at the next KVM_RUN, e.g. so that
KVM does the right thing regardless of the order in which userspace
restores nested state vs. event state.
When "getting" events from userspace, simply drop any pending excpetion
that is destined to be intercepted if there is also an injected exception
to be migrated. Ideally, KVM would migrate both events, but that would
require new ABI, and practically speaking losing the event is unlikely to
be noticed, let alone fatal. The injected exception is captured, RIP
still points at the original faulting instruction, etc... So either the
injection on the target will trigger the same intercepted exception, or
the source of the intercepted exception was transient and/or
non-deterministic, thus dropping it is ok-ish.
Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0")
Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2")
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-22-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-30 23:16:08 +00:00
|
|
|
static inline bool kvm_is_exception_pending(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return vcpu->arch.exception.pending ||
|
2022-08-30 23:16:09 +00:00
|
|
|
vcpu->arch.exception_vmexit.pending ||
|
|
|
|
kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu);
|
KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
Morph pending exceptions to pending VM-Exits (due to interception) when
the exception is queued instead of waiting until nested events are
checked at VM-Entry. This fixes a longstanding bug where KVM fails to
handle an exception that occurs during delivery of a previous exception,
KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow
paging), and KVM determines that the exception is in the guest's domain,
i.e. queues the new exception for L2. Deferring the interception check
causes KVM to esclate various combinations of injected+pending exceptions
to double fault (#DF) without consulting L1's interception desires, and
ends up injecting a spurious #DF into L2.
KVM has fudged around the issue for #PF by special casing emulated #PF
injection for shadow paging, but the underlying issue is not unique to
shadow paging in L0, e.g. if KVM is intercepting #PF because the guest
has a smaller maxphyaddr and L1 (but not L0) is using shadow paging.
Other exceptions are affected as well, e.g. if KVM is intercepting #GP
for one of SVM's workaround or for the VMware backdoor emulation stuff.
The other cases have gone unnoticed because the #DF is spurious if and
only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1
would have injected #DF anyways.
The hack-a-fix has also led to ugly code, e.g. bailing from the emulator
if #PF injection forced a nested VM-Exit and the emulator finds itself
back in L1. Allowing for direct-to-VM-Exit queueing also neatly solves
the async #PF in L2 mess; no need to set a magic flag and token, simply
queue a #PF nested VM-Exit.
Deal with event migration by flagging that a pending exception was queued
by userspace and check for interception at the next KVM_RUN, e.g. so that
KVM does the right thing regardless of the order in which userspace
restores nested state vs. event state.
When "getting" events from userspace, simply drop any pending excpetion
that is destined to be intercepted if there is also an injected exception
to be migrated. Ideally, KVM would migrate both events, but that would
require new ABI, and practically speaking losing the event is unlikely to
be noticed, let alone fatal. The injected exception is captured, RIP
still points at the original faulting instruction, etc... So either the
injection on the target will trigger the same intercepted exception, or
the source of the intercepted exception was transient and/or
non-deterministic, thus dropping it is ok-ish.
Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0")
Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2")
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-22-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-30 23:16:08 +00:00
|
|
|
}
|
|
|
|
|
2008-07-03 14:59:22 +03:00
|
|
|
static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2017-11-19 18:25:43 +02:00
|
|
|
vcpu->arch.exception.pending = false;
|
2017-08-24 03:35:09 -07:00
|
|
|
vcpu->arch.exception.injected = false;
|
KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
Morph pending exceptions to pending VM-Exits (due to interception) when
the exception is queued instead of waiting until nested events are
checked at VM-Entry. This fixes a longstanding bug where KVM fails to
handle an exception that occurs during delivery of a previous exception,
KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow
paging), and KVM determines that the exception is in the guest's domain,
i.e. queues the new exception for L2. Deferring the interception check
causes KVM to esclate various combinations of injected+pending exceptions
to double fault (#DF) without consulting L1's interception desires, and
ends up injecting a spurious #DF into L2.
KVM has fudged around the issue for #PF by special casing emulated #PF
injection for shadow paging, but the underlying issue is not unique to
shadow paging in L0, e.g. if KVM is intercepting #PF because the guest
has a smaller maxphyaddr and L1 (but not L0) is using shadow paging.
Other exceptions are affected as well, e.g. if KVM is intercepting #GP
for one of SVM's workaround or for the VMware backdoor emulation stuff.
The other cases have gone unnoticed because the #DF is spurious if and
only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1
would have injected #DF anyways.
The hack-a-fix has also led to ugly code, e.g. bailing from the emulator
if #PF injection forced a nested VM-Exit and the emulator finds itself
back in L1. Allowing for direct-to-VM-Exit queueing also neatly solves
the async #PF in L2 mess; no need to set a magic flag and token, simply
queue a #PF nested VM-Exit.
Deal with event migration by flagging that a pending exception was queued
by userspace and check for interception at the next KVM_RUN, e.g. so that
KVM does the right thing regardless of the order in which userspace
restores nested state vs. event state.
When "getting" events from userspace, simply drop any pending excpetion
that is destined to be intercepted if there is also an injected exception
to be migrated. Ideally, KVM would migrate both events, but that would
require new ABI, and practically speaking losing the event is unlikely to
be noticed, let alone fatal. The injected exception is captured, RIP
still points at the original faulting instruction, etc... So either the
injection on the target will trigger the same intercepted exception, or
the source of the intercepted exception was transient and/or
non-deterministic, thus dropping it is ok-ish.
Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0")
Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2")
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220830231614.3580124-22-seanjc@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-30 23:16:08 +00:00
|
|
|
vcpu->arch.exception_vmexit.pending = false;
|
2008-07-03 14:59:22 +03:00
|
|
|
}
|
|
|
|
|
2009-05-11 13:35:50 +03:00
|
|
|
static inline void kvm_queue_interrupt(struct kvm_vcpu *vcpu, u8 vector,
|
|
|
|
bool soft)
|
2008-07-03 15:17:01 +03:00
|
|
|
{
|
KVM: x86: Rename interrupt.pending to interrupt.injected
For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.
However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.
This change follows logic of previous commit 664f8e26b00c ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.
It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.
This patch introduce no semantics change.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-23 03:01:31 +03:00
|
|
|
vcpu->arch.interrupt.injected = true;
|
2009-05-11 13:35:50 +03:00
|
|
|
vcpu->arch.interrupt.soft = soft;
|
2008-07-03 15:17:01 +03:00
|
|
|
vcpu->arch.interrupt.nr = vector;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_clear_interrupt_queue(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
KVM: x86: Rename interrupt.pending to interrupt.injected
For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.
However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.
This change follows logic of previous commit 664f8e26b00c ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.
It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.
This patch introduce no semantics change.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-23 03:01:31 +03:00
|
|
|
vcpu->arch.interrupt.injected = false;
|
2008-07-03 15:17:01 +03:00
|
|
|
}
|
|
|
|
|
2009-05-11 13:35:46 +03:00
|
|
|
static inline bool kvm_event_needs_reinjection(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
KVM: x86: Rename interrupt.pending to interrupt.injected
For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.
However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.
This change follows logic of previous commit 664f8e26b00c ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.
It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.
This patch introduce no semantics change.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-23 03:01:31 +03:00
|
|
|
return vcpu->arch.exception.injected || vcpu->arch.interrupt.injected ||
|
2009-05-11 13:35:46 +03:00
|
|
|
vcpu->arch.nmi_injected;
|
|
|
|
}
|
2009-05-11 13:35:50 +03:00
|
|
|
|
|
|
|
static inline bool kvm_exception_is_soft(unsigned int nr)
|
|
|
|
{
|
|
|
|
return (nr == BP_VECTOR) || (nr == OF_VECTOR);
|
|
|
|
}
|
2009-07-05 17:39:35 +03:00
|
|
|
|
2010-01-21 15:31:48 +02:00
|
|
|
static inline bool is_protmode(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2023-03-22 12:58:21 +08:00
|
|
|
return kvm_is_cr0_bit_set(vcpu, X86_CR0_PE);
|
2010-01-21 15:31:48 +02:00
|
|
|
}
|
|
|
|
|
2023-03-22 12:58:24 +08:00
|
|
|
static inline bool is_long_mode(struct kvm_vcpu *vcpu)
|
2010-01-21 15:31:49 +02:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_X86_64
|
2023-03-22 12:58:24 +08:00
|
|
|
return !!(vcpu->arch.efer & EFER_LMA);
|
2010-01-21 15:31:49 +02:00
|
|
|
#else
|
2023-03-22 12:58:24 +08:00
|
|
|
return false;
|
2010-01-21 15:31:49 +02:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2014-06-18 17:19:23 +03:00
|
|
|
static inline bool is_64_bit_mode(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
int cs_db, cs_l;
|
|
|
|
|
2021-05-24 12:48:57 -05:00
|
|
|
WARN_ON_ONCE(vcpu->arch.guest_state_protected);
|
|
|
|
|
2014-06-18 17:19:23 +03:00
|
|
|
if (!is_long_mode(vcpu))
|
|
|
|
return false;
|
2024-05-07 21:31:02 +08:00
|
|
|
kvm_x86_call(get_cs_db_l_bits)(vcpu, &cs_db, &cs_l);
|
2014-06-18 17:19:23 +03:00
|
|
|
return cs_l;
|
|
|
|
}
|
|
|
|
|
2021-05-24 12:48:57 -05:00
|
|
|
static inline bool is_64_bit_hypercall(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If running with protected guest state, the CS register is not
|
|
|
|
* accessible. The hypercall register values will have had to been
|
|
|
|
* provided in 64-bit mode, so assume the guest is in 64-bit.
|
|
|
|
*/
|
|
|
|
return vcpu->arch.guest_state_protected || is_64_bit_mode(vcpu);
|
|
|
|
}
|
|
|
|
|
2018-06-20 17:21:29 -07:00
|
|
|
static inline bool x86_exception_has_error_code(unsigned int vector)
|
|
|
|
{
|
|
|
|
static u32 exception_has_error_code = BIT(DF_VECTOR) | BIT(TS_VECTOR) |
|
|
|
|
BIT(NP_VECTOR) | BIT(SS_VECTOR) | BIT(GP_VECTOR) |
|
|
|
|
BIT(PF_VECTOR) | BIT(AC_VECTOR);
|
|
|
|
|
|
|
|
return (1U << vector) & exception_has_error_code;
|
|
|
|
}
|
|
|
|
|
2010-09-10 17:30:50 +02:00
|
|
|
static inline bool mmu_is_nested(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu;
|
|
|
|
}
|
|
|
|
|
KVM: x86: Use boolean return value for is_{pae,pse,paging}()
Convert is_{pae,pse,paging}() to use kvm_is_cr{0,4}_bit_set() and return
bools. Returning an "int" requires not one, but two implicit casts, first
from "unsigned long" to "int", and then again to a "bool". Both casts are
more than a bit dangerous; the ulong=>int casts would drop a bit on 64-bit
kernels _if_ the bits in question weren't in the lower 32 bits, and the
int=>bool cast can result in false negatives/positives, e.g. see commit
0c928ff26bd6 ("KVM: SVM: Fix benign "bool vs. int" comparison in
svm_set_cr0()").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230322045824.22970-3-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 12:58:22 +08:00
|
|
|
static inline bool is_pae(struct kvm_vcpu *vcpu)
|
2010-01-21 15:31:49 +02:00
|
|
|
{
|
KVM: x86: Use boolean return value for is_{pae,pse,paging}()
Convert is_{pae,pse,paging}() to use kvm_is_cr{0,4}_bit_set() and return
bools. Returning an "int" requires not one, but two implicit casts, first
from "unsigned long" to "int", and then again to a "bool". Both casts are
more than a bit dangerous; the ulong=>int casts would drop a bit on 64-bit
kernels _if_ the bits in question weren't in the lower 32 bits, and the
int=>bool cast can result in false negatives/positives, e.g. see commit
0c928ff26bd6 ("KVM: SVM: Fix benign "bool vs. int" comparison in
svm_set_cr0()").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230322045824.22970-3-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 12:58:22 +08:00
|
|
|
return kvm_is_cr4_bit_set(vcpu, X86_CR4_PAE);
|
2010-01-21 15:31:49 +02:00
|
|
|
}
|
|
|
|
|
KVM: x86: Use boolean return value for is_{pae,pse,paging}()
Convert is_{pae,pse,paging}() to use kvm_is_cr{0,4}_bit_set() and return
bools. Returning an "int" requires not one, but two implicit casts, first
from "unsigned long" to "int", and then again to a "bool". Both casts are
more than a bit dangerous; the ulong=>int casts would drop a bit on 64-bit
kernels _if_ the bits in question weren't in the lower 32 bits, and the
int=>bool cast can result in false negatives/positives, e.g. see commit
0c928ff26bd6 ("KVM: SVM: Fix benign "bool vs. int" comparison in
svm_set_cr0()").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230322045824.22970-3-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 12:58:22 +08:00
|
|
|
static inline bool is_pse(struct kvm_vcpu *vcpu)
|
2010-01-21 15:31:49 +02:00
|
|
|
{
|
KVM: x86: Use boolean return value for is_{pae,pse,paging}()
Convert is_{pae,pse,paging}() to use kvm_is_cr{0,4}_bit_set() and return
bools. Returning an "int" requires not one, but two implicit casts, first
from "unsigned long" to "int", and then again to a "bool". Both casts are
more than a bit dangerous; the ulong=>int casts would drop a bit on 64-bit
kernels _if_ the bits in question weren't in the lower 32 bits, and the
int=>bool cast can result in false negatives/positives, e.g. see commit
0c928ff26bd6 ("KVM: SVM: Fix benign "bool vs. int" comparison in
svm_set_cr0()").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230322045824.22970-3-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 12:58:22 +08:00
|
|
|
return kvm_is_cr4_bit_set(vcpu, X86_CR4_PSE);
|
2010-01-21 15:31:49 +02:00
|
|
|
}
|
|
|
|
|
KVM: x86: Use boolean return value for is_{pae,pse,paging}()
Convert is_{pae,pse,paging}() to use kvm_is_cr{0,4}_bit_set() and return
bools. Returning an "int" requires not one, but two implicit casts, first
from "unsigned long" to "int", and then again to a "bool". Both casts are
more than a bit dangerous; the ulong=>int casts would drop a bit on 64-bit
kernels _if_ the bits in question weren't in the lower 32 bits, and the
int=>bool cast can result in false negatives/positives, e.g. see commit
0c928ff26bd6 ("KVM: SVM: Fix benign "bool vs. int" comparison in
svm_set_cr0()").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230322045824.22970-3-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 12:58:22 +08:00
|
|
|
static inline bool is_paging(struct kvm_vcpu *vcpu)
|
2010-01-21 15:31:49 +02:00
|
|
|
{
|
KVM: x86: Use boolean return value for is_{pae,pse,paging}()
Convert is_{pae,pse,paging}() to use kvm_is_cr{0,4}_bit_set() and return
bools. Returning an "int" requires not one, but two implicit casts, first
from "unsigned long" to "int", and then again to a "bool". Both casts are
more than a bit dangerous; the ulong=>int casts would drop a bit on 64-bit
kernels _if_ the bits in question weren't in the lower 32 bits, and the
int=>bool cast can result in false negatives/positives, e.g. see commit
0c928ff26bd6 ("KVM: SVM: Fix benign "bool vs. int" comparison in
svm_set_cr0()").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230322045824.22970-3-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-22 12:58:22 +08:00
|
|
|
return likely(kvm_is_cr0_bit_set(vcpu, X86_CR0_PG));
|
2010-01-21 15:31:49 +02:00
|
|
|
}
|
|
|
|
|
2019-06-06 18:52:44 +02:00
|
|
|
static inline bool is_pae_paging(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return !is_long_mode(vcpu) && is_pae(vcpu) && is_paging(vcpu);
|
|
|
|
}
|
|
|
|
|
2017-08-24 20:27:56 +08:00
|
|
|
static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2023-03-22 12:58:21 +08:00
|
|
|
return kvm_is_cr4_bit_set(vcpu, X86_CR4_LA57) ? 57 : 48;
|
2017-08-24 20:27:56 +08:00
|
|
|
}
|
|
|
|
|
KVM: x86: model canonical checks more precisely
As a result of a recent investigation, it was determined that x86 CPUs
which support 5-level paging, don't always respect CR4.LA57 when doing
canonical checks.
In particular:
1. MSRs which contain a linear address, allow full 57-bitcanonical address
regardless of CR4.LA57 state. For example: MSR_KERNEL_GS_BASE.
2. All hidden segment bases and GDT/IDT bases also behave like MSRs.
This means that full 57-bit canonical address can be loaded to them
regardless of CR4.LA57, both using MSRS (e.g GS_BASE) and instructions
(e.g LGDT).
3. TLB invalidation instructions also allow the user to use full 57-bit
address regardless of the CR4.LA57.
Finally, it must be noted that the CPU doesn't prevent the user from
disabling 5-level paging, even when the full 57-bit canonical address is
present in one of the registers mentioned above (e.g GDT base).
In fact, this can happen without any userspace help, when the CPU enters
SMM mode - some MSRs, for example MSR_KERNEL_GS_BASE are left to contain
a non-canonical address in regard to the new mode.
Since most of the affected MSRs and all segment bases can be read and
written freely by the guest without any KVM intervention, this patch makes
the emulator closely follow hardware behavior, which means that the
emulator doesn't take in the account the guest CPUID support for 5-level
paging, and only takes in the account the host CPU support.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-4-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-06 18:18:23 -04:00
|
|
|
static inline u8 max_host_virt_addr_bits(void)
|
2017-08-24 20:27:56 +08:00
|
|
|
{
|
KVM: x86: model canonical checks more precisely
As a result of a recent investigation, it was determined that x86 CPUs
which support 5-level paging, don't always respect CR4.LA57 when doing
canonical checks.
In particular:
1. MSRs which contain a linear address, allow full 57-bitcanonical address
regardless of CR4.LA57 state. For example: MSR_KERNEL_GS_BASE.
2. All hidden segment bases and GDT/IDT bases also behave like MSRs.
This means that full 57-bit canonical address can be loaded to them
regardless of CR4.LA57, both using MSRS (e.g GS_BASE) and instructions
(e.g LGDT).
3. TLB invalidation instructions also allow the user to use full 57-bit
address regardless of the CR4.LA57.
Finally, it must be noted that the CPU doesn't prevent the user from
disabling 5-level paging, even when the full 57-bit canonical address is
present in one of the registers mentioned above (e.g GDT base).
In fact, this can happen without any userspace help, when the CPU enters
SMM mode - some MSRs, for example MSR_KERNEL_GS_BASE are left to contain
a non-canonical address in regard to the new mode.
Since most of the affected MSRs and all segment bases can be read and
written freely by the guest without any KVM intervention, this patch makes
the emulator closely follow hardware behavior, which means that the
emulator doesn't take in the account the guest CPUID support for 5-level
paging, and only takes in the account the host CPU support.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240906221824.491834-4-mlevitsk@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-06 18:18:23 -04:00
|
|
|
return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* x86 MSRs which contain linear addresses, x86 hidden segment bases, and
|
|
|
|
* IDT/GDT bases have static canonicality checks, the size of which depends
|
|
|
|
* only on the CPU's support for 5-level paging, rather than on the state of
|
|
|
|
* CR4.LA57. This applies to both WRMSR and to other instructions that set
|
|
|
|
* their values, e.g. SGDT.
|
|
|
|
*
|
|
|
|
* KVM passes through most of these MSRS and also doesn't intercept the
|
|
|
|
* instructions that set the hidden segment bases.
|
|
|
|
*
|
|
|
|
* Because of this, to be consistent with hardware, even if the guest doesn't
|
|
|
|
* have LA57 enabled in its CPUID, perform canonicality checks based on *host*
|
|
|
|
* support for 5 level paging.
|
|
|
|
*
|
|
|
|
* Finally, instructions which are related to MMU invalidation of a given
|
|
|
|
* linear address, also have a similar static canonical check on address.
|
|
|
|
* This allows for example to invalidate 5-level addresses of a guest from a
|
|
|
|
* host which uses 4-level paging.
|
|
|
|
*/
|
|
|
|
static inline bool is_noncanonical_address(u64 la, struct kvm_vcpu *vcpu,
|
|
|
|
unsigned int flags)
|
|
|
|
{
|
|
|
|
if (flags & (X86EMUL_F_INVLPG | X86EMUL_F_MSR | X86EMUL_F_DT_LOAD))
|
|
|
|
return !__is_canonical_address(la, max_host_virt_addr_bits());
|
|
|
|
else
|
|
|
|
return !__is_canonical_address(la, vcpu_virt_addr_bits(vcpu));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool is_noncanonical_msr_address(u64 la, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return is_noncanonical_address(la, vcpu, X86EMUL_F_MSR);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool is_noncanonical_base_address(u64 la, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return is_noncanonical_address(la, vcpu, X86EMUL_F_DT_LOAD);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool is_noncanonical_invlpg_address(u64 la, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return is_noncanonical_address(la, vcpu, X86EMUL_F_INVLPG);
|
2017-08-24 20:27:56 +08:00
|
|
|
}
|
|
|
|
|
2011-07-12 03:23:20 +08:00
|
|
|
static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu,
|
|
|
|
gva_t gva, gfn_t gfn, unsigned access)
|
|
|
|
{
|
2019-02-05 13:01:13 -08:00
|
|
|
u64 gen = kvm_memslots(vcpu->kvm)->generation;
|
|
|
|
|
KVM: Explicitly define the "memslot update in-progress" bit
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-05 13:01:14 -08:00
|
|
|
if (unlikely(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS))
|
2019-02-05 13:01:13 -08:00
|
|
|
return;
|
|
|
|
|
2017-08-17 18:36:58 +02:00
|
|
|
/*
|
|
|
|
* If this is a shadow nested page table, the "GVA" is
|
|
|
|
* actually a nGPA.
|
|
|
|
*/
|
|
|
|
vcpu->arch.mmio_gva = mmu_is_nested(vcpu) ? 0 : gva & PAGE_MASK;
|
2019-08-01 13:35:21 -07:00
|
|
|
vcpu->arch.mmio_access = access;
|
2011-07-12 03:23:20 +08:00
|
|
|
vcpu->arch.mmio_gfn = gfn;
|
2019-02-05 13:01:13 -08:00
|
|
|
vcpu->arch.mmio_gen = gen;
|
2014-08-18 15:46:07 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool vcpu_match_mmio_gen(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return vcpu->arch.mmio_gen == kvm_memslots(vcpu->kvm)->generation;
|
2011-07-12 03:23:20 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2014-08-18 15:46:07 -07:00
|
|
|
* Clear the mmio cache info for the given gva. If gva is MMIO_GVA_ANY, we
|
|
|
|
* clear all mmio cache info.
|
2011-07-12 03:23:20 +08:00
|
|
|
*/
|
2014-08-18 15:46:07 -07:00
|
|
|
#define MMIO_GVA_ANY (~(gva_t)0)
|
|
|
|
|
2011-07-12 03:23:20 +08:00
|
|
|
static inline void vcpu_clear_mmio_info(struct kvm_vcpu *vcpu, gva_t gva)
|
|
|
|
{
|
2014-08-18 15:46:07 -07:00
|
|
|
if (gva != MMIO_GVA_ANY && vcpu->arch.mmio_gva != (gva & PAGE_MASK))
|
2011-07-12 03:23:20 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
vcpu->arch.mmio_gva = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool vcpu_match_mmio_gva(struct kvm_vcpu *vcpu, unsigned long gva)
|
|
|
|
{
|
2014-08-18 15:46:07 -07:00
|
|
|
if (vcpu_match_mmio_gen(vcpu) && vcpu->arch.mmio_gva &&
|
|
|
|
vcpu->arch.mmio_gva == (gva & PAGE_MASK))
|
2011-07-12 03:23:20 +08:00
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
|
|
|
|
{
|
2014-08-18 15:46:07 -07:00
|
|
|
if (vcpu_match_mmio_gen(vcpu) && vcpu->arch.mmio_gfn &&
|
|
|
|
vcpu->arch.mmio_gfn == gpa >> PAGE_SHIFT)
|
2011-07-12 03:23:20 +08:00
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2021-04-21 19:21:28 -07:00
|
|
|
static inline unsigned long kvm_register_read(struct kvm_vcpu *vcpu, int reg)
|
2014-06-18 17:19:23 +03:00
|
|
|
{
|
2021-04-21 19:21:28 -07:00
|
|
|
unsigned long val = kvm_register_read_raw(vcpu, reg);
|
2014-06-18 17:19:23 +03:00
|
|
|
|
|
|
|
return is_64_bit_mode(vcpu) ? val : (u32)val;
|
|
|
|
}
|
|
|
|
|
2021-04-21 19:21:28 -07:00
|
|
|
static inline void kvm_register_write(struct kvm_vcpu *vcpu,
|
2019-09-27 14:45:20 -07:00
|
|
|
int reg, unsigned long val)
|
2014-06-18 17:19:26 +03:00
|
|
|
{
|
|
|
|
if (!is_64_bit_mode(vcpu))
|
|
|
|
val = (u32)val;
|
2021-04-21 19:21:28 -07:00
|
|
|
return kvm_register_write_raw(vcpu, reg, val);
|
2014-06-18 17:19:26 +03:00
|
|
|
}
|
|
|
|
|
2015-07-23 08:22:45 +02:00
|
|
|
static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
|
|
|
|
{
|
|
|
|
return !(kvm->arch.disabled_quirks & quirk);
|
|
|
|
}
|
|
|
|
|
2019-08-27 14:40:36 -07:00
|
|
|
void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
|
2010-04-19 13:32:45 +08:00
|
|
|
|
2016-09-01 14:21:03 +02:00
|
|
|
u64 get_kvmclock_ns(struct kvm *kvm);
|
2023-10-05 10:16:10 +01:00
|
|
|
uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm);
|
KVM: x86/xen: improve accuracy of Xen timers
A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:
root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)
The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.
But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.
It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...
The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.
Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).
Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.
This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.
Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org
[sean: massage moved comment, tweak if statement formatting]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27 11:49:15 +00:00
|
|
|
bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
|
2025-01-24 15:05:39 +00:00
|
|
|
int kvm_guest_time_update(struct kvm_vcpu *v);
|
2010-08-19 22:07:17 -10:00
|
|
|
|
2018-06-06 17:37:49 +02:00
|
|
|
int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
|
2011-05-25 23:04:56 +03:00
|
|
|
gva_t addr, void *val, unsigned int bytes,
|
|
|
|
struct x86_exception *exception);
|
|
|
|
|
2018-06-06 17:37:49 +02:00
|
|
|
int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu,
|
2011-05-25 23:08:00 +03:00
|
|
|
gva_t addr, void *val, unsigned int bytes,
|
|
|
|
struct x86_exception *exception);
|
|
|
|
|
2018-04-03 16:28:48 -07:00
|
|
|
int handle_ud(struct kvm_vcpu *vcpu);
|
|
|
|
|
2022-08-30 23:16:01 +00:00
|
|
|
void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_queued_exception *ex);
|
2018-10-16 14:29:22 -07:00
|
|
|
|
2015-06-15 16:55:22 +08:00
|
|
|
int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
|
|
|
|
int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
|
2016-01-25 16:53:33 +08:00
|
|
|
bool kvm_vector_hashing_enabled(void);
|
2020-07-10 17:48:03 +02:00
|
|
|
void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
|
2021-01-26 03:18:28 -05:00
|
|
|
int x86_decode_emulated_instruction(struct kvm_vcpu *vcpu, int emulation_type,
|
|
|
|
void *insn, int insn_len);
|
2019-12-06 15:57:14 -08:00
|
|
|
int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
|
2018-08-23 13:56:53 -07:00
|
|
|
int emulation_type, void *insn, int insn_len);
|
2020-04-28 14:23:25 +08:00
|
|
|
fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
|
KVM: x86: Add fastpath handling of HLT VM-Exits
Add a fastpath for HLT VM-Exits by immediately re-entering the guest if
it has a pending wake event. When virtual interrupt delivery is enabled,
i.e. when KVM doesn't need to manually inject interrupts, this allows KVM
to stay in the fastpath run loop when a vIRQ arrives between the guest
doing CLI and STI;HLT. Without AMD's Idle HLT-intercept support, the CPU
generates a HLT VM-Exit even though KVM will immediately resume the guest.
Note, on bare metal, it's relatively uncommon for a modern guest kernel to
actually trigger this scenario, as the window between the guest checking
for a wake event and committing to HLT is quite small. But in a nested
environment, the timings change significantly, e.g. rudimentary testing
showed that ~50% of HLT exits where HLT-polling was successful would be
serviced by this fastpath, i.e. ~50% of the time that a nested vCPU gets
a wake event before KVM schedules out the vCPU, the wake event was pending
even before the VM-Exit.
Link: https://lore.kernel.org/all/20240528041926.3989-3-manali.shukla@amd.com
Link: https://lore.kernel.org/r/20240802195120.325560-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-02 12:51:20 -07:00
|
|
|
fastpath_t handle_fastpath_hlt(struct kvm_vcpu *vcpu);
|
2014-09-18 22:39:44 +03:00
|
|
|
|
2022-05-24 21:56:23 +08:00
|
|
|
extern struct kvm_caps kvm_caps;
|
2024-04-23 15:15:18 -07:00
|
|
|
extern struct kvm_host_values kvm_host;
|
2022-05-24 21:56:23 +08:00
|
|
|
|
2022-01-11 15:38:23 +08:00
|
|
|
extern bool enable_pmu;
|
2014-02-24 12:15:16 +01:00
|
|
|
|
2023-04-04 17:45:15 -07:00
|
|
|
/*
|
|
|
|
* Get a filtered version of KVM's supported XCR0 that strips out dynamic
|
|
|
|
* features for which the current process doesn't (yet) have permission to use.
|
|
|
|
* This is intended to be used only when enumerating support to userspace,
|
|
|
|
* e.g. in KVM_GET_SUPPORTED_CPUID and KVM_CAP_XSAVE2, it does NOT need to be
|
|
|
|
* used to check/restrict guest behavior as KVM rejects KVM_SET_CPUID{2} if
|
|
|
|
* userspace attempts to enable unpermitted features.
|
|
|
|
*/
|
|
|
|
static inline u64 kvm_get_filtered_xcr0(void)
|
|
|
|
{
|
2023-04-04 17:45:16 -07:00
|
|
|
u64 permitted_xcr0 = kvm_caps.supported_xcr0;
|
|
|
|
|
|
|
|
BUILD_BUG_ON(XFEATURE_MASK_USER_DYNAMIC != XFEATURE_MASK_XTILE_DATA);
|
|
|
|
|
|
|
|
if (permitted_xcr0 & XFEATURE_MASK_USER_DYNAMIC) {
|
|
|
|
permitted_xcr0 &= xstate_get_guest_group_perm();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Treat XTILE_CFG as unsupported if the current process isn't
|
|
|
|
* allowed to use XTILE_DATA, as attempting to set XTILE_CFG in
|
|
|
|
* XCR0 without setting XTILE_DATA is architecturally illegal.
|
|
|
|
*/
|
|
|
|
if (!(permitted_xcr0 & XFEATURE_MASK_XTILE_DATA))
|
|
|
|
permitted_xcr0 &= ~XFEATURE_MASK_XTILE_CFG;
|
|
|
|
}
|
|
|
|
return permitted_xcr0;
|
2023-04-04 17:45:15 -07:00
|
|
|
}
|
|
|
|
|
2020-03-02 15:56:25 -08:00
|
|
|
static inline bool kvm_mpx_supported(void)
|
|
|
|
{
|
2022-05-24 21:56:23 +08:00
|
|
|
return (kvm_caps.supported_xcr0 & (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR))
|
2020-03-02 15:56:25 -08:00
|
|
|
== (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
|
|
|
|
}
|
|
|
|
|
2014-01-06 12:00:02 -02:00
|
|
|
extern unsigned int min_timer_period_us;
|
|
|
|
|
2018-03-12 13:12:47 +02:00
|
|
|
extern bool enable_vmware_backdoor;
|
|
|
|
|
2019-07-06 09:26:51 +08:00
|
|
|
extern int pi_inject_timer;
|
|
|
|
|
2021-01-08 09:36:55 +08:00
|
|
|
extern bool report_ignored_msrs;
|
|
|
|
|
2022-01-19 23:07:37 +00:00
|
|
|
extern bool eager_page_split;
|
|
|
|
|
2023-01-24 23:49:01 +00:00
|
|
|
static inline void kvm_pr_unimpl_wrmsr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
|
|
|
|
{
|
|
|
|
if (report_ignored_msrs)
|
|
|
|
vcpu_unimpl(vcpu, "Unhandled WRMSR(0x%x) = 0x%llx\n", msr, data);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void kvm_pr_unimpl_rdmsr(struct kvm_vcpu *vcpu, u32 msr)
|
|
|
|
{
|
|
|
|
if (report_ignored_msrs)
|
|
|
|
vcpu_unimpl(vcpu, "Unhandled RDMSR(0x%x)\n", msr);
|
|
|
|
}
|
|
|
|
|
2016-06-20 22:28:02 -03:00
|
|
|
static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
|
|
|
|
{
|
|
|
|
return pvclock_scale_delta(nsec, vcpu->arch.virtual_tsc_mult,
|
|
|
|
vcpu->arch.virtual_tsc_shift);
|
|
|
|
}
|
|
|
|
|
2016-01-22 11:39:22 +01:00
|
|
|
/* Same "calling convention" as do_div:
|
|
|
|
* - divide (n << 32) by base
|
|
|
|
* - put result in n
|
|
|
|
* - return remainder
|
|
|
|
*/
|
|
|
|
#define do_shl32_div32(n, base) \
|
|
|
|
({ \
|
|
|
|
u32 __quot, __rem; \
|
|
|
|
asm("divl %2" : "=a" (__quot), "=d" (__rem) \
|
|
|
|
: "rm" (base), "0" (0), "1" ((u32) n)); \
|
|
|
|
n = __quot; \
|
|
|
|
__rem; \
|
|
|
|
})
|
|
|
|
|
2025-06-25 17:12:21 -07:00
|
|
|
static inline void kvm_disable_exits(struct kvm *kvm, u64 mask)
|
|
|
|
{
|
|
|
|
kvm->arch.disabled_exits |= mask;
|
|
|
|
}
|
|
|
|
|
2018-03-12 04:53:02 -07:00
|
|
|
static inline bool kvm_mwait_in_guest(struct kvm *kvm)
|
2017-04-21 12:27:17 +02:00
|
|
|
{
|
2025-06-25 17:12:21 -07:00
|
|
|
return kvm->arch.disabled_exits & KVM_X86_DISABLE_EXITS_MWAIT;
|
2017-04-21 12:27:17 +02:00
|
|
|
}
|
|
|
|
|
2018-03-12 04:53:03 -07:00
|
|
|
static inline bool kvm_hlt_in_guest(struct kvm *kvm)
|
|
|
|
{
|
2025-06-25 17:12:21 -07:00
|
|
|
return kvm->arch.disabled_exits & KVM_X86_DISABLE_EXITS_HLT;
|
2018-03-12 04:53:03 -07:00
|
|
|
}
|
|
|
|
|
2018-03-12 04:53:04 -07:00
|
|
|
static inline bool kvm_pause_in_guest(struct kvm *kvm)
|
|
|
|
{
|
2025-06-25 17:12:21 -07:00
|
|
|
return kvm->arch.disabled_exits & KVM_X86_DISABLE_EXITS_PAUSE;
|
2018-03-12 04:53:04 -07:00
|
|
|
}
|
|
|
|
|
2019-05-21 14:06:53 +08:00
|
|
|
static inline bool kvm_cstate_in_guest(struct kvm *kvm)
|
|
|
|
{
|
2025-06-25 17:12:21 -07:00
|
|
|
return kvm->arch.disabled_exits & KVM_X86_DISABLE_EXITS_CSTATE;
|
2019-05-21 14:06:53 +08:00
|
|
|
}
|
|
|
|
|
2025-06-25 17:12:22 -07:00
|
|
|
static inline bool kvm_aperfmperf_in_guest(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return kvm->arch.disabled_exits & KVM_X86_DISABLE_EXITS_APERFMPERF;
|
|
|
|
}
|
|
|
|
|
2022-05-24 21:56:24 +08:00
|
|
|
static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
|
|
|
|
}
|
|
|
|
|
2022-12-13 06:09:12 +00:00
|
|
|
static __always_inline void kvm_before_interrupt(struct kvm_vcpu *vcpu,
|
|
|
|
enum kvm_intr_type intr)
|
2017-07-25 17:20:32 -07:00
|
|
|
{
|
2021-11-11 02:07:32 +00:00
|
|
|
WRITE_ONCE(vcpu->arch.handling_intr_from_guest, (u8)intr);
|
2017-07-25 17:20:32 -07:00
|
|
|
}
|
|
|
|
|
2022-12-13 06:09:12 +00:00
|
|
|
static __always_inline void kvm_after_interrupt(struct kvm_vcpu *vcpu)
|
2017-07-25 17:20:32 -07:00
|
|
|
{
|
2021-11-11 02:07:31 +00:00
|
|
|
WRITE_ONCE(vcpu->arch.handling_intr_from_guest, 0);
|
2017-07-25 17:20:32 -07:00
|
|
|
}
|
|
|
|
|
2021-11-11 02:07:31 +00:00
|
|
|
static inline bool kvm_handling_nmi_from_guest(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2021-11-11 02:07:32 +00:00
|
|
|
return vcpu->arch.handling_intr_from_guest == KVM_HANDLING_NMI;
|
2021-11-11 02:07:31 +00:00
|
|
|
}
|
2019-04-10 11:41:40 +02:00
|
|
|
|
|
|
|
static inline bool kvm_pat_valid(u64 data)
|
|
|
|
{
|
|
|
|
if (data & 0xF8F8F8F8F8F8F8F8ull)
|
|
|
|
return false;
|
|
|
|
/* 0, 1, 4, 5, 6, 7 are valid values. */
|
|
|
|
return (data | ((data & 0x0202020202020202ull) << 1)) == data;
|
|
|
|
}
|
|
|
|
|
2020-01-24 15:07:22 -08:00
|
|
|
static inline bool kvm_dr7_valid(u64 data)
|
2020-01-15 19:54:32 -05:00
|
|
|
{
|
|
|
|
/* Bits [63:32] are reserved */
|
|
|
|
return !(data >> 32);
|
|
|
|
}
|
2020-05-22 18:19:51 -04:00
|
|
|
static inline bool kvm_dr6_valid(u64 data)
|
|
|
|
{
|
|
|
|
/* Bits [63:32] are reserved */
|
|
|
|
return !(data >> 32);
|
|
|
|
}
|
2020-01-15 19:54:32 -05:00
|
|
|
|
2020-10-29 14:56:00 +01:00
|
|
|
/*
|
|
|
|
* Trigger machine check on the host. We assume all the MSRs are already set up
|
|
|
|
* by the CPU and that we still run on the same CPU as the MCE occurred on.
|
|
|
|
* We pass a fake environment to the machine check handler because we want
|
|
|
|
* the guest to be always treated like user space, no matter what context
|
|
|
|
* it used internally.
|
|
|
|
*/
|
|
|
|
static inline void kvm_machine_check(void)
|
|
|
|
{
|
|
|
|
#if defined(CONFIG_X86_MCE)
|
|
|
|
struct pt_regs regs = {
|
|
|
|
.cs = 3, /* Fake ring 3 no matter what the guest ran on */
|
|
|
|
.flags = X86_EFLAGS_IF,
|
|
|
|
};
|
|
|
|
|
|
|
|
do_machine_check(®s);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2019-10-21 16:30:25 -07:00
|
|
|
void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
|
|
|
|
void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
|
2020-07-08 14:57:31 +03:00
|
|
|
int kvm_spec_ctrl_test_value(u64 value);
|
2020-09-11 14:29:05 -05:00
|
|
|
int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,
|
|
|
|
struct x86_exception *e);
|
2020-09-11 14:29:12 -05:00
|
|
|
int kvm_handle_invpcid(struct kvm_vcpu *vcpu, unsigned long type, gva_t gva);
|
2020-09-25 16:34:17 +02:00
|
|
|
bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
|
2019-04-10 11:41:40 +02:00
|
|
|
|
2024-08-02 11:19:27 -07:00
|
|
|
enum kvm_msr_access {
|
|
|
|
MSR_TYPE_R = BIT(0),
|
|
|
|
MSR_TYPE_W = BIT(1),
|
|
|
|
MSR_TYPE_RW = MSR_TYPE_R | MSR_TYPE_W,
|
|
|
|
};
|
|
|
|
|
2020-11-01 13:55:23 +02:00
|
|
|
/*
|
|
|
|
* Internal error codes that are used to indicate that MSR emulation encountered
|
2024-08-02 11:19:28 -07:00
|
|
|
* an error that should result in #GP in the guest, unless userspace handles it.
|
|
|
|
* Note, '1', '0', and negative numbers are off limits, as they are used by KVM
|
|
|
|
* as part of KVM's lightly documented internal KVM_RUN return codes.
|
|
|
|
*
|
|
|
|
* UNSUPPORTED - The MSR isn't supported, either because it is completely
|
|
|
|
* unknown to KVM, or because the MSR should not exist according
|
|
|
|
* to the vCPU model.
|
|
|
|
*
|
|
|
|
* FILTERED - Access to the MSR is denied by a userspace MSR filter.
|
2020-11-01 13:55:23 +02:00
|
|
|
*/
|
2024-08-02 11:19:28 -07:00
|
|
|
#define KVM_MSR_RET_UNSUPPORTED 2
|
|
|
|
#define KVM_MSR_RET_FILTERED 3
|
2020-06-22 18:04:41 -04:00
|
|
|
|
2024-11-27 17:33:37 -08:00
|
|
|
static inline bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
|
|
|
|
{
|
|
|
|
return !(cr4 & vcpu->arch.cr4_guest_rsvd_bits);
|
|
|
|
}
|
|
|
|
|
2020-07-08 00:39:55 +00:00
|
|
|
#define __cr4_reserved_bits(__cpu_has, __c) \
|
|
|
|
({ \
|
|
|
|
u64 __reserved_bits = CR4_RESERVED_BITS; \
|
|
|
|
\
|
|
|
|
if (!__cpu_has(__c, X86_FEATURE_XSAVE)) \
|
|
|
|
__reserved_bits |= X86_CR4_OSXSAVE; \
|
|
|
|
if (!__cpu_has(__c, X86_FEATURE_SMEP)) \
|
|
|
|
__reserved_bits |= X86_CR4_SMEP; \
|
|
|
|
if (!__cpu_has(__c, X86_FEATURE_SMAP)) \
|
|
|
|
__reserved_bits |= X86_CR4_SMAP; \
|
|
|
|
if (!__cpu_has(__c, X86_FEATURE_FSGSBASE)) \
|
|
|
|
__reserved_bits |= X86_CR4_FSGSBASE; \
|
|
|
|
if (!__cpu_has(__c, X86_FEATURE_PKU)) \
|
|
|
|
__reserved_bits |= X86_CR4_PKE; \
|
|
|
|
if (!__cpu_has(__c, X86_FEATURE_LA57)) \
|
|
|
|
__reserved_bits |= X86_CR4_LA57; \
|
|
|
|
if (!__cpu_has(__c, X86_FEATURE_UMIP)) \
|
|
|
|
__reserved_bits |= X86_CR4_UMIP; \
|
2020-07-08 07:02:50 -04:00
|
|
|
if (!__cpu_has(__c, X86_FEATURE_VMX)) \
|
|
|
|
__reserved_bits |= X86_CR4_VMXE; \
|
2021-02-01 15:28:43 +01:00
|
|
|
if (!__cpu_has(__c, X86_FEATURE_PCID)) \
|
|
|
|
__reserved_bits |= X86_CR4_PCIDE; \
|
KVM: x86: Virtualize LAM for supervisor pointer
Add support to allow guests to set the new CR4 control bit for LAM and add
implementation to get untagged address for supervisor pointers.
LAM modifies the canonicality check applied to 64-bit linear addresses for
data accesses, allowing software to use of the untranslated address bits for
metadata and masks the metadata bits before using them as linear addresses
to access memory. LAM uses CR4.LAM_SUP (bit 28) to configure and enable LAM
for supervisor pointers. It also changes VMENTER to allow the bit to be set
in VMCS's HOST_CR4 and GUEST_CR4 to support virtualization. Note CR4.LAM_SUP
is allowed to be set even not in 64-bit mode, but it will not take effect
since LAM only applies to 64-bit linear addresses.
Move CR4.LAM_SUP out of CR4_RESERVED_BITS, its reservation depends on vcpu
supporting LAM or not. Leave it intercepted to prevent guest from setting
the bit if LAM is not exposed to guest as well as to avoid vmread every time
when KVM fetches its value, with the expectation that guest won't toggle the
bit frequently.
Set CR4.LAM_SUP bit in the emulated IA32_VMX_CR4_FIXED1 MSR for guests to
allow guests to enable LAM for supervisor pointers in nested VMX operation.
Hardware is not required to do TLB flush when CR4.LAM_SUP toggled, KVM
doesn't need to emulate TLB flush based on it. There's no other features
or vmx_exec_controls connection, and no other code needed in
{kvm,vmx}_set_cr4().
Skip address untag for instruction fetches (which includes branch targets),
operand of INVLPG instructions, and implicit system accesses, all of which
are not subject to untagging. Note, get_untagged_addr() isn't invoked for
implicit system accesses as there is no reason to do so, but check the
flag anyways for documentation purposes.
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-11-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-13 20:42:21 +08:00
|
|
|
if (!__cpu_has(__c, X86_FEATURE_LAM)) \
|
|
|
|
__reserved_bits |= X86_CR4_LAM_SUP; \
|
2020-07-08 00:39:55 +00:00
|
|
|
__reserved_bits; \
|
|
|
|
})
|
|
|
|
|
2020-12-10 11:09:53 -06:00
|
|
|
int kvm_sev_es_mmio_write(struct kvm_vcpu *vcpu, gpa_t src, unsigned int bytes,
|
|
|
|
void *dst);
|
|
|
|
int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t src, unsigned int bytes,
|
|
|
|
void *dst);
|
2020-12-10 11:09:54 -06:00
|
|
|
int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
|
|
|
|
unsigned int port, void *data, unsigned int count,
|
|
|
|
int in);
|
2020-12-10 11:09:53 -06:00
|
|
|
|
2024-11-27 16:43:40 -08:00
|
|
|
static inline bool user_exit_on_hypercall(struct kvm *kvm, unsigned long hc_nr)
|
|
|
|
{
|
|
|
|
return kvm->arch.hypercall_exit_enabled & BIT(hc_nr);
|
|
|
|
}
|
|
|
|
|
2025-02-22 09:42:17 +08:00
|
|
|
int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, int cpl,
|
2024-12-10 11:21:03 -05:00
|
|
|
int (*complete_hypercall)(struct kvm_vcpu *));
|
|
|
|
|
2025-02-22 09:42:17 +08:00
|
|
|
#define __kvm_emulate_hypercall(_vcpu, cpl, complete_hypercall) \
|
|
|
|
({ \
|
|
|
|
int __ret; \
|
|
|
|
__ret = ____kvm_emulate_hypercall(_vcpu, cpl, complete_hypercall); \
|
|
|
|
\
|
|
|
|
if (__ret > 0) \
|
|
|
|
__ret = complete_hypercall(_vcpu); \
|
|
|
|
__ret; \
|
2024-12-10 11:21:03 -05:00
|
|
|
})
|
2024-11-27 16:43:43 -08:00
|
|
|
|
2024-11-27 16:43:41 -08:00
|
|
|
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
|
|
|
|
|
2008-07-03 14:59:22 +03:00
|
|
|
#endif
|