2021-01-14 22:27:55 -05:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2021-12-09 08:12:28 -05:00
|
|
|
#if !defined(KVM_X86_OP) || !defined(KVM_X86_OP_OPTIONAL)
|
2021-01-14 22:27:55 -05:00
|
|
|
BUILD_BUG_ON(1)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
2021-12-09 08:12:28 -05:00
|
|
|
* KVM_X86_OP() and KVM_X86_OP_OPTIONAL() are used to help generate
|
|
|
|
* both DECLARE/DEFINE_STATIC_CALL() invocations and
|
|
|
|
* "static_call_update()" calls.
|
|
|
|
*
|
|
|
|
* KVM_X86_OP_OPTIONAL() can be used for those functions that can have
|
2024-05-07 21:31:01 +08:00
|
|
|
* a NULL definition. KVM_X86_OP_OPTIONAL_RET0() can be used likewise
|
2022-02-15 13:07:10 -05:00
|
|
|
* to make a definition optional, but in this case the default will
|
|
|
|
* be __static_call_return0.
|
2021-01-14 22:27:55 -05:00
|
|
|
*/
|
2022-11-30 23:09:23 +00:00
|
|
|
KVM_X86_OP(check_processor_compatibility)
|
2024-08-29 21:35:56 -07:00
|
|
|
KVM_X86_OP(enable_virtualization_cpu)
|
|
|
|
KVM_X86_OP(disable_virtualization_cpu)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP(hardware_unsetup)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(has_emulated_msr)
|
|
|
|
KVM_X86_OP(vcpu_after_set_cpuid)
|
|
|
|
KVM_X86_OP(vm_init)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(vm_destroy)
|
2025-02-25 12:45:13 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(vm_pre_destroy)
|
2022-04-19 23:45:10 +08:00
|
|
|
KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(vcpu_create)
|
|
|
|
KVM_X86_OP(vcpu_free)
|
|
|
|
KVM_X86_OP(vcpu_reset)
|
KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names
Rename a variety of kvm_x86_op function pointers so that preferred name
for vendor implementations follows the pattern <vendor>_<function>, e.g.
rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will
allow vendor implementations to be wired up via the KVM_X86_OP macro.
In many cases, VMX and SVM "disagree" on the preferred name, though in
reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
the kvm_x86_ops name. Justification for using the VMX nomenclature:
- set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
event that has already been "set" in e.g. the vIRR. SVM's relevant
VMCB field is even named event_inj, and KVM's stat is irq_injections.
- prepare_guest_switch => prepare_switch_to_guest because the former is
ambiguous, e.g. it could mean switching between multiple guests,
switching from the guest to host, etc...
- update_pi_irte => pi_update_irte to allow for matching match the rest
of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
- start_assignment => pi_start_assignment to again follow VMX's posted
interrupt naming scheme, and to provide context for what bit of code
might care about an otherwise undescribed "assignment".
The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave
that change for another time as the Hyper-V hooks always start as NULL,
i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
names requires an astounding amount of churn.
VMX and SVM function names are intentionally left as is to minimize the
diff. Both VMX and SVM will need to rename even more functions in order
to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
inevitable.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-28 00:51:50 +00:00
|
|
|
KVM_X86_OP(prepare_switch_to_guest)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(vcpu_load)
|
|
|
|
KVM_X86_OP(vcpu_put)
|
|
|
|
KVM_X86_OP(update_exception_bitmap)
|
|
|
|
KVM_X86_OP(get_msr)
|
|
|
|
KVM_X86_OP(set_msr)
|
|
|
|
KVM_X86_OP(get_segment_base)
|
|
|
|
KVM_X86_OP(get_segment)
|
|
|
|
KVM_X86_OP(get_cpl)
|
KVM: x86: Bypass register cache when querying CPL from kvm_sched_out()
When querying guest CPL to determine if a vCPU was preempted while in
kernel mode, bypass the register cache, i.e. always read SS.AR_BYTES from
the VMCS on Intel CPUs. If the kernel is running with full preemption
enabled, using the register cache in the preemption path can result in
stale and/or uninitialized data being cached in the segment cache.
In particular the following scenario is currently possible:
- vCPU is just created, and the vCPU thread is preempted before
SS.AR_BYTES is written in vmx_vcpu_reset().
- When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() =>
vmx_get_cpl() reads and caches '0' for SS.AR_BYTES.
- vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't
invoke vmx_segment_cache_clear() to invalidate the cache.
As a result, KVM retains a stale value in the cache, which can be read,
e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX
segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM
selftests) reads and writes system registers just after the vCPU was
created, _without_ modifying SS.AR_BYTES, userspace will write back the
stale '0' value and ultimately will trigger a VM-Entry failure due to
incorrect SS segment type.
Note, the VM-Enter failure can also be avoided by moving the call to
vmx_segment_cache_clear() until after the vmx_vcpu_reset() initializes all
segments. However, while that change is correct and desirable (and will
come along shortly), it does not address the underlying problem that
accessing KVM's register caches from !task context is generally unsafe.
In addition to fixing the immediate bug, bypassing the cache for this
particular case will allow hardening KVM register caching log to assert
that the caches are accessed only when KVM _knows_ it is safe to do so.
Fixes: de63ad4cf497 ("KVM: X86: implement the logic for spinlock optimization")
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Closes: https://lore.kernel.org/all/20240716022014.240960-3-mlevitsk@redhat.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241009175002.1118178-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-09 10:49:59 -07:00
|
|
|
KVM_X86_OP(get_cpl_no_cache)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(set_segment)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP(get_cs_db_l_bits)
|
2023-06-13 13:30:35 -07:00
|
|
|
KVM_X86_OP(is_valid_cr0)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(set_cr0)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(post_set_cr3)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(is_valid_cr4)
|
|
|
|
KVM_X86_OP(set_cr4)
|
|
|
|
KVM_X86_OP(set_efer)
|
|
|
|
KVM_X86_OP(get_idt)
|
|
|
|
KVM_X86_OP(set_idt)
|
|
|
|
KVM_X86_OP(get_gdt)
|
|
|
|
KVM_X86_OP(set_gdt)
|
|
|
|
KVM_X86_OP(sync_dirty_debug_regs)
|
|
|
|
KVM_X86_OP(set_dr7)
|
|
|
|
KVM_X86_OP(cache_reg)
|
|
|
|
KVM_X86_OP(get_rflags)
|
|
|
|
KVM_X86_OP(set_rflags)
|
2021-12-09 07:52:57 -08:00
|
|
|
KVM_X86_OP(get_if_flag)
|
KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names
Rename a variety of kvm_x86_op function pointers so that preferred name
for vendor implementations follows the pattern <vendor>_<function>, e.g.
rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will
allow vendor implementations to be wired up via the KVM_X86_OP macro.
In many cases, VMX and SVM "disagree" on the preferred name, though in
reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
the kvm_x86_ops name. Justification for using the VMX nomenclature:
- set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
event that has already been "set" in e.g. the vIRR. SVM's relevant
VMCB field is even named event_inj, and KVM's stat is irq_injections.
- prepare_guest_switch => prepare_switch_to_guest because the former is
ambiguous, e.g. it could mean switching between multiple guests,
switching from the guest to host, etc...
- update_pi_irte => pi_update_irte to allow for matching match the rest
of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
- start_assignment => pi_start_assignment to again follow VMX's posted
interrupt naming scheme, and to provide context for what bit of code
might care about an otherwise undescribed "assignment".
The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave
that change for another time as the Hyper-V hooks always start as NULL,
i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
names requires an astounding amount of churn.
VMX and SVM function names are intentionally left as is to minimize the
diff. Both VMX and SVM will need to rename even more functions in order
to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
inevitable.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-28 00:51:50 +00:00
|
|
|
KVM_X86_OP(flush_tlb_all)
|
|
|
|
KVM_X86_OP(flush_tlb_current)
|
2023-10-18 12:23:25 -07:00
|
|
|
#if IS_ENABLED(CONFIG_HYPERV)
|
2023-04-04 17:31:32 -07:00
|
|
|
KVM_X86_OP_OPTIONAL(flush_remote_tlbs)
|
|
|
|
KVM_X86_OP_OPTIONAL(flush_remote_tlbs_range)
|
2023-10-18 12:23:25 -07:00
|
|
|
#endif
|
KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names
Rename a variety of kvm_x86_op function pointers so that preferred name
for vendor implementations follows the pattern <vendor>_<function>, e.g.
rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will
allow vendor implementations to be wired up via the KVM_X86_OP macro.
In many cases, VMX and SVM "disagree" on the preferred name, though in
reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
the kvm_x86_ops name. Justification for using the VMX nomenclature:
- set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
event that has already been "set" in e.g. the vIRR. SVM's relevant
VMCB field is even named event_inj, and KVM's stat is irq_injections.
- prepare_guest_switch => prepare_switch_to_guest because the former is
ambiguous, e.g. it could mean switching between multiple guests,
switching from the guest to host, etc...
- update_pi_irte => pi_update_irte to allow for matching match the rest
of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
- start_assignment => pi_start_assignment to again follow VMX's posted
interrupt naming scheme, and to provide context for what bit of code
might care about an otherwise undescribed "assignment".
The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave
that change for another time as the Hyper-V hooks always start as NULL,
i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
names requires an astounding amount of churn.
VMX and SVM function names are intentionally left as is to minimize the
diff. Both VMX and SVM will need to rename even more functions in order
to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
inevitable.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-28 00:51:50 +00:00
|
|
|
KVM_X86_OP(flush_tlb_gva)
|
|
|
|
KVM_X86_OP(flush_tlb_guest)
|
KVM: VMX: Reject KVM_RUN if emulation is required with pending exception
Reject KVM_RUN if emulation is required (because VMX is running without
unrestricted guest) and an exception is pending, as KVM doesn't support
emulating exceptions except when emulating real mode via vm86. The vCPU
is hosed either way, but letting KVM_RUN proceed triggers a WARN due to
the impossible condition. Alternatively, the WARN could be removed, but
then userspace and/or KVM bugs would result in the vCPU silently running
in a bad state, which isn't very friendly to users.
Originally, the bug was hit by syzkaller with a nested guest as that
doesn't require kvm_intel.unrestricted_guest=0. That particular flavor
is likely fixed by commit cd0e615c49e5 ("KVM: nVMX: Synthesize
TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to
trigger the WARN with a non-nested guest, and userspace can likely force
bad state via ioctls() for a nested guest as well.
Checking for the impossible condition needs to be deferred until KVM_RUN
because KVM can't force specific ordering between ioctls. E.g. clearing
exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting
it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with
emulation_required would prevent userspace from queuing an exception and
then stuffing sregs. Note, if KVM were to try and detect/prevent the
condition prior to KVM_RUN, handle_invalid_guest_state() and/or
handle_emulation_failure() would need to be modified to clear the pending
exception prior to exiting to userspace.
------------[ cut here ]------------
WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel]
CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279
Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel]
Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202
RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000
RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006
R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000
FS: 00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0
Call Trace:
kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm]
kvm_vcpu_ioctl+0x279/0x690 [kvm]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211228232437.1875318-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-28 23:24:36 +00:00
|
|
|
KVM_X86_OP(vcpu_pre_run)
|
KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names
Rename a variety of kvm_x86_op function pointers so that preferred name
for vendor implementations follows the pattern <vendor>_<function>, e.g.
rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will
allow vendor implementations to be wired up via the KVM_X86_OP macro.
In many cases, VMX and SVM "disagree" on the preferred name, though in
reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
the kvm_x86_ops name. Justification for using the VMX nomenclature:
- set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
event that has already been "set" in e.g. the vIRR. SVM's relevant
VMCB field is even named event_inj, and KVM's stat is irq_injections.
- prepare_guest_switch => prepare_switch_to_guest because the former is
ambiguous, e.g. it could mean switching between multiple guests,
switching from the guest to host, etc...
- update_pi_irte => pi_update_irte to allow for matching match the rest
of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
- start_assignment => pi_start_assignment to again follow VMX's posted
interrupt naming scheme, and to provide context for what bit of code
might care about an otherwise undescribed "assignment".
The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave
that change for another time as the Hyper-V hooks always start as NULL,
i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
names requires an astounding amount of churn.
VMX and SVM function names are intentionally left as is to minimize the
diff. Both VMX and SVM will need to rename even more functions in order
to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
inevitable.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-28 00:51:50 +00:00
|
|
|
KVM_X86_OP(vcpu_run)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP(handle_exit)
|
|
|
|
KVM_X86_OP(skip_emulated_instruction)
|
|
|
|
KVM_X86_OP_OPTIONAL(update_emulated_instruction)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(set_interrupt_shadow)
|
|
|
|
KVM_X86_OP(get_interrupt_shadow)
|
|
|
|
KVM_X86_OP(patch_hypercall)
|
KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names
Rename a variety of kvm_x86_op function pointers so that preferred name
for vendor implementations follows the pattern <vendor>_<function>, e.g.
rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will
allow vendor implementations to be wired up via the KVM_X86_OP macro.
In many cases, VMX and SVM "disagree" on the preferred name, though in
reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
the kvm_x86_ops name. Justification for using the VMX nomenclature:
- set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
event that has already been "set" in e.g. the vIRR. SVM's relevant
VMCB field is even named event_inj, and KVM's stat is irq_injections.
- prepare_guest_switch => prepare_switch_to_guest because the former is
ambiguous, e.g. it could mean switching between multiple guests,
switching from the guest to host, etc...
- update_pi_irte => pi_update_irte to allow for matching match the rest
of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
- start_assignment => pi_start_assignment to again follow VMX's posted
interrupt naming scheme, and to provide context for what bit of code
might care about an otherwise undescribed "assignment".
The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave
that change for another time as the Hyper-V hooks always start as NULL,
i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
names requires an astounding amount of churn.
VMX and SVM function names are intentionally left as is to minimize the
diff. Both VMX and SVM will need to rename even more functions in order
to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
inevitable.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-28 00:51:50 +00:00
|
|
|
KVM_X86_OP(inject_irq)
|
|
|
|
KVM_X86_OP(inject_nmi)
|
KVM: x86: Add support for SVM's Virtual NMI
Add support for SVM's Virtual NMIs implementation, which adds proper
tracking of virtual NMI blocking, and an intr_ctrl flag that software can
set to mark a virtual NMI as pending. Pending virtual NMIs are serviced
by hardware if/when virtual NMIs become unblocked, i.e. act more or less
like real NMIs.
Introduce two new kvm_x86_ops callbacks so to support SVM's vNMI, as KVM
needs to treat a pending vNMI as partially injected. Specifically, if
two NMIs (for L1) arrive concurrently in KVM's software model, KVM's ABI
is to inject one and pend the other. Without vNMI, KVM manually tracks
the pending NMI and uses NMI windows to detect when the NMI should be
injected.
With vNMI, the pending NMI is simply stuffed into the VMCB and handed
off to hardware. This means that KVM needs to be able to set a vNMI
pending on-demand, and also query if a vNMI is pending, e.g. to honor the
"at most one NMI pending" rule and to preserve all NMIs across save and
restore.
Warn if KVM attempts to open an NMI window when vNMI is fully enabled,
as the above logic should prevent KVM from ever getting to
kvm_check_and_inject_events() with two NMIs pending _in software_, and
the "at most one NMI pending" logic should prevent having an NMI pending
in hardware and an NMI pending in software if NMIs are also blocked, i.e.
if KVM can't immediately inject the second NMI.
Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com>
Co-developed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20230227084016.3368-11-santosh.shukla@amd.com
[sean: rewrite shortlog and changelog, massage code comments]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-27 14:10:15 +05:30
|
|
|
KVM_X86_OP_OPTIONAL_RET0(is_vnmi_pending)
|
|
|
|
KVM_X86_OP_OPTIONAL_RET0(set_vnmi_pending)
|
2022-08-30 23:16:00 +00:00
|
|
|
KVM_X86_OP(inject_exception)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(cancel_injection)
|
|
|
|
KVM_X86_OP(interrupt_allowed)
|
|
|
|
KVM_X86_OP(nmi_allowed)
|
|
|
|
KVM_X86_OP(get_nmi_mask)
|
|
|
|
KVM_X86_OP(set_nmi_mask)
|
|
|
|
KVM_X86_OP(enable_nmi_window)
|
|
|
|
KVM_X86_OP(enable_irq_window)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(update_cr8_intercept)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(refresh_apicv_exec_ctrl)
|
2022-02-08 13:08:19 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(hwapic_isr_update)
|
|
|
|
KVM_X86_OP_OPTIONAL(load_eoi_exitmap)
|
|
|
|
KVM_X86_OP_OPTIONAL(set_virtual_apic_mode)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(set_apic_access_page_addr)
|
2022-01-28 00:51:48 +00:00
|
|
|
KVM_X86_OP(deliver_interrupt)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(sync_pir_to_irr)
|
2022-02-15 13:07:10 -05:00
|
|
|
KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
|
|
|
|
KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
|
2022-07-14 15:37:07 +00:00
|
|
|
KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(load_mmu_pgd)
|
KVM: x86/tdp_mmu: Propagate building mirror page tables
Integrate hooks for mirroring page table operations for cases where TDX
will set PTEs or link page tables.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly through calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from any actual guest
mapping. Now add plumbing to propagate changes to the "external" page
tables being mirrored. Just create the x86_ops for now, leave plumbing the
operations into the TDX module for future patches.
Add two operations for setting up external page tables, one for linking
new page tables and one for setting leaf PTEs. Don't add any op for
configuring the root PFN, as TDX handles this itself. Don't provide a
way to set permissions on the PTEs also, as TDX doesn't support it.
This results in MMU "mirroring" support that is very targeted towards TDX.
Since it is likely there will be no other user, the main benefit of making
the support generic is to keep TDX specific *looking* code outside of the
MMU. As a generic feature it will make enough sense from TDX's
perspective. For developers unfamiliar with TDX arch it can express the
general concepts such that they can continue to work in the code.
TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For setting/linking, only
hook tdp_mmu_set_spte_atomic() which is used for mapping and linking
PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for
setting PTEs in operations unsupported by TDX: splitting huge pages and
write protecting. Sprinkle KVM_BUG_ON()s to document as code that these
paths are not supported for mirrored page tables. For zapping operations,
leave those for near future changes.
Many operations in the TDP MMU depend on atomicity of the PTE update.
While the mirror PTE on KVM's side can be updated atomically, the update
that happens inside the external operations (S-EPT updates via TDX module
call) can't happen atomically with the mirror update. The following race
could result during two vCPU's populating private memory:
* vcpu 1: atomically update 2M level mirror EPT entry to be present
* vcpu 2: read 2M level EPT entry that is present
* vcpu 2: walk down into 4K level EPT
* vcpu 2: atomically update 4K level mirror EPT entry to be present
* vcpu 2: set_exterma;_spte() to update 4K secure EPT entry => error
because 2M secure EPT entry is not populated yet
* vcpu 1: link_external_spt() to update 2M secure EPT entry
Prevent this by setting the mirror PTE to FROZEN_SPTE while the reflect
operations are performed. Only write the actual mirror PTE value once the
reflect operations have completed. When trying to set a PTE to present and
encountering a frozen SPTE, retry the fault.
By doing this the race is prevented as follows:
* vcpu 1: atomically update 2M level EPT entry to be FROZEN_SPTE
* vcpu 2: read 2M level EPT entry that is FROZEN_SPTE
* vcpu 2: find that the EPT entry is frozen
abandon page table walk to resume guest execution
* vcpu 1: link_external_spt() to update 2M secure EPT entry
* vcpu 1: atomically update 2M level EPT entry to be present (unfreeze)
* vcpu 2: resume guest execution
Depending on vcpu 1 state, vcpu 2 may result in EPT violation
again or make progress on guest execution
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-15-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-14 13:42:29 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(link_external_spt)
|
|
|
|
KVM_X86_OP_OPTIONAL(set_external_spte)
|
KVM: x86/tdp_mmu: Propagate tearing down mirror page tables
Integrate hooks for mirroring page table operations for cases where TDX
will zap PTEs or free page tables.
Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.
Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from the guest. Now add
plumbing to propagate changes to the "external" page tables being
mirrored. Just create the x86_ops for now, leave plumbing the operations
into the TDX module for future patches.
Add two operations for tearing down page tables, one for freeing page
tables (free_external_spt) and one for zapping PTEs (remove_external_spte).
Define them such that remove_external_spte will perform a TLB flush as
well. (in TDX terms "ensure there are no active translations").
TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For zapping/freeing, only
hook tdp_mmu_iter_set_spte() which is used for mapping and linking PTs.
Don't bother hooking tdp_mmu_set_spte_atomic() as it is only used for
zapping PTEs in operations unsupported by TDX: zapping collapsible PTEs and
kvm_mmu_zap_all_fast().
In previous changes to address races around concurrent populating using
tdp_mmu_set_spte_atomic(), a solution was introduced to temporarily set
FROZEN_SPTE in the mirrored page tables while performing the external
operations. Such a solution is not needed for the tear down paths in TDX
as these will always be performed with the mmu_lock held for write.
Sprinkle some KVM_BUG_ON()s to reflect this.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240718211230.1492011-16-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-18 14:12:27 -07:00
|
|
|
KVM_X86_OP_OPTIONAL(free_external_spt)
|
|
|
|
KVM_X86_OP_OPTIONAL(remove_external_spte)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP(has_wbinvd_exit)
|
2021-05-26 19:44:13 +01:00
|
|
|
KVM_X86_OP(get_l2_tsc_offset)
|
|
|
|
KVM_X86_OP(get_l2_tsc_multiplier)
|
2021-05-26 19:44:15 +01:00
|
|
|
KVM_X86_OP(write_tsc_offset)
|
2021-06-07 11:54:38 +01:00
|
|
|
KVM_X86_OP(write_tsc_multiplier)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(get_exit_info)
|
2024-09-10 16:03:48 -04:00
|
|
|
KVM_X86_OP(get_entry_info)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(check_intercept)
|
|
|
|
KVM_X86_OP(handle_exit_irqoff)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(update_cpu_dirty_logging)
|
|
|
|
KVM_X86_OP_OPTIONAL(vcpu_blocking)
|
|
|
|
KVM_X86_OP_OPTIONAL(vcpu_unblocking)
|
|
|
|
KVM_X86_OP_OPTIONAL(pi_update_irte)
|
2025-06-11 15:45:56 -07:00
|
|
|
KVM_X86_OP_OPTIONAL(pi_start_bypass)
|
KVM: x86: Fix lapic timer interrupt lost after loading a snapshot.
When running android emulator (which is based on QEMU 2.12) on
certain Intel hosts with kernel version 6.3-rc1 or above, guest
will freeze after loading a snapshot. This is almost 100%
reproducible. By default, the android emulator will use snapshot
to speed up the next launching of the same android guest. So
this breaks the android emulator badly.
I tested QEMU 8.0.4 from Debian 12 with an Ubuntu 22.04 guest by
running command "loadvm" after "savevm". The same issue is
observed. At the same time, none of our AMD platforms is impacted.
More experiments show that loading the KVM module with
"enable_apicv=false" can workaround it.
The issue started to show up after commit 8e6ed96cdd50 ("KVM: x86:
fire timer when it is migrated and expired, and in oneshot mode").
However, as is pointed out by Sean Christopherson, it is introduced
by commit 967235d32032 ("KVM: vmx: clear pending interrupts on
KVM_SET_LAPIC"). commit 8e6ed96cdd50 ("KVM: x86: fire timer when
it is migrated and expired, and in oneshot mode") just makes it
easier to hit the issue.
Having both commits, the oneshot lapic timer gets fired immediately
inside the KVM_SET_LAPIC call when loading the snapshot. On Intel
platforms with APIC virtualization and posted interrupt processing,
this eventually leads to setting the corresponding PIR bit. However,
the whole PIR bits get cleared later in the same KVM_SET_LAPIC call
by apicv_post_state_restore. This leads to timer interrupt lost.
The fix is to move vmx_apicv_post_state_restore to the beginning of
the KVM_SET_LAPIC call and rename to vmx_apicv_pre_state_restore.
What vmx_apicv_post_state_restore does is actually clearing any
former apicv state and this behavior is more suitable to carry out
in the beginning.
Fixes: 967235d32032 ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Haitao Shan <hshan@google.com>
Link: https://lore.kernel.org/r/20230913000215.478387-1-hshan@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-12 16:55:45 -07:00
|
|
|
KVM_X86_OP_OPTIONAL(apicv_pre_state_restore)
|
2022-02-08 13:08:19 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(apicv_post_state_restore)
|
2022-02-15 13:07:10 -05:00
|
|
|
KVM_X86_OP_OPTIONAL_RET0(dy_apicv_has_pending_interrupt)
|
KVM: TDX: Add support for find pending IRQ in a protected local APIC
Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest has a pending IRQ. For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM. As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM. To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector. And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.
Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.
Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.
For the TD exits not due to HLT TDCALL, skip checking RVI pending in
tdx_protected_apic_has_interrupt(). Except for the guest being stupid
(e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to
have an interrupt in RVI that is fully unmasked. There is no any CPU flows
that modify RVI in the middle of instruction execution. I.e. if RVI is
non-zero, then either the interrupt has been pending since before the TD
exit, or the instruction caused the TD exit is in an STI/SS shadow. KVM
doesn't care about STI/SS shadows outside of the HALTED case. And if the
interrupt was pending before TD exit, then it _must_ be blocked, otherwise
the interrupt would have been serviced at the instruction boundary.
For the HLT TDCALL case, it will be handled in a future patch when HLT
TDCALL is supported.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014757.897978-2-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-22 09:47:42 +08:00
|
|
|
KVM_X86_OP_OPTIONAL(protected_apic_has_interrupt)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(set_hv_timer)
|
|
|
|
KVM_X86_OP_OPTIONAL(cancel_hv_timer)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(setup_mce)
|
2022-09-29 13:20:14 -04:00
|
|
|
#ifdef CONFIG_KVM_SMM
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(smi_allowed)
|
2021-06-09 11:56:19 -07:00
|
|
|
KVM_X86_OP(enter_smm)
|
|
|
|
KVM_X86_OP(leave_smm)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(enable_smi_window)
|
2022-09-29 13:20:14 -04:00
|
|
|
#endif
|
2024-04-04 08:13:14 -04:00
|
|
|
KVM_X86_OP_OPTIONAL(dev_get_attr)
|
2025-05-02 13:34:21 -07:00
|
|
|
KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
|
2024-10-30 12:00:36 -07:00
|
|
|
KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(mem_enc_register_region)
|
|
|
|
KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
|
|
|
|
KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
|
|
|
|
KVM_X86_OP_OPTIONAL(vm_move_enc_context_from)
|
2022-04-21 03:14:07 +00:00
|
|
|
KVM_X86_OP_OPTIONAL(guest_memory_reclaimed)
|
2024-08-02 11:19:30 -07:00
|
|
|
KVM_X86_OP(get_feature_msr)
|
2023-08-24 18:36:20 -07:00
|
|
|
KVM_X86_OP(check_emulate_instruction)
|
2021-01-14 22:27:55 -05:00
|
|
|
KVM_X86_OP(apic_init_signal_blocked)
|
2022-11-01 15:53:43 +01:00
|
|
|
KVM_X86_OP_OPTIONAL(enable_l2_tlb_flush)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP_OPTIONAL(migrate_timers)
|
2025-06-10 15:57:25 -07:00
|
|
|
KVM_X86_OP(recalc_msr_intercepts)
|
2021-12-09 08:12:28 -05:00
|
|
|
KVM_X86_OP(complete_emulated_msr)
|
2022-01-28 00:51:51 +00:00
|
|
|
KVM_X86_OP(vcpu_deliver_sipi_vector)
|
2022-03-22 19:40:49 +02:00
|
|
|
KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
|
2023-09-13 20:42:19 +08:00
|
|
|
KVM_X86_OP_OPTIONAL(get_untagged_addr)
|
KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
Implement a workaround for an SNP erratum where the CPU will incorrectly
signal an RMP violation #PF if a hugepage (2MB or 1GB) collides with the
RMP entry of a VMCB, VMSA or AVIC backing page.
When SEV-SNP is globally enabled, the CPU marks the VMCB, VMSA, and AVIC
backing pages as "in-use" via a reserved bit in the corresponding RMP
entry after a successful VMRUN. This is done for _all_ VMs, not just
SNP-Active VMs.
If the hypervisor accesses an in-use page through a writable
translation, the CPU will throw an RMP violation #PF. On early SNP
hardware, if an in-use page is 2MB-aligned and software accesses any
part of the associated 2MB region with a hugepage, the CPU will
incorrectly treat the entire 2MB region as in-use and signal a an RMP
violation #PF.
To avoid this, the recommendation is to not use a 2MB-aligned page for
the VMCB, VMSA or AVIC pages. Add a generic allocator that will ensure
that the page returned is not 2MB-aligned and is safe to be used when
SEV-SNP is enabled. Also implement similar handling for the VMCB/VMSA
pages of nested guests.
[ mdr: Squash in nested guest handling from Ashish, commit msg fixups. ]
Reported-by: Alper Gun <alpergun@google.com> # for nested VMSA case
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Marc Orr <marcorr@google.com>
Signed-off-by: Marc Orr <marcorr@google.com>
Co-developed-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20240126041126.1927228-22-michael.roth@amd.com
2024-01-25 22:11:21 -06:00
|
|
|
KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
|
2024-05-07 12:54:03 -04:00
|
|
|
KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
|
2024-05-01 03:51:52 -05:00
|
|
|
KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
|
2023-12-30 11:23:21 -06:00
|
|
|
KVM_X86_OP_OPTIONAL(gmem_invalidate)
|
2021-01-14 22:27:55 -05:00
|
|
|
|
|
|
|
#undef KVM_X86_OP
|
2021-12-09 08:12:28 -05:00
|
|
|
#undef KVM_X86_OP_OPTIONAL
|
2022-02-15 13:07:10 -05:00
|
|
|
#undef KVM_X86_OP_OPTIONAL_RET0
|