Commit graph

1312 commits

Author SHA1 Message Date
Sean Christopherson
d26638bfcd KVM: SVM: Don't change target vCPU state on AP Creation VMGEXIT error
If KVM rejects an AP Creation event, leave the target vCPU state as-is.
Nothing in the GHCB suggests the hypervisor is *allowed* to muck with vCPU
state on failure, let alone required to do so.  Furthermore, kicking only
in the !ON_INIT case leads to divergent behavior, and even the "kick" case
is non-deterministic.

E.g. if an ON_INIT request fails, the guest can successfully retry if the
fixed AP Creation request is made prior to sending INIT.  And if a !ON_INIT
fails, the guest can successfully retry if the fixed AP Creation request is
handled before the target vCPU processes KVM's
KVM_REQ_UPDATE_PROTECTED_GUEST_STATE.

Fixes: e366f92ea9 ("KVM: SEV: Support SEV-SNP AP Creation NAE event")
Cc: stable@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:33:44 -08:00
Sean Christopherson
72d12715ed KVM: SVM: Refuse to attempt VRMUN if an SEV-ES+ guest has an invalid VMSA
Explicitly reject KVM_RUN with KVM_EXIT_FAIL_ENTRY if userspace "coerces"
KVM into running an SEV-ES+ guest with an invalid VMSA, e.g. by modifying
a vCPU's mp_state to be RUNNABLE after an SNP vCPU has undergone a Destroy
event.  On Destroy or failed Create, KVM marks the vCPU HALTED so that
*KVM* doesn't run the vCPU, but nothing prevents a misbehaving VMM from
manually making the vCPU RUNNABLE via KVM_SET_MP_STATE.

Attempting VMRUN with an invalid VMSA should be harmless, but knowingly
executing VMRUN with bad control state is at best dodgy.

Fixes: e366f92ea9 ("KVM: SEV: Support SEV-SNP AP Creation NAE event")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:33:43 -08:00
Sean Christopherson
807cb9ce2e KVM: SVM: Don't rely on DebugSwap to restore host DR0..DR3
Never rely on the CPU to restore/load host DR0..DR3 values, even if the
CPU supports DebugSwap, as there are no guarantees that SNP guests will
actually enable DebugSwap on APs.  E.g. if KVM were to rely on the CPU to
load DR0..DR3 and skipped them during hw_breakpoint_restore(), KVM would
run with clobbered-to-zero DRs if an SNP guest created APs without
DebugSwap enabled.

Update the comment to explain the dangers, and hopefully prevent breaking
KVM in the future.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:26:39 -08:00
Sean Christopherson
b2653cd3b7 KVM: SVM: Save host DR masks on CPUs with DebugSwap
When running SEV-SNP guests on a CPU that supports DebugSwap, always save
the host's DR0..DR3 mask MSR values irrespective of whether or not
DebugSwap is enabled, to ensure the host values aren't clobbered by the
CPU.  And for now, also save DR0..DR3, even though doing so isn't
necessary (see below).

SVM_VMGEXIT_AP_CREATE is deeply flawed in that it allows the *guest* to
create a VMSA with guest-controlled SEV_FEATURES.  A well behaved guest
can inform the hypervisor, i.e. KVM, of its "requested" features, but on
CPUs without ALLOWED_SEV_FEATURES support, nothing prevents the guest from
lying about which SEV features are being enabled (or not!).

If a misbehaving guest enables DebugSwap in a secondary vCPU's VMSA, the
CPU will load the DR0..DR3 mask MSRs on #VMEXIT, i.e. will clobber the
MSRs with '0' if KVM doesn't save its desired value.

Note, DR0..DR3 themselves are "ok", as DR7 is reset on #VMEXIT, and KVM
restores all DRs in common x86 code as needed via hw_breakpoint_restore().
I.e. there is no risk of host DR0..DR3 being clobbered (when it matters).
However, there is a flaw in the opposite direction; because the guest can
lie about enabling DebugSwap, i.e. can *disable* DebugSwap without KVM's
knowledge, KVM must not rely on the CPU to restore DRs.  Defer fixing
that wart, as it's more of a documentation issue than a bug in the code.

Note, KVM added support for DebugSwap on commit d1f85fbe83 ("KVM: SEV:
Enable data breakpoints in SEV-ES"), but that is not an appropriate Fixes,
as the underlying flaw exists in hardware, not in KVM.  I.e. all kernels
that support SEV-SNP need to be patched, not just kernels with KVM's full
support for DebugSwap (ignoring that DebugSwap support landed first).

Opportunistically fix an incorrect statement in the comment; on CPUs
without DebugSwap, the CPU does NOT save or load debug registers, i.e.

Fixes: e366f92ea9 ("KVM: SEV: Support SEV-SNP AP Creation NAE event")
Cc: stable@vger.kernel.org
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Kim Phillips <kim.phillips@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Alexey Kardashevskiy <aik@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250227012541.3234589-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-03-03 07:26:39 -08:00
Sean Christopherson
433265870a KVM: SVM: Manually context switch DEBUGCTL if LBR virtualization is disabled
Manually load the guest's DEBUGCTL prior to VMRUN (and restore the host's
value on #VMEXIT) if it diverges from the host's value and LBR
virtualization is disabled, as hardware only context switches DEBUGCTL if
LBR virtualization is fully enabled.  Running the guest with the host's
value has likely been mildly problematic for quite some time, e.g. it will
result in undesirable behavior if BTF diverges (with the caveat that KVM
now suppresses guest BTF due to lack of support).

But the bug became fatal with the introduction of Bus Lock Trap ("Detect"
in kernel paralance) support for AMD (commit 408eb7417a
("x86/bus_lock: Add support for AMD")), as a bus lock in the guest will
trigger an unexpected #DB.

Note, suppressing the bus lock #DB, i.e. simply resuming the guest without
injecting a #DB, is not an option.  It wouldn't address the general issue
with DEBUGCTL, e.g. for things like BTF, and there are other guest-visible
side effects if BusLockTrap is left enabled.

If BusLockTrap is disabled, then DR6.BLD is reserved-to-1; any attempts to
clear it by software are ignored.  But if BusLockTrap is enabled, software
can clear DR6.BLD:

  Software enables bus lock trap by setting DebugCtl MSR[BLCKDB] (bit 2)
  to 1.  When bus lock trap is enabled, ... The processor indicates that
  this #DB was caused by a bus lock by clearing DR6[BLD] (bit 11).  DR6[11]
  previously had been defined to be always 1.

and clearing DR6.BLD is "sticky" in that it's not set (i.e. lowered) by
other #DBs:

  All other #DB exceptions leave DR6[BLD] unmodified

E.g. leaving BusLockTrap enable can confuse a legacy guest that writes '0'
to reset DR6.

Reported-by: rangemachine@gmail.com
Reported-by: whanos@sergal.fun
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219787
Closes: https://lore.kernel.org/all/bug-219787-28872@https.bugzilla.kernel.org%2F
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: stable@vger.kernel.org
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:17:45 -08:00
Sean Christopherson
d0eac42f5c KVM: SVM: Suppress DEBUGCTL.BTF on AMD
Mark BTF as reserved in DEBUGCTL on AMD, as KVM doesn't actually support
BTF, and fully enabling BTF virtualization is non-trivial due to
interactions with the emulator, guest_debug, #DB interception, nested SVM,
etc.

Don't inject #GP if the guest attempts to set BTF, as there's no way to
communicate lack of support to the guest, and instead suppress the flag
and treat the WRMSR as (partially) unsupported.

In short, make KVM behave the same on AMD and Intel (VMX already squashes
BTF).

Note, due to other bugs in KVM's handling of DEBUGCTL, the only way BTF
has "worked" in any capacity is if the guest simultaneously enables LBRs.

Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: stable@vger.kernel.org
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:17:45 -08:00
Sean Christopherson
ee89e80133 KVM: SVM: Drop DEBUGCTL[5:2] from guest's effective value
Drop bits 5:2 from the guest's effective DEBUGCTL value, as AMD changed
the architectural behavior of the bits and broke backwards compatibility.
On CPUs without BusLockTrap (or at least, in APMs from before ~2023),
bits 5:2 controlled the behavior of external pins:

  Performance-Monitoring/Breakpoint Pin-Control (PBi)—Bits 5:2, read/write.
  Software uses thesebits to control the type of information reported by
  the four external performance-monitoring/breakpoint pins on the
  processor. When a PBi bit is cleared to 0, the corresponding external pin
  (BPi) reports performance-monitor information. When a PBi bit is set to
  1, the corresponding external pin (BPi) reports breakpoint information.

With the introduction of BusLockTrap, presumably to be compatible with
Intel CPUs, AMD redefined bit 2 to be BLCKDB:

  Bus Lock #DB Trap (BLCKDB)—Bit 2, read/write. Software sets this bit to
  enable generation of a #DB trap following successful execution of a bus
  lock when CPL is > 0.

and redefined bits 5:3 (and bit 6) as "6:3 Reserved MBZ".

Ideally, KVM would treat bits 5:2 as reserved.  Defer that change to a
feature cleanup to avoid breaking existing guest in LTS kernels.  For now,
drop the bits to retain backwards compatibility (of a sort).

Note, dropping bits 5:2 is still a guest-visible change, e.g. if the guest
is enabling LBRs *and* the legacy PBi bits, then the state of the PBi bits
is visible to the guest, whereas now the guest will always see '0'.

Reported-by: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: stable@vger.kernel.org
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250227222411.3490595-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:17:45 -08:00
Sean Christopherson
d4b69c3d14 KVM: SVM: Inject #GP if memory operand for INVPCID is non-canonical
Inject a #GP if the memory operand received by INVCPID is non-canonical.
The APM clearly states that the intercept takes priority over all #GP
checks except the CPL0 restriction.

Of course, that begs the question of how the CPU generates a linear
address in the first place.  Tracing confirms that EXITINFO1 does hold a
linear address, at least for 64-bit mode guests (hooray GS prefix).
Unfortunately, the APM says absolutely nothing about the EXITINFO fields
for INVPCID intercepts, so it's not at all clear what's supposed to
happen.

Add a FIXME to call out that KVM still does the wrong thing for 32-bit
guests, and if the stack segment is used for the memory operand.

Cc: Babu Moger <babu.moger@amd.com>
Cc: Jim Mattson <jmattson@google.com>
Fixes: 4407a797e9 ("KVM: SVM: Enable INVPCID feature on AMD")
Link: https://lore.kernel.org/r/20250224174522.2363400-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:16:21 -08:00
Sean Christopherson
be45bc4eff KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadow
Enable/disable local IRQs, i.e. set/clear RFLAGS.IF, in the common
svm_vcpu_enter_exit() just after/before guest_state_{enter,exit}_irqoff()
so that VMRUN is not executed in an STI shadow.  AMD CPUs have a quirk
(some would say "bug"), where the STI shadow bleeds into the guest's
intr_state field if a #VMEXIT occurs during injection of an event, i.e. if
the VMRUN doesn't complete before the subsequent #VMEXIT.

The spurious "interrupts masked" state is relatively benign, as it only
occurs during event injection and is transient.  Because KVM is already
injecting an event, the guest can't be in HLT, and if KVM is querying IRQ
blocking for injection, then KVM would need to force an immediate exit
anyways since injecting multiple events is impossible.

However, because KVM copies int_state verbatim from vmcb02 to vmcb12, the
spurious STI shadow is visible to L1 when running a nested VM, which can
trip sanity checks, e.g. in VMware's VMM.

Hoist the STI+CLI all the way to C code, as the aforementioned calls to
guest_state_{enter,exit}_irqoff() already inform lockdep that IRQs are
enabled/disabled, and taking a fault on VMRUN with RFLAGS.IF=1 is already
possible.  I.e. if there's kernel code that is confused by running with
RFLAGS.IF=1, then it's already a problem.  In practice, since GIF=0 also
blocks NMIs, the only change in exposure to non-KVM code (relative to
surrounding VMRUN with STI+CLI) is exception handling code, and except for
the kvm_rebooting=1 case, all exception in the core VM-Enter/VM-Exit path
are fatal.

Use the "raw" variants to enable/disable IRQs to avoid tracing in the
"no instrumentation" code; the guest state helpers also take care of
tracing IRQ state.

Oppurtunstically document why KVM needs to do STI in the first place.

Reported-by: Doug Covelli <doug.covelli@broadcom.com>
Closes: https://lore.kernel.org/all/CADH9ctBs1YPmE4aCfGPNBwA10cA8RuAk2gO7542DjMZgs4uzJQ@mail.gmail.com
Fixes: f14eec0a32 ("KVM: SVM: move more vmentry code to assembly")
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250224165442.2338294-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-28 09:15:23 -08:00
Yosry Ahmed
80dacb0804 x86/bugs: Use a static branch to guard IBPB on vCPU switch
Instead of using X86_FEATURE_USE_IBPB to guard the IBPB execution in KVM
when a new vCPU is loaded, introduce a static branch, similar to
switch_mm_*_ibpb.

This makes it obvious in spectre_v2_user_select_mitigation() what
exactly is being toggled, instead of the unclear X86_FEATURE_USE_IBPB
(which will be shortly removed). It also provides more fine-grained
control, making it simpler to change/add paths that control the IBPB in
the vCPU switch path without affecting other IBPBs.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250227012712.3193063-5-yosry.ahmed@linux.dev
2025-02-27 10:57:20 +01:00
Yosry Ahmed
549435aab4 x86/bugs: Move the X86_FEATURE_USE_IBPB check into callers
indirect_branch_prediction_barrier() only performs the MSR write if
X86_FEATURE_USE_IBPB is set, using alternative_msr_write(). In
preparation for removing X86_FEATURE_USE_IBPB, move the feature check
into the callers so that they can be addressed one-by-one, and use
X86_FEATURE_IBPB instead to guard the MSR write.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250227012712.3193063-2-yosry.ahmed@linux.dev
2025-02-27 10:57:20 +01:00
Borislav Petkov
8442df2b49 x86/bugs: KVM: Add support for SRSO_MSR_FIX
Add support for

  CPUID Fn8000_0021_EAX[31] (SRSO_MSR_FIX). If this bit is 1, it
  indicates that software may use MSR BP_CFG[BpSpecReduce] to mitigate
  SRSO.

Enable BpSpecReduce to mitigate SRSO across guest/host boundaries.

Switch back to enabling the bit when virtualization is enabled and to
clear the bit when virtualization is disabled because using a MSR slot
would clear the bit when the guest is exited and any training the guest
has done, would potentially influence the host kernel when execution
enters the kernel and hasn't VMRUN the guest yet.

More detail on the public thread in Link below.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241202120416.6054-1-bp@kernel.org
2025-02-26 15:13:06 +01:00
Manali Shukla
fa662c9080 KVM: SVM: Add Idle HLT intercept support
Add support for "Idle HLT" interception on AMD CPUs, and enable Idle HLT
interception instead of "normal" HLT interception for all VMs for which
HLT-exiting is enabled.  Idle HLT provides a mild performance boost for
all VM types, by avoiding a VM-Exit in the scenario where KVM would
immediately "wake" and resume the vCPU.

Idle HLT makes HLT-exiting conditional on the vCPU not having a valid,
unmasked interrupt.  Specifically, a VM-Exit occurs on execution of HLT
if and only if there are no pending V_IRQ or V_NMI events.  Note, Idle
is a replacement for full HLT interception, i.e. enabling HLT interception
would result in all HLT instructions causing unconditional VM-Exits.  Per
the APM:

 When both HLT and Idle HLT intercepts are active at the same time, the
 HLT intercept takes priority. This intercept occurs only if a virtual
 interrupt is not pending (V_INTR or V_NMI).

For KVM's use of V_IRQ (also called V_INTR in the APM) to detect interrupt
windows, the net effect of enabling Idle HLT is that, if a virtual
interupt is pending and unmasked at the time of HLT, the vCPU will take
a V_IRQ intercept instead of a HLT intercept.

When AVIC is enabled, Idle HLT works as intended: the vCPU continues
unimpeded and services the pending virtual interrupt.

Note, the APM's description of V_IRQ interaction with AVIC is quite
confusing, and requires piecing together implied behavior.  Per the APM,
when AVIC is enabled, V_IRQ *from the VMCB* is ignored:

  When AVIC mode is enabled for a virtual processor, the V_IRQ, V_INTR_PRIO,
  V_INTR_VECTOR, and V_IGN_TPR fields in the VMCB are ignored.

Which seems to contradict the behavior of Idle HLT:

  This intercept occurs only if a virtual interrupt is not pending (V_INTR
  or V_NMI).

What's not explicitly stated is that hardware's internal copy of V_IRQ
(and related fields) *are* still active, i.e. are presumably used to cache
information from the virtual APIC.

Handle Idle HLT exits as if they were normal HLT exits, e.g. don't try to
optimize the handling under the assumption that there isn't a pending IRQ.
Irrespective of AVIC, Idle HLT is inherently racy with respect to the vIRR,
as KVM can set vIRR bits asychronously.

No changes are required to support KVM's use Idle HLT while running
L2.  In fact, supporting Idle HLT is actually a bug fix to some extent.
If L1 wants to intercept HLT, recalc_intercepts() will enable HLT
interception in vmcb02 and forward the intercept to L1 as normal.

But if L1 does not want to intercept HLT, then KVM will run L2 with Idle
HLT enabled and HLT interception disabled.  If a V_IRQ or V_NMI for L2
becomes pending and L2 executes HLT, then use of Idle HLT will do the
right thing, i.e. not #VMEXIT and instead deliver the virtual event.  KVM
currently doesn't handle this scenario correctly, e.g. doesn't check V_IRQ
or V_NMI in vmcs02 as part of kvm_vcpu_has_events().

Do not expose Idle HLT to L1 at this time, as supporting nested Idle HLT is
more complex than just enumerating the feature, e.g. requires KVM to handle
the aforementioned scenarios of V_IRQ and V_NMI at the time of exit.

Signed-off-by: Manali Shukla <Manali.Shukla@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://bugzilla.kernel.org/attachment.cgi?id=306250
Link: https://lore.kernel.org/r/20250128124812.7324-3-manali.shukla@amd.com
[sean: rewrite changelog, drop nested "support"]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 16:30:02 -08:00
Melody Wang
c3392d0ab7 KVM: SVM: Provide helpers to set the error code
Provide helpers to set the error code when converting VMGEXIT SW_EXITINFO1 and
SW_EXITINFO2 codes from plain numbers to proper defines. Add comments for
better code readability.

No functionality changed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Melody Wang <huibo.wang@amd.com>
Link: https://lore.kernel.org/r/20250225213937.2471419-3-huibo.wang@amd.com
[sean: tweak comments, fix formatting goofs]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 16:30:00 -08:00
Melody Wang
ea4c2f2f5e KVM: SVM: Convert plain error code numbers to defines
Convert VMGEXIT SW_EXITINFO1 codes from plain numbers to proper defines.

Opportunistically update the comment for the malformed input "sub-error"
codes to state that they are defined by the GHCB, and to capure the
relationship to the malformed input response.

No functional change intended.

Signed-off-by: Melody Wang <huibo.wang@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Link: https://lore.kernel.org/r/20250225213937.2471419-2-huibo.wang@amd.com
[sean: update comments]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 16:29:59 -08:00
Sean Christopherson
b50cb2b155 KVM: x86: Use a dedicated flow for queueing re-injected exceptions
Open code the filling of vcpu->arch.exception in kvm_requeue_exception()
instead of bouncing through kvm_multiple_exception(), as re-injection
doesn't actually share that much code with "normal" injection, e.g. the
VM-Exit interception check, payload delivery, and nested exception code
is all bypassed as those flows only apply during initial injection.

When FRED comes along, the special casing will only get worse, as FRED
explicitly tracks nested exceptions and essentially delivers the payload
on the stack frame, i.e. re-injection will need more inputs, and normal
injection will have yet more code that needs to be bypassed when KVM is
re-injecting an exception.

No functional change intended.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20241001050110.3643764-2-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-25 07:23:07 -08:00
Paolo Bonzini
d3d0b8dfe0 KVM fixes for 6.14 part 1
- Reject Hyper-V SEND_IPI hypercalls if the local APIC isn't being emulated
    by KVM to fix a NULL pointer dereference.
 
  - Enter guest mode (L2) from KVM's perspective before initializing the vCPU's
    nested NPT MMU so that the MMU is properly tagged for L2, not L1.
 
  - Load the guest's DR6 outside of the innermost .vcpu_run() loop, as the
    guest's value may be stale if a VM-Exit is handled in the fastpath.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmev2ekACgkQOlYIJqCj
 N/32Gg/7B2+oV9RaKB1VNv4G4vbQLiA+DxPM91U0sBqytkr9BfU5kciaVs068OVk
 2M3j007HHm51sWlsCB7VLeTmiNNi/RcJzh6mOCpJVGa70imNZl3/1cvbzx1hjOAn
 DbZSIqBfLpPnAmNUp4c++WsDPZR2vVVMXriVNWM+RLFRT8E2GavCKxGppoNf+FIS
 8aYYikiqIx+E6iYsZjEm4TXqOQ2CSLM+auq2/L24bFgkn/v6I5m70QfnnYgs7Y7R
 uZhv+x2O8DXuW2RxabiC4q302PDdNKtHYpEh/5+vmG34mouZEEPTVlSRU720frqU
 SnOwtiTKwDmAwMDSRXUAP4jc9FsD4JHSUUM7Sk0J/YaI55X3xV+YrJUBZ07bwunT
 TkKPr6TvlJW9s2bi+CEc0HHoMHqmejjKhq8fOeDgVkGYH1nhjrLQAFpxjI4iVmPQ
 vZLmCZXEMzJaqySMNVIPdSFJLLsKnD7mJT3XfbXG7dV5zmde2qYd7+TiRVb5dmst
 xTgSvhA1jLXpSYA4rmMjhweLEfQyljaPgb1GEZCQCBrV9clP0cb091rOWNbrcieG
 aMXFwHEyPjGDvlXlhjdfkNeHdP6Dq8y0aBoyeSnvdwvpAN256jswrzpYjBHWQqfv
 jsD3QHcbImUr+kH2CHFsZuXxsjh+woL+4crR1eQkL8oZWHEykzs=
 =aFcV
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-fixes-6.14-rcN' of https://github.com/kvm-x86/linux into HEAD

KVM fixes for 6.14 part 1

 - Reject Hyper-V SEND_IPI hypercalls if the local APIC isn't being emulated
   by KVM to fix a NULL pointer dereference.

 - Enter guest mode (L2) from KVM's perspective before initializing the vCPU's
   nested NPT MMU so that the MMU is properly tagged for L2, not L1.

 - Load the guest's DR6 outside of the innermost .vcpu_run() loop, as the
   guest's value may be stale if a VM-Exit is handled in the fastpath.
2025-02-14 19:08:35 -05:00
Sean Christopherson
44e70718df KVM: SVM: Ensure PSP module is initialized if KVM module is built-in
The kernel's initcall infrastructure lacks the ability to express
dependencies between initcalls, whereas the modules infrastructure
automatically handles dependencies via symbol loading.  Ensure the
PSP SEV driver is initialized before proceeding in sev_hardware_setup()
if KVM is built-in as the dependency isn't handled by the initcall
infrastructure.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <f78ddb64087df27e7bcb1ae0ab53f55aa0804fab.1739226950.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-14 18:39:19 -05:00
Ge Yang
7e066cb9b7 KVM: SEV: Use long-term pin when registering encrypted memory regions
When registering an encrypted memory region for SEV-MEM/SEV-ES guests,
pin the pages with FOLL_TERM so that the pages are migrated out of
MIGRATE_CMA/ZONE_MOVABLE.  Failure to do so violates the CMA/MOVABLE
mechanisms and can result in fragmentation due to unmovable pages, e.g.
can make CMA allocations fail.

Signed-off-by: Ge Yang <yangge1116@126.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/1739241423-14326-1-git-send-email-yangge1116@126.com
[sean: massage changelog, make @flags an unsigned int]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:50:42 -08:00
Sean Christopherson
93da6af3ae KVM: x86: Defer runtime updates of dynamic CPUID bits until CPUID emulation
Defer runtime CPUID updates until the next non-faulting CPUID emulation
or KVM_GET_CPUID2, which are the only paths in KVM that consume the
dynamic entries.  Deferring the updates is especially beneficial to
nested VM-Enter/VM-Exit, as KVM will almost always detect multiple state
changes, not to mention the updates don't need to be realized while L2 is
active if CPUID is being intercepted by L1 (CPUID is a mandatory intercept
on Intel, but not AMD).

Deferring CPUID updates shaves several hundred cycles from nested VMX
roundtrips, as measured from L2 executing CPUID in a tight loop:

  SKX 6850 => 6450
  ICX 9000 => 8800
  EMR 7900 => 7700

Alternatively, KVM could update only the CPUID leaves that are affected
by the state change, e.g. update XSAVE info only if XCR0 or XSS changes,
but that adds non-trivial complexity and doesn't solve the underlying
problem of nested transitions potentially changing both XCR0 and XSS, on
both nested VM-Enter and VM-Exit.

Skipping updates entirely if L2 is active and CPUID is being intercepted
by L1 could work for the common case.  However, simply skipping updates if
L2 is active is *very* subtly dangerous and complex.  Most KVM updates are
triggered by changes to the current vCPU state, which may be L2 state,
whereas performing updates only for L1 would requiring detecting changes
to L1 state.  KVM would need to either track relevant L1 state, or defer
runtime CPUID updates until the next nested VM-Exit.  The former is ugly
and complex, while the latter comes with similar dangers to deferring all
CPUID updates, and would only address the nested VM-Enter path.

To guard against using stale data, disallow querying dynamic CPUID feature
bits, i.e. features that KVM updates at runtime, via a compile-time
assertion in guest_cpu_cap_has().  Exempt MWAIT from the rule, as the
MISC_ENABLE_NO_MWAIT means that MWAIT is _conditionally_ a dynamic CPUID
feature.

Note, the rule could be enforced for MWAIT as well, e.g. by querying guest
CPUID in kvm_emulate_monitor_mwait, but there's no obvious advtantage to
doing so, and allowing MWAIT for guest_cpuid_has() opens up a different can
of worms.  MONITOR/MWAIT can't be virtualized (for a reasonable definition),
and the nature of the MWAIT_NEVER_UD_FAULTS and MISC_ENABLE_NO_MWAIT quirks
means checking X86_FEATURE_MWAIT outside of kvm_emulate_monitor_mwait() is
wrong for other reasons.

Beyond the aforementioned feature bits, the only other dynamic CPUID
(sub)leaves are the XSAVE sizes, and similar to MWAIT, consuming those
CPUID entries in KVM is all but guaranteed to be a bug.  The layout for an
actual XSAVE buffer depends on the format (compacted or not) and
potentially the features that are actually enabled.  E.g. see the logic in
fpstate_clear_xstate_component() needed to poke into the guest's effective
XSAVE state to clear MPX state on INIT.  KVM does consume
CPUID.0xD.0.{EAX,EDX} in kvm_check_cpuid() and cpuid_get_supported_xcr0(),
but not EBX, which is the only dynamic output register in the leaf.

Link: https://lore.kernel.org/r/20241211013302.1347853-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:33 -08:00
Jim Mattson
e9cb61055f KVM: x86: Clear pv_unhalted on all transitions to KVM_MP_STATE_RUNNABLE
In kvm_set_mp_state(), ensure that vcpu->arch.pv.pv_unhalted is always
cleared on a transition to KVM_MP_STATE_RUNNABLE, so that the next HLT
instruction will be respected.

Fixes: 6aef266c6e ("kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks")
Fixes: b6b8a1451f ("KVM: nVMX: Rework interception of IRQs and NMIs")
Fixes: 38c0b192bd ("KVM: SVM: leave halted state on vmexit")
Fixes: 1a65105a5a ("KVM: x86/xen: handle PV spinlocks slowpath")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250113200150.487409-3-jmattson@google.com
[sean: add Xen PV spinlocks to the list of Fixes, tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:28 -08:00
Jim Mattson
c9e5f3fa90 KVM: x86: Introduce kvm_set_mp_state()
Replace all open-coded assignments to vcpu->arch.mp_state with calls
to a new helper, kvm_set_mp_state(), to centralize all changes to
mp_state.

No functional change intended.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250113200150.487409-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 10:16:27 -08:00
Sean Christopherson
c2fee09fc1 KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loop
Move the conditional loading of hardware DR6 with the guest's DR6 value
out of the core .vcpu_run() loop to fix a bug where KVM can load hardware
with a stale vcpu->arch.dr6.

When the guest accesses a DR and host userspace isn't debugging the guest,
KVM disables DR interception and loads the guest's values into hardware on
VM-Enter and saves them on VM-Exit.  This allows the guest to access DRs
at will, e.g. so that a sequence of DR accesses to configure a breakpoint
only generates one VM-Exit.

For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also
identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest)
and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading
DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop.

But for DR6, the guest's value doesn't need to be loaded into hardware for
KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas
VMX requires software to manually load the guest value, and so loading the
guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done
_inside_ the core run loop.

Unfortunately, saving the guest values on VM-Exit is initiated by common
x86, again outside of the core run loop.  If the guest modifies DR6 (in
hardware, when DR interception is disabled), and then the next VM-Exit is
a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and
clobber the guest's actual value.

The bug shows up primarily with nested VMX because KVM handles the VMX
preemption timer in the fastpath, and the window between hardware DR6
being modified (in guest context) and DR6 being read by guest software is
orders of magnitude larger in a nested setup.  E.g. in non-nested, the
VMX preemption timer would need to fire precisely between #DB injection
and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the
window where hardware DR6 is "dirty" extends all the way from L1 writing
DR6 to VMRESUME (in L1).

    L1's view:
    ==========
    <L1 disables DR interception>
           CPU 0/KVM-7289    [023] d....  2925.640961: kvm_entry: vcpu 0
 A:  L1 Writes DR6
           CPU 0/KVM-7289    [023] d....  2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1

 B:        CPU 0/KVM-7289    [023] d....  2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec

 D: L1 reads DR6, arch.dr6 = 0
           CPU 0/KVM-7289    [023] d....  2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0

           CPU 0/KVM-7289    [023] d....  2925.640976: kvm_entry: vcpu 0
    L2 reads DR6, L1 disables DR interception
           CPU 0/KVM-7289    [023] d....  2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216
           CPU 0/KVM-7289    [023] d....  2925.640983: kvm_entry: vcpu 0

           CPU 0/KVM-7289    [023] d....  2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0

    L2 detects failure
           CPU 0/KVM-7289    [023] d....  2925.640987: kvm_exit: vcpu 0 reason HLT
    L1 reads DR6 (confirms failure)
           CPU 0/KVM-7289    [023] d....  2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0

    L0's view:
    ==========
    L2 reads DR6, arch.dr6 = 0
          CPU 23/KVM-5046    [001] d....  3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216
          CPU 23/KVM-5046    [001] .....  3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216

    L2 => L1 nested VM-Exit
          CPU 23/KVM-5046    [001] .....  3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216

          CPU 23/KVM-5046    [001] d....  3410.005610: kvm_entry: vcpu 23
          CPU 23/KVM-5046    [001] d....  3410.005611: kvm_exit: vcpu 23 reason VMREAD
          CPU 23/KVM-5046    [001] d....  3410.005611: kvm_entry: vcpu 23
          CPU 23/KVM-5046    [001] d....  3410.005612: kvm_exit: vcpu 23 reason VMREAD
          CPU 23/KVM-5046    [001] d....  3410.005612: kvm_entry: vcpu 23

    L1 writes DR7, L0 disables DR interception
          CPU 23/KVM-5046    [001] d....  3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007
          CPU 23/KVM-5046    [001] d....  3410.005613: kvm_entry: vcpu 23

    L0 writes DR6 = 0 (arch.dr6)
          CPU 23/KVM-5046    [001] d....  3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0

 A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'>

 B:       CPU 23/KVM-5046    [001] d....  3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER
          CPU 23/KVM-5046    [001] d....  3410.005614: kvm_entry: vcpu 23

 C: L0 writes DR6 = 0 (arch.dr6)
          CPU 23/KVM-5046    [001] d....  3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0

    L1 => L2 nested VM-Enter
          CPU 23/KVM-5046    [001] d....  3410.005616: kvm_exit: vcpu 23 reason VMRESUME

    L0 reads DR6, arch.dr6 = 0

Reported-by: John Stultz <jstultz@google.com>
Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com
Fixes: 375e28ffc0 ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT")
Fixes: d67668e9dd ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 08:59:38 -08:00
Sean Christopherson
46d6c6f3ef KVM: nSVM: Enter guest mode before initializing nested NPT MMU
When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest
mode prior to initializing the MMU for nested NPT so that guest_mode is
set in the MMU's role.  KVM's model is that all L2 MMUs are tagged with
guest_mode, as the behavior of hypervisor MMUs tends to be significantly
different than kernel MMUs.

Practically speaking, the bug is relatively benign, as KVM only directly
queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and
kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths
that are optimizations (mmu_page_zap_pte() and
shadow_mmu_try_split_huge_pages()).

And while the role is incorprated into shadow page usage, because nested
NPT requires KVM to be using NPT for L1, reusing shadow pages across L1
and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs
will have direct=0.

Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning
of the flow instead of forcing guest_mode in the MMU, as nothing in
nested_vmcb02_prepare_control() between the old and new locations touches
TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present
inconsistent vCPU state to the MMU.

Fixes: 69cb877487 ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control")
Cc: stable@vger.kernel.org
Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-12 08:57:55 -08:00
Sean Christopherson
a6136669da KVM: SVM: Ensure PSP module is initialized if KVM module is built-in
The kernel's initcall infrastructure lacks the ability to express
dependencies between initcalls, whereas the modules infrastructure
automatically handles dependencies via symbol loading.  Ensure the
PSP SEV driver is initialized before proceeding in sev_hardware_setup()
if KVM is built-in as the dependency isn't handled by the initcall
infrastructure.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/f78ddb64087df27e7bcb1ae0ab53f55aa0804fab.1739226950.git.ashish.kalra@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-11 13:29:08 -08:00
Nikunj A Dadhania
8a01902a01 KVM: SEV: Use to_kvm_sev_info() for fetching kvm_sev_info struct
Simplify code by replacing &to_kvm_svm(kvm)->sev_info with
to_kvm_sev_info() helper function. Wherever possible, drop the local
variable declaration and directly use the helper instead.

No functional changes.

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250123055140.144378-1-nikunj@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-02-11 10:32:46 -08:00
Linus Torvalds
0f8e26b38d Loongarch:
* Clear LLBCTL if secondary mmu mapping changes.
 
 * Add hypercall service support for usermode VMM.
 
 x86:
 
 * Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a
   direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
 
 * Ensure that all SEV code is compiled out when disabled in Kconfig, even
   if building with less brilliant compilers.
 
 * Remove a redundant TLB flush on AMD processors when guest CR4.PGE changes.
 
 * Use str_enabled_disabled() to replace open coded strings.
 
 * Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache
   prior to every VM-Enter.
 
 * Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
   instead of just those where KVM needs to manage state and/or explicitly
   enable the feature in hardware.  Along the way, refactor the code to make
   it easier to add features, and to make it more self-documenting how KVM
   is handling each feature.
 
 * Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
   where KVM unintentionally puts the vCPU into infinite loops in some scenarios
   (e.g. if emulation is triggered by the exit), and brings parity between VMX
   and SVM.
 
 * Add pending request and interrupt injection information to the kvm_exit and
   kvm_entry tracepoints respectively.
 
 * Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
   loading guest/host PKRU, due to a refactoring of the kernel helpers that
   didn't account for KVM's pre-checking of the need to do WRPKRU.
 
 * Make the completion of hypercalls go through the complete_hypercall
   function pointer argument, no matter if the hypercall exits to
   userspace or not.  Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
   specifically went to userspace, and all the others did not; the new code
   need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
   all whether there was an exit to userspace or not.
 
 * As part of enabling TDX virtual machines, support support separation of
   private/shared EPT into separate roots.  When TDX will be enabled, operations
   on private pages will need to go through the privileged TDX Module via SEAMCALLs;
   as a result, they are limited and relatively slow compared to reading a PTE.
   The patches included in 6.14 allow KVM to keep a mirror of the private EPT in
   host memory, and define entries in kvm_x86_ops to operate on external page
   tables such as the TDX private EPT.
 
 * The recently introduced conversion of the NX-page reclamation kthread to
   vhost_task moved the task under the main process.  The task is created as
   soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that
   didn't expect to see any child task of the VM process until it started
   creating its own userspace threads.  In particular crosvm refuses to fork()
   if procfs shows any child task, so unbreak it by creating the task lazily.
   This is arguably a userspace bug, as there can be other kinds of legitimate
   worker tasks and they wouldn't impede fork(); but it's not like userspace
   has a way to distinguish kernel worker tasks right now.  Should they show
   as "Kthread: 1" in proc/.../status?
 
 x86 - Intel:
 
 * Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit
   while L2 is active, while ultimately results in a hardware-accelerated L1
   EOI effectively being lost.
 
 * Honor event priority when emulating Posted Interrupt delivery during nested
   VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the
   interrupt.
 
 * Rework KVM's processing of the Page-Modification Logging buffer to reap
   entries in the same order they were created, i.e. to mark gfns dirty in the
   same order that hardware marked the page/PTE dirty.
 
 * Misc cleanups.
 
 Generic:
 
 * Cleanup and harden kvm_set_memory_region(); add proper lockdep assertions when
   setting memory regions and add a dedicated API for setting KVM-internal
   memory regions.  The API can then explicitly disallow all flags for
   KVM-internal memory regions.
 
 * Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug
   where KVM would return a pointer to a vCPU prior to it being fully online,
   and give kvm_for_each_vcpu() similar treatment to fix a similar flaw.
 
 * Wait for a vCPU to come online prior to executing a vCPU ioctl, to fix a
   bug where userspace could coerce KVM into handling the ioctl on a vCPU that
   isn't yet onlined.
 
 * Gracefully handle xarray insertion failures; even though such failures are
   impossible in practice after xa_reserve(), reserving an entry is always followed
   by xa_store() which does not know (or differentiate) whether there was an
   xa_reserve() before or not.
 
 RISC-V:
 
 * Zabha, Svvptc, and Ziccrse extension support for guests.  None of them
   require anything in KVM except for detecting them and marking them
   as supported; Zabha adds byte and halfword atomic operations, while the
   others are markers for specific operation of the TLB and of LL/SC
   instructions respectively.
 
 * Virtualize SBI system suspend extension for Guest/VM
 
 * Support firmware counters which can be used by the guests to collect
   statistics about traps that occur in the host.
 
 Selftests:
 
 * Rework vcpu_get_reg() to return a value instead of using an out-param, and
   update all affected arch code accordingly.
 
 * Convert the max_guest_memory_test into a more generic mmu_stress_test.
   The basic gist of the "conversion" is to have the test do mprotect() on
   guest memory while vCPUs are accessing said memory, e.g. to verify KVM
   and mmu_notifiers are working as intended.
 
 * Play nice with treewrite builds of unsupported architectures, e.g. arm
   (32-bit), as KVM selftests' Makefile doesn't do anything to ensure the
   target architecture is actually one KVM selftests supports.
 
 * Use the kernel's $(ARCH) definition instead of the target triple for arch
   specific directories, e.g. arm64 instead of aarch64, mainly so as not to
   be different from the rest of the kernel.
 
 * Ensure that format strings for logging statements are checked by the
   compiler even when the logging statement itself is disabled.
 
 * Attempt to whack the last LLC references/misses mole in the Intel PMU
   counters test by adding a data load and doing CLFLUSH{OPT} on the data
   instead of the code being executed.  It seems that modern Intel CPUs
   have learned new code prefetching tricks that bypass the PMU counters.
 
 * Fix a flaw in the Intel PMU counters test where it asserts that events
   are counting correctly without actually knowing what the events count
   given the underlying hardware; this can happen if Intel reuses a
   formerly microarchitecture-specific event encoding as an architectural
   event, as was the case for Top-Down Slots.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmeTuzoUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOkBwf8CRNExYaM3j9y2E7mmo6AiL2ug6+J
 Uy5Hai1poY48pPwKC6ke3EWT8WVsgj/Py5pCeHvLojQchWNjCCYNfSQluJdkRxwG
 DgP3QUljSxEJWBeSwyTRcKM+IySi5hZd1IFo3gePFRB829Jpnj05vjbvCyv8gIwU
 y3HXxSYDsViaaFoNg4OlZFsIGis7mtknsZzk++QjuCXmxNa6UCbv3qvE/UkVLhVg
 WH65RTRdjk+EsdwaOMHKuUvQoGa+iM4o39b6bqmw8+ZMK39+y33WeTX/y5RXsp1N
 tUUBRfS+MuuYgC/6LmTr66EkMzoChxk3Dp3kKUaCBcfqRC8PxQag5reZhw==
 =NEaO
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "Loongarch:

   - Clear LLBCTL if secondary mmu mapping changes

   - Add hypercall service support for usermode VMM

  x86:

   - Add a comment to kvm_mmu_do_page_fault() to explain why KVM
     performs a direct call to kvm_tdp_page_fault() when RETPOLINE is
     enabled

   - Ensure that all SEV code is compiled out when disabled in Kconfig,
     even if building with less brilliant compilers

   - Remove a redundant TLB flush on AMD processors when guest CR4.PGE
     changes

   - Use str_enabled_disabled() to replace open coded strings

   - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's
     APICv cache prior to every VM-Enter

   - Overhaul KVM's CPUID feature infrastructure to track all vCPU
     capabilities instead of just those where KVM needs to manage state
     and/or explicitly enable the feature in hardware. Along the way,
     refactor the code to make it easier to add features, and to make it
     more self-documenting how KVM is handling each feature

   - Rework KVM's handling of VM-Exits during event vectoring; this
     plugs holes where KVM unintentionally puts the vCPU into infinite
     loops in some scenarios (e.g. if emulation is triggered by the
     exit), and brings parity between VMX and SVM

   - Add pending request and interrupt injection information to the
     kvm_exit and kvm_entry tracepoints respectively

   - Fix a relatively benign flaw where KVM would end up redoing RDPKRU
     when loading guest/host PKRU, due to a refactoring of the kernel
     helpers that didn't account for KVM's pre-checking of the need to
     do WRPKRU

   - Make the completion of hypercalls go through the complete_hypercall
     function pointer argument, no matter if the hypercall exits to
     userspace or not.

     Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically
     went to userspace, and all the others did not; the new code need
     not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
     all whether there was an exit to userspace or not

   - As part of enabling TDX virtual machines, support support
     separation of private/shared EPT into separate roots.

     When TDX will be enabled, operations on private pages will need to
     go through the privileged TDX Module via SEAMCALLs; as a result,
     they are limited and relatively slow compared to reading a PTE.

     The patches included in 6.14 allow KVM to keep a mirror of the
     private EPT in host memory, and define entries in kvm_x86_ops to
     operate on external page tables such as the TDX private EPT

   - The recently introduced conversion of the NX-page reclamation
     kthread to vhost_task moved the task under the main process. The
     task is created as soon as KVM_CREATE_VM was invoked and this, of
     course, broke userspace that didn't expect to see any child task of
     the VM process until it started creating its own userspace threads.

     In particular crosvm refuses to fork() if procfs shows any child
     task, so unbreak it by creating the task lazily. This is arguably a
     userspace bug, as there can be other kinds of legitimate worker
     tasks and they wouldn't impede fork(); but it's not like userspace
     has a way to distinguish kernel worker tasks right now. Should they
     show as "Kthread: 1" in proc/.../status?

  x86 - Intel:

   - Fix a bug where KVM updates hardware's APICv cache of the highest
     ISR bit while L2 is active, while ultimately results in a
     hardware-accelerated L1 EOI effectively being lost

   - Honor event priority when emulating Posted Interrupt delivery
     during nested VM-Enter by queueing KVM_REQ_EVENT instead of
     immediately handling the interrupt

   - Rework KVM's processing of the Page-Modification Logging buffer to
     reap entries in the same order they were created, i.e. to mark gfns
     dirty in the same order that hardware marked the page/PTE dirty

   - Misc cleanups

  Generic:

   - Cleanup and harden kvm_set_memory_region(); add proper lockdep
     assertions when setting memory regions and add a dedicated API for
     setting KVM-internal memory regions. The API can then explicitly
     disallow all flags for KVM-internal memory regions

   - Explicitly verify the target vCPU is online in kvm_get_vcpu() to
     fix a bug where KVM would return a pointer to a vCPU prior to it
     being fully online, and give kvm_for_each_vcpu() similar treatment
     to fix a similar flaw

   - Wait for a vCPU to come online prior to executing a vCPU ioctl, to
     fix a bug where userspace could coerce KVM into handling the ioctl
     on a vCPU that isn't yet onlined

   - Gracefully handle xarray insertion failures; even though such
     failures are impossible in practice after xa_reserve(), reserving
     an entry is always followed by xa_store() which does not know (or
     differentiate) whether there was an xa_reserve() before or not

  RISC-V:

   - Zabha, Svvptc, and Ziccrse extension support for guests. None of
     them require anything in KVM except for detecting them and marking
     them as supported; Zabha adds byte and halfword atomic operations,
     while the others are markers for specific operation of the TLB and
     of LL/SC instructions respectively

   - Virtualize SBI system suspend extension for Guest/VM

   - Support firmware counters which can be used by the guests to
     collect statistics about traps that occur in the host

  Selftests:

   - Rework vcpu_get_reg() to return a value instead of using an
     out-param, and update all affected arch code accordingly

   - Convert the max_guest_memory_test into a more generic
     mmu_stress_test. The basic gist of the "conversion" is to have the
     test do mprotect() on guest memory while vCPUs are accessing said
     memory, e.g. to verify KVM and mmu_notifiers are working as
     intended

   - Play nice with treewrite builds of unsupported architectures, e.g.
     arm (32-bit), as KVM selftests' Makefile doesn't do anything to
     ensure the target architecture is actually one KVM selftests
     supports

   - Use the kernel's $(ARCH) definition instead of the target triple
     for arch specific directories, e.g. arm64 instead of aarch64,
     mainly so as not to be different from the rest of the kernel

   - Ensure that format strings for logging statements are checked by
     the compiler even when the logging statement itself is disabled

   - Attempt to whack the last LLC references/misses mole in the Intel
     PMU counters test by adding a data load and doing CLFLUSH{OPT} on
     the data instead of the code being executed. It seems that modern
     Intel CPUs have learned new code prefetching tricks that bypass the
     PMU counters

   - Fix a flaw in the Intel PMU counters test where it asserts that
     events are counting correctly without actually knowing what the
     events count given the underlying hardware; this can happen if
     Intel reuses a formerly microarchitecture-specific event encoding
     as an architectural event, as was the case for Top-Down Slots"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits)
  kvm: defer huge page recovery vhost task to later
  KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
  KVM: Disallow all flags for KVM-internal memslots
  KVM: x86: Drop double-underscores from __kvm_set_memory_region()
  KVM: Add a dedicated API for setting KVM-internal memslots
  KVM: Assert slots_lock is held when setting memory regions
  KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
  LoongArch: KVM: Add hypercall service support for usermode VMM
  LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed
  KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
  KVM: VMX: read the PML log in the same order as it was written
  KVM: VMX: refactor PML terminology
  KVM: VMX: Fix comment of handle_vmx_instruction()
  KVM: VMX: Reinstate __exit attribute for vmx_exit()
  KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
  KVM: x86: Avoid double RDPKRU when loading host/guest PKRU
  KVM: x86: Use LVT_TIMER instead of an open coded literal
  RISC-V: KVM: Add new exit statstics for redirected traps
  RISC-V: KVM: Update firmware counters for various events
  RISC-V: KVM: Redirect instruction access fault trap to guest
  ...
2025-01-25 09:55:09 -08:00
Linus Torvalds
a6640c8c2f Objtool changes for v6.14:
- Introduce the generic section-based annotation
    infrastructure a.k.a. ASM_ANNOTATE/ANNOTATE (Peter Zijlstra)
 
  - Convert various facilities to ASM_ANNOTATE/ANNOTATE: (Peter Zijlstra)
 
     - ANNOTATE_NOENDBR
     - ANNOTATE_RETPOLINE_SAFE
     - instrumentation_{begin,end}()
     - VALIDATE_UNRET_BEGIN
     - ANNOTATE_IGNORE_ALTERNATIVE
     - ANNOTATE_INTRA_FUNCTION_CALL
     - {.UN}REACHABLE
 
  - Optimize the annotation-sections parsing code (Peter Zijlstra)
 
  - Centralize annotation definitions in <linux/objtool.h>
 
  - Unify & simplify the barrier_before_unreachable()/unreachable()
    definitions (Peter Zijlstra)
 
  - Convert unreachable() calls to BUG() in x86 code, as
    unreachable() has unreliable code generation (Peter Zijlstra)
 
  - Remove annotate_reachable() and annotate_unreachable(), as it's
    unreliable against compiler optimizations (Peter Zijlstra)
 
  - Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra)
 
  - Robustify the annotation code by warning about unknown annotation
    types (Peter Zijlstra)
 
  - Allow arch code to discover jump table size, in preparation of
    annotated jump table support (Ard Biesheuvel)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmeOHiARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gATw/7Bn4A+Isqk9bKo6QgYEnKRoyf760ALQl6
 av/toEy1qCHT/CXCiEn1Hut1JEy4YyD6lIarC1scRl5xy7amRDEcCL0i2CKz3orn
 pf6Fk8/Pi68G2K50o4LTiq8t3uPBJXPlGyDlngh2hFTYRfPRT4m+cig784hmJEXG
 Xq2YzzUNG++U/4Uwe3JH7bX/vcZTYkZfM62FWfp3I4V0OqKU4c+Pkiv4u3Rs7L7b
 c3xk5/PktKZWV5TDsz0wU4SAGxYFGV47hhYM6cxdSYD3la7RVO+qZcqxsJByjpcL
 bvOmGKQ1SAXr08rV7TB+Fh8icaNE8Rbbmxf6slB0hdXBQb8STAZ810mZJFey6pnm
 kXgfhhfBOK5Sq+UbTfzF2JgquCGAbKK75bmNGgf2HaLnVLkFIw3AyMsuFqnxhI4X
 vXRHGnHCYpYUHTxzRYTFYR8XL8twA2kgjWkSe7hYrX/RQZV3XfyKOc2jyoJFMXeX
 LecfGJCE/pziZyj60SXT9WaUTvKc8gjWOEuAnW1pJQRM0zJqB9kjLh1cDYUseuwv
 gGkH59KEu0kcfOb5t/jWoqW3PTENJjEAhOmjun6Jv8wgbOxU88TMmSCWppj54O2X
 c2ibO407535u1SKBWZuaKFBLYftS2GM4WaGsdyTyh+ta48C8An90HMfYNKTHM9Nz
 F61Q7Zbn65E=
 =9nGt
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:

 - Introduce the generic section-based annotation infrastructure a.k.a.
   ASM_ANNOTATE/ANNOTATE (Peter Zijlstra)

 - Convert various facilities to ASM_ANNOTATE/ANNOTATE: (Peter Zijlstra)
    - ANNOTATE_NOENDBR
    - ANNOTATE_RETPOLINE_SAFE
    - instrumentation_{begin,end}()
    - VALIDATE_UNRET_BEGIN
    - ANNOTATE_IGNORE_ALTERNATIVE
    - ANNOTATE_INTRA_FUNCTION_CALL
    - {.UN}REACHABLE

 - Optimize the annotation-sections parsing code (Peter Zijlstra)

 - Centralize annotation definitions in <linux/objtool.h>

 - Unify & simplify the barrier_before_unreachable()/unreachable()
   definitions (Peter Zijlstra)

 - Convert unreachable() calls to BUG() in x86 code, as unreachable()
   has unreliable code generation (Peter Zijlstra)

 - Remove annotate_reachable() and annotate_unreachable(), as it's
   unreliable against compiler optimizations (Peter Zijlstra)

 - Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra)

 - Robustify the annotation code by warning about unknown annotation
   types (Peter Zijlstra)

 - Allow arch code to discover jump table size, in preparation of
   annotated jump table support (Ard Biesheuvel)

* tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Convert unreachable() to BUG()
  objtool: Allow arch code to discover jump table size
  objtool: Warn about unknown annotation types
  objtool: Fix ANNOTATE_REACHABLE to be a normal annotation
  objtool: Convert {.UN}REACHABLE to ANNOTATE
  objtool: Remove annotate_{,un}reachable()
  loongarch: Use ASM_REACHABLE
  x86: Convert unreachable() to BUG()
  unreachable: Unify
  objtool: Collect more annotations in objtool.h
  objtool: Collapse annotate sequences
  objtool: Convert ANNOTATE_INTRA_FUNCTION_CALL to ANNOTATE
  objtool: Convert ANNOTATE_IGNORE_ALTERNATIVE to ANNOTATE
  objtool: Convert VALIDATE_UNRET_BEGIN to ANNOTATE
  objtool: Convert instrumentation_{begin,end}() to ANNOTATE
  objtool: Convert ANNOTATE_RETPOLINE_SAFE to ANNOTATE
  objtool: Convert ANNOTATE_NOENDBR to ANNOTATE
  objtool: Generic annotation infrastructure
2025-01-21 10:13:11 -08:00
Paolo Bonzini
3eba032bb7 Merge branch 'kvm-userspace-hypercall' into HEAD
Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not.  Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
specifically went to userspace, and all the others did not; the new code
need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not.
2025-01-20 07:03:06 -05:00
Paolo Bonzini
4f7ff70c05 KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to replace "governed" features
    with per-vCPU tracking of the vCPU's capabailities for all features.  Along
    the way, refactor the code to make it easier to add/modify features, and
    add a variety of self-documenting macro types to again simplify adding new
    features and to help readers understand KVM's handling of existing features.
 
  - Rework KVM's handling of VM-Exits during event vectoring to plug holes where
    KVM unintentionally puts the vCPU into infinite loops in some scenarios,
    e.g. if emulation is triggered by the exit, and to bring parity between VMX
    and SVM.
 
  - Add pending request and interrupt injection information to the kvm_exit and
    kvm_entry tracepoints respectively.
 
  - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
    loading guest/host PKRU due to a refactoring of the kernel helpers that
    didn't account for KVM's pre-checking of the need to do WRPKRU.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJngsACgkQOlYIJqCj
 N/1dfA/+NIZmnd8OV9Zvc6HGxrzgt4QsM9pmsUmrfkDWefxMYIAMeaW8Vn4CJfRf
 zY/UcqyNI7JYxSiuVTckz+Tf54HhqYaLrUwILGCQ49koirZx+aQT1OUfjLroVMlh
 ffX1i6GOoLNtxjb9MXM/heLVdUbvmzQMSFkd/AkOH+nrOtDNOiPlZfjHsewj9zrf
 BNJGhzvT4M6vc/AsScC7tc0yFD5KKFRv8tVwJ6Zf1nWKyUDOSpMTWkVnq6geKJPZ
 iGBZPPNg55Oy1g6uj6VYWmqYTD8Qioz5jtEJ/8pPHdAyIFo21s81bfJc548d+QLh
 KfrL1K7TrCOhSAGC3Cb3lTLeq2immmGHaiTBLwGABG4MhpiX4NVpMMdOyFbVLMOS
 HIYuwXwDckm1pfU7/w+PgPaakCyPrXQntm+3Y2pvDOoY6e2JbwodK4j8BvvQda35
 8TrYKEGFvq5aij7Iw1O9TUoLAocDM/sHIHE6BCazHyzKBIv9xLRFeabiCQ+A1pwv
 gZk5u0+j+DPpLdeLhbMYhIXUtr3bvyMYvc+tRkG716f8ubAE3+Kn5BEDo4Ot2DcT
 vc+NTRYYWN6zavHiJH3Ddt153yj256JCZhLwCdfbryCQdz3Mpy16m36tgkDRd3lR
 QT4IkPQo1Vl/aU0yiE/dhnJgh1rTO26YQjZoHs5Oj16d0HRrKyc=
 =32mM
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.14:

 - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
   instead of just those where KVM needs to manage state and/or explicitly
   enable the feature in hardware.  Along the way, refactor the code to make
   it easier to add features, and to make it more self-documenting how KVM
   is handling each feature.

 - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
   where KVM unintentionally puts the vCPU into infinite loops in some scenarios
   (e.g. if emulation is triggered by the exit), and brings parity between VMX
   and SVM.

 - Add pending request and interrupt injection information to the kvm_exit and
   kvm_entry tracepoints respectively.

 - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
   loading guest/host PKRU, due to a refactoring of the kernel helpers that
   didn't account for KVM's pre-checking of the need to do WRPKRU.
2025-01-20 06:49:39 -05:00
Paolo Bonzini
672162a0d3 KVM SVM changes for 6.14:
- Macrofy the SEV=n version of the sev_xxx_guest() helpers so that the code is
    optimized away when building with less than brilliant compilers.
 
  - Remove a now-redundant TLB flush when guest CR4.PGE changes.
 
  - Use str_enabled_disabled() to replace open coded strings.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmeJoesACgkQOlYIJqCj
 N/1FuQ//bKhTrPVvMhCZB7cxS7wcpb82LA15Gc73hXBTjsik896O6fAN0MyvYWvp
 eLmvhRWloVspIAkLj4pYF7PB6Tl0VAtpGOIdZp9TXOkbGcWg7jFFBrF3Udr2kDig
 3YyinqGhwPN6oSkrJMGuk79EKhuRVGcb7utE/T8MCV5ct9kayQAogGYMLwfnpYvX
 02SRYj++PUz5rk/gVUA2RA/EFSe57JE2VrzQlT34ttqqMZ2JKtysl5ytONGpEW/u
 ZJQ4naztY+vRCLHm+n8ZWE5TCQ3ZZVR5WVcnMqx1MQQEVSZJT5SxveLGVu9tVV4c
 Fv5fTvxa6iKfsUYl4o6tX1E0s388mwEast6uY40q4GysgkpsRRQR1KOLMGTQTMiz
 7hagYbc9gGY3Pr6pn1+qf3hgVogcDX0wnQlFe6PYkHiMc7woIhUCGJ3LND7Y11O+
 cPnbdK+0ZZhsOzYQ6SGeXIv326wvZacqKzXkkWt9pf8abbwaUXVqjYe+Y0yEHsuJ
 e1gPnTdtzzim3Jq5qDRhPEHthHLm+na4O+6vjYlFVoUN4XnbNGhVzUJvQX70WX8a
 6vrQ6gM768dwkyzdpRAGuWv+OWXHBe8UYKp4HQKpWGjRcKiwcXWDwQeIbfE7JkS+
 L/Ck0oQ72sWoBoH4noM/5ehVB4ZY/bEp+GbVbd/ny6nDP1fmApM=
 =MhtD
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.14' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.14:

 - Macrofy the SEV=n version of the sev_xxx_guest() helpers so that the code is
   optimized away when building with less than brilliant compilers.

 - Remove a now-redundant TLB flush when guest CR4.PGE changes.

 - Use str_enabled_disabled() to replace open coded strings.
2025-01-20 06:48:35 -05:00
Thorsten Blum
4c334c6880 KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
Remove hard-coded strings by using the str_enabled_disabled() helper
function.

Suggested-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20250110101100.272312-2-thorsten.blum@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-10 06:56:20 -08:00
Thorsten Blum
800173cf75 KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
Remove hard-coded strings by using the str_enabled_disabled() helper
function.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20241227094450.674104-2-thorsten.blum@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-01-08 14:08:41 -08:00
Binbin Wu
c4c083d951 KVM: x86: Add a helper to check for user interception of KVM hypercalls
Add and use user_exit_on_hypercall() to check if userspace wants to handle
a KVM hypercall instead of open-coding the logic everywhere.

No functional change intended.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
[sean: squash into one patch, keep explicit KVM_HC_MAP_GPA_RANGE check]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-ID: <20241128004344.4072099-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22 13:00:25 -05:00
Paolo Bonzini
9a1dfeff44 KVM: x86: clear vcpu->run->hypercall.ret before exiting for KVM_EXIT_HYPERCALL
QEMU up to 9.2.0 is assuming that vcpu->run->hypercall.ret is 0 on exit and
it never modifies it when processing KVM_EXIT_HYPERCALL.  Make this explicit
in the code, to avoid breakage when KVM starts modifying that field.

This in principle is not a good idea... It would have been much better if
KVM had set the field to -KVM_ENOSYS from the beginning, so that a dumb
userspace that does nothing on KVM_EXIT_HYPERCALL would tell the guest it
does not support KVM_HC_MAP_GPA_RANGE.  However, breaking userspace is
a Very Bad Thing, as everybody should know.

Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-22 13:00:25 -05:00
Sean Christopherson
4d5163cba4 KVM: SVM: Allow guest writes to set MSR_AMD64_DE_CFG bits
Drop KVM's arbitrary behavior of making DE_CFG.LFENCE_SERIALIZE read-only
for the guest, as rejecting writes can lead to guest crashes, e.g. Windows
in particular doesn't gracefully handle unexpected #GPs on the WRMSR, and
nothing in the AMD manuals suggests that LFENCE_SERIALIZE is read-only _if
it exists_.

KVM only allows LFENCE_SERIALIZE to be set, by the guest or host, if the
underlying CPU has X86_FEATURE_LFENCE_RDTSC, i.e. if LFENCE is guaranteed
to be serializing.  So if the guest sets LFENCE_SERIALIZE, KVM will provide
the desired/correct behavior without any additional action (the guest's
value is never stuffed into hardware).  And having LFENCE be serializing
even when it's not _required_ to be is a-ok from a functional perspective.

Fixes: 74a0e79df6 ("KVM: SVM: Disallow guest from changing userspace's MSR_AMD64_DE_CFG value")
Fixes: d1d93fa90f ("KVM: SVM: Add MSR-based feature support for serializing LFENCE")
Reported-by: Simon Pilkington <simonp.git@mailbox.org>
Closes: https://lore.kernel.org/all/52914da7-a97b-45ad-86a0-affdf8266c61@mailbox.org
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20241211172952.1477605-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 17:47:52 -08:00
Suravee Suthikulpanit
d81cadbe16 KVM: SVM: Disable AVIC on SNP-enabled system without HvInUseWrAllowed feature
On SNP-enabled system, VMRUN marks AVIC Backing Page as in-use while
the guest is running for both secure and non-secure guest. Any hypervisor
write to the in-use vCPU's AVIC backing page (e.g. to inject an interrupt)
will generate unexpected #PF in the host.

Currently, attempt to run AVIC guest would result in the following error:

    BUG: unable to handle page fault for address: ff3a442e549cc270
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x80000003) - RMP violation
    PGD b6ee01067 P4D b6ee02067 PUD 10096d063 PMD 11c540063 PTE 80000001149cc163
    SEV-SNP: PFN 0x1149cc unassigned, dumping non-zero entries in 2M PFN region: [0x114800 - 0x114a00]
    ...

Newer AMD system is enhanced to allow hypervisor to modify the backing page
for non-secure guest on SNP-enabled system. This enhancement is available
when the CPUID Fn8000_001F_EAX bit 30 is set (HvInUseWrAllowed).

This table describes AVIC support matrix w.r.t. SNP enablement:

               | Non-SNP system |     SNP system
-----------------------------------------------------
 Non-SNP guest |  AVIC Activate | AVIC Activate iff
               |                | HvInuseWrAllowed=1
-----------------------------------------------------
     SNP guest |      N/A       |    Secure AVIC

Therefore, check and disable AVIC in kvm_amd driver when the feature is not
available on SNP-enabled system.

See the AMD64 Architecture Programmer’s Manual (APM) Volume 2 for detail.
(https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/
programmer-references/40332.pdf)

Fixes: 216d106c7f ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20241104075845.7583-1-suravee.suthikulpanit@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-19 17:47:51 -08:00
Maxim Levitsky
3e633e7e7d KVM: x86: Add interrupt injection information to the kvm_entry tracepoint
Add VMX/SVM specific interrupt injection info the kvm_entry tracepoint.
As is done with kvm_exit, gather the information via a kvm_x86_ops hook
to avoid the moderately costly VMREADs on VMX when the tracepoint isn't
enabled.

Opportunistically rename the parameters in the get_exit_info()
declaration to match the names used by both SVM and VMX.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20240910200350.264245-2-mlevitsk@redhat.com
[sean: drop is_guest_mode() change, use intr_info/error_code for names]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:48 -08:00
Ivan Orlov
7bd7ff9911 KVM: SVM: Handle event vectoring error in check_emulate_instruction()
Detect unhandleable vectoring in check_emulate_instruction() to prevent
infinite retry loops on SVM, and to eliminate the main differences in how
VM-Exits during event vectoring are handled on SVM versus VMX.  E.g. if
the vCPU puts its IDT in emulated MMIO memory and generates an event,
without the check_emulate_instruction() change, SVM will re-inject the
event and resume the guest, and effectively put the vCPU into an infinite
loop.

Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-6-iorlov@amazon.com
[sean: grab "svm" locally, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 15:14:45 -08:00
Sean Christopherson
cbdeea032b KVM: x86: Drop superfluous host XSAVE check when adjusting guest XSAVES caps
Drop the manual boot_cpu_has() checks on XSAVE when adjusting the guest's
XSAVES capabilities now that guest cpu_caps incorporates KVM's support.
The guest's cpu_caps are initialized from kvm_cpu_caps, which are in turn
initialized from boot_cpu_data, i.e. checking guest_cpu_cap_has() also
checks host/KVM capabilities (which is the entire point of cpu_caps).

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-52-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:17 -08:00
Sean Christopherson
8f2a27752e KVM: x86: Replace (almost) all guest CPUID feature queries with cpu_caps
Switch all queries (except XSAVES) of guest features from guest CPUID to
guest capabilities, i.e. replace all calls to guest_cpuid_has() with calls
to guest_cpu_cap_has().

Keep guest_cpuid_has() around for XSAVES, but subsume its helper
guest_cpuid_get_register() and add a compile-time assertion to prevent
using guest_cpuid_has() for any other feature.  Add yet another comment
for XSAVE to explain why KVM is allowed to query its raw guest CPUID.

Opportunistically drop the unused guest_cpuid_clear(), as there should be
no circumstance in which KVM needs to _clear_ a guest CPUID feature now
that everything is tracked via cpu_caps.  E.g. KVM may need to _change_
a feature to emulate dynamic CPUID flags, but KVM should never need to
clear a feature in guest CPUID to prevent it from being used by the guest.

Delete the last remnants of the governed features framework, as the lone
holdout was vmx_adjust_secondary_exec_control()'s divergent behavior for
governed vs. ungoverned features.

Note, replacing guest_cpuid_has() checks with guest_cpu_cap_has() when
computing reserved CR4 bits is a nop when viewed as a whole, as KVM's
capabilities are already incorporated into the calculation, i.e. if a
feature is present in guest CPUID but unsupported by KVM, its CR4 bit
was already being marked as reserved, checking guest_cpu_cap_has() simply
double-stamps that it's a reserved bit.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-51-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:15 -08:00
Sean Christopherson
e592ec657d KVM: x86: Initialize guest cpu_caps based on KVM support
Constrain all guest cpu_caps based on KVM support instead of constraining
only the few features that KVM _currently_ needs to verify are actually
supported by KVM.  The intent of cpu_caps is to track what the guest is
actually capable of using, not the raw, unfiltered CPUID values that the
guest sees.

I.e. KVM should always consult it's only support when making decisions
based on guest CPUID, and the only reason KVM has historically made the
checks opt-in was due to lack of centralized tracking.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:09 -08:00
Sean Christopherson
a7a308f863 KVM: x86: Initialize guest cpu_caps based on guest CPUID
Initialize a vCPU's capabilities based on the guest CPUID provided by
userspace instead of simply zeroing the entire array.  This is the first
step toward using cpu_caps to query *all* CPUID-based guest capabilities,
i.e. will allow converting all usage of guest_cpuid_has() to
guest_cpu_cap_has().

Zeroing the array was the logical choice when using cpu_caps was opt-in,
e.g. "unsupported" was generally a safer default, and the whole point of
governed features is that KVM would need to check host and guest support,
i.e. making everything unsupported by default didn't require more code.

But requiring KVM to manually "enable" every CPUID-based feature in
cpu_caps would require an absurd amount of boilerplate code.

Follow existing CPUID/kvm_cpu_caps nomenclature where possible, e.g. for
the change() and clear() APIs.  Replace check_and_set() with constrain()
to try and capture that KVM is constraining userspace's desired guest
feature set based on KVM's capabilities.

This is intended to be gigantic nop, i.e. should not have any impact on
guest or KVM functionality.

This is also an intermediate step; a future commit will also incorporate
KVM support into the vCPU's cpu_caps before converting guest_cpuid_has()
to guest_cpu_cap_has().

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:06 -08:00
Sean Christopherson
2c5e168e5c KVM: x86: Rename "governed features" helpers to use "guest_cpu_cap"
As the first step toward replacing KVM's so-called "governed features"
framework with a more comprehensive, less poorly named implementation,
replace the "kvm_governed_feature" function prefix with "guest_cpu_cap"
and rename guest_can_use() to guest_cpu_cap_has().

The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and
provides a more clear distinction between guest capabilities, which are
KVM controlled (heh, or one might say "governed"), and guest CPUID, which
with few exceptions is fully userspace controlled.

Opportunistically rewrite the comment about XSS passthrough for SEV-ES
guests to avoid referencing so many functions, as such comments are prone
to becoming stale (case in point...).

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:03 -08:00
Sean Christopherson
036e78a942 KVM: SVM: Remove redundant TLB flush on guest CR4.PGE change
Drop SVM's direct TLB flush when CR4.PGE is toggled and NPT is enabled, as
KVM already guarantees TLBs are flushed appropriately.

For the call from cr_trap(), kvm_post_set_cr4() requests TLB_FLUSH_GUEST
(which is a superset of TLB_FLUSH_CURRENT) when CR4.PGE is toggled,
regardless of whether or not KVM is using TDP.

The calls from nested_vmcb02_prepare_save() and nested_svm_vmexit() are
checking guest (L2) vs. host (L1) CR4, and so a flush is unnecessary as L2
is defined to use a different ASID (from L1's perspective).

Lastly, the call from svm_set_cr0() passes in the current CR4 value, i.e.
can't toggle PGE.

Link: https://lore.kernel.org/r/20241127235312.4048445-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 16:15:28 -08:00
Sean Christopherson
45d522d3ee KVM: SVM: Macrofy SEV=n versions of sev_xxx_guest()
Define sev_{,es_,snp_}guest() as "false" when SEV is disabled via Kconfig,
i.e. when CONFIG_KVM_AMD_SEV=n.  Despite the helpers being __always_inline,
gcc-12 is somehow incapable of realizing that the return value is a
compile-time constant and generates sub-optimal code.

Opportunistically clump the paths together to reduce the amount of
ifdeffery.

No functional change intended.

Link: https://lore.kernel.org/r/20241127234659.4046347-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-16 16:15:27 -08:00
Peter Zijlstra
2190966fbc x86: Convert unreachable() to BUG()
Avoid unreachable() as it can (and will in the absence of UBSAN)
generate fallthrough code. Use BUG() so we get a UD2 trap (with
unreachable annotation).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20241128094312.028316261@infradead.org
2024-12-02 12:01:43 +01:00
Linus Torvalds
9f16d5e6f2 The biggest change here is eliminating the awful idea that KVM had, of
essentially guessing which pfns are refcounted pages.  The reason to
 do so was that KVM needs to map both non-refcounted pages (for example
 BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain
 refcounted pages.  However, the result was security issues in the past,
 and more recently the inability to map VM_IO and VM_PFNMAP memory
 that _is_ backed by struct page but is not refcounted.  In particular
 this broke virtio-gpu blob resources (which directly map host graphics
 buffers into the guest as "vram" for the virtio-gpu device) with the
 amdgpu driver, because amdgpu allocates non-compound higher order pages
 and the tail pages could not be mapped into KVM.
 
 This requires adjusting all uses of struct page in the per-architecture
 code, to always work on the pfn whenever possible.  The large series that
 did this, from David Stevens and Sean Christopherson, also cleaned up
 substantially the set of functions that provided arch code with the
 pfn for a host virtual addresses.  The previous maze of twisty little
 passages, all different, is replaced by five functions (__gfn_to_page,
 __kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages)
 saving almost 200 lines of code.
 
 ARM:
 
 * Support for stage-1 permission indirection (FEAT_S1PIE) and
   permission overlays (FEAT_S1POE), including nested virt + the
   emulated page table walker
 
 * Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
   was introduced in PSCIv1.3 as a mechanism to request hibernation,
   similar to the S4 state in ACPI
 
 * Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
   part of it, introduce trivial initialization of the host's MPAM
   context so KVM can use the corresponding traps
 
 * PMU support under nested virtualization, honoring the guest
   hypervisor's trap configuration and event filtering when running a
   nested guest
 
 * Fixes to vgic ITS serialization where stale device/interrupt table
   entries are not zeroed when the mapping is invalidated by the VM
 
 * Avoid emulated MMIO completion if userspace has requested synchronous
   external abort injection
 
 * Various fixes and cleanups affecting pKVM, vCPU initialization, and
   selftests
 
 LoongArch:
 
 * Add iocsr and mmio bus simulation in kernel.
 
 * Add in-kernel interrupt controller emulation.
 
 * Add support for virtualization extensions to the eiointc irqchip.
 
 PPC:
 
 * Drop lingering and utterly obsolete references to PPC970 KVM, which was
   removed 10 years ago.
 
 * Fix incorrect documentation references to non-existing ioctls
 
 RISC-V:
 
 * Accelerate KVM RISC-V when running as a guest
 
 * Perf support to collect KVM guest statistics from host side
 
 s390:
 
 * New selftests: more ucontrol selftests and CPU model sanity checks
 
 * Support for the gen17 CPU model
 
 * List registers supported by KVM_GET/SET_ONE_REG in the documentation
 
 x86:
 
 * Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
   documentation, harden against unexpected changes.  Even if the hardware
   A/D tracking is disabled, it is possible to use the hardware-defined A/D
   bits to track if a PFN is Accessed and/or Dirty, and that removes a lot
   of special cases.
 
 * Elide TLB flushes when aging secondary PTEs, as has been done in x86's
   primary MMU for over 10 years.
 
 * Recover huge pages in-place in the TDP MMU when dirty page logging is
   toggled off, instead of zapping them and waiting until the page is
   re-accessed to create a huge mapping.  This reduces vCPU jitter.
 
 * Batch TLB flushes when dirty page logging is toggled off.  This reduces
   the time it takes to disable dirty logging by ~3x.
 
 * Remove the shrinker that was (poorly) attempting to reclaim shadow page
   tables in low-memory situations.
 
 * Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
 
 * Advertise CPUIDs for new instructions in Clearwater Forest
 
 * Quirk KVM's misguided behavior of initialized certain feature MSRs to
   their maximum supported feature set, which can result in KVM creating
   invalid vCPU state.  E.g. initializing PERF_CAPABILITIES to a non-zero
   value results in the vCPU having invalid state if userspace hides PDCM
   from the guest, which in turn can lead to save/restore failures.
 
 * Fix KVM's handling of non-canonical checks for vCPUs that support LA57
   to better follow the "architecture", in quotes because the actual
   behavior is poorly documented.  E.g. most MSR writes and descriptor
   table loads ignore CR4.LA57 and operate purely on whether the CPU
   supports LA57.
 
 * Bypass the register cache when querying CPL from kvm_sched_out(), as
   filling the cache from IRQ context is generally unsafe; harden the
   cache accessors to try to prevent similar issues from occuring in the
   future.  The issue that triggered this change was already fixed in 6.12,
   but was still kinda latent.
 
 * Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
   over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
 
 * Minor cleanups
 
 * Switch hugepage recovery thread to use vhost_task.  These kthreads can
   consume significant amounts of CPU time on behalf of a VM or in response
   to how the VM behaves (for example how it accesses its memory); therefore
   KVM tried to place the thread in the VM's cgroups and charge the CPU
   time consumed by that work to the VM's container.  However the kthreads
   did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM
   instances inside could not complete freezing.  Fix this by replacing the
   kthread with a PF_USER_WORKER thread, via the vhost_task abstraction.
   Another 100+ lines removed, with generally better behavior too like
   having these threads properly parented in the process tree.
 
 * Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't
   really work; there was really nothing to work around anyway: the broken
   patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL
   MSR is virtualized and therefore unaffected by the erratum.
 
 * Fix 6.12 regression where CONFIG_KVM will be built as a module even
   if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'.
 
 x86 selftests:
 
 * x86 selftests can now use AVX.
 
 Documentation:
 
 * Use rST internal links
 
 * Reorganize the introduction to the API document
 
 Generic:
 
 * Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
   of RCU, so that running a vCPU on a different task doesn't encounter long
   due to having to wait for all CPUs become quiescent.  In general both reads
   and writes are rare, but userspace that supports confidential computing is
   introducing the use of "helper" vCPUs that may jump from one host processor
   to another.  Those will be very happy to trigger a synchronize_rcu(), and
   the effect on performance is quite the disaster.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmc9MRYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroP00QgArxqxBIGLCW5t7bw7vtNq63QYRyh4
 dTiDguLiYQJ+AXmnRu11R6aPC7HgMAvlFCCmH+GEce4WEgt26hxCmncJr/aJOSwS
 letCS7TrME16PeZvh25A1nhPBUw6mTF1qqzgcdHMrqXG8LuHoGcKYGSRVbkf3kfI
 1ZoMq1r8ChXbVVmCx9DQ3gw1TVr5Dpjs2voLh8rDSE9Xpw0tVVabHu3/NhQEz/F+
 t8/nRaqH777icCHIf9PCk5HnarHxLAOvhM2M0Yj09PuBcE5fFQxpxltw/qiKQqqW
 ep4oquojGl87kZnhlDaac2UNtK90Ws+WxxvCwUmbvGN0ZJVaQwf4FvTwig==
 =lWpE
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "The biggest change here is eliminating the awful idea that KVM had of
  essentially guessing which pfns are refcounted pages.

  The reason to do so was that KVM needs to map both non-refcounted
  pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP
  VMAs that contain refcounted pages.

  However, the result was security issues in the past, and more recently
  the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by
  struct page but is not refcounted. In particular this broke virtio-gpu
  blob resources (which directly map host graphics buffers into the
  guest as "vram" for the virtio-gpu device) with the amdgpu driver,
  because amdgpu allocates non-compound higher order pages and the tail
  pages could not be mapped into KVM.

  This requires adjusting all uses of struct page in the
  per-architecture code, to always work on the pfn whenever possible.
  The large series that did this, from David Stevens and Sean
  Christopherson, also cleaned up substantially the set of functions
  that provided arch code with the pfn for a host virtual addresses.

  The previous maze of twisty little passages, all different, is
  replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the
  non-__ versions of these two, and kvm_prefetch_pages) saving almost
  200 lines of code.

  ARM:

   - Support for stage-1 permission indirection (FEAT_S1PIE) and
     permission overlays (FEAT_S1POE), including nested virt + the
     emulated page table walker

   - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This
     call was introduced in PSCIv1.3 as a mechanism to request
     hibernation, similar to the S4 state in ACPI

   - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
     part of it, introduce trivial initialization of the host's MPAM
     context so KVM can use the corresponding traps

   - PMU support under nested virtualization, honoring the guest
     hypervisor's trap configuration and event filtering when running a
     nested guest

   - Fixes to vgic ITS serialization where stale device/interrupt table
     entries are not zeroed when the mapping is invalidated by the VM

   - Avoid emulated MMIO completion if userspace has requested
     synchronous external abort injection

   - Various fixes and cleanups affecting pKVM, vCPU initialization, and
     selftests

  LoongArch:

   - Add iocsr and mmio bus simulation in kernel.

   - Add in-kernel interrupt controller emulation.

   - Add support for virtualization extensions to the eiointc irqchip.

  PPC:

   - Drop lingering and utterly obsolete references to PPC970 KVM, which
     was removed 10 years ago.

   - Fix incorrect documentation references to non-existing ioctls

  RISC-V:

   - Accelerate KVM RISC-V when running as a guest

   - Perf support to collect KVM guest statistics from host side

  s390:

   - New selftests: more ucontrol selftests and CPU model sanity checks

   - Support for the gen17 CPU model

   - List registers supported by KVM_GET/SET_ONE_REG in the
     documentation

  x86:

   - Cleanup KVM's handling of Accessed and Dirty bits to dedup code,
     improve documentation, harden against unexpected changes.

     Even if the hardware A/D tracking is disabled, it is possible to
     use the hardware-defined A/D bits to track if a PFN is Accessed
     and/or Dirty, and that removes a lot of special cases.

   - Elide TLB flushes when aging secondary PTEs, as has been done in
     x86's primary MMU for over 10 years.

   - Recover huge pages in-place in the TDP MMU when dirty page logging
     is toggled off, instead of zapping them and waiting until the page
     is re-accessed to create a huge mapping. This reduces vCPU jitter.

   - Batch TLB flushes when dirty page logging is toggled off. This
     reduces the time it takes to disable dirty logging by ~3x.

   - Remove the shrinker that was (poorly) attempting to reclaim shadow
     page tables in low-memory situations.

   - Clean up and optimize KVM's handling of writes to
     MSR_IA32_APICBASE.

   - Advertise CPUIDs for new instructions in Clearwater Forest

   - Quirk KVM's misguided behavior of initialized certain feature MSRs
     to their maximum supported feature set, which can result in KVM
     creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to
     a non-zero value results in the vCPU having invalid state if
     userspace hides PDCM from the guest, which in turn can lead to
     save/restore failures.

   - Fix KVM's handling of non-canonical checks for vCPUs that support
     LA57 to better follow the "architecture", in quotes because the
     actual behavior is poorly documented. E.g. most MSR writes and
     descriptor table loads ignore CR4.LA57 and operate purely on
     whether the CPU supports LA57.

   - Bypass the register cache when querying CPL from kvm_sched_out(),
     as filling the cache from IRQ context is generally unsafe; harden
     the cache accessors to try to prevent similar issues from occuring
     in the future. The issue that triggered this change was already
     fixed in 6.12, but was still kinda latent.

   - Advertise AMD_IBPB_RET to userspace, and fix a related bug where
     KVM over-advertises SPEC_CTRL when trying to support cross-vendor
     VMs.

   - Minor cleanups

   - Switch hugepage recovery thread to use vhost_task.

     These kthreads can consume significant amounts of CPU time on
     behalf of a VM or in response to how the VM behaves (for example
     how it accesses its memory); therefore KVM tried to place the
     thread in the VM's cgroups and charge the CPU time consumed by that
     work to the VM's container.

     However the kthreads did not process SIGSTOP/SIGCONT, and therefore
     cgroups which had KVM instances inside could not complete freezing.

     Fix this by replacing the kthread with a PF_USER_WORKER thread, via
     the vhost_task abstraction. Another 100+ lines removed, with
     generally better behavior too like having these threads properly
     parented in the process tree.

   - Revert a workaround for an old CPU erratum (Nehalem/Westmere) that
     didn't really work; there was really nothing to work around anyway:
     the broken patch was meant to fix nested virtualization, but the
     PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the
     erratum.

   - Fix 6.12 regression where CONFIG_KVM will be built as a module even
     if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is
     'y'.

  x86 selftests:

   - x86 selftests can now use AVX.

  Documentation:

   - Use rST internal links

   - Reorganize the introduction to the API document

  Generic:

   - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock
     instead of RCU, so that running a vCPU on a different task doesn't
     encounter long due to having to wait for all CPUs become quiescent.

     In general both reads and writes are rare, but userspace that
     supports confidential computing is introducing the use of "helper"
     vCPUs that may jump from one host processor to another. Those will
     be very happy to trigger a synchronize_rcu(), and the effect on
     performance is quite the disaster"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits)
  KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
  KVM: x86: add back X86_LOCAL_APIC dependency
  Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
  KVM: x86: switch hugepage recovery thread to vhost_task
  KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
  x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
  Documentation: KVM: fix malformed table
  irqchip/loongson-eiointc: Add virt extension support
  LoongArch: KVM: Add irqfd support
  LoongArch: KVM: Add PCHPIC user mode read and write functions
  LoongArch: KVM: Add PCHPIC read and write functions
  LoongArch: KVM: Add PCHPIC device support
  LoongArch: KVM: Add EIOINTC user mode read and write functions
  LoongArch: KVM: Add EIOINTC read and write functions
  LoongArch: KVM: Add EIOINTC device support
  LoongArch: KVM: Add IPI user mode read and write function
  LoongArch: KVM: Add IPI read and write function
  LoongArch: KVM: Add IPI device support
  LoongArch: KVM: Add iocsr and mmio bus simulation in kernel
  KVM: arm64: Pass on SVE mapping failures
  ...
2024-11-23 16:00:50 -08:00
Linus Torvalds
0f25f0e4ef the bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same
 scope where they'd been created, getting rid of reassignments
 and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
 
 We are getting very close to having the memory safety of that stuff
 trivial to verify.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
 69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
 CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
 =gVH3
 -----END PGP SIGNATURE-----

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' class updates from Al Viro:
 "The bulk of struct fd memory safety stuff

  Making sure that struct fd instances are destroyed in the same scope
  where they'd been created, getting rid of reassignments and passing
  them by reference, converting to CLASS(fd{,_pos,_raw}).

  We are getting very close to having the memory safety of that stuff
  trivial to verify"

* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
  deal with the last remaing boolean uses of fd_file()
  css_set_fork(): switch to CLASS(fd_raw, ...)
  memcg_write_event_control(): switch to CLASS(fd)
  assorted variants of irqfd setup: convert to CLASS(fd)
  do_pollfd(): convert to CLASS(fd)
  convert do_select()
  convert vfs_dedupe_file_range().
  convert cifs_ioctl_copychunk()
  convert media_request_get_by_fd()
  convert spu_run(2)
  switch spufs_calls_{get,put}() to CLASS() use
  convert cachestat(2)
  convert do_preadv()/do_pwritev()
  fdget(), more trivial conversions
  fdget(), trivial conversions
  privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
  o2hb_region_dev_store(): avoid goto around fdget()/fdput()
  introduce "fd_pos" class, convert fdget_pos() users to it.
  fdget_raw() users: switch to CLASS(fd_raw)
  convert vmsplice() to CLASS(fd)
  ...
2024-11-18 12:24:06 -08:00
Paolo Bonzini
2e9a2c624e Merge branch 'kvm-docs-6.13' into HEAD
- Drop obsolete references to PPC970 KVM, which was removed 10 years ago.

- Fix incorrect references to non-existing ioctls

- List registers supported by KVM_GET/SET_ONE_REG on s390

- Use rST internal links

- Reorganize the introduction to the API document
2024-11-13 07:18:12 -05:00