linux

mirror of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2025-08-05 16:54:27 +00:00

Author	SHA1	Message	Date
Paolo Bonzini	d7f4aac280	Merge tag 'kvm-x86-mmu-6.17' of https://github.com/kvm-x86/linux into HEAD KVM x86 MMU changes for 6.17 - Exempt nested EPT from the the !USER + CR0.WP logic, as EPT doesn't interact with CR0.WP. - Move the TDX hardware setup code to tdx.c to better co-locate TDX code and eliminate a few global symbols. - Dynamically allocation the shadow MMU's hashed page list, and defer allocating the hashed list until it's actually needed (the TDP MMU doesn't use the list).	2025-07-29 08:36:43 -04:00
Paolo Bonzini	1a14928e2e	Merge tag 'kvm-x86-misc-6.17' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.17 - Prevert the host's DEBUGCTL.FREEZE_IN_SMM (Intel only) when running the guest. Failure to honor FREEZE_IN_SMM can bleed host state into the guest. - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter (Intel only) to prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF. - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the vCPU's CPUID model. - Rework the MSR interception code so that the SVM and VMX APIs are more or less identical. - Recalculate all MSR intercepts from the "source" on MSR filter changes, and drop the dedicated "shadow" bitmaps (and their awful "max" size defines). - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the nested SVM MSRPM offsets tracker can't handle an MSR. - Advertise support for LKGS (Load Kernel GS base), a new instruction that's loosely related to FRED, but is supported and enumerated independently. - Fix a user-triggerable WARN that syzkaller found by stuffing INIT_RECEIVED, a.k.a. WFS, and then putting the vCPU into VMX Root Mode (post-VMXON). Use the same approach KVM uses for dealing with "impossible" emulation when running a !URG guest, and simply wait until KVM_RUN to detect that the vCPU has architecturally impossible state. - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of APERF/MPERF reads, so that a "properly" configured VM can "virtualize" APERF/MPERF (with many caveats). - Reject KVM_SET_TSC_KHZ if vCPUs have been created, as changing the "default" frequency is unsupported for VMs with a "secure" TSC, and there's no known use case for changing the default frequency for other VM types.	2025-07-29 08:36:43 -04:00
Jim Mattson	a7cec20845	KVM: x86: Provide a capability to disable APERF/MPERF read intercepts Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs without interception. The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not handled at all. The MSR values are not zeroed on vCPU creation, saved on suspend, or restored on resume. No accommodation is made for processor migration or for sharing a logical processor with other tasks. No adjustments are made for non-unit TSC multipliers. The MSRs do not account for time the same way as the comparable PMU events, whether the PMU is virtualized by the traditional emulation method or the new mediated pass-through approach. Nonetheless, in a properly constrained environment, this capability can be combined with a guest CPUID table that advertises support for CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the effective physical CPU frequency in /proc/cpuinfo. Moreover, there is no performance cost for this capability. Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250530185239.2335185-3-jmattson@google.com Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250626001225.744268-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-09 09:33:37 -07:00
Jim Mattson	6fbef8615d	KVM: x86: Replace growing set of *_in_guest bools with a u64 Store each "disabled exit" boolean in a single bit rather than a byte. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250530185239.2335185-2-jmattson@google.com Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250626001225.744268-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-09 09:32:32 -07:00
Chao Gao	3f06b8927a	KVM: x86: Deduplicate MSR interception enabling and disabling Extract a common function from MSR interception disabling logic and create disabling and enabling functions based on it. This removes most of the duplicated code for MSR interception disabling/enabling. No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com [sean: s/enable/set, inline the wrappers] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 15:42:12 -07:00
Sean Christopherson	ac777fbf06	KVM: x86: Use kvzalloc() to allocate VM struct Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical allocation before falling back to __vmalloc(), to avoid the overhead of establishing the virtual mappings. For non-debug builds, The SVM and VMX (and TDX) structures are now just below 7000 bytes in the worst case scenario (see below), i.e. are order-1 allocations, and will likely remain that way for quite some time. Add compile-time assertions in vendor code to ensure the size of the structures, sans the memslot hash tables, are order-0 allocations, i.e. are less than 4KiB. There's nothing fundamentally wrong with a larger kvm_{svm,vmx,tdx} size, but given that the size of the structure (without the memslots hash tables) is below 2KiB after 18+ years of existence, more than doubling the size would be quite notable. Add sanity checks on the memslot hash table sizes, partly to ensure they aren't resized without accounting for the impact on VM structure size, and partly to document that the majority of the size of VM structures comes from the memslots. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 12:50:48 -07:00
Sean Christopherson	99836eb9c5	KVM: SVM: WARN if ir_list is non-empty at vCPU free Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is freed with ir_list entries, i.e. if KVM leaves a dangling IRTE. Initialize the per-vCPU interrupt remapping list and its lock even if AVIC is disabled so that the WARN doesn't hit false positives (and so that KVM doesn't need to call into AVIC code for a simple sanity check). Link: https://lore.kernel.org/r/20250611224604.313496-54-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:46 -07:00
Maxim Levitsky	d921665e01	KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv module param, by never setting IsRunning. SVM doesn't provide a way to disable IPI virtualization in hardware, but by ensuring CPUs never see IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a VM-Exit. To avoid setting the real IsRunning bit, while still allowing KVM to use each vCPU's entry to update GA log entries, simply maintain a shadow of the entry, without propagating IsRunning updates to the real table when IPI virtualization is disabled. Providing a way to effectively disable IPI virtualization will allow KVM to safely enable AVIC on hardware that is susceptible to erratum #1235, which causes hardware to sometimes fail to detect that the IsRunning bit has been cleared by software. Note, the table _must_ be fully populated, as broadcast IPIs skip invalid entries, i.e. won't generate VM-Exit if every entry is invalid, and so simply pointing the VMCB at a common dummy table won't work. Alternatively, KVM could allocate a shadow of the entire table, but that'd be a waste of 4KiB since the per-vCPU entry doesn't actually consume an additional 8 bytes of memory (vCPU structures are large enough that they are backed by order-N pages). Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [sean: keep "entry" variables, reuse enable_ipiv, split from erratum] Link: https://lore.kernel.org/r/20250611224604.313496-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:53:02 -07:00
Sean Christopherson	bea44d1992	KVM: x86: Simplify userspace filter logic when disabling MSR interception Refactor {svm,vmx}_disable_intercept_for_msr() to simplify the handling of userspace filters that disallow access to an MSR. The more complicated logic is no longer needed or justified now that KVM recalculates all MSR intercepts on a userspace MSR filter change, i.e. now that KVM doesn't need to also update shadow bitmaps. No functional change intended. Suggested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-32-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:37 -07:00
Sean Christopherson	73be81b3bb	KVM: SVM: Add a helper to allocate and initialize permissions bitmaps Add a helper to allocate and initialize an MSR or I/O permissions map, as the logic is identical between the two map types, the only difference is the size of the bitmap. Opportunistically add a comment to explain why the bitmaps are initialized with 0xff, e.g. instead of the more common zero-initialized behavior, which is the main motivation for deduplicating the code. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-31-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:37 -07:00
Sean Christopherson	7fe0578041	KVM: SVM: Store MSRPM pointer as "void " instead of "u32 " Store KVM's MSRPM pointers as "void " instead of "u32 " to guard against directly accessing the bitmaps outside of code that is explicitly written to access the bitmaps with a specific type. Opportunistically use svm_vcpu_free_msrpm() in svm_vcpu_free() instead of open coding an equivalent. Link: https://lore.kernel.org/r/20250610225737.156318-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:34 -07:00
Sean Christopherson	5c9c084763	KVM: SVM: Move svm_msrpm_offset() to nested.c Move svm_msrpm_offset() from svm.c to nested.c now that all usage of the u32-index offsets is nested virtualization specific. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:33 -07:00
Sean Christopherson	2f89888434	KVM: SVM: Drop explicit check on MSRPM offset when emulating SEV-ES accesses Now that msr_write_intercepted() defaults to true, i.e. accurately reflects hardware behavior for out-of-range MSRs, and doesn't WARN (or BUG) on an out-of-range MSR, drop sev_es_prevent_msr_access()'s svm_msrpm_offset() check that guarded against calling msr_write_intercepted() with a "bad" index. Opportunistically clean up the helper's formatting. Link: https://lore.kernel.org/r/20250610225737.156318-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:32 -07:00
Sean Christopherson	4880919aaf	KVM: SVM: Merge "after set CPUID" intercept recalc helpers Merge svm_recalc_intercepts_after_set_cpuid() and svm_recalc_instruction_intercepts() such that the "after set CPUID" helper simply invokes the type-specific helpers (MSRs vs. instructions), i.e. make svm_recalc_intercepts_after_set_cpuid() a single entry point for all intercept updates that need to be performed after a CPUID change. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:32 -07:00
Sean Christopherson	40ba80e4b0	KVM: SVM: Fold svm_vcpu_init_msrpm() into its sole caller Fold svm_vcpu_init_msrpm() into svm_recalc_msr_intercepts() now that there is only the one caller (and because the "init" misnomer is even more misleading than it was in the past). No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:31 -07:00
Sean Christopherson	049dff172b	KVM: SVM: Rename init_vmcb_after_set_cpuid() to make it intercepts specific Rename init_vmcb_after_set_cpuid() to svm_recalc_intercepts_after_set_cpuid() to more precisely describe its role. Strictly speaking, the name isn't perfect as toggling virtual VM{LOAD,SAVE} is arguably not recalculating an intercept, but practically speaking it's close enough. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:30 -07:00
Sean Christopherson	4ceca57e3f	KVM: x86: Rename msr_filter_changed() => recalc_msr_intercepts() Rename msr_filter_changed() to recalc_msr_intercepts() and drop the trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc helper to react to the new userspace filter. No functional change intended. Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:30 -07:00
Sean Christopherson	160f143cc1	KVM: SVM: Manually recalc all MSR intercepts on userspace MSR filter change On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM must be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Francesco Lavra <francescolavra.fl@gmail.com> Link: https://lore.kernel.org/r/20250610225737.156318-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:29 -07:00
Sean Christopherson	405a63d4d3	KVM: x86: Move definition of X2APIC_MSR() to lapic.h Dedup the definition of X2APIC_MSR and put it in the local APIC code where it belongs. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:28 -07:00
Sean Christopherson	cb53d07948	KVM: SVM: Drop "always" flag from list of possible passthrough MSRs Drop the "always" flag from the array of possible passthrough MSRs, and instead manually initialize the permissions for the handful of MSRs that KVM passes through by default. In addition to cutting down on boilerplate copy+paste code and eliminating a misleading flag (the MSRs aren't always passed through, e.g. thanks to MSR filters), this will allow for removing the direct_access_msrs array entirely. Link: https://lore.kernel.org/r/20250610225737.156318-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:27 -07:00
Sean Christopherson	3a0f09b361	KVM: SVM: Pass through GHCB MSR if and only if VM is an SEV-ES guest Disable interception of the GHCB MSR if and only if the VM is an SEV-ES guest. While the exact behavior is completely undocumented in the APM, common sense and testing on SEV-ES capable CPUs says that accesses to the GHCB from non-SEV-ES guests will #GP. I.e. from the guest's perspective, no functional change intended. Fixes: `376c6d2850` ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Link: https://lore.kernel.org/r/20250610225737.156318-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:26 -07:00
Sean Christopherson	6b7315fe54	KVM: SVM: Implement and adopt VMX style MSR intercepts APIs Add and use SVM MSR interception APIs (in most paths) to match VMX's APIs and nomenclature. Specifically, add SVM variants of: vmx_disable_intercept_for_msr(vcpu, msr, type) vmx_enable_intercept_for_msr(vcpu, msr, type) vmx_set_intercept_for_msr(vcpu, msr, type, intercept) to eventually replace SVM's single helper: set_msr_interception(vcpu, msrpm, msr, allow_read, allow_write) which is awkward to use (in all cases, KVM either applies the same logic for both reads and writes, or intercepts one of read or write), and is unintuitive due to using '0' to indicate interception should be set. Keep the guts of the old API for the moment to avoid churning the MSR filter code, as that mess will be overhauled in the near future. Leave behind a temporary comment to call out that the shadow bitmaps have inverted polarity relative to the bitmaps consumed by hardware. No functional change intended. Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:26 -07:00
Sean Christopherson	c38595ad69	KVM: SVM: Add helpers for accessing MSR bitmap that don't rely on offsets Add macro-built helpers for testing, setting, and clearing MSRPM entries without relying on precomputed offsets. This sets the stage for eventually removing general KVM use of precomputed offsets, which are quite confusing and rather inefficient for the vast majority of KVM's usage. Outside of merging L0 and L1 bitmaps for nested SVM, using u32-indexed offsets and accesses is at best unnecessary, and at worst introduces extra operations to retrieve the individual bit from within the offset u32 value. And simply calling them "offsets" is very confusing, as the "unit" of the offset isn't immediately obvious. Use the new helpers in set_msr_interception_bitmap() and msr_write_intercepted() to verify the math and operations, but keep the existing offset-based logic in set_msr_interception_bitmap() to sanity check the "clear" and "set" operations. Manipulating MSR interceptions isn't a hot path and no kernel release is ever expected to contain this specific version of set_msr_interception_bitmap() (it will be removed entirely in the near future). Link: https://lore.kernel.org/r/20250610225737.156318-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:25 -07:00
Sean Christopherson	4879dc9469	KVM: nSVM: Don't initialize vmcb02 MSRPM with vmcb01's "always passthrough" Don't initialize vmcb02's MSRPM with KVM's set of "always passthrough" MSRs, as KVM always needs to consult L1's intercepts, i.e. needs to merge vmcb01 with vmcb12 and write the result to vmcb02. This will eventually allow for the removal of svm_vcpu_init_msrpm(). Note, the bitmaps are truly initialized by svm_vcpu_alloc_msrpm() (default to intercepting all MSRs), e.g. if there is a bug lurking elsewhere, the worst case scenario from dropping the call to svm_vcpu_init_msrpm() should be that KVM would fail to passthrough MSRs to L2. Link: https://lore.kernel.org/r/20250610225737.156318-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:24 -07:00
Sean Christopherson	9b72c3d59f	KVM: nSVM: Use dedicated array of MSRPM offsets to merge L0 and L1 bitmaps Use a dedicated array of MSRPM offsets to merge L0 and L1 bitmaps, i.e. to merge KVM's vmcb01 bitmap with L1's vmcb12 bitmap. This will eventually allow for the removal of direct_access_msrs, as the only path where tracking the offsets is truly justified is the merge for nested SVM, where merging in chunks is an easy way to batch uaccess reads/writes. Opportunistically omit the x2APIC MSRs from the merge-specific array instead of filtering them out at runtime. Note, disabling interception of DEBUGCTL, XSS, EFER, PAT, GHCB, and TSC_AUX is mutually exclusive with nested virtualization, as KVM passes through those MSRs only for SEV-ES guests, and KVM doesn't support nested virtualization for SEV+ guests. Defer removing those MSRs to a future cleanup in order to make this refactoring as benign as possible. Link: https://lore.kernel.org/r/20250610225737.156318-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:23 -07:00
Sean Christopherson	16e9584cc0	KVM: SVM: Clean up macros related to architectural MSRPM definitions Move SVM's MSR Permissions Map macros to svm.h in anticipation of adding helpers that are available to SVM code, and opportunistically replace a variety of open-coded literals with (hopefully) informative macros. Opportunistically open code ARRAY_SIZE(msrpm_ranges) instead of wrapping it as NUM_MSR_MAPS, which is an ambiguous name even if it were qualified with "SVM_MSRPM". Deliberately leave the ranges as open coded literals, as using macros to define the ranges actually introduces more potential failure points, since both the definitions and the usage have to be careful to use the correct index. The lack of clear intent behind the ranges will be addressed in future patches. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:07:11 -07:00
Sean Christopherson	925149b6d0	KVM: SVM: Massage name and param of helper that merges vmcb01 and vmcb12 MSRPMs Rename nested_svm_vmrun_msrpm() to nested_svm_merge_msrpm() to better capture its role, and opportunistically feed it @vcpu instead of @svm, as grabbing "svm" only to turn around and grab svm->vcpu is rather silly. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	b1bccf7883	KVM: x86: Use non-atomic bit ops to manipulate "shadow" MSR intercepts Manipulate the MSR bitmaps using non-atomic bit ops APIs (two underscores), as the bitmaps are per-vCPU and are only ever accessed while vcpu->mutex is held. Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	6353cd685c	KVM: SVM: Kill the VM instead of the host if MSR interception is buggy WARN and kill the VM instead of panicking the host if KVM attempts to set or query MSR interception for an unsupported MSR. Accessing the MSR interception bitmaps only meaningfully affects post-VMRUN behavior, and KVM_BUG_ON() is guaranteed to prevent the current vCPU from doing VMRUN, i.e. there is no need to panic the entire host. Opportunistically move the sanity checks about their use to index into the MSRPM, e.g. so that bugs only WARN and terminate the VM, as opposed to doing that _and_ generating an out-of-bounds load. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	b241c50c4e	KVM: SVM: Use ARRAY_SIZE() to iterate over direct_access_msrs Drop the unnecessary and dangerous value-terminated behavior of direct_access_msrs, and simply iterate over the actual size of the array. The use in svm_set_x2apic_msr_interception() is especially sketchy, as it relies on unused capacity being zero-initialized, and '0' being outside the range of x2APIC MSRs. To ensure the array and shadow_msr_intercept stay synchronized, simply assert that their sizes are identical (note the six 64-bit-only MSRs). Note, direct_access_msrs will soon be removed entirely; keeping the assert synchronized with the array isn't expected to be along-term maintenance burden. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	f886515f9b	KVM: SVM: Tag MSR bitmap initialization helpers with __init Tag init_msrpm_offsets() and add_msr_offset() with __init, as they're used only during hardware setup to map potential passthrough MSRs to offsets in the bitmap. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	5ebd737308	KVM: SVM: Don't BUG if setting up the MSR intercept bitmaps fails WARN and reject module loading if there is a problem with KVM's MSR interception bitmaps. Panicking the host in this situation is inexcusable since it is trivially easy to propagate the error up the stack. Link: https://lore.kernel.org/r/20250610225737.156318-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:40 -07:00
Sean Christopherson	fb96d5cf0f	KVM: SVM: Allocate IOPM pages after initial setup in svm_hardware_setup() Allocate pages for the IOPM after initial setup has been completed in svm_hardware_setup(), so that sanity checks can be added in the setup flow without needing to free the IOPM pages. The IOPM is only referenced (via iopm_base) in init_vmcb() and svm_hardware_unsetup(), so there's no need to allocate it early on. No functional change intended (beyond the obvious ordering differences, e.g. if the allocation fails). Link: https://lore.kernel.org/r/20250610225737.156318-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:39 -07:00
Sean Christopherson	674ffc6503	KVM: SVM: Disable interception of SPEC_CTRL iff the MSR exists for the guest Disable interception of SPEC_CTRL when the CPU virtualizes (i.e. context switches) SPEC_CTRL if and only if the MSR exists according to the vCPU's CPUID model. Letting the guest access SPEC_CTRL is generally benign, but the guest would see inconsistent behavior if KVM happened to emulate an access to the MSR. Fixes: `d00b99c514` ("KVM: SVM: Add support for Virtual SPEC_CTRL") Reported-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:05:39 -07:00
Sean Christopherson	80c64c7afe	KVM: x86: Drop kvm_x86_ops.set_dr6() in favor of a new KVM_RUN flag Instruct vendor code to load the guest's DR6 into hardware via a new KVM_RUN flag, and remove kvm_x86_ops.set_dr6(), whose sole purpose was to load vcpu->arch.dr6 into hardware when DR6 can be read/written directly by the guest. Note, TDX already WARNs on any run_flag being set, i.e. will yell if KVM thinks DR6 needs to be reloaded. TDX vCPUs force KVM_DEBUGREG_AUTO_SWITCH and never clear the flag, i.e. should never observe KVM_RUN_LOAD_GUEST_DR6. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:04:24 -07:00
Sean Christopherson	2478b1b220	KVM: x86: Convert vcpu_run()'s immediate exit param into a generic bitmap Convert kvm_x86_ops.vcpu_run()'s "force_immediate_exit" boolean parameter into an a generic bitmap so that similar "take action" information can be passed to vendor code without creating a pile of boolean parameters. This will allow dropping kvm_x86_ops.set_dr6() in favor of a new flag, and will also allow for adding similar functionality for re-loading debugctl in the active VMCS. Opportunistically massage the TDX WARN and comment to prepare for adding more run_flags, all of which are expected to be mutually exclusive with TDX, i.e. should be WARNed on. No functional change intended. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:04:24 -07:00
Linus Torvalds	7f9039c524	Generic: * Clean up locking of all vCPUs for a VM by using the _nest_lock() family of functions, and move duplicated code to virt/kvm/. kernel/ patches acked by Peter Zijlstra. Add MGLRU support to the access tracking perf test. ARM fixes: * Make the irqbypass hooks resilient to changes in the GSI<->MSI routing, avoiding behind stale vLPI mappings being left behind. The fix is to resolve the VGIC IRQ using the host IRQ (which is stable) and nuking the vLPI mapping upon a routing change. * Close another VGIC race where vCPU creation races with VGIC creation, leading to in-flight vCPUs entering the kernel w/o private IRQs allocated. * Fix a build issue triggered by the recently added workaround for Ampere's AC04_CPU_23 erratum. * Correctly sign-extend the VA when emulating a TLBI instruction potentially targeting a VNCR mapping. * Avoid dereferencing a NULL pointer in the VGIC debug code, which can happen if the device doesn't have any mapping yet. s390: * Fix interaction between some filesystems and Secure Execution * Some cleanups and refactorings, preparing for an upcoming big series x86: * Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to fix a race between AP destroy and VMRUN. * Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM. * Refine and harden handling of spurious faults. * Add support for ALLOWED_SEV_FEATURES. * Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y. * Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features that utilize those bits. * Don't account temporary allocations in sev_send_update_data(). * Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold. * Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between SVM and VMX. * Advertise support to userspace for WRMSRNS and PREFETCHI. * Rescan I/O APIC routes after handling EOI that needed to be intercepted due to the old/previous routing, but not the new/current routing. * Add a module param to control and enumerate support for device posted interrupts. * Fix a potential overflow with nested virt on Intel systems running 32-bit kernels. * Flush shadow VMCSes on emergency reboot. * Add support for SNP to the various SEV selftests. * Add a selftest to verify fastops instructions via forced emulation. * Refine and optimize KVM's software processing of the posted interrupt bitmap, and share the harvesting code between KVM and the kernel's Posted MSI handler -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg9TjwUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOUxQf7B7nnWqIKd7jSkGzSD6YsSX9TXktr 2tJIOfWM3zNYg5GRCidg+m4Y5+DqQWd3Hi5hH2P9wUw7RNuOjOFsDe+y0VBr8ysE ve39t/yp+mYalNmHVFl8s3dBDgrIeGKiz+Wgw3zCQIBZ18rJE1dREhv37RlYZ3a2 wSvuObe8sVpCTyKIowDs1xUi7qJUBoopMSuqfleSHawRrcgCpV99U8/KNFF5plLH 7fXOBAHHniVCVc+mqQN2wxtVJDhST+U3TaU4GwlKy9Yevr+iibdOXffveeIgNEU4 D6q1F2zKp6UdV3+p8hxyaTTbiCVDqsp9WOgY/0I/f+CddYn0WVZgOlR+ow== =mYFL -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull more kvm updates from Paolo Bonzini: Generic: - Clean up locking of all vCPUs for a VM by using the _nest_lock() family of functions, and move duplicated code to virt/kvm/. kernel/ patches acked by Peter Zijlstra - Add MGLRU support to the access tracking perf test ARM fixes: - Make the irqbypass hooks resilient to changes in the GSI<->MSI routing, avoiding behind stale vLPI mappings being left behind. The fix is to resolve the VGIC IRQ using the host IRQ (which is stable) and nuking the vLPI mapping upon a routing change - Close another VGIC race where vCPU creation races with VGIC creation, leading to in-flight vCPUs entering the kernel w/o private IRQs allocated - Fix a build issue triggered by the recently added workaround for Ampere's AC04_CPU_23 erratum - Correctly sign-extend the VA when emulating a TLBI instruction potentially targeting a VNCR mapping - Avoid dereferencing a NULL pointer in the VGIC debug code, which can happen if the device doesn't have any mapping yet s390: - Fix interaction between some filesystems and Secure Execution - Some cleanups and refactorings, preparing for an upcoming big series x86: - Wait for target vCPU to ack KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to fix a race between AP destroy and VMRUN - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM - Refine and harden handling of spurious faults - Add support for ALLOWED_SEV_FEATURES - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features that utilize those bits - Don't account temporary allocations in sev_send_update_data() - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold - Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between SVM and VMX - Advertise support to userspace for WRMSRNS and PREFETCHI - Rescan I/O APIC routes after handling EOI that needed to be intercepted due to the old/previous routing, but not the new/current routing - Add a module param to control and enumerate support for device posted interrupts - Fix a potential overflow with nested virt on Intel systems running 32-bit kernels - Flush shadow VMCSes on emergency reboot - Add support for SNP to the various SEV selftests - Add a selftest to verify fastops instructions via forced emulation - Refine and optimize KVM's software processing of the posted interrupt bitmap, and share the harvesting code between KVM and the kernel's Posted MSI handler" tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits) rtmutex_api: provide correct extern functions KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race KVM: arm64: Unmap vLPIs affected by changes to GSI routing information KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding() KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock KVM: arm64: Use lock guard in vgic_v4_set_forwarding() KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation arm64: sysreg: Drag linux/kconfig.h to work around vdso build issue KVM: s390: Simplify and move pv code KVM: s390: Refactor and split some gmap helpers KVM: s390: Remove unneeded srcu lock s390: Remove unneeded includes s390/uv: Improve splitting of large folios that cannot be split while dirty s390/uv: Always return 0 from s390_wiggle_split_folio() if successful s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful rust: add helper for mutex_trylock RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation ...	2025-06-02 12:24:58 -07:00
Linus Torvalds	43db111107	ARM: * Add large stage-2 mapping (THP) support for non-protected guests when pKVM is enabled, clawing back some performance. * Enable nested virtualisation support on systems that support it, though it is disabled by default. * Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and protected modes. * Large rework of the way KVM tracks architecture features and links them with the effects of control bits. While this has no functional impact, it ensures correctness of emulation (the data is automatically extracted from the published JSON files), and helps dealing with the evolution of the architecture. * Significant changes to the way pKVM tracks ownership of pages, avoiding page table walks by storing the state in the hypervisor's vmemmap. This in turn enables the THP support described above. * New selftest checking the pKVM ownership transition rules * Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests even if the host didn't have it. * Fixes for the address translation emulation, which happened to be rather buggy in some specific contexts. * Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N from the number of counters exposed to a guest and addressing a number of issues in the process. * Add a new selftest for the SVE host state being corrupted by a guest. * Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, ensuring that the window for interrupts is slightly bigger, and avoiding a pretty bad erratum on the AmpereOne HW. * Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from a pretty bad case of TLB corruption unless accesses to HCR_EL2 are heavily synchronised. * Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables in a human-friendly fashion. * and the usual random cleanups. LoongArch: * Don't flush tlb if the host supports hardware page table walks. * Add KVM selftests support. RISC-V: * Add vector registers to get-reg-list selftest * VCPU reset related improvements * Remove scounteren initialization from VCPU reset * Support VCPU reset from userspace using set_mpstate() ioctl x86: * Initial support for TDX in KVM. This finally makes it possible to use the TDX module to run confidential guests on Intel processors. This is quite a large series, including support for private page tables (managed by the TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs to userspace, and handling several special VM exits from the TDX module. This has been in the works for literally years and it's not really possible to describe everything here, so I'll defer to the various merge commits up to and including commit `7bcf7246c4` ("Merge branch 'kvm-tdx-finish-initial' into HEAD"). -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg02hwUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNnkwf/db4xeWKSMseCIvBVR+ObDn3LXhwT hAgmTkDkP1zq9RfbfJSbUA1DXRwfP+f1sWySLMWECkFEQW9fGIJF9fOQRDSXKmhX 158U3+FEt+3jxLRCGFd4zyXAqyY3C8JSkPUyJZxCpUbXtB5tdDNac4rZAXKDULwe sUi0OW/kFDM2yt369pBGQAGdN+75/oOrYISGOSvMXHxjccNqvveX8MUhpBjYIuuj 73iBWmsfv3vCtam56Racz3C3v44ie498PmWFtnB0R+CVfWfrnUAaRiGWx+egLiBW dBPDiZywMn++prmphEUFgaStDTQy23JBLJ8+RvHkp+o5GaTISKJB3nedZQ== =adZU -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "As far as x86 goes this pull request "only" includes TDX host support. Quotes are appropriate because (at 6k lines and 100+ commits) it is much bigger than the rest, which will come later this week and consists mostly of bugfixes and selftests. s390 changes will also come in the second batch. ARM: - Add large stage-2 mapping (THP) support for non-protected guests when pKVM is enabled, clawing back some performance. - Enable nested virtualisation support on systems that support it, though it is disabled by default. - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and protected modes. - Large rework of the way KVM tracks architecture features and links them with the effects of control bits. While this has no functional impact, it ensures correctness of emulation (the data is automatically extracted from the published JSON files), and helps dealing with the evolution of the architecture. - Significant changes to the way pKVM tracks ownership of pages, avoiding page table walks by storing the state in the hypervisor's vmemmap. This in turn enables the THP support described above. - New selftest checking the pKVM ownership transition rules - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests even if the host didn't have it. - Fixes for the address translation emulation, which happened to be rather buggy in some specific contexts. - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N from the number of counters exposed to a guest and addressing a number of issues in the process. - Add a new selftest for the SVE host state being corrupted by a guest. - Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2, ensuring that the window for interrupts is slightly bigger, and avoiding a pretty bad erratum on the AmpereOne HW. - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from a pretty bad case of TLB corruption unless accesses to HCR_EL2 are heavily synchronised. - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables in a human-friendly fashion. - and the usual random cleanups. LoongArch: - Don't flush tlb if the host supports hardware page table walks. - Add KVM selftests support. RISC-V: - Add vector registers to get-reg-list selftest - VCPU reset related improvements - Remove scounteren initialization from VCPU reset - Support VCPU reset from userspace using set_mpstate() ioctl x86: - Initial support for TDX in KVM. This finally makes it possible to use the TDX module to run confidential guests on Intel processors. This is quite a large series, including support for private page tables (managed by the TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs to userspace, and handling several special VM exits from the TDX module. This has been in the works for literally years and it's not really possible to describe everything here, so I'll defer to the various merge commits up to and including commit `7bcf7246c4` ('Merge branch 'kvm-tdx-finish-initial' into HEAD')" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits) x86/tdx: mark tdh_vp_enter() as __flatten Documentation: virt/kvm: remove unreferenced footnote RISC-V: KVM: lock the correct mp_state during reset KVM: arm64: Fix documentation for vgic_its_iter_next() KVM: arm64: np-guest CMOs with PMD_SIZE fixmap KVM: arm64: Stage-2 huge mappings for np-guests KVM: arm64: Add a range to pkvm_mappings KVM: arm64: Convert pkvm_mappings to interval tree KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest() KVM: arm64: Add a range to __pkvm_host_wrprotect_guest() KVM: arm64: Add a range to __pkvm_host_unshare_guest() KVM: arm64: Add a range to __pkvm_host_share_guest() KVM: arm64: Introduce for_each_hyp_page KVM: arm64: Handle huge mappings for np-guest CMOs KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET RISC-V: KVM: Remove scounteren initialization KVM: RISC-V: remove unnecessary SBI reset state ...	2025-05-29 08:10:01 -07:00
Paolo Bonzini	4e02d4f973	KVM SVM changes for 6.16: - Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to fix a race between AP destroy and VMRUN. - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM. - Add support for ALLOWED_SEV_FEATURES. - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y. - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features that utilize those bits. - Don't account temporary allocations in sev_send_update_data(). - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmwAACgkQOlYIJqCj N/1pHw//edW/x838POMeeCN8j39NBKErW9yZoQLhMbzogttRvfoba+xYY9zXyRFx 8AXB8+2iLtb7pXUohc0eYN0mNqgD0SnoMLqGfn7nrkJafJSUAJHAoZn1Mdom1M1y jHvBPbHCMMsgdLV8wpDRqCNWTH+d5W0kcN5WjKwOswVLj1rybVfK7bSLMhvkk1e5 RrOR4Ewf95/Ag2b36L4SvS1yG9fTClmKeGArMXhEXjy2INVSpBYyZMjVtjHiNzU9 TjtB2RSM45O+Zl0T2fZdVW8LFhA6kVeL1v+Oo433CjOQE0LQff3Vl14GCANIlPJU PiWN/RIKdWkuxStIP3vw02eHzONCcg2GnNHzEyKQ1xW8lmrwzVRdXZzVsc2Dmowb 7qGykBQ+wzoE0sMeZPA0k/QOSqg2vGxUQHjR7720loLV9m9Tu/mJnS9e179GJKgI e1ArSLwKmHpjwKZqU44IQVTZaxSC4Sg2kI670i21ChPgx8+oVkA6I0LFQXymx7uS 2lbH+ovTlJSlP9fbaJhMwAU2wpSHAyXif/HPjdw2LTH3NdgXzfEnZfTlAWiP65LQ hnz5HvmUalW3x9kmzRmeDIAkDnAXhyt3ZQMvbNzqlO5AfS+Tqh4Ed5EFP3IrQAzK HQ+Gi0ip+B84t9Tbi6rfQwzTZEbSSOfYksC7TXqRGhNo/DvHumE= =k6rK -----END PGP SIGNATURE----- Merge tag 'kvm-x86-svm-6.16' of https://github.com/kvm-x86/linux into HEAD KVM SVM changes for 6.16: - Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to fix a race between AP destroy and VMRUN. - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM. - Add support for ALLOWED_SEV_FEATURES. - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y. - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features that utilize those bits. - Don't account temporary allocations in sev_send_update_data(). - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.	2025-05-27 12:15:49 -04:00
Paolo Bonzini	ebd38b26ec	KVM x86 misc changes for 6.16: - Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between SVM and VMX. - Advertise support to userspace for WRMSRNS and PREFETCHI. - Rescan I/O APIC routes after handling EOI that needed to be intercepted due to the old/previous routing, but not the new/current routing. - Add a module param to control and enumerate support for device posted interrupts. - Misc cleanups. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmgwmHsACgkQOlYIJqCj N/2fYhAAiwKkqQpOWLcGjjezDTnpMqDDbHCffroq0Ttmqfg/cuul1oyZax+9fxBO 203HUi5VKcG7uAGSpLcMFkPUs9hKnaln2lsDaQD+AnGucdj+JKF5p3INCsSYCo9N LVRjRZWtZocxJwHSHX9gU8om0pJ5fBCBG2+7+7XhWRaIqCpJe5k944JotiiOkgZ4 5sXeITkN2kouFVMI8eD4wQGNXxRxs837SYUlwCnoD3VuuBesOZuEhz/CEL9l8vNY keXBLPg7bSW53clKfquNKwXDQRephnZaYoexDebUd+OlZphGhTIPh+C75xPQLWSi aYg6W9XDu3TChf4LPxHnJLwLg/rjeKNQARcxrnb3XLpPAtx3i2cKU8pDPhnd4qn0 +YV5H0dato8bbe+oClGv+oIolM01qfI9SJVoaEhTPu3Rdw9cCQSVFn5t32vG3Vab FVxX+seV3+XTmVveD4cjiiMbqtNADwZ/PmHNAi9QCl46DgHR++MLfRtjYuGo1koL QOmCg2fWOFYtQT6XPJqZxp1SHYxuawrB4qcO9FNyxTuMChoslYoLAr3mBUj0DvwL fdXNof74Ccj8OK9o3uCPOXS92pZz92rz/edy/XmYiCmO+VEwBJFR6IdginZMPX11 UAc4mAC+KkDTvOcKPPIWEXArOsfQMFKefi+bPOeWUx9/nqpcVws= =Vs9P -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.16' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.16: - Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between SVM and VMX. - Advertise support to userspace for WRMSRNS and PREFETCHI. - Rescan I/O APIC routes after handling EOI that needed to be intercepted due to the old/previous routing, but not the new/current routing. - Add a module param to control and enumerate support for device posted interrupts. - Misc cleanups.	2025-05-27 12:14:36 -04:00
Paolo Bonzini	85502b2214	LoongArch KVM changes for v6.16 1. Don't flush tlb if HW PTW supported. 2. Add LoongArch KVM selftests support. -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmgsdO8WHGNoZW5odWFj YWlAa2VybmVsLm9yZwAKCRAChivD8uImejUWD/9zNaBSqTpqeAptRQ6qTKdrxYtN ZKJR9a8AQF5vMPD9dWoRr6iLaNt061GqBOKbhF5RGUVq6uIDfCkZbSbX1h6Ptgcz OwkJHbrZAu+Z31NSoYqbgPZnurSJ9oOUGsghqp3ecKEf0LptTVaKw4WDeKkKNOlq eJaUC1WJ1P4sSXhTlHKAl79Ds/1pza9iAgtKXothrh09PL48LYaWFpmiS0uQmKOD qwYVU/OJIUWn4ZOxyGdk6ZR0B+mJOwwoO1ILptRIeSk4oTKiE1HiIEqnhdnMrJFr DkRdwQ/jq7iH/CFkVybNzxgLqAqpGLJgPj5VakYPmp2scuWSESej9/o8wYrmn0y0 QuDQah0LiRIcbUYRHLDkfKRKCndfJ5KCrpOiD5mZ8bMd2LF9hPnH5toCXb/ZEpsK plu9qUrQgXF8rkX/zCIvQrOp0kYdU8DMbhZsXymSVpEs1fwrCNp2/Z2PrMYLKGt+ JT+65jVRJ67d1OdKw3DrWQA8Au0Ma1rgzX3oLDu8wnqAG7ULAJRVIkDMcxG7a5SQ P+DlIbEHC1U8Dw8hW+PFhfOl13M9p5s3EP5fy8q85UJ/5fJ41fCPBUOm/rfMQcci 6/e+xwBKKkJ7oQ/fFKvJ+n6GbV6suV5FUxicYQjC43iiCKklRhG77uek/DK3c5um ZiTu+YapX9qbG54Q9w== =YM/u -----END PGP SIGNATURE----- Merge tag 'loongarch-kvm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.16 1. Don't flush tlb if HW PTW supported. 2. Add LoongArch KVM selftests support.	2025-05-26 16:12:13 -04:00
Manali Shukla	89f9edf4c6	KVM: SVM: Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to allow reusing KVM_CAP_X86_BUS_LOCK_EXIT. The biggest difference between the two features is that Threshold is fault-like, whereas Detection is trap-like. To allow the guest to make forward progress, Threshold provides a per-VMCB counter which is decremented every time a bus lock occurs, and a VM-Exit is triggered if and only if the counter is '0'. To provide Detection-like semantics, initialize the counter to '0', i.e. exit on every bus lock, and when re-executing the guilty instruction, set the counter to '1' to effectively step past the instruction. Note, in the unlikely scenario that re-executing the instruction doesn't trigger a bus lock, e.g. because the guest has changed memory types or patched the guilty instruction, the bus lock counter will be left at '1', i.e. the guest will be able to do a bus lock on a different instruction. In a perfect world, KVM would ensure the counter is '0' if the guest has made forward progress, e.g. if RIP has changed. But trying to close that hole would incur non-trivial complexity, for marginal benefit; the intent of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks, not to allow for precise detection of problematic guest code. And, it's simply not feasible to fully close the hole, e.g. if an interrupt arrives before the original instruction can re-execute, the guest could step past a different bus lock. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Manali Shukla <manali.shukla@amd.com> Link: https://lore.kernel.org/r/20250502050346.14274-5-manali.shukla@amd.com [sean: fix typo in comment] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-05-19 11:05:10 -07:00
Ingo Molnar	1f82e8e1ca	Merge branch 'x86/msr' into x86/core, to resolve conflicts Conflicts: arch/x86/boot/startup/sme.c arch/x86/coco/sev/core.c arch/x86/kernel/fpu/core.c arch/x86/kernel/fpu/xstate.c Semantic conflict: arch/x86/include/asm/sev-internal.h Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-05-13 10:42:06 +02:00
Sean Christopherson	e3417ab75a	KVM: SVM: Set/clear SRSO's BP_SPEC_REDUCE on 0 <=> 1 VM count transitions Set the magic BP_SPEC_REDUCE bit to mitigate SRSO when running VMs if and only if KVM has at least one active VM. Leaving the bit set at all times unfortunately degrades performance by a wee bit more than expected. Use a dedicated spinlock and counter instead of hooking virtualization enablement, as changing the behavior of kvm.enable_virt_at_load based on SRSO_BP_SPEC_REDUCE is painful, and has its own drawbacks, e.g. could result in performance issues for flows that are sensitive to VM creation latency. Defer setting BP_SPEC_REDUCE until VMRUN is imminent to avoid impacting performance on CPUs that aren't running VMs, e.g. if a setup is using housekeeping CPUs. Setting BP_SPEC_REDUCE in task context, i.e. without blasting IPIs to all CPUs, also helps avoid serializing 1<=>N transitions without incurring a gross amount of complexity (see the Link for details on how ugly coordinating via IPIs gets). Link: https://lore.kernel.org/all/aBOnzNCngyS_pQIW@google.com Fixes: `8442df2b49` ("x86/bugs: KVM: Add support for SRSO_MSR_FIX") Reported-by: Michael Larabel <Michael@michaellarabel.com> Closes: https://www.phoronix.com/review/linux-615-amd-regression Cc: Borislav Petkov <bp@alien8.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250505180300.973137-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-05-08 07:17:10 -07:00
Xin Li (Intel)	502ad6e5a6	x86/msr: Change the function type of native_read_msr_safe() Modify the function type of native_read_msr_safe() to: int native_read_msr_safe(u32 msr, u64 *val) This change makes the function return an error code instead of the MSR value, aligning it with the type of native_write_msr_safe(). Consequently, their callers can check the results in the same way. While at it, convert leftover MSR data type "unsigned int" to u32. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20250427092027.1598740-16-xin@zytor.com	2025-05-02 10:36:36 +02:00
Xin Li (Intel)	0c2678efed	x86/pvops/msr: Refactor pv_cpu_ops.write_msr{,_safe}() An MSR value is represented as a 64-bit unsigned integer, with existing MSR instructions storing it in EDX:EAX as two 32-bit segments. The new immediate form MSR instructions, however, utilize a 64-bit general-purpose register to store the MSR value. To unify the usage of all MSR instructions, let the default MSR access APIs accept an MSR value as a single 64-bit argument instead of two 32-bit segments. The dual 32-bit APIs are still available as convenient wrappers over the APIs that handle an MSR value as a single 64-bit argument. The following illustrates the updated derivation of the MSR write APIs: __wrmsrq(u32 msr, u64 val) / \ / \ native_wrmsrq(msr, val) native_wrmsr(msr, low, high) \| \| native_write_msr(msr, val) / \ / \ wrmsrq(msr, val) wrmsr(msr, low, high) When CONFIG_PARAVIRT is enabled, wrmsrq() and wrmsr() are defined on top of paravirt_write_msr(): paravirt_write_msr(u32 msr, u64 val) / \ / \ wrmsrq(msr, val) wrmsr(msr, low, high) paravirt_write_msr() invokes cpu.write_msr(msr, val), an indirect layer of pv_ops MSR write call: If on native: cpu.write_msr = native_write_msr If on Xen: cpu.write_msr = xen_write_msr Therefore, refactor pv_cpu_ops.write_msr{_safe}() to accept an MSR value in a single u64 argument, replacing the current dual u32 arguments. No functional change intended. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20250427092027.1598740-14-xin@zytor.com	2025-05-02 10:36:36 +02:00
Xin Li (Intel)	efef7f184f	x86/msr: Add explicit includes of <asm/msr.h> For historic reasons there are some TSC-related functions in the <asm/msr.h> header, even though there's an <asm/tsc.h> header. To facilitate the relocation of rdtsc{,_ordered}() from <asm/msr.h> to <asm/tsc.h> and to eventually eliminate the inclusion of <asm/msr.h> in <asm/tsc.h>, add an explicit <asm/msr.h> dependency to the source files that reference definitions from <asm/msr.h>. [ mingo: Clarified the changelog. ] Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250501054241.1245648-1-xin@zytor.com	2025-05-02 10:23:47 +02:00
Sean Christopherson	54a1a24fea	KVM: x86: Unify cross-vCPU IBPB Both SVM and VMX have similar implementation for executing an IBPB between running different vCPUs on the same CPU to create separate prediction domains for different vCPUs. For VMX, when the currently loaded VMCS is changed in vmx_vcpu_load_vmcs(), an IBPB is executed if there is no 'buddy', which is the case on vCPU load. The intention is to execute an IBPB when switching vCPUs, but not when switching the VMCS within the same vCPU. Executing an IBPB on nested transitions within the same vCPU is handled separately and conditionally in nested_vmx_vmexit(). For SVM, the current VMCB is tracked on vCPU load and an IBPB is executed when it is changed. The intention is also to execute an IBPB when switching vCPUs, although it is possible that in some cases an IBBP is executed when switching VMCBs for the same vCPU. Executing an IBPB on nested transitions should be handled separately, and is proposed at [1]. Unify the logic by tracking the last loaded vCPU and execuintg the IBPB on vCPU change in kvm_arch_vcpu_load() instead. When a vCPU is destroyed, make sure all references to it are removed from any CPU. This is similar to how SVM clears the current_vmcb tracking on vCPU destruction. Remove the current VMCB tracking in SVM as it is no longer required, as well as the 'buddy' parameter to vmx_vcpu_load_vmcs(). [1] https://lore.kernel.org/lkml/20250221163352.3818347-4-yosry.ahmed@linux.dev Link: https://lore.kernel.org/all/20250320013759.3965869-1-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> [sean: tweak comment to stay at/under 80 columns] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-04-29 08:39:44 -07:00
Yosry Ahmed	1bee4838eb	KVM: SVM: Clear current_vmcb during vCPU free for all possible CPUs When freeing a vCPU and thus its VMCB, clear current_vmcb for all possible CPUs, not just online CPUs, as it's theoretically possible a CPU could go offline and come back online in conjunction with KVM reusing the page for a new VMCB. Link: https://lore.kernel.org/all/20250320013759.3965869-1-yosry.ahmed@linux.dev Fixes: `fd65d3142f` ("kvm: svm: Ensure an IBPB on all affected CPUs when freeing a vmcb") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> [sean: split to separate patch, write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-04-29 08:39:35 -07:00
Sean Christopherson	5ecdb48dd9	KVM: SVM: Treat DEBUGCTL[5:2] as reserved Stop ignoring DEBUGCTL[5:2] on AMD CPUs and instead treat them as reserved. KVM has never properly virtualized AMD's legacy PBi bits, but did allow the guest (and host userspace) to set the bits. To avoid breaking guests when running on CPUs with BusLockTrap, which redefined bit 2 to BLCKDB and made bits 5:3 reserved, a previous KVM change ignored bits 5:3, e.g. so that legacy guest software wouldn't inadvertently enable BusLockTrap or hit a VMRUN failure due to setting reserved. To allow for virtualizing BusLockTrap and whatever future features may use bits 5:3, treat bits 5:2 as reserved (and hope that doing so doesn't break any existing guests). Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-04-28 10:56:35 -07:00

1 2 3 4 5 ...

723 commits