linux/Documentation/virt/kvm/x86/errata.rst

.. SPDX-License-Identifier: GPL-2.0

=======================================
Known limitations of CPU virtualization
=======================================

Whenever perfect emulation of a CPU feature is impossible or too hard, KVM
has to choose between not implementing the feature at all or introducing
behavioral differences between virtual machines and bare metal systems.

This file documents some of the known limitations that KVM has in
virtualizing CPU features.

x86
===

``KVM_GET_SUPPORTED_CPUID`` issues
----------------------------------

x87 features
~~~~~~~~~~~~

Unlike most other CPUID feature bits, CPUID[EAX=7,ECX=0]:EBX[6]
(FDP_EXCPTN_ONLY) and CPUID[EAX=7,ECX=0]:EBX]13] (ZERO_FCS_FDS) are
clear if the features are present and set if the features are not present.

Clearing these bits in CPUID has no effect on the operation of the guest;
if these bits are set on hardware, the features will not be present on
any virtual machine that runs on that hardware.

**Workaround:** It is recommended to always set these bits in guest CPUID.
Note however that any software (e.g ``WIN87EM.DLL``) expecting these features
to be present likely predates these CPUID feature bits, and therefore
doesn't know to check for them anyway.

``KVM_SET_VCPU_EVENTS`` issue
-----------------------------

Invalid KVM_SET_VCPU_EVENTS input with respect to error codes *may* result in
failed VM-Entry on Intel CPUs.  Pre-CET Intel CPUs require that exception
injection through the VMCS correctly set the "error code valid" flag, e.g.
require the flag be set when injecting a #GP, clear when injecting a #UD,
clear when injecting a soft exception, etc.  Intel CPUs that enumerate
IA32_VMX_BASIC[56] as '1' relax VMX's consistency checks, and AMD CPUs have no
restrictions whatsoever.  KVM_SET_VCPU_EVENTS doesn't sanity check the vector
versus "has_error_code", i.e. KVM's ABI follows AMD behavior.

Nested virtualization features
------------------------------

TBD

x2APIC
------
When KVM_X2APIC_API_USE_32BIT_IDS is enabled, KVM activates a hack/quirk that
allows sending events to a single vCPU using its x2APIC ID even if the target
vCPU has legacy xAPIC enabled, e.g. to bring up hotplugged vCPUs via INIT-SIPI
on VMs with > 255 vCPUs.  A side effect of the quirk is that, if multiple vCPUs
have the same physical APIC ID, KVM will deliver events targeting that APIC ID
only to the vCPU with the lowest vCPU ID.  If KVM_X2APIC_API_USE_32BIT_IDS is
not enabled, KVM follows x86 architecture when processing interrupts (all vCPUs
matching the target APIC ID receive the interrupt).

MTRRs
-----
KVM does not virtualize guest MTRR memory types.  KVM emulates accesses to MTRR
MSRs, i.e. {RD,WR}MSR in the guest will behave as expected, but KVM does not
honor guest MTRRs when determining the effective memory type, and instead
treats all of guest memory as having Writeback (WB) MTRRs.

CR0.CD
------
KVM does not virtualize CR0.CD on Intel CPUs.  Similar to MTRR MSRs, KVM
emulates CR0.CD accesses so that loads and stores from/to CR0 behave as
expected, but setting CR0.CD=1 has no impact on the cachaeability of guest
memory.

Note, this erratum does not affect AMD CPUs, which fully virtualize CR0.CD in
hardware, i.e. put the CPU caches into "no fill" mode when CR0.CD=1, even when
running in the guest.
Documentation: KVM: Add SPDX-License-Identifier tag +new file mode 100644 +WARNING: Missing or malformed SPDX-License-Identifier tag in line 1 +#27: FILE: Documentation/virt/kvm/x86/errata.rst:1: Opportunistically update all other non-added KVM documents and remove a new extra blank line at EOF for x86/errata.rst. Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20220406063715.55625-5-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> 2022-04-06 14:37:15 +08:00			`.. SPDX-License-Identifier: GPL-2.0`
Documentation: KVM: add virtual CPU errata documentation Add a file to document all the different ways in which the virtual CPU emulation is imperfect. Include an example to show how to document such errata. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20220322110712.222449-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> 2022-03-22 12:07:11 +01:00
			`=======================================`
			`Known limitations of CPU virtualization`
			`=======================================`

			`Whenever perfect emulation of a CPU feature is impossible or too hard, KVM`
			`has to choose between not implementing the feature at all or introducing`
			`behavioral differences between virtual machines and bare metal systems.`

			`This file documents some of the known limitations that KVM has in`
			`virtualizing CPU features.`

			`x86`
			`===`

			``KVM_GET_SUPPORTED_CPUID`` issues
			`----------------------------------`

			`x87 features`
			`~~~~~~~~~~~~`

			`Unlike most other CPUID feature bits, CPUID[EAX=7,ECX=0]:EBX[6]`
			`(FDP_EXCPTN_ONLY) and CPUID[EAX=7,ECX=0]:EBX]13] (ZERO_FCS_FDS) are`
			`clear if the features are present and set if the features are not present.`

			`Clearing these bits in CPUID has no effect on the operation of the guest;`
			`if these bits are set on hardware, the features will not be present on`
			`any virtual machine that runs on that hardware.`

			`Workaround: It is recommended to always set these bits in guest CPUID.`
			Note however that any software (e.g ``WIN87EM.DLL``) expecting these features
			`to be present likely predates these CPUID feature bits, and therefore`
			`doesn't know to check for them anyway.`

KVM: x86: Document an erratum in KVM_SET_VCPU_EVENTS on Intel CPUs Document a flaw in KVM's ABI which lets userspace attempt to inject a "bad" hardware exception event, and thus induce VM-Fail on Intel CPUs. Fixing the flaw is a fool's errand, as AMD doesn't sanity check the validity of the error code, Intel CPUs that support CET relax the check for Protected Mode, userspace can change the mode after queueing an exception, KVM ignores the error code when emulating Real Mode exceptions, and so on and so forth. The VM-Fail itself doesn't harm KVM or the kernel beyond triggering a ratelimited pr_warn(), so just document the oddity. Link: https://lore.kernel.org/r/20240802200420.330769-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> 2024-08-02 13:04:20 -07:00			``KVM_SET_VCPU_EVENTS`` issue
			`-----------------------------`

			`Invalid KVM_SET_VCPU_EVENTS input with respect to error codes may result in`
			`failed VM-Entry on Intel CPUs. Pre-CET Intel CPUs require that exception`
			`injection through the VMCS correctly set the "error code valid" flag, e.g.`
			`require the flag be set when injecting a #GP, clear when injecting a #UD,`
			`clear when injecting a soft exception, etc. Intel CPUs that enumerate`
			`IA32_VMX_BASIC[56] as '1' relax VMX's consistency checks, and AMD CPUs have no`
			`restrictions whatsoever. KVM_SET_VCPU_EVENTS doesn't sanity check the vector`
			`versus "has_error_code", i.e. KVM's ABI follows AMD behavior.`

Documentation: KVM: add virtual CPU errata documentation Add a file to document all the different ways in which the virtual CPU emulation is imperfect. Include an example to show how to document such errata. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20220322110712.222449-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> 2022-03-22 12:07:11 +01:00			`Nested virtualization features`
			`------------------------------`

			`TBD`
KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDs Apply KVM's hotplug hack if and only if userspace has enabled 32-bit IDs for x2APIC. If 32-bit IDs are not enabled, disable the optimized map to honor x86 architectural behavior if multiple vCPUs shared a physical APIC ID. As called out in the changelog that added the hack, all CPUs whose (possibly truncated) APIC ID matches the target are supposed to receive the IPI. KVM intentionally differs from real hardware, because real hardware (Knights Landing) does just "x2apic_id & 0xff" to decide whether to accept the interrupt in xAPIC mode and it can deliver one interrupt to more than one physical destination, e.g. 0x123 to 0x123 and 0x23. Applying the hack even when x2APIC is not fully enabled means KVM doesn't correctly handle scenarios where the guest has aliased xAPIC IDs across multiple vCPUs, as only the vCPU with the lowest vCPU ID will receive any interrupts. It's extremely unlikely any real world guest aliases APIC IDs, or even modifies APIC IDs, but KVM's behavior is arbitrary, e.g. the lowest vCPU ID "wins" regardless of which vCPU is "aliasing" and which vCPU is "normal". Furthermore, the hack is _not_ guaranteed to work! The hack works if and only if the optimized APIC map is successfully allocated. If the map allocation fails (unlikely), KVM will fall back to its unoptimized behavior, which _does_ honor the architectural behavior. Pivot on 32-bit x2APIC IDs being enabled as that is required to take advantage of the hotplug hack (see kvm_apic_state_fixup()), i.e. won't break existing setups unless they are way, way off in the weeds. And an entry in KVM's errata to document the hack. Alternatively, KVM could provide an actual x2APIC quirk and document the hack that way, but there's unlikely to ever be a use case for disabling the quirk. Go the errata route to avoid having to validate a quirk no one cares about. Fixes: 5bd5db385b3e ("KVM: x86: allow hotplug of VCPU with APIC ID over 0xff") Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230106011306.85230-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> 2023-01-06 01:12:55 +00:00
			`x2APIC`
			`------`
			`When KVM_X2APIC_API_USE_32BIT_IDS is enabled, KVM activates a hack/quirk that`
			`allows sending events to a single vCPU using its x2APIC ID even if the target`
			`vCPU has legacy xAPIC enabled, e.g. to bring up hotplugged vCPUs via INIT-SIPI`
			`on VMs with > 255 vCPUs. A side effect of the quirk is that, if multiple vCPUs`
			`have the same physical APIC ID, KVM will deliver events targeting that APIC ID`
			`only to the vCPU with the lowest vCPU ID. If KVM_X2APIC_API_USE_32BIT_IDS is`
			`not enabled, KVM follows x86 architecture when processing interrupts (all vCPUs`
			`matching the target APIC ID receive the interrupt).`
KVM: x86: Remove VMX support for virtualizing guest MTRR memtypes Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR adds no value, negatively impacts guest performance, and is a maintenance burden due to it's complexity and oddities. KVM's approach to virtualizating MTRRs make no sense, at all. KVM only honors guest MTRR memtypes if EPT is enabled and the guest has a device that may perform non-coherent DMA access. From a hardware virtualization perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically account for guest MTRRs, nor does NPT. Unwinding and deciphering KVM's murky history, the MTRR virtualization code appears to be the result of misdiagnosed issues when EPT + VT-d with passthrough devices was enabled years and years ago. And importantly, the underlying bugs that were fudged around by honoring guest MTRR memtypes have since been fixed (though rather poorly in some cases). The zapping GFNs logic in the MTRR virtualization code came from: commit efdfe536d8c643391e19d5726b072f82964bfbdb Author: Xiao Guangrong <guangrong.xiao@linux.intel.com> Date: Wed May 13 14:42:27 2015 +0800 KVM: MMU: fix MTRR update Currently, whenever guest MTRR registers are changed kvm_mmu_reset_context is called to switch to the new root shadow page table, however, it's useless since: 1) the cache type is not cached into shadow page's attribute so that the original root shadow page will be reused 2) the cache type is set on the last spte, that means we should sync the last sptes when MTRR is changed This patch fixs this issue by drop all the spte in the gfn range which is being updated by MTRR which was a fix for: commit 0bed3b568b68e5835ef5da888a372b9beabf7544 Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Oct 9 16:01:54 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Wed Dec 31 16:51:44 2008 +0200 KVM: Improve MTRR structure As well as reset mmu context when set MTRR. which was part of a "MTRR/PAT support for EPT" series that also added: + if (mt_mask) { + mt_mask = get_memory_type(vcpu, gfn) << + kvm_x86_ops->get_mt_mask_shift(); + spte \|= mt_mask; + } where get_memory_type() was a truly gnarly helper to retrieve the guest MTRR memtype for a given memtype. And very subtly, at the time of that change, KVM always set VMX_EPT_IGMT_BIT, kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK \| VMX_EPT_WRITABLE_MASK \| VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT \| VMX_EPT_IGMT_BIT); which came in via: commit 928d4bf747e9c290b690ff515d8f81e8ee226d97 Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Nov 6 14:55:45 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Tue Nov 11 21:00:37 2008 +0200 KVM: VMX: Set IGMT bit in EPT entry There is a potential issue that, when guest using pagetable without vmexit when EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory, which would be inconsistent with host side and would cause host MCE due to inconsistent cache attribute. The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default memory type to protect host (notice that all memory mapped by KVM should be WB). Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole "ignoreIGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough enabling, but it was applied earlier because it was a generic fix. Jumping back to 0bed3b568b68 ("KVM: Improve MTRR structure"), the other relevant code, or rather lack thereof, is the handling of host MMIO. That fix came in a bit later, but given the author and timing, it's safe to say it was all part of the same EPT+VT-d enabling mess. commit 2aaf69dcee864f4fb6402638dd2f263324ac839f Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Wed Jan 21 16:52:16 2009 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Sun Feb 15 02:47:37 2009 +0200 KVM: MMU: Map device MMIO as UC in EPT Software are not allow to access device MMIO using cacheable memory type, the patch limit MMIO region with UC and WC(guest can select WC using PAT and PCD/PWT). In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization was obviously never tested on NPT until much later, which lends further credence to the theory/argument that this was all the result of misdiagnosed issues. Discussion from the EPT+MTRR enabling thread[] more or less confirms that Sheng Yang was trying to resolve issues with passthrough MMIO. Sheng Yang : Do you mean host(qemu) would access this memory and if we set it to guest : MTRR, host access would be broken? We would cover this in our shadow MTRR : patch, for we encountered this in video ram when doing some experiment with : VGA assignment. And in the same thread, there's also what appears to be confirmation of Intel running into issues with Windows XP related to a guest device driver mapping DMA with WC in the PAT. * Avi Kavity : Sheng Yang wrote: : > Yes... But it's easy to do with assigned devices' mmio, but what if guest : > specific some non-mmio memory's memory type? E.g. we have met one issue in : > Xen, that a assigned-device's XP driver specific one memory region as buffer, : > and modify the memory type then do DMA. : > : > Only map MMIO space can be first step, but I guess we can modify assigned : > memory region memory type follow guest's? : > : : With ept/npt, we can't, since the memory type is in the guest's : pagetable entries, and these are not accessible. [] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com So, for the most part, what likely happened is that 15 years ago, a few engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough device MMIO by emulating guest* MTRRs. Except for the below case, everything since then has been a result of those two intertwined changes. The one exception, which is actually yet more confirmation of all of the above, is the revert of Paolo's attempt at "full" virtualization of guest MTRRs: commit 606decd67049217684e3cb5a54104d51ddd4ef35 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:12:47 2015 +0200 Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages" This reverts commit fd717f11015f673487ffc826e59b2bad69d20fe5. It was reported to cause Machine Check Exceptions (bug 104091). ... commit fd717f11015f673487ffc826e59b2bad69d20fe5 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:38:13 2015 +0200 KVM: x86: apply guest MTRR virtualization on host reserved pages Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true. However, the guest could prefer a different page type than UC for such pages. A good example is that pass-throughed VGA frame buffer is not always UC as host expected. This patch enables full use of virtual guest MTRRs. I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT" and got the same result: machine checks, likely due to the guest MTRRs not being trustworthy/sane at all times. Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that too got reverted. Unfortunately, it doesn't appear that anyone ever found a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times doesn't appear to have a definitive root cause. commit fc07e76ac7ffa3afd621a1c3858a503386a14281 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:20:22 2015 +0200 Revert "KVM: SVM: use NPT page attributes" This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22. Initializing the mapping from MTRR to PAT values was reported to fail nondeterministically, and it also caused extremely slow boot (due to caching getting disabled---bug 103321) with assigned devices. ... commit 3c2e7f7de3240216042b61073803b61b9b3cfb22 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:32:17 2015 +0200 KVM: SVM: use NPT page attributes Right now, NPT page attributes are not used, and the final page attribute depends solely on gPAT (which however is not synced correctly), the guest MTRRs and the guest page attributes. However, we can do better by mimicking what is done for VMX. In the absence of PCI passthrough, the guest PAT can be ignored and the page attributes can be just WB. If passthrough is being used, instead, keep respecting the guest PAT, and emulate the guest MTRRs through the PAT field of the nested page tables. The only snag is that WP memory cannot be emulated correctly, because Linux's default PAT setting only includes the other types. In short, honoring guest MTRRs for VMX was initially a workaround of sorts for KVM ignoring guest PAT and for KVM not forcing UC for host MMIO. And while there are known cases where honoring guest MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in that case is to get WC instead of UC, i.e. at this point it's for performance, not correctness. Furthermore, the complete absence of MTRR virtualization on NPT and shadow paging proves that, while KVM theoretically can do better, it's by no means necessary for correctnesss. Lastly, since kernels mostly rely on firmware to do MTRR setup, and the host typically provides guest firmware, honoring guest MTRRs is effectively honoring host userspace memtypes, which is also backwards. I.e. it would be far better for host userspace to communicate its desired memtype directly to KVM (or perhaps indirectly via VMAs in the host kernel), not through guest MTRRs. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> 2024-03-08 17:09:25 -08:00
			`MTRRs`
			`-----`
KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1 Drop KVM's emulation of CR0.CD=1 on Intel CPUs now that KVM no longer honors guest MTRR memtypes, as forcing UC memory for VMs with non-coherent DMA only makes sense if the guest is using something other than PAT to configure the memtype for the DMA region. Furthermore, KVM has forced WB memory for CR0.CD=1 since commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), and no known VMM in existence disables KVM_X86_QUIRK_CD_NW_CLEARED, let alone does so with non-coherent DMA. Lastly, commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") was from the same author as commit b18d5431acc7 ("KVM: x86: fix CR0.CD virtualization"), and followed by a mere month. I.e. forcing UC memory was likely the result of code inspection or perhaps misdiagnosed failures, and not the necessitate by a concrete use case. Update KVM's documentation to note that KVM_X86_QUIRK_CD_NW_CLEARED is now AMD-only, and to take an erratum for lack of CR0.CD virtualization on Intel. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> 2024-03-08 17:09:26 -08:00			`KVM does not virtualize guest MTRR memory types. KVM emulates accesses to MTRR`
			`MSRs, i.e. {RD,WR}MSR in the guest will behave as expected, but KVM does not`
			`honor guest MTRRs when determining the effective memory type, and instead`
			`treats all of guest memory as having Writeback (WB) MTRRs.`

			`CR0.CD`
			`------`
			`KVM does not virtualize CR0.CD on Intel CPUs. Similar to MTRR MSRs, KVM`
			`emulates CR0.CD accesses so that loads and stores from/to CR0 behave as`
			`expected, but setting CR0.CD=1 has no impact on the cachaeability of guest`
			`memory.`

			`Note, this erratum does not affect AMD CPUs, which fully virtualize CR0.CD in`
			`hardware, i.e. put the CPU caches into "no fill" mode when CR0.CD=1, even when`
			`running in the guest.`