2022-04-06 02:29:10 +03:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
|
|
|
/* Copyright (C) 2021-2022 Intel Corporation */
|
|
|
|
|
|
|
|
#undef pr_fmt
|
|
|
|
#define pr_fmt(fmt) "tdx: " fmt
|
|
|
|
|
|
|
|
#include <linux/cpufeature.h>
|
2022-11-16 14:38:18 -08:00
|
|
|
#include <linux/export.h>
|
|
|
|
#include <linux/io.h>
|
2022-04-06 02:29:13 +03:00
|
|
|
#include <asm/coco.h>
|
2022-04-06 02:29:10 +03:00
|
|
|
#include <asm/tdx.h>
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
#include <asm/vmx.h>
|
2023-12-04 11:31:38 +03:00
|
|
|
#include <asm/ia32.h>
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
#include <asm/insn.h>
|
|
|
|
#include <asm/insn-eval.h>
|
2022-04-06 02:29:35 +03:00
|
|
|
#include <asm/pgtable.h>
|
2022-04-06 02:29:10 +03:00
|
|
|
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
/* MMIO direction */
|
|
|
|
#define EPT_READ 0
|
|
|
|
#define EPT_WRITE 1
|
|
|
|
|
2022-04-06 02:29:26 +03:00
|
|
|
/* Port I/O direction */
|
|
|
|
#define PORT_READ 0
|
|
|
|
#define PORT_WRITE 1
|
|
|
|
|
|
|
|
/* See Exit Qualification for I/O Instructions in VMX documentation */
|
|
|
|
#define VE_IS_IO_IN(e) ((e) & BIT(3))
|
|
|
|
#define VE_GET_IO_SIZE(e) (((e) & GENMASK(2, 0)) + 1)
|
|
|
|
#define VE_GET_PORT_NUM(e) ((e) >> 16)
|
|
|
|
#define VE_IS_IO_STRING(e) ((e) & BIT(4))
|
|
|
|
|
2023-01-27 01:11:58 +03:00
|
|
|
#define ATTR_DEBUG BIT(0)
|
2022-10-28 17:12:20 +03:00
|
|
|
#define ATTR_SEPT_VE_DISABLE BIT(28)
|
|
|
|
|
2022-11-16 14:38:18 -08:00
|
|
|
/* TDX Module call error codes */
|
|
|
|
#define TDCALL_RETURN_CODE(a) ((a) >> 32)
|
|
|
|
#define TDCALL_INVALID_OPERAND 0xc0000100
|
|
|
|
|
|
|
|
#define TDREPORT_SUBTYPE_0 0
|
|
|
|
|
x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.
In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.
A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A leaf of TDCALL used to communicate
with the VMM is called TDVMCALL.
Add generic interfaces to communicate with the TDX module and VMM
(using the TDCALL instruction).
__tdx_module_call() - Used to communicate with the TDX module (via
TDCALL instruction).
__tdx_hypercall() - Used by the guest to request services from
the VMM (via TDVMCALL leaf of TDCALL).
Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.
The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file. The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.
Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.
For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".
Based on previous patch by Sean Christopherson.
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-4-kirill.shutemov@linux.intel.com
2022-04-06 02:29:12 +03:00
|
|
|
/* Called from __tdx_hypercall() for unrecoverable failure */
|
x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL
Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are
almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro
share similar code pattern too. The __tdx_hypercall() and __tdcall()
should be unified to use the same assembly code.
As a preparation to unify them, simplify the TDX_HYPERCALL to make it
more like the TDX_MODULE_CALL.
The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as
function call argument, and does below extra things comparing to the
TDX_MODULE_CALL:
1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally;
2) It sets RCX to the (fixed) bitmap of shared registers internally;
3) It calls __tdx_hypercall_failed() internally (and panics) when the
TDCALL instruction itself fails;
4) After TDCALL, it moves R10 to RAX to return the return code of the
VMCALL leaf, regardless the '\ret' asm macro argument;
Firstly, change the TDX_HYPERCALL to take the same function call
arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer
to 'struct tdx_module_args'. Then 1) and 2) can be moved to the
caller:
- TDG.VP.VMCALL leaf ID can be passed via the function call argument;
- 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus
the bitmap of shared registers can be passed via RCX in the
structure.
Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL
always save output registers to the structure. The caller then can:
- Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error;
- Return R10 in the structure as the return code of the VMCALL leaf;
With above changes, change the asm function from __tdx_hypercall() to
__tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper
of it. This avoids having to add another wrapper of __tdx_hypercall()
(_tdx_hypercall() is already taken).
The __tdcall_hypercall() will be replaced with a __tdcall() variant
using TDX_MODULE_CALL in a later commit as the final goal is to have one
assembly to handle both TDCALL and TDVMCALL.
Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep
this unchanged, annotate __tdx_hypercall(), which is a C function now,
as 'noinstr'.
Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so.
Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the
compressed code.
Opportunistically fix a checkpatch error complaining using space around
parenthesis '(' and ')' while moving the bitmap of shared registers to
<asm/shared/tdx.h>.
[ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com
2023-08-15 23:02:01 +12:00
|
|
|
noinstr void __noreturn __tdx_hypercall_failed(void)
|
x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.
In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.
A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A leaf of TDCALL used to communicate
with the VMM is called TDVMCALL.
Add generic interfaces to communicate with the TDX module and VMM
(using the TDCALL instruction).
__tdx_module_call() - Used to communicate with the TDX module (via
TDCALL instruction).
__tdx_hypercall() - Used by the guest to request services from
the VMM (via TDVMCALL leaf of TDCALL).
Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.
The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file. The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.
Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.
For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".
Based on previous patch by Sean Christopherson.
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-4-kirill.shutemov@linux.intel.com
2022-04-06 02:29:12 +03:00
|
|
|
{
|
2023-01-12 20:43:43 +01:00
|
|
|
instrumentation_begin();
|
x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions
Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.
In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.
A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A leaf of TDCALL used to communicate
with the VMM is called TDVMCALL.
Add generic interfaces to communicate with the TDX module and VMM
(using the TDCALL instruction).
__tdx_module_call() - Used to communicate with the TDX module (via
TDCALL instruction).
__tdx_hypercall() - Used by the guest to request services from
the VMM (via TDVMCALL leaf of TDCALL).
Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.
The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file. The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.
Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.
For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".
Based on previous patch by Sean Christopherson.
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-4-kirill.shutemov@linux.intel.com
2022-04-06 02:29:12 +03:00
|
|
|
panic("TDVMCALL failed. TDX module bug?");
|
|
|
|
}
|
|
|
|
|
2022-04-06 02:29:28 +03:00
|
|
|
#ifdef CONFIG_KVM_GUEST
|
|
|
|
long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
|
|
|
|
unsigned long p3, unsigned long p4)
|
|
|
|
{
|
2023-08-15 23:02:03 +12:00
|
|
|
struct tdx_module_args args = {
|
2022-04-06 02:29:28 +03:00
|
|
|
.r10 = nr,
|
|
|
|
.r11 = p1,
|
|
|
|
.r12 = p2,
|
|
|
|
.r13 = p3,
|
|
|
|
.r14 = p4,
|
|
|
|
};
|
|
|
|
|
2023-03-21 03:35:11 +03:00
|
|
|
return __tdx_hypercall(&args);
|
2022-04-06 02:29:28 +03:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(tdx_kvm_hypercall);
|
|
|
|
#endif
|
|
|
|
|
2022-04-06 02:29:13 +03:00
|
|
|
/*
|
|
|
|
* Used for TDX guests to make calls directly to the TD module. This
|
|
|
|
* should only be used for calls that have no legitimate reason to fail
|
|
|
|
* or where the kernel can not survive the call failing.
|
|
|
|
*/
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
static inline void tdcall(u64 fn, struct tdx_module_args *args)
|
2022-04-06 02:29:13 +03:00
|
|
|
{
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
if (__tdcall_ret(fn, args))
|
2022-04-06 02:29:13 +03:00
|
|
|
panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
|
|
|
|
}
|
|
|
|
|
2022-11-16 14:38:18 -08:00
|
|
|
/**
|
|
|
|
* tdx_mcall_get_report0() - Wrapper to get TDREPORT0 (a.k.a. TDREPORT
|
|
|
|
* subtype 0) using TDG.MR.REPORT TDCALL.
|
|
|
|
* @reportdata: Address of the input buffer which contains user-defined
|
|
|
|
* REPORTDATA to be included into TDREPORT.
|
|
|
|
* @tdreport: Address of the output buffer to store TDREPORT.
|
|
|
|
*
|
|
|
|
* Refer to section titled "TDG.MR.REPORT leaf" in the TDX Module
|
|
|
|
* v1.0 specification for more information on TDG.MR.REPORT TDCALL.
|
|
|
|
* It is used in the TDX guest driver module to get the TDREPORT0.
|
|
|
|
*
|
|
|
|
* Return 0 on success, -EINVAL for invalid operands, or -EIO on
|
|
|
|
* other TDCALL failures.
|
|
|
|
*/
|
|
|
|
int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport)
|
|
|
|
{
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
struct tdx_module_args args = {
|
|
|
|
.rcx = virt_to_phys(tdreport),
|
|
|
|
.rdx = virt_to_phys(reportdata),
|
|
|
|
.r8 = TDREPORT_SUBTYPE_0,
|
|
|
|
};
|
2022-11-16 14:38:18 -08:00
|
|
|
u64 ret;
|
|
|
|
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
ret = __tdcall(TDG_MR_REPORT, &args);
|
2022-11-16 14:38:18 -08:00
|
|
|
if (ret) {
|
|
|
|
if (TDCALL_RETURN_CODE(ret) == TDCALL_INVALID_OPERAND)
|
|
|
|
return -EINVAL;
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(tdx_mcall_get_report0);
|
|
|
|
|
virt: tdx-guest: Add Quote generation support using TSM_REPORTS
In TDX guest, the attestation process is used to verify the TDX guest
trustworthiness to other entities before provisioning secrets to the
guest. The first step in the attestation process is TDREPORT
generation, which involves getting the guest measurement data in the
format of TDREPORT, which is further used to validate the authenticity
of the TDX guest. TDREPORT by design is integrity-protected and can
only be verified on the local machine.
To support remote verification of the TDREPORT in a SGX-based
attestation, the TDREPORT needs to be sent to the SGX Quoting Enclave
(QE) to convert it to a remotely verifiable Quote. SGX QE by design can
only run outside of the TDX guest (i.e. in a host process or in a
normal VM) and guest can use communication channels like vsock or
TCP/IP to send the TDREPORT to the QE. But for security concerns, the
TDX guest may not support these communication channels. To handle such
cases, TDX defines a GetQuote hypercall which can be used by the guest
to request the host VMM to communicate with the SGX QE. More details
about GetQuote hypercall can be found in TDX Guest-Host Communication
Interface (GHCI) for Intel TDX 1.0, section titled
"TDG.VP.VMCALL<GetQuote>".
Trusted Security Module (TSM) [1] exposes a common ABI for Confidential
Computing Guest platforms to get the measurement data via ConfigFS.
Extend the TSM framework and add support to allow an attestation agent
to get the TDX Quote data (included usage example below).
report=/sys/kernel/config/tsm/report/report0
mkdir $report
dd if=/dev/urandom bs=64 count=1 > $report/inblob
hexdump -C $report/outblob
rmdir $report
GetQuote TDVMCALL requires TD guest pass a 4K aligned shared buffer
with TDREPORT data as input, which is further used by the VMM to copy
the TD Quote result after successful Quote generation. To create the
shared buffer, allocate a large enough memory and mark it shared using
set_memory_decrypted() in tdx_guest_init(). This buffer will be re-used
for GetQuote requests in the TDX TSM handler.
Although this method reserves a fixed chunk of memory for GetQuote
requests, such one time allocation can help avoid memory fragmentation
related allocation failures later in the uptime of the guest.
Since the Quote generation process is not time-critical or frequently
used, the current version uses a polling model for Quote requests and
it also does not support parallel GetQuote requests.
Link: https://lore.kernel.org/lkml/169342399185.3934343.3035845348326944519.stgit@dwillia2-xfh.jf.intel.com/ [1]
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Erdem Aktas <erdemaktas@google.com>
Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Tested-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-09-25 20:10:10 -07:00
|
|
|
/**
|
|
|
|
* tdx_hcall_get_quote() - Wrapper to request TD Quote using GetQuote
|
|
|
|
* hypercall.
|
|
|
|
* @buf: Address of the directly mapped shared kernel buffer which
|
|
|
|
* contains TDREPORT. The same buffer will be used by VMM to
|
|
|
|
* store the generated TD Quote output.
|
|
|
|
* @size: size of the tdquote buffer (4KB-aligned).
|
|
|
|
*
|
|
|
|
* Refer to section titled "TDG.VP.VMCALL<GetQuote>" in the TDX GHCI
|
|
|
|
* v1.0 specification for more information on GetQuote hypercall.
|
|
|
|
* It is used in the TDX guest driver module to get the TD Quote.
|
|
|
|
*
|
|
|
|
* Return 0 on success or error code on failure.
|
|
|
|
*/
|
|
|
|
u64 tdx_hcall_get_quote(u8 *buf, size_t size)
|
|
|
|
{
|
|
|
|
/* Since buf is a shared memory, set the shared (decrypted) bits */
|
|
|
|
return _tdx_hypercall(TDVMCALL_GET_QUOTE, cc_mkdec(virt_to_phys(buf)), size, 0, 0);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(tdx_hcall_get_quote);
|
|
|
|
|
2023-01-27 01:11:57 +03:00
|
|
|
static void __noreturn tdx_panic(const char *msg)
|
|
|
|
{
|
2023-08-15 23:02:03 +12:00
|
|
|
struct tdx_module_args args = {
|
2023-01-27 01:11:57 +03:00
|
|
|
.r10 = TDX_HYPERCALL_STANDARD,
|
|
|
|
.r11 = TDVMCALL_REPORT_FATAL_ERROR,
|
|
|
|
.r12 = 0, /* Error code: 0 is Panic */
|
|
|
|
};
|
|
|
|
union {
|
|
|
|
/* Define register order according to the GHCI */
|
|
|
|
struct { u64 r14, r15, rbx, rdi, rsi, r8, r9, rdx; };
|
|
|
|
|
|
|
|
char str[64];
|
|
|
|
} message;
|
|
|
|
|
|
|
|
/* VMM assumes '\0' in byte 65, if the message took all 64 bytes */
|
2023-10-03 21:54:59 +00:00
|
|
|
strtomem_pad(message.str, msg, '\0');
|
2023-01-27 01:11:57 +03:00
|
|
|
|
|
|
|
args.r8 = message.r8;
|
|
|
|
args.r9 = message.r9;
|
|
|
|
args.r14 = message.r14;
|
|
|
|
args.r15 = message.r15;
|
|
|
|
args.rdi = message.rdi;
|
|
|
|
args.rsi = message.rsi;
|
|
|
|
args.rbx = message.rbx;
|
|
|
|
args.rdx = message.rdx;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This hypercall should never return and it is not safe
|
|
|
|
* to keep the guest running. Call it forever if it
|
|
|
|
* happens to return.
|
|
|
|
*/
|
|
|
|
while (1)
|
2023-03-21 03:35:11 +03:00
|
|
|
__tdx_hypercall(&args);
|
2023-01-27 01:11:57 +03:00
|
|
|
}
|
|
|
|
|
2022-10-28 17:12:19 +03:00
|
|
|
static void tdx_parse_tdinfo(u64 *cc_mask)
|
2022-04-06 02:29:13 +03:00
|
|
|
{
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
struct tdx_module_args args = {};
|
2022-04-06 02:29:13 +03:00
|
|
|
unsigned int gpa_width;
|
2022-10-28 17:12:20 +03:00
|
|
|
u64 td_attr;
|
2022-04-06 02:29:13 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* TDINFO TDX module call is used to get the TD execution environment
|
|
|
|
* information like GPA width, number of available vcpus, debug mode
|
|
|
|
* information, etc. More details about the ABI can be found in TDX
|
|
|
|
* Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
|
|
|
|
* [TDG.VP.INFO].
|
|
|
|
*/
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
tdcall(TDG_VP_INFO, &args);
|
2022-04-06 02:29:13 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The highest bit of a guest physical address is the "sharing" bit.
|
|
|
|
* Set it for shared pages and clear it for private pages.
|
2022-10-28 17:12:20 +03:00
|
|
|
*
|
|
|
|
* The GPA width that comes out of this call is critical. TDX guests
|
|
|
|
* can not meaningfully run without it.
|
2022-04-06 02:29:13 +03:00
|
|
|
*/
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
gpa_width = args.rcx & GENMASK(5, 0);
|
2022-10-28 17:12:19 +03:00
|
|
|
*cc_mask = BIT_ULL(gpa_width - 1);
|
2022-10-28 17:12:20 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The kernel can not handle #VE's when accessing normal kernel
|
|
|
|
* memory. Ensure that no #VE will be delivered for accesses to
|
|
|
|
* TD-private memory. Only VMM-shared memory (MMIO) will #VE.
|
|
|
|
*/
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
td_attr = args.rdx;
|
2023-01-27 01:11:58 +03:00
|
|
|
if (!(td_attr & ATTR_SEPT_VE_DISABLE)) {
|
|
|
|
const char *msg = "TD misconfiguration: SEPT_VE_DISABLE attribute must be set.";
|
|
|
|
|
|
|
|
/* Relax SEPT_VE_DISABLE check for debug TD. */
|
|
|
|
if (td_attr & ATTR_DEBUG)
|
|
|
|
pr_warn("%s\n", msg);
|
|
|
|
else
|
|
|
|
tdx_panic(msg);
|
|
|
|
}
|
2022-04-06 02:29:13 +03:00
|
|
|
}
|
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
/*
|
|
|
|
* The TDX module spec states that #VE may be injected for a limited set of
|
|
|
|
* reasons:
|
|
|
|
*
|
|
|
|
* - Emulation of the architectural #VE injection on EPT violation;
|
|
|
|
*
|
|
|
|
* - As a result of guest TD execution of a disallowed instruction,
|
|
|
|
* a disallowed MSR access, or CPUID virtualization;
|
|
|
|
*
|
|
|
|
* - A notification to the guest TD about anomalous behavior;
|
|
|
|
*
|
|
|
|
* The last one is opt-in and is not used by the kernel.
|
|
|
|
*
|
|
|
|
* The Intel Software Developer's Manual describes cases when instruction
|
|
|
|
* length field can be used in section "Information for VM Exits Due to
|
|
|
|
* Instruction Execution".
|
|
|
|
*
|
|
|
|
* For TDX, it ultimately means GET_VEINFO provides reliable instruction length
|
|
|
|
* information if #VE occurred due to instruction execution, but not for EPT
|
|
|
|
* violations.
|
|
|
|
*/
|
|
|
|
static int ve_instr_len(struct ve_info *ve)
|
|
|
|
{
|
|
|
|
switch (ve->exit_reason) {
|
|
|
|
case EXIT_REASON_HLT:
|
|
|
|
case EXIT_REASON_MSR_READ:
|
|
|
|
case EXIT_REASON_MSR_WRITE:
|
|
|
|
case EXIT_REASON_CPUID:
|
|
|
|
case EXIT_REASON_IO_INSTRUCTION:
|
|
|
|
/* It is safe to use ve->instr_len for #VE due instructions */
|
|
|
|
return ve->instr_len;
|
|
|
|
case EXIT_REASON_EPT_VIOLATION:
|
|
|
|
/*
|
|
|
|
* For EPT violations, ve->insn_len is not defined. For those,
|
|
|
|
* the kernel must decode instructions manually and should not
|
|
|
|
* be using this function.
|
|
|
|
*/
|
|
|
|
WARN_ONCE(1, "ve->instr_len is not defined for EPT violations");
|
|
|
|
return 0;
|
|
|
|
default:
|
|
|
|
WARN_ONCE(1, "Unexpected #VE-type: %lld\n", ve->exit_reason);
|
|
|
|
return ve->instr_len;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-01-12 20:43:36 +01:00
|
|
|
static u64 __cpuidle __halt(const bool irq_disabled)
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
{
|
2023-08-15 23:02:03 +12:00
|
|
|
struct tdx_module_args args = {
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
.r10 = TDX_HYPERCALL_STANDARD,
|
|
|
|
.r11 = hcall_func(EXIT_REASON_HLT),
|
|
|
|
.r12 = irq_disabled,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Emulate HLT operation via hypercall. More info about ABI
|
|
|
|
* can be found in TDX Guest-Host-Communication Interface
|
|
|
|
* (GHCI), section 3.8 TDG.VP.VMCALL<Instruction.HLT>.
|
|
|
|
*
|
|
|
|
* The VMM uses the "IRQ disabled" param to understand IRQ
|
|
|
|
* enabled status (RFLAGS.IF) of the TD guest and to determine
|
|
|
|
* whether or not it should schedule the halted vCPU if an
|
|
|
|
* IRQ becomes pending. E.g. if IRQs are disabled, the VMM
|
|
|
|
* can keep the vCPU in virtual HLT, even if an IRQ is
|
|
|
|
* pending, without hanging/breaking the guest.
|
|
|
|
*/
|
2023-03-21 03:35:11 +03:00
|
|
|
return __tdx_hypercall(&args);
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
}
|
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
static int handle_halt(struct ve_info *ve)
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
{
|
|
|
|
const bool irq_disabled = irqs_disabled();
|
|
|
|
|
2023-01-12 20:43:36 +01:00
|
|
|
if (__halt(irq_disabled))
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EIO;
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
return ve_instr_len(ve);
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void __cpuidle tdx_safe_halt(void)
|
|
|
|
{
|
|
|
|
const bool irq_disabled = false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Use WARN_ONCE() to report the failure.
|
|
|
|
*/
|
2023-01-12 20:43:36 +01:00
|
|
|
if (__halt(irq_disabled))
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
WARN_ONCE(1, "HLT instruction emulation failed\n");
|
|
|
|
}
|
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
static int read_msr(struct pt_regs *regs, struct ve_info *ve)
|
2022-04-06 02:29:18 +03:00
|
|
|
{
|
2023-08-15 23:02:03 +12:00
|
|
|
struct tdx_module_args args = {
|
2022-04-06 02:29:18 +03:00
|
|
|
.r10 = TDX_HYPERCALL_STANDARD,
|
|
|
|
.r11 = hcall_func(EXIT_REASON_MSR_READ),
|
|
|
|
.r12 = regs->cx,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Emulate the MSR read via hypercall. More info about ABI
|
|
|
|
* can be found in TDX Guest-Host-Communication Interface
|
|
|
|
* (GHCI), section titled "TDG.VP.VMCALL<Instruction.RDMSR>".
|
|
|
|
*/
|
x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL
Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are
almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro
share similar code pattern too. The __tdx_hypercall() and __tdcall()
should be unified to use the same assembly code.
As a preparation to unify them, simplify the TDX_HYPERCALL to make it
more like the TDX_MODULE_CALL.
The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as
function call argument, and does below extra things comparing to the
TDX_MODULE_CALL:
1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally;
2) It sets RCX to the (fixed) bitmap of shared registers internally;
3) It calls __tdx_hypercall_failed() internally (and panics) when the
TDCALL instruction itself fails;
4) After TDCALL, it moves R10 to RAX to return the return code of the
VMCALL leaf, regardless the '\ret' asm macro argument;
Firstly, change the TDX_HYPERCALL to take the same function call
arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer
to 'struct tdx_module_args'. Then 1) and 2) can be moved to the
caller:
- TDG.VP.VMCALL leaf ID can be passed via the function call argument;
- 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus
the bitmap of shared registers can be passed via RCX in the
structure.
Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL
always save output registers to the structure. The caller then can:
- Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error;
- Return R10 in the structure as the return code of the VMCALL leaf;
With above changes, change the asm function from __tdx_hypercall() to
__tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper
of it. This avoids having to add another wrapper of __tdx_hypercall()
(_tdx_hypercall() is already taken).
The __tdcall_hypercall() will be replaced with a __tdcall() variant
using TDX_MODULE_CALL in a later commit as the final goal is to have one
assembly to handle both TDCALL and TDVMCALL.
Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep
this unchanged, annotate __tdx_hypercall(), which is a C function now,
as 'noinstr'.
Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so.
Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the
compressed code.
Opportunistically fix a checkpatch error complaining using space around
parenthesis '(' and ')' while moving the bitmap of shared registers to
<asm/shared/tdx.h>.
[ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com
2023-08-15 23:02:01 +12:00
|
|
|
if (__tdx_hypercall(&args))
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EIO;
|
2022-04-06 02:29:18 +03:00
|
|
|
|
|
|
|
regs->ax = lower_32_bits(args.r11);
|
|
|
|
regs->dx = upper_32_bits(args.r11);
|
2022-06-14 15:01:34 +03:00
|
|
|
return ve_instr_len(ve);
|
2022-04-06 02:29:18 +03:00
|
|
|
}
|
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
static int write_msr(struct pt_regs *regs, struct ve_info *ve)
|
2022-04-06 02:29:18 +03:00
|
|
|
{
|
2023-08-15 23:02:03 +12:00
|
|
|
struct tdx_module_args args = {
|
2022-04-06 02:29:18 +03:00
|
|
|
.r10 = TDX_HYPERCALL_STANDARD,
|
|
|
|
.r11 = hcall_func(EXIT_REASON_MSR_WRITE),
|
|
|
|
.r12 = regs->cx,
|
|
|
|
.r13 = (u64)regs->dx << 32 | regs->ax,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Emulate the MSR write via hypercall. More info about ABI
|
|
|
|
* can be found in TDX Guest-Host-Communication Interface
|
|
|
|
* (GHCI) section titled "TDG.VP.VMCALL<Instruction.WRMSR>".
|
|
|
|
*/
|
2023-03-21 03:35:11 +03:00
|
|
|
if (__tdx_hypercall(&args))
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
return ve_instr_len(ve);
|
2022-04-06 02:29:18 +03:00
|
|
|
}
|
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
|
2022-04-06 02:29:19 +03:00
|
|
|
{
|
2023-08-15 23:02:03 +12:00
|
|
|
struct tdx_module_args args = {
|
2022-04-06 02:29:19 +03:00
|
|
|
.r10 = TDX_HYPERCALL_STANDARD,
|
|
|
|
.r11 = hcall_func(EXIT_REASON_CPUID),
|
|
|
|
.r12 = regs->ax,
|
|
|
|
.r13 = regs->cx,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only allow VMM to control range reserved for hypervisor
|
|
|
|
* communication.
|
|
|
|
*
|
|
|
|
* Return all-zeros for any CPUID outside the range. It matches CPU
|
|
|
|
* behaviour for non-supported leaf.
|
|
|
|
*/
|
|
|
|
if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
|
|
|
|
regs->ax = regs->bx = regs->cx = regs->dx = 0;
|
2022-06-14 15:01:34 +03:00
|
|
|
return ve_instr_len(ve);
|
2022-04-06 02:29:19 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Emulate the CPUID instruction via a hypercall. More info about
|
|
|
|
* ABI can be found in TDX Guest-Host-Communication Interface
|
|
|
|
* (GHCI), section titled "VP.VMCALL<Instruction.CPUID>".
|
|
|
|
*/
|
x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL
Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are
almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro
share similar code pattern too. The __tdx_hypercall() and __tdcall()
should be unified to use the same assembly code.
As a preparation to unify them, simplify the TDX_HYPERCALL to make it
more like the TDX_MODULE_CALL.
The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as
function call argument, and does below extra things comparing to the
TDX_MODULE_CALL:
1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally;
2) It sets RCX to the (fixed) bitmap of shared registers internally;
3) It calls __tdx_hypercall_failed() internally (and panics) when the
TDCALL instruction itself fails;
4) After TDCALL, it moves R10 to RAX to return the return code of the
VMCALL leaf, regardless the '\ret' asm macro argument;
Firstly, change the TDX_HYPERCALL to take the same function call
arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer
to 'struct tdx_module_args'. Then 1) and 2) can be moved to the
caller:
- TDG.VP.VMCALL leaf ID can be passed via the function call argument;
- 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus
the bitmap of shared registers can be passed via RCX in the
structure.
Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL
always save output registers to the structure. The caller then can:
- Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error;
- Return R10 in the structure as the return code of the VMCALL leaf;
With above changes, change the asm function from __tdx_hypercall() to
__tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper
of it. This avoids having to add another wrapper of __tdx_hypercall()
(_tdx_hypercall() is already taken).
The __tdcall_hypercall() will be replaced with a __tdcall() variant
using TDX_MODULE_CALL in a later commit as the final goal is to have one
assembly to handle both TDCALL and TDVMCALL.
Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep
this unchanged, annotate __tdx_hypercall(), which is a C function now,
as 'noinstr'.
Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so.
Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the
compressed code.
Opportunistically fix a checkpatch error complaining using space around
parenthesis '(' and ')' while moving the bitmap of shared registers to
<asm/shared/tdx.h>.
[ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com
2023-08-15 23:02:01 +12:00
|
|
|
if (__tdx_hypercall(&args))
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EIO;
|
2022-04-06 02:29:19 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
|
|
|
|
* EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
|
|
|
|
* So copy the register contents back to pt_regs.
|
|
|
|
*/
|
|
|
|
regs->ax = args.r12;
|
|
|
|
regs->bx = args.r13;
|
|
|
|
regs->cx = args.r14;
|
|
|
|
regs->dx = args.r15;
|
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
return ve_instr_len(ve);
|
2022-04-06 02:29:19 +03:00
|
|
|
}
|
|
|
|
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
static bool mmio_read(int size, unsigned long addr, unsigned long *val)
|
|
|
|
{
|
2023-08-15 23:02:03 +12:00
|
|
|
struct tdx_module_args args = {
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
.r10 = TDX_HYPERCALL_STANDARD,
|
|
|
|
.r11 = hcall_func(EXIT_REASON_EPT_VIOLATION),
|
|
|
|
.r12 = size,
|
|
|
|
.r13 = EPT_READ,
|
|
|
|
.r14 = addr,
|
|
|
|
.r15 = *val,
|
|
|
|
};
|
|
|
|
|
x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL
Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are
almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro
share similar code pattern too. The __tdx_hypercall() and __tdcall()
should be unified to use the same assembly code.
As a preparation to unify them, simplify the TDX_HYPERCALL to make it
more like the TDX_MODULE_CALL.
The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as
function call argument, and does below extra things comparing to the
TDX_MODULE_CALL:
1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally;
2) It sets RCX to the (fixed) bitmap of shared registers internally;
3) It calls __tdx_hypercall_failed() internally (and panics) when the
TDCALL instruction itself fails;
4) After TDCALL, it moves R10 to RAX to return the return code of the
VMCALL leaf, regardless the '\ret' asm macro argument;
Firstly, change the TDX_HYPERCALL to take the same function call
arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer
to 'struct tdx_module_args'. Then 1) and 2) can be moved to the
caller:
- TDG.VP.VMCALL leaf ID can be passed via the function call argument;
- 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus
the bitmap of shared registers can be passed via RCX in the
structure.
Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL
always save output registers to the structure. The caller then can:
- Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error;
- Return R10 in the structure as the return code of the VMCALL leaf;
With above changes, change the asm function from __tdx_hypercall() to
__tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper
of it. This avoids having to add another wrapper of __tdx_hypercall()
(_tdx_hypercall() is already taken).
The __tdcall_hypercall() will be replaced with a __tdcall() variant
using TDX_MODULE_CALL in a later commit as the final goal is to have one
assembly to handle both TDCALL and TDVMCALL.
Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep
this unchanged, annotate __tdx_hypercall(), which is a C function now,
as 'noinstr'.
Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so.
Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the
compressed code.
Opportunistically fix a checkpatch error complaining using space around
parenthesis '(' and ')' while moving the bitmap of shared registers to
<asm/shared/tdx.h>.
[ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com
2023-08-15 23:02:01 +12:00
|
|
|
if (__tdx_hypercall(&args))
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
return false;
|
x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL
Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are
almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro
share similar code pattern too. The __tdx_hypercall() and __tdcall()
should be unified to use the same assembly code.
As a preparation to unify them, simplify the TDX_HYPERCALL to make it
more like the TDX_MODULE_CALL.
The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as
function call argument, and does below extra things comparing to the
TDX_MODULE_CALL:
1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally;
2) It sets RCX to the (fixed) bitmap of shared registers internally;
3) It calls __tdx_hypercall_failed() internally (and panics) when the
TDCALL instruction itself fails;
4) After TDCALL, it moves R10 to RAX to return the return code of the
VMCALL leaf, regardless the '\ret' asm macro argument;
Firstly, change the TDX_HYPERCALL to take the same function call
arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer
to 'struct tdx_module_args'. Then 1) and 2) can be moved to the
caller:
- TDG.VP.VMCALL leaf ID can be passed via the function call argument;
- 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus
the bitmap of shared registers can be passed via RCX in the
structure.
Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL
always save output registers to the structure. The caller then can:
- Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error;
- Return R10 in the structure as the return code of the VMCALL leaf;
With above changes, change the asm function from __tdx_hypercall() to
__tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper
of it. This avoids having to add another wrapper of __tdx_hypercall()
(_tdx_hypercall() is already taken).
The __tdcall_hypercall() will be replaced with a __tdcall() variant
using TDX_MODULE_CALL in a later commit as the final goal is to have one
assembly to handle both TDCALL and TDVMCALL.
Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep
this unchanged, annotate __tdx_hypercall(), which is a C function now,
as 'noinstr'.
Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so.
Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the
compressed code.
Opportunistically fix a checkpatch error complaining using space around
parenthesis '(' and ')' while moving the bitmap of shared registers to
<asm/shared/tdx.h>.
[ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com
2023-08-15 23:02:01 +12:00
|
|
|
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
*val = args.r11;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool mmio_write(int size, unsigned long addr, unsigned long val)
|
|
|
|
{
|
|
|
|
return !_tdx_hypercall(hcall_func(EXIT_REASON_EPT_VIOLATION), size,
|
|
|
|
EPT_WRITE, addr, val);
|
|
|
|
}
|
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
static int handle_mmio(struct pt_regs *regs, struct ve_info *ve)
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
{
|
2022-06-14 15:01:35 +03:00
|
|
|
unsigned long *reg, val, vaddr;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
char buffer[MAX_INSN_SIZE];
|
2023-01-01 17:29:04 +01:00
|
|
|
enum insn_mmio_type mmio;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
struct insn insn = {};
|
|
|
|
int size, extend_size;
|
|
|
|
u8 extend_val = 0;
|
|
|
|
|
|
|
|
/* Only in-kernel MMIO is supported */
|
|
|
|
if (WARN_ON_ONCE(user_mode(regs)))
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EFAULT;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
|
|
|
|
if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE))
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EFAULT;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
|
|
|
|
if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64))
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EINVAL;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
|
|
|
|
mmio = insn_decode_mmio(&insn, &size);
|
2023-01-01 17:29:04 +01:00
|
|
|
if (WARN_ON_ONCE(mmio == INSN_MMIO_DECODE_FAILED))
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EINVAL;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
|
2023-01-01 17:29:04 +01:00
|
|
|
if (mmio != INSN_MMIO_WRITE_IMM && mmio != INSN_MMIO_MOVS) {
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
reg = insn_get_modrm_reg_ptr(&insn, regs);
|
|
|
|
if (!reg)
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EINVAL;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
}
|
|
|
|
|
2022-06-14 15:01:35 +03:00
|
|
|
/*
|
|
|
|
* Reject EPT violation #VEs that split pages.
|
|
|
|
*
|
|
|
|
* MMIO accesses are supposed to be naturally aligned and therefore
|
|
|
|
* never cross page boundaries. Seeing split page accesses indicates
|
|
|
|
* a bug or a load_unaligned_zeropad() that stepped into an MMIO page.
|
|
|
|
*
|
|
|
|
* load_unaligned_zeropad() will recover using exception fixups.
|
|
|
|
*/
|
|
|
|
vaddr = (unsigned long)insn_get_addr_ref(&insn, regs);
|
|
|
|
if (vaddr / PAGE_SIZE != (vaddr + size - 1) / PAGE_SIZE)
|
|
|
|
return -EFAULT;
|
|
|
|
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
/* Handle writes first */
|
|
|
|
switch (mmio) {
|
2023-01-01 17:29:04 +01:00
|
|
|
case INSN_MMIO_WRITE:
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
memcpy(&val, reg, size);
|
2022-06-14 15:01:34 +03:00
|
|
|
if (!mmio_write(size, ve->gpa, val))
|
|
|
|
return -EIO;
|
|
|
|
return insn.length;
|
2023-01-01 17:29:04 +01:00
|
|
|
case INSN_MMIO_WRITE_IMM:
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
val = insn.immediate.value;
|
2022-06-14 15:01:34 +03:00
|
|
|
if (!mmio_write(size, ve->gpa, val))
|
|
|
|
return -EIO;
|
|
|
|
return insn.length;
|
2023-01-01 17:29:04 +01:00
|
|
|
case INSN_MMIO_READ:
|
|
|
|
case INSN_MMIO_READ_ZERO_EXTEND:
|
|
|
|
case INSN_MMIO_READ_SIGN_EXTEND:
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
/* Reads are handled below */
|
|
|
|
break;
|
2023-01-01 17:29:04 +01:00
|
|
|
case INSN_MMIO_MOVS:
|
|
|
|
case INSN_MMIO_DECODE_FAILED:
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
/*
|
|
|
|
* MMIO was accessed with an instruction that could not be
|
|
|
|
* decoded or handled properly. It was likely not using io.h
|
|
|
|
* helpers or accessed MMIO accidentally.
|
|
|
|
*/
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EINVAL;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
default:
|
|
|
|
WARN_ONCE(1, "Unknown insn_decode_mmio() decode value?");
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EINVAL;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Handle reads */
|
|
|
|
if (!mmio_read(size, ve->gpa, &val))
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EIO;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
|
|
|
|
switch (mmio) {
|
2023-01-01 17:29:04 +01:00
|
|
|
case INSN_MMIO_READ:
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
/* Zero-extend for 32-bit operation */
|
|
|
|
extend_size = size == 4 ? sizeof(*reg) : 0;
|
|
|
|
break;
|
2023-01-01 17:29:04 +01:00
|
|
|
case INSN_MMIO_READ_ZERO_EXTEND:
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
/* Zero extend based on operand size */
|
|
|
|
extend_size = insn.opnd_bytes;
|
|
|
|
break;
|
2023-01-01 17:29:04 +01:00
|
|
|
case INSN_MMIO_READ_SIGN_EXTEND:
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
/* Sign extend based on operand size */
|
|
|
|
extend_size = insn.opnd_bytes;
|
|
|
|
if (size == 1 && val & BIT(7))
|
|
|
|
extend_val = 0xFF;
|
|
|
|
else if (size > 1 && val & BIT(15))
|
|
|
|
extend_val = 0xFF;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
/* All other cases has to be covered with the first switch() */
|
|
|
|
WARN_ON_ONCE(1);
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EINVAL;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (extend_size)
|
|
|
|
memset(reg, extend_val, extend_size);
|
|
|
|
memcpy(reg, &val, size);
|
2022-06-14 15:01:34 +03:00
|
|
|
return insn.length;
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
}
|
|
|
|
|
2022-04-06 02:29:26 +03:00
|
|
|
static bool handle_in(struct pt_regs *regs, int size, int port)
|
|
|
|
{
|
2023-08-15 23:02:03 +12:00
|
|
|
struct tdx_module_args args = {
|
2022-04-06 02:29:26 +03:00
|
|
|
.r10 = TDX_HYPERCALL_STANDARD,
|
|
|
|
.r11 = hcall_func(EXIT_REASON_IO_INSTRUCTION),
|
|
|
|
.r12 = size,
|
|
|
|
.r13 = PORT_READ,
|
|
|
|
.r14 = port,
|
|
|
|
};
|
|
|
|
u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
|
|
|
|
bool success;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Emulate the I/O read via hypercall. More info about ABI can be found
|
|
|
|
* in TDX Guest-Host-Communication Interface (GHCI) section titled
|
|
|
|
* "TDG.VP.VMCALL<Instruction.IO>".
|
|
|
|
*/
|
x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL
Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are
almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro
share similar code pattern too. The __tdx_hypercall() and __tdcall()
should be unified to use the same assembly code.
As a preparation to unify them, simplify the TDX_HYPERCALL to make it
more like the TDX_MODULE_CALL.
The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as
function call argument, and does below extra things comparing to the
TDX_MODULE_CALL:
1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally;
2) It sets RCX to the (fixed) bitmap of shared registers internally;
3) It calls __tdx_hypercall_failed() internally (and panics) when the
TDCALL instruction itself fails;
4) After TDCALL, it moves R10 to RAX to return the return code of the
VMCALL leaf, regardless the '\ret' asm macro argument;
Firstly, change the TDX_HYPERCALL to take the same function call
arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer
to 'struct tdx_module_args'. Then 1) and 2) can be moved to the
caller:
- TDG.VP.VMCALL leaf ID can be passed via the function call argument;
- 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus
the bitmap of shared registers can be passed via RCX in the
structure.
Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL
always save output registers to the structure. The caller then can:
- Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error;
- Return R10 in the structure as the return code of the VMCALL leaf;
With above changes, change the asm function from __tdx_hypercall() to
__tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper
of it. This avoids having to add another wrapper of __tdx_hypercall()
(_tdx_hypercall() is already taken).
The __tdcall_hypercall() will be replaced with a __tdcall() variant
using TDX_MODULE_CALL in a later commit as the final goal is to have one
assembly to handle both TDCALL and TDVMCALL.
Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep
this unchanged, annotate __tdx_hypercall(), which is a C function now,
as 'noinstr'.
Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so.
Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the
compressed code.
Opportunistically fix a checkpatch error complaining using space around
parenthesis '(' and ')' while moving the bitmap of shared registers to
<asm/shared/tdx.h>.
[ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com
2023-08-15 23:02:01 +12:00
|
|
|
success = !__tdx_hypercall(&args);
|
2022-04-06 02:29:26 +03:00
|
|
|
|
|
|
|
/* Update part of the register affected by the emulated instruction */
|
|
|
|
regs->ax &= ~mask;
|
|
|
|
if (success)
|
|
|
|
regs->ax |= args.r11 & mask;
|
|
|
|
|
|
|
|
return success;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool handle_out(struct pt_regs *regs, int size, int port)
|
|
|
|
{
|
|
|
|
u64 mask = GENMASK(BITS_PER_BYTE * size, 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Emulate the I/O write via hypercall. More info about ABI can be found
|
|
|
|
* in TDX Guest-Host-Communication Interface (GHCI) section titled
|
|
|
|
* "TDG.VP.VMCALL<Instruction.IO>".
|
|
|
|
*/
|
|
|
|
return !_tdx_hypercall(hcall_func(EXIT_REASON_IO_INSTRUCTION), size,
|
|
|
|
PORT_WRITE, port, regs->ax & mask);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Emulate I/O using hypercall.
|
|
|
|
*
|
|
|
|
* Assumes the IO instruction was using ax, which is enforced
|
|
|
|
* by the standard io.h macros.
|
|
|
|
*
|
|
|
|
* Return True on success or False on failure.
|
|
|
|
*/
|
2022-06-14 15:01:34 +03:00
|
|
|
static int handle_io(struct pt_regs *regs, struct ve_info *ve)
|
2022-04-06 02:29:26 +03:00
|
|
|
{
|
2022-06-14 15:01:34 +03:00
|
|
|
u32 exit_qual = ve->exit_qual;
|
2022-04-06 02:29:26 +03:00
|
|
|
int size, port;
|
2022-06-14 15:01:34 +03:00
|
|
|
bool in, ret;
|
2022-04-06 02:29:26 +03:00
|
|
|
|
|
|
|
if (VE_IS_IO_STRING(exit_qual))
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EIO;
|
2022-04-06 02:29:26 +03:00
|
|
|
|
|
|
|
in = VE_IS_IO_IN(exit_qual);
|
|
|
|
size = VE_GET_IO_SIZE(exit_qual);
|
|
|
|
port = VE_GET_PORT_NUM(exit_qual);
|
|
|
|
|
|
|
|
|
|
|
|
if (in)
|
2022-06-14 15:01:34 +03:00
|
|
|
ret = handle_in(regs, size, port);
|
2022-04-06 02:29:26 +03:00
|
|
|
else
|
2022-06-14 15:01:34 +03:00
|
|
|
ret = handle_out(regs, size, port);
|
|
|
|
if (!ret)
|
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
return ve_instr_len(ve);
|
2022-04-06 02:29:26 +03:00
|
|
|
}
|
|
|
|
|
2022-04-06 02:29:27 +03:00
|
|
|
/*
|
|
|
|
* Early #VE exception handler. Only handles a subset of port I/O.
|
|
|
|
* Intended only for earlyprintk. If failed, return false.
|
|
|
|
*/
|
|
|
|
__init bool tdx_early_handle_ve(struct pt_regs *regs)
|
|
|
|
{
|
|
|
|
struct ve_info ve;
|
2022-06-14 15:01:34 +03:00
|
|
|
int insn_len;
|
2022-04-06 02:29:27 +03:00
|
|
|
|
|
|
|
tdx_get_ve_info(&ve);
|
|
|
|
|
|
|
|
if (ve.exit_reason != EXIT_REASON_IO_INSTRUCTION)
|
|
|
|
return false;
|
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
insn_len = handle_io(regs, &ve);
|
|
|
|
if (insn_len < 0)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
regs->ip += insn_len;
|
|
|
|
return true;
|
2022-04-06 02:29:27 +03:00
|
|
|
}
|
|
|
|
|
x86/traps: Add #VE support for TDX guest
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to specific guest physical addresses
Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.
Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.
During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.
TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag. This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.
Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.
For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 02:29:16 +03:00
|
|
|
void tdx_get_ve_info(struct ve_info *ve)
|
|
|
|
{
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
struct tdx_module_args args = {};
|
x86/traps: Add #VE support for TDX guest
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to specific guest physical addresses
Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.
Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.
During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.
TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag. This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.
Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.
For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 02:29:16 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Called during #VE handling to retrieve the #VE info from the
|
|
|
|
* TDX module.
|
|
|
|
*
|
|
|
|
* This has to be called early in #VE handling. A "nested" #VE which
|
|
|
|
* occurs before this will raise a #DF and is not recoverable.
|
|
|
|
*
|
|
|
|
* The call retrieves the #VE info from the TDX module, which also
|
|
|
|
* clears the "#VE valid" flag. This must be done before anything else
|
|
|
|
* because any #VE that occurs while the valid flag is set will lead to
|
|
|
|
* #DF.
|
|
|
|
*
|
|
|
|
* Note, the TDX module treats virtual NMIs as inhibited if the #VE
|
|
|
|
* valid flag is set. It means that NMI=>#VE will not result in a #DF.
|
|
|
|
*/
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
tdcall(TDG_VP_VEINFO_GET, &args);
|
x86/traps: Add #VE support for TDX guest
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to specific guest physical addresses
Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.
Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.
During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.
TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag. This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.
Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.
For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 02:29:16 +03:00
|
|
|
|
|
|
|
/* Transfer the output parameters */
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
ve->exit_reason = args.rcx;
|
|
|
|
ve->exit_qual = args.rdx;
|
|
|
|
ve->gla = args.r8;
|
|
|
|
ve->gpa = args.r9;
|
|
|
|
ve->instr_len = lower_32_bits(args.r10);
|
|
|
|
ve->instr_info = upper_32_bits(args.r10);
|
x86/traps: Add #VE support for TDX guest
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to specific guest physical addresses
Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.
Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.
During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.
TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag. This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.
Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.
For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 02:29:16 +03:00
|
|
|
}
|
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
/*
|
|
|
|
* Handle the user initiated #VE.
|
|
|
|
*
|
|
|
|
* On success, returns the number of bytes RIP should be incremented (>=0)
|
|
|
|
* or -errno on error.
|
|
|
|
*/
|
|
|
|
static int virt_exception_user(struct pt_regs *regs, struct ve_info *ve)
|
2022-04-06 02:29:19 +03:00
|
|
|
{
|
|
|
|
switch (ve->exit_reason) {
|
|
|
|
case EXIT_REASON_CPUID:
|
2022-06-14 15:01:34 +03:00
|
|
|
return handle_cpuid(regs, ve);
|
2022-04-06 02:29:19 +03:00
|
|
|
default:
|
|
|
|
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EIO;
|
2022-04-06 02:29:19 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-01-27 01:11:58 +03:00
|
|
|
static inline bool is_private_gpa(u64 gpa)
|
|
|
|
{
|
|
|
|
return gpa == cc_mkenc(gpa);
|
|
|
|
}
|
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
/*
|
|
|
|
* Handle the kernel #VE.
|
|
|
|
*
|
|
|
|
* On success, returns the number of bytes RIP should be incremented (>=0)
|
|
|
|
* or -errno on error.
|
|
|
|
*/
|
|
|
|
static int virt_exception_kernel(struct pt_regs *regs, struct ve_info *ve)
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
{
|
|
|
|
switch (ve->exit_reason) {
|
|
|
|
case EXIT_REASON_HLT:
|
2022-06-14 15:01:34 +03:00
|
|
|
return handle_halt(ve);
|
2022-04-06 02:29:18 +03:00
|
|
|
case EXIT_REASON_MSR_READ:
|
2022-06-14 15:01:34 +03:00
|
|
|
return read_msr(regs, ve);
|
2022-04-06 02:29:18 +03:00
|
|
|
case EXIT_REASON_MSR_WRITE:
|
2022-06-14 15:01:34 +03:00
|
|
|
return write_msr(regs, ve);
|
2022-04-06 02:29:19 +03:00
|
|
|
case EXIT_REASON_CPUID:
|
2022-06-14 15:01:34 +03:00
|
|
|
return handle_cpuid(regs, ve);
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
case EXIT_REASON_EPT_VIOLATION:
|
2023-01-27 01:11:58 +03:00
|
|
|
if (is_private_gpa(ve->gpa))
|
|
|
|
panic("Unexpected EPT-violation on private memory.");
|
x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.
To emulate an instruction an emulator needs two things:
- R/W access to the register file to read/modify instruction arguments
and see RIP of the faulted instruction.
- Read access to memory where instruction is placed to see what to
emulate. In this case it is guest kernel text.
Both of them are not available to VMM in TDX environment:
- Register file is never exposed to VMM. When a TD exits to the module,
it saves registers into the state-save area allocated for that TD.
The module then scrubs these registers before returning execution
control to the VMM, to help prevent leakage of TD state.
- TDX does not allow guests to execute from shared memory. All executed
instructions are in TD-private memory. Being private to the TD, VMMs
have no way to access TD-private memory and no way to read the
instruction to decode and emulate it.
In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.
Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.
This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.
In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.
This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.
MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.
Any CPU instruction that accesses memory can also be used to access
MMIO. However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.
The io.h helpers intentionally use a limited set of instructions when
accessing MMIO. This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.
MMIO accesses performed without the io.h helpers are at the mercy of the
compiler. Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated. TDX
guests will oops if they encounter one of these decoding failures.
This means that TDX guests *must* use the io.h helpers to access MMIO.
This requirement is not new. Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.
=== Potential alternative approaches ===
== Paravirtualizing all MMIO ==
An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.
Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.
However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.
Many drivers will never be used in the TDX environment and the bloat
cannot be justified.
== Patching TDX drivers ==
Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.
All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.
This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-06 02:29:20 +03:00
|
|
|
return handle_mmio(regs, ve);
|
2022-04-06 02:29:26 +03:00
|
|
|
case EXIT_REASON_IO_INSTRUCTION:
|
2022-06-14 15:01:34 +03:00
|
|
|
return handle_io(regs, ve);
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
default:
|
|
|
|
pr_warn("Unexpected #VE: %lld\n", ve->exit_reason);
|
2022-06-14 15:01:34 +03:00
|
|
|
return -EIO;
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
x86/traps: Add #VE support for TDX guest
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to specific guest physical addresses
Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.
Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.
During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.
TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag. This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.
Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.
For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 02:29:16 +03:00
|
|
|
bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
|
|
|
|
{
|
2022-06-14 15:01:34 +03:00
|
|
|
int insn_len;
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
|
|
|
|
if (user_mode(regs))
|
2022-06-14 15:01:34 +03:00
|
|
|
insn_len = virt_exception_user(regs, ve);
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
else
|
2022-06-14 15:01:34 +03:00
|
|
|
insn_len = virt_exception_kernel(regs, ve);
|
|
|
|
if (insn_len < 0)
|
|
|
|
return false;
|
x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-06 02:29:17 +03:00
|
|
|
|
|
|
|
/* After successful #VE handling, move the IP */
|
2022-06-14 15:01:34 +03:00
|
|
|
regs->ip += insn_len;
|
x86/traps: Add #VE support for TDX guest
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to specific guest physical addresses
Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.
Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.
During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.
TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag. This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.
Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.
For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 02:29:16 +03:00
|
|
|
|
2022-06-14 15:01:34 +03:00
|
|
|
return true;
|
x86/traps: Add #VE support for TDX guest
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to specific guest physical addresses
Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.
Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.
During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.
TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag. This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.
Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.
For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-06 02:29:16 +03:00
|
|
|
}
|
|
|
|
|
2022-04-06 02:29:35 +03:00
|
|
|
static bool tdx_tlb_flush_required(bool private)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* TDX guest is responsible for flushing TLB on private->shared
|
|
|
|
* transition. VMM is responsible for flushing on shared->private.
|
|
|
|
*
|
|
|
|
* The VMM _can't_ flush private addresses as it can't generate PAs
|
|
|
|
* with the guest's HKID. Shared memory isn't subject to integrity
|
|
|
|
* checking, i.e. the VMM doesn't need to flush for its own protection.
|
|
|
|
*
|
|
|
|
* There's no need to flush when converting from shared to private,
|
|
|
|
* as flushing is the VMM's responsibility in this case, e.g. it must
|
|
|
|
* flush to avoid integrity failures in the face of a buggy or
|
|
|
|
* malicious guest.
|
|
|
|
*/
|
|
|
|
return !private;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool tdx_cache_flush_required(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
|
|
|
|
* TDX doesn't have such capability.
|
|
|
|
*
|
|
|
|
* Flush cache unconditionally.
|
|
|
|
*/
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2023-08-10 19:12:45 -07:00
|
|
|
* Notify the VMM about page mapping conversion. More info about ABI
|
|
|
|
* can be found in TDX Guest-Host-Communication Interface (GHCI),
|
|
|
|
* section "TDG.VP.VMCALL<MapGPA>".
|
2022-04-06 02:29:35 +03:00
|
|
|
*/
|
2023-08-10 19:12:45 -07:00
|
|
|
static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
|
2022-04-06 02:29:35 +03:00
|
|
|
{
|
2023-08-10 19:12:45 -07:00
|
|
|
/* Retrying the hypercall a second time should succeed; use 3 just in case */
|
|
|
|
const int max_retries_per_page = 3;
|
|
|
|
int retry_count = 0;
|
2022-04-06 02:29:35 +03:00
|
|
|
|
|
|
|
if (!enc) {
|
|
|
|
/* Set the shared (decrypted) bits: */
|
|
|
|
start |= cc_mkdec(0);
|
|
|
|
end |= cc_mkdec(0);
|
|
|
|
}
|
|
|
|
|
2023-08-10 19:12:45 -07:00
|
|
|
while (retry_count < max_retries_per_page) {
|
2023-08-15 23:02:03 +12:00
|
|
|
struct tdx_module_args args = {
|
2023-08-10 19:12:45 -07:00
|
|
|
.r10 = TDX_HYPERCALL_STANDARD,
|
|
|
|
.r11 = TDVMCALL_MAP_GPA,
|
|
|
|
.r12 = start,
|
|
|
|
.r13 = end - start };
|
|
|
|
|
|
|
|
u64 map_fail_paddr;
|
x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL
Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are
almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro
share similar code pattern too. The __tdx_hypercall() and __tdcall()
should be unified to use the same assembly code.
As a preparation to unify them, simplify the TDX_HYPERCALL to make it
more like the TDX_MODULE_CALL.
The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as
function call argument, and does below extra things comparing to the
TDX_MODULE_CALL:
1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally;
2) It sets RCX to the (fixed) bitmap of shared registers internally;
3) It calls __tdx_hypercall_failed() internally (and panics) when the
TDCALL instruction itself fails;
4) After TDCALL, it moves R10 to RAX to return the return code of the
VMCALL leaf, regardless the '\ret' asm macro argument;
Firstly, change the TDX_HYPERCALL to take the same function call
arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer
to 'struct tdx_module_args'. Then 1) and 2) can be moved to the
caller:
- TDG.VP.VMCALL leaf ID can be passed via the function call argument;
- 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus
the bitmap of shared registers can be passed via RCX in the
structure.
Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL
always save output registers to the structure. The caller then can:
- Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error;
- Return R10 in the structure as the return code of the VMCALL leaf;
With above changes, change the asm function from __tdx_hypercall() to
__tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper
of it. This avoids having to add another wrapper of __tdx_hypercall()
(_tdx_hypercall() is already taken).
The __tdcall_hypercall() will be replaced with a __tdcall() variant
using TDX_MODULE_CALL in a later commit as the final goal is to have one
assembly to handle both TDCALL and TDVMCALL.
Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep
this unchanged, annotate __tdx_hypercall(), which is a C function now,
as 'noinstr'.
Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so.
Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the
compressed code.
Opportunistically fix a checkpatch error complaining using space around
parenthesis '(' and ')' while moving the bitmap of shared registers to
<asm/shared/tdx.h>.
[ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com
2023-08-15 23:02:01 +12:00
|
|
|
u64 ret = __tdx_hypercall(&args);
|
2023-08-10 19:12:45 -07:00
|
|
|
|
|
|
|
if (ret != TDVMCALL_STATUS_RETRY)
|
|
|
|
return !ret;
|
|
|
|
/*
|
|
|
|
* The guest must retry the operation for the pages in the
|
|
|
|
* region starting at the GPA specified in R11. R11 comes
|
|
|
|
* from the untrusted VMM. Sanity check it.
|
|
|
|
*/
|
|
|
|
map_fail_paddr = args.r11;
|
|
|
|
if (map_fail_paddr < start || map_fail_paddr >= end)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* "Consume" a retry without forward progress */
|
|
|
|
if (map_fail_paddr == start) {
|
|
|
|
retry_count++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
start = map_fail_paddr;
|
|
|
|
retry_count = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inform the VMM of the guest's intent for this physical page: shared with
|
|
|
|
* the VMM or private to the guest. The VMM is expected to change its mapping
|
|
|
|
* of the page in response.
|
|
|
|
*/
|
|
|
|
static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
|
|
|
|
{
|
|
|
|
phys_addr_t start = __pa(vaddr);
|
|
|
|
phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
|
|
|
|
|
|
|
|
if (!tdx_map_gpa(start, end, enc))
|
2022-04-06 02:29:35 +03:00
|
|
|
return false;
|
|
|
|
|
2023-06-06 17:26:37 +03:00
|
|
|
/* shared->private conversion requires memory to be accepted before use */
|
|
|
|
if (enc)
|
|
|
|
return tdx_accept_memory(start, end);
|
2022-04-06 02:29:35 +03:00
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2023-06-06 12:56:21 +03:00
|
|
|
static bool tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
|
|
|
|
bool enc)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Only handle shared->private conversion here.
|
|
|
|
* See the comment in tdx_early_init().
|
|
|
|
*/
|
|
|
|
if (enc)
|
|
|
|
return tdx_enc_status_changed(vaddr, numpages, enc);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
|
|
|
|
bool enc)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Only handle private->shared conversion here.
|
|
|
|
* See the comment in tdx_early_init().
|
|
|
|
*/
|
|
|
|
if (!enc)
|
|
|
|
return tdx_enc_status_changed(vaddr, numpages, enc);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2022-04-06 02:29:10 +03:00
|
|
|
void __init tdx_early_init(void)
|
|
|
|
{
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
struct tdx_module_args args = {
|
|
|
|
.rdx = TDCS_NOTIFY_ENABLES,
|
|
|
|
.r9 = -1ULL,
|
|
|
|
};
|
2022-04-06 02:29:13 +03:00
|
|
|
u64 cc_mask;
|
2022-04-06 02:29:10 +03:00
|
|
|
u32 eax, sig[3];
|
|
|
|
|
|
|
|
cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax, &sig[0], &sig[2], &sig[1]);
|
|
|
|
|
|
|
|
if (memcmp(TDX_IDENT, sig, sizeof(sig)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
|
|
|
|
|
x86/tdx: Mark TSC reliable
In x86 virtualization environments, including TDX, RDTSC instruction is
handled without causing a VM exit, resulting in minimal overhead and
jitters. On the other hand, other clock sources (such as HPET, ACPI
timer, APIC, etc.) necessitate VM exits to implement, resulting in more
fluctuating measurements compared to TSC. Thus, those clock sources are
not effective for calibrating TSC.
As a foundation, the host TSC is guaranteed to be invariant on any
system which enumerates TDX support.
TDX guests and the TDX module build on that foundation by enforcing:
- Virtual TSC is monotonously incrementing for any single VCPU;
- Virtual TSC values are consistent among all the TD’s VCPUs at the
level supported by the CPU:
+ VMM is required to set the same TSC_ADJUST;
+ VMM must not modify from initial value of TSC_ADJUST before
SEAMCALL;
- The frequency is determined by TD configuration:
+ Virtual TSC frequency is specified by VMM on TDH.MNG.INIT;
+ Virtual TSC starts counting from 0 at TDH.MNG.INIT;
The result is that a reliable TSC is a TDX architectural guarantee.
Use the TSC as the only reliable clock source in TD guests, bypassing
unstable calibration.
This is similar to what the kernel already does in some VMWare and
HyperV environments.
[ dhansen: changelog tweaks ]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Erdem Aktas <erdemaktas@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/all/20231006144549.2633-1-kirill.shutemov%40linux.intel.com
2023-10-06 17:45:49 +03:00
|
|
|
/* TSC is the only reliable clock in TDX guest */
|
|
|
|
setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
|
|
|
|
|
2023-05-08 12:44:28 +02:00
|
|
|
cc_vendor = CC_VENDOR_INTEL;
|
2022-10-28 17:12:19 +03:00
|
|
|
tdx_parse_tdinfo(&cc_mask);
|
2022-04-06 02:29:13 +03:00
|
|
|
cc_set_mask(cc_mask);
|
|
|
|
|
2023-01-27 01:11:59 +03:00
|
|
|
/* Kernel does not use NOTIFY_ENABLES and does not need random #VEs */
|
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
2023-08-15 23:01:59 +12:00
|
|
|
tdcall(TDG_VM_WR, &args);
|
2023-01-27 01:11:59 +03:00
|
|
|
|
2022-04-06 02:29:14 +03:00
|
|
|
/*
|
|
|
|
* All bits above GPA width are reserved and kernel treats shared bit
|
|
|
|
* as flag, not as part of physical address.
|
|
|
|
*
|
|
|
|
* Adjust physical mask to only cover valid GPA bits.
|
|
|
|
*/
|
|
|
|
physical_mask &= cc_mask - 1;
|
|
|
|
|
2023-06-06 12:56:21 +03:00
|
|
|
/*
|
|
|
|
* The kernel mapping should match the TDX metadata for the page.
|
|
|
|
* load_unaligned_zeropad() can touch memory *adjacent* to that which is
|
|
|
|
* owned by the caller and can catch even _momentary_ mismatches. Bad
|
|
|
|
* things happen on mismatch:
|
|
|
|
*
|
|
|
|
* - Private mapping => Shared Page == Guest shutdown
|
|
|
|
* - Shared mapping => Private Page == Recoverable #VE
|
|
|
|
*
|
|
|
|
* guest.enc_status_change_prepare() converts the page from
|
|
|
|
* shared=>private before the mapping becomes private.
|
|
|
|
*
|
|
|
|
* guest.enc_status_change_finish() converts the page from
|
|
|
|
* private=>shared after the mapping becomes private.
|
|
|
|
*
|
|
|
|
* In both cases there is a temporary shared mapping to a private page,
|
|
|
|
* which can result in a #VE. But, there is never a private mapping to
|
|
|
|
* a shared page.
|
|
|
|
*/
|
|
|
|
x86_platform.guest.enc_status_change_prepare = tdx_enc_status_change_prepare;
|
|
|
|
x86_platform.guest.enc_status_change_finish = tdx_enc_status_change_finish;
|
|
|
|
|
|
|
|
x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
|
|
|
|
x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
|
2022-04-06 02:29:35 +03:00
|
|
|
|
2023-05-31 09:44:26 +02:00
|
|
|
/*
|
|
|
|
* TDX intercepts the RDMSR to read the X2APIC ID in the parallel
|
|
|
|
* bringup low level code. That raises #VE which cannot be handled
|
|
|
|
* there.
|
|
|
|
*
|
|
|
|
* Intel-TDX has a secure RDMSR hypercall, but that needs to be
|
2024-01-02 18:40:11 -06:00
|
|
|
* implemented separately in the low level startup ASM code.
|
2023-05-31 09:44:26 +02:00
|
|
|
* Until that is in place, disable parallel bringup for TDX.
|
|
|
|
*/
|
|
|
|
x86_cpuinit.parallel_bringup = false;
|
|
|
|
|
2022-04-06 02:29:10 +03:00
|
|
|
pr_info("Guest detected\n");
|
|
|
|
}
|