Commit graph

59 commits

Author SHA1 Message Date
Uros Bizjak
d109ff4f0b x86/asm: Replace "REP; NOP" with PAUSE mnemonic
Current minimum required version of binutils is 2.25,
which supports PAUSE instruction mnemonic.

Replace "REP; NOP" with this proper mnemonic.

No functional change intended.

Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250418080805.83679-2-ubizjak@gmail.com
2025-04-18 10:19:25 +02:00
Uros Bizjak
42c782fae3 x86/asm: Remove semicolon from "rep" prefixes
Minimum version of binutils required to compile the kernel is 2.25.
This version correctly handles the "rep" prefixes, so it is possible
to remove the semicolon, which was used to support ancient versions
of GNU as.

Due to the semicolon, the compiler considers "rep; insn" (or its
alternate "rep\n\tinsn" form) as two separate instructions. Removing
the semicolon makes asm length calculations more accurate, consequently
making scheduling and inlining decisions of the compiler more accurate.

Removing the semicolon also enables assembler checks involving "rep"
prefixes. Trying to assemble e.g. "rep addl %eax, %ebx" results in:

  Error: invalid instruction `add' after `rep'

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@kernel.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20250418071437.4144391-2-ubizjak@gmail.com
2025-04-18 09:33:33 +02:00
David Woodhouse
de085ddd49 x86/kexec: Invalidate GDT/IDT from relocate_kernel() instead of earlier
Reduce the window during which exceptions are unhandled, by leaving the
GDT/IDT in place all the way into the relocate_kernel() function, until
the moment that %cr3 gets replaced.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250326142404.256980-4-dwmw2@infradead.org
2025-04-10 12:17:14 +02:00
David Woodhouse
7516e7216b x86/kexec: Add 8250 MMIO serial port output
This supports the same 32-bit MMIO-mapped 8250 as the early_printk code.

It's not clear why the early_printk code supports this form and only this
form; the actual runtime 8250_pci doesn't seem to support it. But having
hacked up QEMU to expose such a device, early_printk does work with it,
and now so does the kexec debug code.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250326142404.256980-3-dwmw2@infradead.org
2025-04-10 12:17:14 +02:00
David Woodhouse
d358b45120 x86/kexec: Add 8250 serial port output
If a serial port was configured for early_printk, use it for debug output
from the relocate_kernel exception handler too.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250326142404.256980-2-dwmw2@infradead.org
2025-04-10 12:17:13 +02:00
David Woodhouse
3d66af75b0 x86/kexec: Debugging support: Dump registers on exception
The actual serial output function is a no-op for now.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20250314173226.3062535-3-dwmw2@infradead.org
2025-03-25 12:49:05 +01:00
David Woodhouse
8df505af7f x86/kexec: Debugging support: Load an IDT and basic exception entry points
[ mingo: Minor readability edits ]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20250314173226.3062535-2-dwmw2@infradead.org
2025-03-25 12:49:05 +01:00
David Woodhouse
b25eb5f5e4 x86/kexec: Add relocate_kernel() debugging support: Load a GDT
There are some failure modes which lead to triple-faults in the
relocate_kernel() function, which is fairly much undebuggable
for normal mortals.

Adding a GDT in the relocate_kernel() environment is step 1 towards
being able to catch faults and do something more useful.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250312144257.2348250-2-dwmw2@infradead.org
2025-03-14 11:01:53 +01:00
Linus Torvalds
5b7f7234ff x86/boot changes for v6.14:
- A large and involved preparatory series to pave the way to add exception
    handling for relocate_kernel - which will be a debugging facility that
    has aided in the field to debug an exceptionally hard to debug early boot bug.
    Plus assorted cleanups and fixes that were discovered along the way,
    by David Woodhouse:
 
       - Clean up and document register use in relocate_kernel_64.S
       - Use named labels in swap_pages in relocate_kernel_64.S
       - Only swap pages for ::preserve_context mode
       - Allocate PGD for x86_64 transition page tables separately
       - Copy control page into place in machine_kexec_prepare()
       - Invoke copy of relocate_kernel() instead of the original
       - Move relocate_kernel to kernel .data section
       - Add data section to relocate_kernel
       - Drop page_list argument from relocate_kernel()
       - Eliminate writes through kernel mapping of relocate_kernel page
       - Clean up register usage in relocate_kernel()
       - Mark relocate_kernel page as ROX instead of RWX
       - Disable global pages before writing to control page
       - Ensure preserve_context flag is set on return to kernel
       - Use correct swap page in swap_pages function
       - Fix stack and handling of re-entry point for ::preserve_context
       - Mark machine_kexec() with __nocfi
       - Cope with relocate_kernel() not being at the start of the page
       - Use typedef for relocate_kernel_fn function prototype
       - Fix location of relocate_kernel with -ffunction-sections (fix by Nathan Chancellor)
 
  - A series to remove the last remaining absolute symbol references from
    .head.text, and enforce this at build time, by Ard Biesheuvel:
 
       - Avoid WARN()s and panic()s in early boot code
       - Don't hang but terminate on failure to remap SVSM CA
       - Determine VA/PA offset before entering C code
       - Avoid intentional absolute symbol references in .head.text
       - Disable UBSAN in early boot code
       - Move ENTRY_TEXT to the start of the image
       - Move .head.text into its own output section
       - Reject absolute references in .head.text
 
  - Which build-time enforcement uncovered a handful of bugs of essentially
    non-working code, and a wrokaround for a toolchain bug, fixed by
    Ard Biesheuvel as well:
 
       - Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
       - Disable UBSAN on SEV code that may execute very early
       - Disable ftrace branch profiling in SEV startup code
 
  - And miscellaneous cleanups:
 
        - kexec_core: Add and update comments regarding the KEXEC_JUMP flow (Rafael J. Wysocki)
        - x86/sysfs: Constify 'struct bin_attribute' (Thomas Weißschuh)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmeQDmURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1inwRAAjD5QR/Yu7Yiv2nM/ncUwAItsFkv9Jk4Y
 HPGz9qNJoZxKxuZVj9bfQhWDe3g6VLnlDYgatht9BsyP5b12qZrUe+yp/TOH54Z3
 wPD+U/jun4jiSr7oJkJC+bFn+a/tL39pB8Y6m+jblacgVglleO3SH5fBWNE1UbIV
 e2iiNxi0bfuHy3wquegnKaMyF1e7YLw1p5laGSwwk21g5FjT7cLQOqC0/9u8u9xX
 Ha+iaod7JOcjiQOqIt/MV57ldWEFCrUhQozRV3tK5Ptf5aoGFpisgQoRoduWUtFz
 UbHiHhv6zE4DOIUzaAbJjYfR1Z/LCviwON97XJgeOOkJaULF7yFCfhGxKSyQoMIh
 qZtlBs4VsGl2/dOl+iW6xKwgRiNundTzSQtt5D/xuFz5LnDxe/SrlZnYp8lOPP8R
 w9V2b/fC0YxmUzEW6EDhBqvfuScKiNWoic47qvYfZPaWyg1ESpvWTIh6AKB5ThUR
 upgJQdA4HW+y5C57uHW40TSe3xEeqM3+Slk0jxLElP7/yTul5r7jrjq2EkwaAv/j
 6/0LsMSr33r9fVFeMP1qLXPUaipcqTWWTpeeTr8NBGUcvOKzw5SltEG4NihzCyhF
 3/UMQhcQ6KE3iFMPlRu4hV7ZV4gErZmLoRwh9Uk28f2Xx8T95uoV8KTg1/sRZRTo
 uQLeRxYnyrw=
 =vGWS
 -----END PGP SIGNATURE-----

Merge tag 'x86-boot-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot updates from Ingo Molnar:

 - A large and involved preparatory series to pave the way to add
   exception handling for relocate_kernel - which will be a debugging
   facility that has aided in the field to debug an exceptionally hard
   to debug early boot bug. Plus assorted cleanups and fixes that were
   discovered along the way, by David Woodhouse:

      - Clean up and document register use in relocate_kernel_64.S
      - Use named labels in swap_pages in relocate_kernel_64.S
      - Only swap pages for ::preserve_context mode
      - Allocate PGD for x86_64 transition page tables separately
      - Copy control page into place in machine_kexec_prepare()
      - Invoke copy of relocate_kernel() instead of the original
      - Move relocate_kernel to kernel .data section
      - Add data section to relocate_kernel
      - Drop page_list argument from relocate_kernel()
      - Eliminate writes through kernel mapping of relocate_kernel page
      - Clean up register usage in relocate_kernel()
      - Mark relocate_kernel page as ROX instead of RWX
      - Disable global pages before writing to control page
      - Ensure preserve_context flag is set on return to kernel
      - Use correct swap page in swap_pages function
      - Fix stack and handling of re-entry point for ::preserve_context
      - Mark machine_kexec() with __nocfi
      - Cope with relocate_kernel() not being at the start of the page
      - Use typedef for relocate_kernel_fn function prototype
      - Fix location of relocate_kernel with -ffunction-sections (fix by Nathan Chancellor)

 - A series to remove the last remaining absolute symbol references from
   .head.text, and enforce this at build time, by Ard Biesheuvel:

      - Avoid WARN()s and panic()s in early boot code
      - Don't hang but terminate on failure to remap SVSM CA
      - Determine VA/PA offset before entering C code
      - Avoid intentional absolute symbol references in .head.text
      - Disable UBSAN in early boot code
      - Move ENTRY_TEXT to the start of the image
      - Move .head.text into its own output section
      - Reject absolute references in .head.text

 - The above build-time enforcement uncovered a handful of bugs of
   essentially non-working code, and a wrokaround for a toolchain bug,
   fixed by Ard Biesheuvel as well:

      - Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
      - Disable UBSAN on SEV code that may execute very early
      - Disable ftrace branch profiling in SEV startup code

 - And miscellaneous cleanups:

      - kexec_core: Add and update comments regarding the KEXEC_JUMP flow (Rafael J. Wysocki)
      - x86/sysfs: Constify 'struct bin_attribute' (Thomas Weißschuh)"

* tag 'x86-boot-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  x86/sev: Disable ftrace branch profiling in SEV startup code
  x86/kexec: Use typedef for relocate_kernel_fn function prototype
  x86/kexec: Cope with relocate_kernel() not being at the start of the page
  kexec_core: Add and update comments regarding the KEXEC_JUMP flow
  x86/kexec: Mark machine_kexec() with __nocfi
  x86/kexec: Fix location of relocate_kernel with -ffunction-sections
  x86/kexec: Fix stack and handling of re-entry point for ::preserve_context
  x86/kexec: Use correct swap page in swap_pages function
  x86/kexec: Ensure preserve_context flag is set on return to kernel
  x86/kexec: Disable global pages before writing to control page
  x86/sev: Don't hang but terminate on failure to remap SVSM CA
  x86/sev: Disable UBSAN on SEV code that may execute very early
  x86/boot/64: Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
  x86/sysfs: Constify 'struct bin_attribute'
  x86/kexec: Mark relocate_kernel page as ROX instead of RWX
  x86/kexec: Clean up register usage in relocate_kernel()
  x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page
  x86/kexec: Drop page_list argument from relocate_kernel()
  x86/kexec: Add data section to relocate_kernel
  x86/kexec: Move relocate_kernel to kernel .data section
  ...
2025-01-24 05:54:26 -08:00
David Woodhouse
e536057543 x86/kexec: Cope with relocate_kernel() not being at the start of the page
A few places in the kexec control code page make the assumption that the first
instruction of relocate_kernel is at the very start of the page.

To allow for Clang CFI information to be added to relocate_kernel(), as well
as the general principle of removing unwarranted assumptions, fix them to use
the external __relocate_kernel_start symbol that the linker adds. This means
using a separate addq and subq for calculating offsets, as the assembler can
no longer calculate the delta directly for itself and relocations aren't that
versatile. But those values can at least be used relative to a local label to
avoid absolute relocations.

Turn the jump from relocate_kernel() to identity_mapped() into a real indirect
'jmp *%rsi' too, while touching it. There was no real reason for it to be
a push+ret in the first place, and adding Clang CFI info will also give
objtool enough visibility to start complaining 'return with modified stack
frame' about it.

  [ bp: Massage commit message. ]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-9-dwmw2@infradead.org
2025-01-14 13:05:14 +01:00
Nathan Chancellor
eeed915041 x86/kexec: Fix location of relocate_kernel with -ffunction-sections
After commit

  cb33ff9e06 ("x86/kexec: Move relocate_kernel to kernel .data section"),

kernels configured with an option that uses -ffunction-sections, such as
CONFIG_LTO_CLANG, crash when kexecing because the value of relocate_kernel
does not match the value of __relocate_kernel_start so incorrect code gets
copied via machine_kexec_prepare().

  $ llvm-nm good-vmlinux &| rg relocate_kernel
  ffffffff83280d41 T __relocate_kernel_end
  ffffffff83280b00 T __relocate_kernel_start
  ffffffff83280b00 T relocate_kernel

  $ llvm-nm bad-vmlinux &| rg relocate_kernel
  ffffffff83266100 D __relocate_kernel_end
  ffffffff83266100 D __relocate_kernel_start
  ffffffff8120b0d8 T relocate_kernel

When -ffunction-sections is enabled, TEXT_MAIN matches on
'.text.[0-9a-zA-Z_]*' to coalesce the function specific functions back
into .text during link time after they have been optimized. Due to the
placement of TEXT_TEXT before KEXEC_RELOCATE_KERNEL in the x86 linker
script, the .text.relocate_kernel section ends up in .text instead of
.data.

Use a second dot in the relocate_kernel section name to avoid matching
on TEXT_MAIN, which matches a similar situation that happened in
commit

  79cd2a1122 ("x86/retpoline,kprobes: Fix position of thunk sections with CONFIG_LTO_CLANG"),

which allows kexec to function properly.

While .data.relocate_kernel still ends up in the .data section via
DATA_MAIN -> DATA_DATA, ensure it is located with the
.text.relocate_kernel section as intended by performing the same
transformation.

Fixes: cb33ff9e06 ("x86/kexec: Move relocate_kernel to kernel .data section")
Fixes: 8dbec5c77b ("x86/kexec: Add data section to relocate_kernel")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-6-dwmw2@infradead.org
2025-01-14 13:00:18 +01:00
David Woodhouse
2cacf7f23a x86/kexec: Fix stack and handling of re-entry point for ::preserve_context
A ::preserve_context kimage can be invoked more than once, and the entry point
can be different every time. When the callee returns to the kernel, it leaves
the address of its entry point for next time on the stack.

That being the case, one might reasonably assume that the caller would
allocate space for it on the stack frame before actually performing the 'call'
into the callee.

Apparently not, though. Ever since the kjump code was first added in 2009, it
has set up a *new* stack at the top of the swap_page scratch page, then just
performed the 'call' without allocating any space for the re-entry address to
be returned. It then reads the re-entry point for next time from 0(%rsp) which
is actually the first qword of the page *after* the swap page, which might not
exist at all! And if the callee has written to that, then it will have
corrupted memory it doesn't own.

Correct this by pushing the entry point of the callee onto the stack before
calling it. The callee may then adjust it, or not, as it sees fit, and
subsequent invocations should work correctly either way.

Remove a stray push of zero to the *relocate_kernel* stack, which may have
been intended for this purpose, but which was actually just noise.

Also, loading the stack for the callee relied on the address of the swap page
being in %r10 without ever documenting that fact. Recent code changes made
that no longer true, so load it directly from the local kexec_pa_swap_page
variable instead.

Fixes: b3adabae8a ("x86/kexec: Drop page_list argument from relocate_kernel()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-5-dwmw2@infradead.org
2025-01-14 12:57:43 +01:00
David Woodhouse
85d724df8c x86/kexec: Use correct swap page in swap_pages function
The swap_pages function expects the swap page to be in %r10, but there
was no documentation to that effect. Once upon a time the setup code
used to load its value from a kernel virtual address and save it to an
address which is accessible in the identity-mapped page tables, and
*happened* to use %r10 to do so, with no comment that it was left there
on *purpose* instead of just being a scratch register. Once that was no
longer necessary, %r10 just holds whatever the kernel happened to leave
in it.

Now that the original value passed by the kernel is accessible via
%rip-relative addressing, load directly from there instead of using %r10
for it. But document the other parameters that the swap_pages function
*does* expect in registers.

Fixes: b3adabae8a ("x86/kexec: Drop page_list argument from relocate_kernel()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-4-dwmw2@infradead.org
2025-01-14 12:54:36 +01:00
David Woodhouse
4d5f1da98f x86/kexec: Ensure preserve_context flag is set on return to kernel
The swap_pages() function will only actually *swap*, as its name implies, if
the preserve_context flag in the %r11 register is non-zero. On the way back
from a ::preserve_context kexec, ensure that the %r11 register is non-zero so
that the pages get swapped back.

Fixes: 9e5683e2d0 ("x86/kexec: Only swap pages for ::preserve_context mode")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-3-dwmw2@infradead.org
2025-01-14 12:52:47 +01:00
David Woodhouse
d144d8a652 x86/kexec: Disable global pages before writing to control page
The kernel switches to a new set of page tables during kexec. The global
mappings (_PAGE_GLOBAL==1) can remain in the TLB after this switch. This
is generally not a problem because the new page tables use a different
portion of the virtual address space than the normal kernel mappings.

The critical exception to that generalisation (and the only mapping
which isn't an identity mapping) is the kexec control page itself —
which was ROX in the original kernel mapping, but should be RWX in the
new page tables. If there is a global TLB entry for that in its prior
read-only state, it definitely needs to be flushed before attempting to
write through that virtual mapping.

It would be possible to just avoid writing to the virtual address of the
page and defer all writes until they can be done through the identity
mapping. But there's no good reason to keep the old TLB entries around,
as they can cause nothing but trouble.

Clear the PGE bit in %cr4 early, before storing data in the control page.

Fixes: 5a82223e07 ("x86/kexec: Mark relocate_kernel page as ROX instead of RWX")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219592
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com>
Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com>
Link: https://lore.kernel.org/r/20250109140757.2841269-2-dwmw2@infradead.org
2025-01-14 12:46:17 +01:00
Damien Le Moal
aeb6893761 x86: Fix build regression with CONFIG_KEXEC_JUMP enabled
Build 6.13-rc12 for x86_64 with gcc 14.2.1 fails with the error:

  ld: vmlinux.o: in function `virtual_mapped':
  linux/arch/x86/kernel/relocate_kernel_64.S:249:(.text+0x5915b): undefined reference to `saved_context_gdt_desc'

when CONFIG_KEXEC_JUMP is enabled.

This was introduced by commit 07fa619f2a ("x86/kexec: Restore GDT on
return from ::preserve_context kexec") which introduced a use of
saved_context_gdt_desc without a declaration for it.

Fix that by including asm/asm-offsets.h where saved_context_gdt_desc
is defined (indirectly in include/generated/asm-offsets.h which
asm/asm-offsets.h includes).

Fixes: 07fa619f2a ("x86/kexec: Restore GDT on return from ::preserve_context kexec")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Closes: https://lore.kernel.org/oe-kbuild-all/202411270006.ZyyzpYf8-lkp@intel.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-09 10:13:28 -08:00
David Woodhouse
93e489ad7a x86/kexec: Clean up register usage in relocate_kernel()
The memory encryption flag is passed in %r8 because that's where the
calling convention puts it. Instead of moving it to %r12 and then using
%r8 for other things, just leave it in %r8 and use other registers
instead.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-13-dwmw2@infradead.org
2024-12-06 10:42:00 +01:00
David Woodhouse
b7155dfd49 x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page
All writes to the relocate_kernel control page are now done *after* the
%cr3 switch via simple %rip-relative addressing, which means the DATA()
macro with its pointer arithmetic can also now be removed.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-12-dwmw2@infradead.org
2024-12-06 10:42:00 +01:00
David Woodhouse
b3adabae8a x86/kexec: Drop page_list argument from relocate_kernel()
The kernel's virtual mapping of the relocate_kernel page currently needs
to be RWX because it is written to before the %cr3 switch.

Now that the relocate_kernel page has its own .data section and local
variables, it can also have *global* variables. So eliminate the separate
page_list argument, and write the same information directly to variables
in the relocate_kernel page instead. This way, the relocate_kernel code
itself doesn't need to copy it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-11-dwmw2@infradead.org
2024-12-06 10:42:00 +01:00
David Woodhouse
8dbec5c77b x86/kexec: Add data section to relocate_kernel
Now that the relocate_kernel page is handled sanely by a linker script
we can have actual data, and just use %rip-relative addressing to access
it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-10-dwmw2@infradead.org
2024-12-06 10:42:00 +01:00
David Woodhouse
cb33ff9e06 x86/kexec: Move relocate_kernel to kernel .data section
Now that the copy is executed instead of the original, the relocate_kernel
page can live in the kernel's .text section. This will allow subsequent
commits to actually add real data to it and clean up the code somewhat as
well as making the control page ROX.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-9-dwmw2@infradead.org
2024-12-06 10:41:59 +01:00
David Woodhouse
eeebbde571 x86/kexec: Invoke copy of relocate_kernel() instead of the original
This currently calls set_memory_x() from machine_kexec_prepare() just
like the 32-bit version does. That's actually a bit earlier than I'd
like, as it leaves the page RWX all the time the image is even *loaded*.

Subsequent commits will eliminate all the writes to the page between the
point it's marked executable in machine_kexec_prepare() the time that
relocate_kernel() is running and has switched to the identmap %cr3, so
that it can be ROX. But that can't happen until it's moved to the .data
section of the kernel, and *that* can't happen until we start executing
the copy instead of executing it in place in the kernel .text. So break
the circular dependency in those commits by letting it be RWX for now.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-8-dwmw2@infradead.org
2024-12-06 10:41:59 +01:00
David Woodhouse
9e5683e2d0 x86/kexec: Only swap pages for ::preserve_context mode
There's no need to swap pages (which involves three memcopies for each
page) in the plain kexec case. Just do a single copy from source to
destination page.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-5-dwmw2@infradead.org
2024-12-06 10:41:59 +01:00
David Woodhouse
46d4e205e2 x86/kexec: Use named labels in swap_pages in relocate_kernel_64.S
Make the code a little more readable.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Kai Huang <kai.huang@intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-4-dwmw2@infradead.org
2024-12-06 10:41:58 +01:00
David Woodhouse
207bdf7f72 x86/kexec: Clean up and document register use in relocate_kernel_64.S
Add more comments explaining what each register contains, and save the
preserve_context flag to a non-clobbered register sooner, to keep things
simpler.

No change in behavior intended.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Kai Huang <kai.huang@intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20241205153343.3275139-3-dwmw2@infradead.org
2024-12-06 10:41:58 +01:00
David Woodhouse
07fa619f2a x86/kexec: Restore GDT on return from ::preserve_context kexec
The restore_processor_state() function explicitly states that "the asm code
that gets us here will have restored a usable GDT". That wasn't true in the
case of returning from a ::preserve_context kexec. Make it so.

Without this, the kernel was depending on the called function to reload a
GDT which is appropriate for the kernel before returning.

Test program:

 #include <unistd.h>
 #include <errno.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <linux/kexec.h>
 #include <linux/reboot.h>
 #include <sys/reboot.h>
 #include <sys/syscall.h>

 int main (void)
 {
        struct kexec_segment segment = {};
	unsigned char purgatory[] = {
		0x66, 0xba, 0xf8, 0x03,	// mov $0x3f8, %dx
		0xb0, 0x42,		// mov $0x42, %al
		0xee,			// outb %al, (%dx)
		0xc3,			// ret
	};
	int ret;

	segment.buf = &purgatory;
	segment.bufsz = sizeof(purgatory);
	segment.mem = (void *)0x400000;
	segment.memsz = 0x1000;
	ret = syscall(__NR_kexec_load, 0x400000, 1, &segment, KEXEC_PRESERVE_CONTEXT);
	if (ret) {
		perror("kexec_load");
		exit(1);
	}

	ret = syscall(__NR_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, LINUX_REBOOT_CMD_KEXEC);
	if (ret) {
		perror("kexec reboot");
		exit(1);
	}
	printf("Success\n");
	return 0;
 }

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241205153343.3275139-2-dwmw2@infradead.org
2024-12-06 10:35:49 +01:00
Kai Huang
ea49cdb26e x86/kexec: Add comments around swap_pages() assembly to improve readability
The current assembly around swap_pages() in the relocate_kernel() takes
some time to follow because the use of registers can be easily lost when
the line of assembly goes long.  Add a couple of comments to clarify the
code around swap_pages() to improve readability.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/8b52b0b8513a34b2a02fb4abb05c6700c2821475.1724573384.git.kai.huang@intel.com
2024-08-25 14:29:39 +02:00
Kai Huang
3c41ad39f1 x86/kexec: Fix a comment of swap_pages() assembly
When relocate_kernel() gets called, %rdi holds 'indirection_page' and
%rsi holds 'page_list'.  And %rdi always holds 'indirection_page' when
swap_pages() is called.

Therefore the comment of the first line code of swap_pages()

	movq    %rdi, %rcx      /* Put the page_list in %rcx */

.. isn't correct because it actually moves the 'indirection_page' to
the %rcx.  Fix it.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/adafdfb1421c88efce04420fc9a996c0e2ca1b34.1724573384.git.kai.huang@intel.com
2024-08-25 14:29:39 +02:00
Kirill A. Shutemov
de60613173 x86/kexec: Keep CR4.MCE set during kexec for TDX guest
TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
that bit is cleared during CR4 register reprogramming during boot or kexec
flows, a #VE exception will be raised which the guest kernel cannot handle.

Therefore, make sure the CR4.MCE setting is preserved over kexec too and avoid
raising any #VEs.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240614095904.1345461-7-kirill.shutemov@linux.intel.com
2024-06-17 17:45:50 +02:00
Borislav Petkov
7b46a8997d x86/relocate_kernel: Use named labels for less confusion
That identity_mapped() function was loving that "1" label to the point of
completely confusing its readers.

Use named labels in each place for clarity.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/20240614095904.1345461-6-kirill.shutemov@linux.intel.com
2024-06-17 17:45:46 +02:00
Nikolay Borisov
54183d103d x86/kexec: Remove spurious unconditional JMP from from identity_mapped()
This seemingly straightforward JMP was introduced in the initial version
of the the 64bit kexec code without any explanation.

It turns out (check accompanying Link) it's likely a copy/paste artefact
from 32-bit code, where such a JMP could be used as a serializing
instruction for the 486's prefetch queue. On x86_64 that's not needed
because there's already a preceding write to cr4 which itself is
a serializing operation.

  [ bp: Typos. Let's try this and see what cries out. If it does,
    reverting it is trivial. ]

Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/55bc0649-c017-49ab-905d-212f140a403f@citrix.com/
2024-06-17 17:45:19 +02:00
Josh Poimboeuf
fb799447ae x86,objtool: Split UNWIND_HINT_EMPTY in two
Mark reported that the ORC unwinder incorrectly marks an unwind as
reliable when the unwind terminates prematurely in the dark corners of
return_to_handler() due to lack of information about the next frame.

The problem is UNWIND_HINT_EMPTY is used in two different situations:

  1) The end of the kernel stack unwind before hitting user entry, boot
     code, or fork entry

  2) A blind spot in ORC coverage where the unwinder has to bail due to
     lack of information about the next frame

The ORC unwinder has no way to tell the difference between the two.
When it encounters an undefined stack state with 'end=1', it blindly
marks the stack reliable, which can break the livepatch consistency
model.

Fix it by splitting UNWIND_HINT_EMPTY into UNWIND_HINT_UNDEFINED and
UNWIND_HINT_END_OF_STACK.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/fd6212c8b450d3564b855e1cb48404d6277b4d9f.1677683419.git.jpoimboe@kernel.org
2023-03-23 23:18:58 +01:00
Thomas Gleixner
e81dc127ef x86/callthunks: Add call patching for call depth tracking
Mitigating the Intel SKL RSB underflow issue in software requires to
track the call depth. That is every CALL and every RET need to be
intercepted and additional code injected.

The existing retbleed mitigations already include means of redirecting
RET to __x86_return_thunk; this can be re-purposed and RET can be
redirected to another function doing RET accounting.

CALL accounting will use the function padding introduced in prior
patches. For each CALL instruction, the destination symbol's padding
is rewritten to do the accounting and the CALL instruction is adjusted
to call into the padding.

This ensures only affected CPUs pay the overhead of this accounting.
Unaffected CPUs will leave the padding unused and have their 'JMP
__x86_return_thunk' replaced with an actual 'RET' instruction.

Objtool has been modified to supply a .call_sites section that lists
all the 'CALL' instructions. Additionally the paravirt instruction
sites are iterated since they will have been patched from an indirect
call to direct calls (or direct instructions in which case it'll be
ignored).

Module handling and the actual thunk code for SKL will be added in
subsequent steps.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.470877038@infradead.org
2022-10-17 16:41:13 +02:00
Konrad Rzeszutek Wilk
697977d841 x86/kexec: Disable RET on kexec
All the invocations unroll to __x86_return_thunk and this file
must be PIC independent.

This fixes kexec on 64-bit AMD boxes.

  [ bp: Fix 32-bit build. ]

Reported-by: Edward Tran <edward.tran@oracle.com>
Reported-by: Awais Tanveer <awais.tanveer@oracle.com>
Suggested-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-07-09 13:12:32 +02:00
Peter Zijlstra
3e3f069504 x86/ibt: Annotate text references
Annotate away some of the generic code references. This is things
where we take the address of a symbol for exception handling or return
addresses (eg. context switch).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.877758523@infradead.org
2022-03-15 10:32:40 +01:00
Peter Zijlstra
af22700390 x86/ibt,kexec: Disable CET on kexec
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.641454603@infradead.org
2022-03-15 10:32:39 +01:00
Peter Zijlstra
f94909ceb1 x86: Prepare asm files for straight-line-speculation
Replace all ret/retq instructions with RET in preparation of making
RET a macro. Since AS is case insensitive it's a big no-op without
RET defined.

  find arch/x86/ -name \*.S | while read file
  do
	sed -i 's/\<ret[q]*\>/RET/' $file
  done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211204134907.905503893@infradead.org
2021-12-08 12:25:37 +01:00
Tom Lendacky
32cb4d02fb x86/sme: Replace occurrences of sme_active() with cc_platform_has()
Replace uses of sme_active() with the more generic cc_platform_has()
using CC_ATTR_HOST_MEM_ENCRYPT. If future support is added for other
memory encryption technologies, the use of CC_ATTR_HOST_MEM_ENCRYPT
can be updated, as required.

This also replaces two usages of sev_active() that are really geared
towards detecting if SME is active.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-6-bp@alien8.de
2021-10-04 11:46:46 +02:00
Ingo Molnar
163b099146 x86: Fix various typos in comments, take #2
Fix another ~42 single-word typos in arch/x86/ code comments,
missed a few in the first pass, in particular in .S files.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
2021-03-21 23:50:28 +01:00
Peter Zijlstra
36cc552055 x86/kexec: Make relocate_kernel_64.S objtool clean
Having fixed the biggest objtool issue in this file; fix up the rest
and remove the exception.

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20200324160924.202621656@infradead.org
2020-03-25 18:28:28 +01:00
Peter Zijlstra
fc8bd77d64 x86/kexec: Use RIP relative addressing
Normally identity_mapped is not visible to objtool, due to:

  arch/x86/kernel/Makefile:OBJECT_FILES_NON_STANDARD_relocate_kernel_$(BITS).o := y

However, when we want to run objtool on vmlinux.o there is no hiding
it:

  vmlinux.o: warning: objtool: .text+0x4c0f1: unsupported intra-function call

Replace the (i386 inspired) pattern:

	call 1f
  1:	popq %r8
	subq $(1b - relocate_kernel), %r8

With a x86_64 RIP-relative LEA:

	leaq relocate_kernel(%rip), %r8

Suggested-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20200324160924.143334345@infradead.org
2020-03-25 18:28:27 +01:00
Jiri Slaby
6ec2a96824 x86/asm: Annotate relocate_kernel_{32,64}.c
There are functions in relocate_kernel_{32,64}.c which are not
annotated. This makes automatic annotations on them rather hard. So
annotate all the functions now.

Note that these are not C-like functions, so FUNC is not used. Instead
CODE markers are used. Also the functions are not aligned, so the
NOALIGN versions are used:

- SYM_CODE_START_NOALIGN
- SYM_CODE_START_LOCAL_NOALIGN
- SYM_CODE_END

The result is:
  0000   108 NOTYPE  GLOBAL DEFAULT    1 relocate_kernel
  006c   165 NOTYPE  LOCAL  DEFAULT    1 identity_mapped
  0146   127 NOTYPE  LOCAL  DEFAULT    1 swap_pages
  0111    53 NOTYPE  LOCAL  DEFAULT    1 virtual_mapped

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Alexios Zavras <alexios.zavras@intel.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Enrico Weigelt <info@metux.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-arch@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191011115108.12392-4-jslaby@suse.cz
2019-10-18 09:53:19 +02:00
Thomas Gleixner
40b0b3f8fb treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 230
Based on 2 normalized pattern(s):

  this source code is licensed under the gnu general public license
  version 2 see the file copying for more details

  this source code is licensed under general public license version 2
  see

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 52 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.449021192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:06 +02:00
Kirill A. Shutemov
5bf3031699 x86/kexec: Make kexec (mostly) work in 5-level paging mode
Currently kexec() will crash when switching into a 5-level paging
enabled kernel.

I missed that we need to change relocate_kernel() to set CR4.LA57
flag if the kernel has 5-level paging enabled.

I avoided using #ifdef CONFIG_X86_5LEVEL here and inferred if we need to
enable 5-level paging from previous CR4 value. This way the code is
ready for boot-time switching between paging modes.

With this patch applied, in addition to kexec 4-to-4 which always worked,
we can kexec 4-to-5 and 5-to-5 - while 5-to-4 will need more work.

Reported-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Baoquan He <bhe@redhat.com>
Cc: <stable@vger.kernel.org> # v4.14+
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Fixes: 77ef56e4f0 ("x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y")
Link: http://lkml.kernel.org/r/20180129110845.26633-1-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-31 08:39:40 +01:00
Tom Lendacky
4e237903f9 x86/mm, kexec: Fix memory corruption with SME on successive kexecs
After issuing successive kexecs it was found that the SHA hash failed
verification when booting the kexec'd kernel.  When SME is enabled, the
change from using pages that were marked encrypted to now being marked as
not encrypted (through new identify mapped page tables) results in memory
corruption if there are any cache entries for the previously encrypted
pages. This is because separate cache entries can exist for the same
physical location but tagged both with and without the encryption bit.

To prevent this, issue a wbinvd if SME is active before copying the pages
from the source location to the destination location to clear any possible
cache entry conflicts.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <kexec@lists.infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e7fb8610af3a93e8f8ae6f214cd9249adc0df2b4.1501186516.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-30 12:09:12 +02:00
Denys Vlasenko
a734b4a23e x86/asm: Replace "MOVQ $imm, %reg" with MOVL
There is no reason to use MOVQ to load a non-negative immediate
constant value into a 64-bit register. MOVL does the same, since
the upper 32 bits are zero-extended by the CPU.

This makes the code a bit smaller, while leaving functionality
unchanged.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427821211-25099-8-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-01 13:17:39 +02:00
Denys Vlasenko
3e1aa7cb59 x86/asm: Optimize unnecessarily wide TEST instructions
By the nature of the TEST operation, it is often possible to test
a narrower part of the operand:

    "testl $3,  mem"  ->  "testb $3, mem",
    "testq $3, %rcx"  ->  "testb $3, %cl"

This results in shorter instructions, because the TEST instruction
has no sign-entending byte-immediate forms unlike other ALU ops.

Note that this change does not create any LCP (Length-Changing Prefix)
stalls, which happen when adding a 0x66 prefix, which happens when
16-bit immediates are used, which changes such TEST instructions:

  [test_opcode] [modrm] [imm32]

to:

  [0x66] [test_opcode] [modrm] [imm16]

where [imm16] has a *different length* now: 2 bytes instead of 4.
This confuses the decoder and slows down execution.

REX prefixes were carefully designed to almost never hit this case:
adding REX prefix does not change instruction length except MOVABS
and MOV [addr],RAX instruction.

This patch does not add instructions which would use a 0x66 prefix,
code changes in assembly are:

    -48 f7 07 01 00 00 00 	testq  $0x1,(%rdi)
    +f6 07 01             	testb  $0x1,(%rdi)
    -48 f7 c1 01 00 00 00 	test   $0x1,%rcx
    +f6 c1 01             	test   $0x1,%cl
    -48 f7 c1 02 00 00 00 	test   $0x2,%rcx
    +f6 c1 02             	test   $0x2,%cl
    -41 f7 c2 01 00 00 00 	test   $0x1,%r10d
    +41 f6 c2 01          	test   $0x1,%r10b
    -48 f7 c1 04 00 00 00 	test   $0x4,%rcx
    +f6 c1 04             	test   $0x4,%cl
    -48 f7 c1 08 00 00 00 	test   $0x8,%rcx
    +f6 c1 08             	test   $0x8,%cl

Linus further notes:

   "There are no stalls from using 8-bit instruction forms.

    Now, changing from 64-bit or 32-bit 'test' instructions to 8-bit ones
    *could* cause problems if it ends up having forwarding issues, so that
    instead of just forwarding the result, you end up having to wait for
    it to be stable in the L1 cache (or possibly the register file). The
    forwarding from the store buffer is simplest and most reliable if the
    read is done at the exact same address and the exact same size as the
    write that gets forwarded.

    But that's true only if:

     (a) the write was very recent and is still in the write queue. I'm
         not sure that's the case here anyway.

     (b) on at least most Intel microarchitectures, you have to test a
         different byte than the lowest one (so forwarding a 64-bit write
         to a 8-bit read ends up working fine, as long as the 8-bit read
         is of the low 8 bits of the written data).

    A very similar issue *might* show up for registers too, not just
    memory writes, if you use 'testb' with a high-byte register (where
    instead of forwarding the value from the original producer it needs to
    go through the register file and then shifted). But it's mainly a
    problem for store buffers.

    But afaik, the way Denys changed the test instructions, neither of the
    above issues should be true.

    The real problem for store buffer forwarding tends to be "write 8
    bits, read 32 bits". That can be really surprisingly expensive,
    because the read ends up having to wait until the write has hit the
    cacheline, and we might talk tens of cycles of latency here. But
    "write 32 bits, read the low 8 bits" *should* be fast on pretty much
    all x86 chips, afaik."

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1425675332-31576-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-07 11:12:43 +01:00
H. Peter Anvin
f037e416af x86, reloc: Use xorl instead of xorq in relocate_kernel_64.S
There is no point in using "xorq" to clear a register... use "xorl" to
clear the bottom 32 bits, and the upper 32 bits get cleared by virtue
of zero extension.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/n/tip-b76zi1gep39c0zs8fbvkhie9@git.kernel.org
2013-06-20 21:30:04 -07:00
Kees Cook
c8a22d19dd x86: Fix typo in kexec register clearing
Fixes a typo in register clearing code. Thanks to PaX Team for fixing
this originally, and James Troup for pointing it out.

Signed-off-by: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/20130605184718.GA8396@www.outflux.net
Cc: <stable@vger.kernel.org> v2.6.30+
Cc: PaX Team <pageexec@freemail.hu>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-06-12 15:16:18 -07:00
Huang Ying
050438ed5a kexec, x86: Fix incorrect jump back address if not preserving context
In kexec jump support, jump back address passed to the kexeced
kernel via function calling ABI, that is, the function call
return address is the jump back entry.

Furthermore, jump back entry == 0 should be used to signal that
the jump back or preserve context is not enabled in the original
kernel.

But in the current implementation the stack position used for
function call return address is not cleared context
preservation is disabled. The patch fixes this bug.

Reported-and-tested-by: Yin Kangkai <kangkai.yin@intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1310607277-25029-1-git-send-email-ying.huang@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-21 11:19:28 +02:00