2019-06-03 07:44:50 +02:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0-only */
|
2012-03-05 11:49:27 +00:00
|
|
|
/*
|
|
|
|
* Low-level CPU initialisation
|
|
|
|
* Based on arch/arm/kernel/head.S
|
|
|
|
*
|
|
|
|
* Copyright (C) 1994-2002 Russell King
|
|
|
|
* Copyright (C) 2003-2012 ARM Ltd.
|
|
|
|
* Authors: Catalin Marinas <catalin.marinas@arm.com>
|
|
|
|
* Will Deacon <will.deacon@arm.com>
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/linkage.h>
|
|
|
|
#include <linux/init.h>
|
2020-06-08 21:32:42 -07:00
|
|
|
#include <linux/pgtable.h>
|
2012-03-05 11:49:27 +00:00
|
|
|
|
arm64: simplify ptrauth initialization
Currently __cpu_setup conditionally initializes the address
authentication keys and enables them in SCTLR_EL1, doing so differently
for the primary CPU and secondary CPUs, and skipping this work for CPUs
returning from an idle state. For the latter case, cpu_do_resume
restores the keys and SCTLR_EL1 value after the MMU has been enabled.
This flow is rather difficult to follow, so instead let's move the
primary and secondary CPU initialization into their respective boot
paths. By following the example of cpu_do_resume and doing so once the
MMU is enabled, we can always initialize the keys from the values in
thread_struct, and avoid the machinery necessary to pass the keys in
secondary_data or open-coding initialization for the boot CPU.
This means we perform an additional RMW of SCTLR_EL1, but we already do
this in the cpu_do_resume path, and for other features in cpufeature.c,
so this isn't a major concern in a bringup path. Note that even while
the enable bits are clear, the key registers are accessible.
As this now renders the argument to __cpu_setup redundant, let's also
remove that entirely. Future extensions can follow a similar approach to
initialize values that differ for primary/secondary CPUs.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Reviewed-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Cc: Amit Daniel Kachhap <amit.kachhap@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200423101606.37601-3-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-04-23 11:16:06 +01:00
|
|
|
#include <asm/asm_pointer_auth.h>
|
2012-03-05 11:49:27 +00:00
|
|
|
#include <asm/assembler.h>
|
2016-04-18 17:09:47 +02:00
|
|
|
#include <asm/boot.h>
|
arm64: Implement stack trace termination record
Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.
We'd like to use task_pt_regs(task)->stackframe as the final frame
record, as this is already setup upon exception entry from EL0. For
kernel tasks we need to consistently reserve the pt_regs and point x29
at this, which we can do with small changes to __primary_switched,
__secondary_switched, and copy_process().
Since the final frame record must be at a specific location, we must
create the final frame record in __primary_switched and
__secondary_switched rather than leaving this to start_kernel and
secondary_start_kernel. Thus, __primary_switched and
__secondary_switched will now show up in stacktraces for the idle tasks.
Since the final frame record is now identified by its location rather
than by its contents, we identify it at the start of unwind_frame(),
before we read any values from it.
External debuggers may terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel. That said, gdb
does not show the extra record probably because it uses DWARF and not
frame pointers for stack traces.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
[Mark: rebase, use ASM_BUG(), update comments, update commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210510110026.18061-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-10 12:00:26 +01:00
|
|
|
#include <asm/bug.h>
|
2012-03-05 11:49:27 +00:00
|
|
|
#include <asm/ptrace.h>
|
|
|
|
#include <asm/asm-offsets.h>
|
2014-03-26 18:25:55 +00:00
|
|
|
#include <asm/cache.h>
|
2012-08-29 18:32:18 +01:00
|
|
|
#include <asm/cputype.h>
|
2020-12-02 18:41:04 +00:00
|
|
|
#include <asm/el2_setup.h>
|
2016-01-26 09:13:44 +01:00
|
|
|
#include <asm/elf.h>
|
2018-11-15 14:52:46 +09:00
|
|
|
#include <asm/image.h>
|
2015-10-19 14:19:27 +01:00
|
|
|
#include <asm/kernel-pgtable.h>
|
2014-02-19 09:33:14 +00:00
|
|
|
#include <asm/kvm_arm.h>
|
2012-03-05 11:49:27 +00:00
|
|
|
#include <asm/memory.h>
|
|
|
|
#include <asm/pgtable-hwdef.h>
|
|
|
|
#include <asm/page.h>
|
2020-04-27 09:00:16 -07:00
|
|
|
#include <asm/scs.h>
|
2016-02-23 10:31:42 +00:00
|
|
|
#include <asm/smp.h>
|
2015-10-19 14:19:35 +01:00
|
|
|
#include <asm/sysreg.h>
|
arm64: stacktrace: unwind exception boundaries
When arm64's stack unwinder encounters an exception boundary, it uses
the pt_regs::stackframe created by the entry code, which has a copy of
the PC and FP at the time the exception was taken. The unwinder doesn't
know anything about pt_regs, and reports the PC from the stackframe, but
does not report the LR.
The LR is only guaranteed to contain the return address at function call
boundaries, and can be used as a scratch register at other times, so the
LR at an exception boundary may or may not be a legitimate return
address. It would be useful to report the LR value regardless, as it can
be helpful when debugging, and in future it will be helpful for reliable
stacktrace support.
This patch changes the way we unwind across exception boundaries,
allowing both the PC and LR to be reported. The entry code creates a
frame_record_meta structure embedded within pt_regs, which the unwinder
uses to find the pt_regs. The unwinder can then extract pt_regs::pc and
pt_regs::lr as two separate unwind steps before continuing with a
regular walk of frame records.
When a PC is unwound from pt_regs::lr, dump_backtrace() will log this
with an "L" marker so that it can be identified easily. For example,
an unwind across an exception boundary will appear as follows:
| el1h_64_irq+0x6c/0x70
| _raw_spin_unlock_irqrestore+0x10/0x60 (P)
| __aarch64_insn_write+0x6c/0x90 (L)
| aarch64_insn_patch_text_nosync+0x28/0x80
... with a (P) entry for pt_regs::pc, and an (L) entry for pt_regs:lr.
Note that the LR may be stale at the point of the exception, for example,
shortly after a return:
| el1h_64_irq+0x6c/0x70
| default_idle_call+0x34/0x180 (P)
| default_idle_call+0x28/0x180 (L)
| do_idle+0x204/0x268
... where the LR points a few instructions before the current PC.
This plays nicely with all the other unwind metadata tracking. With the
ftrace_graph profiler enabled globally, and kretprobes installed on
generic_handle_domain_irq() and do_interrupt_handler(), a backtrace triggered
by magic-sysrq + L reports:
| Call trace:
| show_stack+0x20/0x40 (CF)
| dump_stack_lvl+0x60/0x80 (F)
| dump_stack+0x18/0x28
| nmi_cpu_backtrace+0xfc/0x140
| nmi_trigger_cpumask_backtrace+0x1c8/0x200
| arch_trigger_cpumask_backtrace+0x20/0x40
| sysrq_handle_showallcpus+0x24/0x38 (F)
| __handle_sysrq+0xa8/0x1b0 (F)
| handle_sysrq+0x38/0x50 (F)
| pl011_int+0x460/0x5a8 (F)
| __handle_irq_event_percpu+0x60/0x220 (F)
| handle_irq_event+0x54/0xc0 (F)
| handle_fasteoi_irq+0xa8/0x1d0 (F)
| generic_handle_domain_irq+0x34/0x58 (F)
| gic_handle_irq+0x54/0x140 (FK)
| call_on_irq_stack+0x24/0x58 (F)
| do_interrupt_handler+0x88/0xa0
| el1_interrupt+0x34/0x68 (FK)
| el1h_64_irq_handler+0x18/0x28
| el1h_64_irq+0x6c/0x70
| default_idle_call+0x34/0x180 (P)
| default_idle_call+0x28/0x180 (L)
| do_idle+0x204/0x268
| cpu_startup_entry+0x3c/0x50 (F)
| rest_init+0xe4/0xf0
| start_kernel+0x744/0x750
| __primary_switched+0x88/0x98
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241017092538.1859841-11-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 10:25:38 +01:00
|
|
|
#include <asm/stacktrace/frame.h>
|
2015-10-19 14:19:35 +01:00
|
|
|
#include <asm/thread_info.h>
|
2012-10-26 15:40:05 +01:00
|
|
|
#include <asm/virt.h>
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2017-03-23 19:00:46 +00:00
|
|
|
#include "efi-header.S"
|
|
|
|
|
2020-08-25 15:54:40 +02:00
|
|
|
#if (PAGE_OFFSET & 0x1fffff) != 0
|
2014-06-24 16:51:37 +01:00
|
|
|
#error PAGE_OFFSET must be at least 2MB aligned
|
2012-03-05 11:49:27 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Kernel startup entry point.
|
|
|
|
* ---------------------------
|
|
|
|
*
|
|
|
|
* The requirements are:
|
|
|
|
* MMU = off, D-cache = off, I-cache = on or off,
|
|
|
|
* x0 = physical address to the FDT blob.
|
|
|
|
*
|
|
|
|
* Note that the callee-saved registers are used for storing variables
|
|
|
|
* that are useful before the MMU is enabled. The allocations are described
|
|
|
|
* in the entry routines.
|
|
|
|
*/
|
|
|
|
__HEAD
|
|
|
|
/*
|
|
|
|
* DO NOT MODIFY. Image header expected by Linux boot-loaders.
|
|
|
|
*/
|
2020-11-17 13:47:29 +01:00
|
|
|
efi_signature_nop // special NOP to identity as PE/COFF executable
|
2020-03-26 18:14:23 +01:00
|
|
|
b primary_entry // branch to kernel start, magic
|
2020-08-25 15:54:40 +02:00
|
|
|
.quad 0 // Image load offset from start of RAM, little-endian
|
2015-12-26 13:48:02 +01:00
|
|
|
le64sym _kernel_size_le // Effective size of kernel image, little-endian
|
|
|
|
le64sym _kernel_flags_le // Informative flags, little-endian
|
2013-08-15 00:10:00 +01:00
|
|
|
.quad 0 // reserved
|
|
|
|
.quad 0 // reserved
|
|
|
|
.quad 0 // reserved
|
2018-11-15 14:52:46 +09:00
|
|
|
.ascii ARM64_IMAGE_MAGIC // Magic number
|
2020-11-17 13:47:29 +01:00
|
|
|
.long .Lpe_header_offset // Offset to the PE header.
|
2014-04-15 22:47:52 -04:00
|
|
|
|
2017-03-23 19:00:46 +00:00
|
|
|
__EFI_PE_HEADER
|
2012-03-05 11:49:27 +00:00
|
|
|
|
arm64: fix .idmap.text assertion for large kernels
When building a kernel with many debug options enabled (which happens in
test configurations use by myself and syzbot), the kernel can become
large enough that portions of .text can be more than 128M away from
.idmap.text (which is placed inside the .rodata section). Where idmap
code branches into .text, the linker will place veneers in the
.idmap.text section to make those branches possible.
Unfortunately, as Ard reports, GNU LD has bseen observed to add 4K of
padding when adding such veneers, e.g.
| .idmap.text 0xffffffc01e48e5c0 0x32c arch/arm64/mm/proc.o
| 0xffffffc01e48e5c0 idmap_cpu_replace_ttbr1
| 0xffffffc01e48e600 idmap_kpti_install_ng_mappings
| 0xffffffc01e48e800 __cpu_setup
| *fill* 0xffffffc01e48e8ec 0x4
| .idmap.text.stub
| 0xffffffc01e48e8f0 0x18 linker stubs
| 0xffffffc01e48f8f0 __idmap_text_end = .
| 0xffffffc01e48f000 . = ALIGN (0x1000)
| *fill* 0xffffffc01e48f8f0 0x710
| 0xffffffc01e490000 idmap_pg_dir = .
This makes the __idmap_text_start .. __idmap_text_end region bigger than
the 4K we require it to fit within, and triggers an assertion in arm64's
vmlinux.lds.S, which breaks the build:
| LD .tmp_vmlinux.kallsyms1
| aarch64-linux-gnu-ld: ID map text too big or misaligned
| make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 1
| make: *** [Makefile:1264: vmlinux] Error 2
Avoid this by using an `ADRP+ADD+BLR` sequence for branches out of
.idmap.text, which avoids the need for veneers. These branches are only
executed once per boot, and only when the MMU is on, so there should be
no noticeable performance penalty in replacing `BL` with `ADRP+ADD+BLR`.
At the same time, remove the "x" and "w" attributes when placing code in
.idmap.text, as these are not necessary, and this will prevent the
linker from assuming that it is safe to place PLTs into .idmap.text,
causing it to warn if and when there are out-of-range branches within
.idmap.text, e.g.
| LD .tmp_vmlinux.kallsyms1
| arch/arm64/kernel/head.o: in function `primary_entry':
| (.idmap.text+0x1c): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| arch/arm64/kernel/head.o: in function `init_el2':
| (.idmap.text+0x88): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| make[1]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
| make: *** [Makefile:1252: vmlinux] Error 2
Thus, if future changes add out-of-range branches in .idmap.text, it
should be easy enough to identify those from the resulting linker
errors.
Reported-by: syzbot+f8ac312e31226e23302b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-arm-kernel/00000000000028ea4105f4e2ef54@google.com/
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Will Deacon <will@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20230220162317.1581208-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-02-20 16:23:17 +00:00
|
|
|
.section ".idmap.text","a"
|
2016-03-30 17:43:07 +02:00
|
|
|
|
2016-08-31 12:05:17 +01:00
|
|
|
/*
|
|
|
|
* The following callee saved general purpose registers are used on the
|
|
|
|
* primary lowlevel boot path:
|
|
|
|
*
|
|
|
|
* Register Scope Purpose
|
2023-01-11 11:22:33 +01:00
|
|
|
* x19 primary_entry() .. start_kernel() whether we entered with the MMU on
|
2022-06-24 17:06:48 +02:00
|
|
|
* x20 primary_entry() .. __primary_switch() CPU boot mode
|
2020-03-26 18:14:23 +01:00
|
|
|
* x21 primary_entry() .. start_kernel() FDT pointer passed at boot in x0
|
2016-08-31 12:05:17 +01:00
|
|
|
*/
|
2020-03-26 18:14:23 +01:00
|
|
|
SYM_CODE_START(primary_entry)
|
2023-01-11 11:22:33 +01:00
|
|
|
bl record_mmu_state
|
2015-03-17 10:55:12 +01:00
|
|
|
bl preserve_boot_args
|
arm64: kernel: Create initial ID map from C code
The asm code that creates the initial ID map is rather intricate and
hard to follow. This is problematic because it makes adding support for
things like LPA2 or WXN more difficult than necessary. Also, it is
parameterized like the rest of the MM code to run with a configurable
number of levels, which is rather pointless, given that all AArch64 CPUs
implement support for 48-bit virtual addressing, and that many systems
exist with DRAM located outside of the 39-bit addressable range, which
is the only smaller VA size that is widely used, and we need additional
tricks to make things work in that combination.
So let's bite the bullet, and rip out all the asm macros, and fiddly
code, and replace it with a C implementation based on the newly added
routines for creating the early kernel VA mappings. And while at it,
create the initial ID map based on 48-bit virtual addressing as well,
regardless of the number of configured levels for the kernel proper.
Note that this code may execute with the MMU and caches disabled, and is
therefore not permitted to make unaligned accesses. This shouldn't
generally happen in any case for the algorithm as implemented, but to be
sure, let's pass -mstrict-align to the compiler just in case.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-66-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:07 +01:00
|
|
|
|
|
|
|
adrp x1, early_init_stack
|
|
|
|
mov sp, x1
|
|
|
|
mov x29, xzr
|
2025-05-08 13:43:30 +02:00
|
|
|
adrp x0, __pi_init_idmap_pg_dir
|
arm64: Enable LPA2 at boot if supported by the system
Update the early kernel mapping code to take 52-bit virtual addressing
into account based on the LPA2 feature. This is a bit more involved than
LVA (which is supported with 64k pages only), given that some page table
descriptor bits change meaning in this case.
To keep the handling in asm to a minimum, the initial ID map is still
created with 48-bit virtual addressing, which implies that the kernel
image must be loaded into 48-bit addressable physical memory. This is
currently required by the boot protocol, even though we happen to
support placement outside of that for LVA/64k based configurations.
Enabling LPA2 involves more than setting TCR.T1SZ to a lower value,
there is also a DS bit in TCR that needs to be set, and which changes
the meaning of bits [9:8] in all page table descriptors. Since we cannot
enable DS and every live page table descriptor at the same time, let's
pivot through another temporary mapping. This avoids the need to
reintroduce manipulations of the page tables with the MMU and caches
disabled.
To permit the LPA2 feature to be overridden on the kernel command line,
which may be necessary to work around silicon errata, or to deal with
mismatched features on heterogeneous SoC designs, test for CPU feature
overrides first, and only then enable LPA2.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-78-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:19 +01:00
|
|
|
mov x1, xzr
|
arm64: kernel: Create initial ID map from C code
The asm code that creates the initial ID map is rather intricate and
hard to follow. This is problematic because it makes adding support for
things like LPA2 or WXN more difficult than necessary. Also, it is
parameterized like the rest of the MM code to run with a configurable
number of levels, which is rather pointless, given that all AArch64 CPUs
implement support for 48-bit virtual addressing, and that many systems
exist with DRAM located outside of the 39-bit addressable range, which
is the only smaller VA size that is widely used, and we need additional
tricks to make things work in that combination.
So let's bite the bullet, and rip out all the asm macros, and fiddly
code, and replace it with a C implementation based on the newly added
routines for creating the early kernel VA mappings. And while at it,
create the initial ID map based on 48-bit virtual addressing as well,
regardless of the number of configured levels for the kernel proper.
Note that this code may execute with the MMU and caches disabled, and is
therefore not permitted to make unaligned accesses. This shouldn't
generally happen in any case for the algorithm as implemented, but to be
sure, let's pass -mstrict-align to the compiler just in case.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-66-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:07 +01:00
|
|
|
bl __pi_create_init_idmap
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the page tables have been populated with non-cacheable
|
|
|
|
* accesses (MMU disabled), invalidate those tables again to
|
|
|
|
* remove any speculatively loaded cache lines.
|
|
|
|
*/
|
|
|
|
cbnz x19, 0f
|
|
|
|
dmb sy
|
|
|
|
mov x1, x0 // end of used region
|
2025-05-08 13:43:30 +02:00
|
|
|
adrp x0, __pi_init_idmap_pg_dir
|
arm64: kernel: Create initial ID map from C code
The asm code that creates the initial ID map is rather intricate and
hard to follow. This is problematic because it makes adding support for
things like LPA2 or WXN more difficult than necessary. Also, it is
parameterized like the rest of the MM code to run with a configurable
number of levels, which is rather pointless, given that all AArch64 CPUs
implement support for 48-bit virtual addressing, and that many systems
exist with DRAM located outside of the 39-bit addressable range, which
is the only smaller VA size that is widely used, and we need additional
tricks to make things work in that combination.
So let's bite the bullet, and rip out all the asm macros, and fiddly
code, and replace it with a C implementation based on the newly added
routines for creating the early kernel VA mappings. And while at it,
create the initial ID map based on 48-bit virtual addressing as well,
regardless of the number of configured levels for the kernel proper.
Note that this code may execute with the MMU and caches disabled, and is
therefore not permitted to make unaligned accesses. This shouldn't
generally happen in any case for the algorithm as implemented, but to be
sure, let's pass -mstrict-align to the compiler just in case.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-66-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:07 +01:00
|
|
|
adr_l x2, dcache_inval_poc
|
|
|
|
blr x2
|
|
|
|
b 1f
|
2023-01-11 11:22:35 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we entered with the MMU and caches on, clean the ID mapped part
|
|
|
|
* of the primary boot code to the PoC so we can safely execute it with
|
|
|
|
* the MMU off.
|
|
|
|
*/
|
arm64: kernel: Create initial ID map from C code
The asm code that creates the initial ID map is rather intricate and
hard to follow. This is problematic because it makes adding support for
things like LPA2 or WXN more difficult than necessary. Also, it is
parameterized like the rest of the MM code to run with a configurable
number of levels, which is rather pointless, given that all AArch64 CPUs
implement support for 48-bit virtual addressing, and that many systems
exist with DRAM located outside of the 39-bit addressable range, which
is the only smaller VA size that is widely used, and we need additional
tricks to make things work in that combination.
So let's bite the bullet, and rip out all the asm macros, and fiddly
code, and replace it with a C implementation based on the newly added
routines for creating the early kernel VA mappings. And while at it,
create the initial ID map based on 48-bit virtual addressing as well,
regardless of the number of configured levels for the kernel proper.
Note that this code may execute with the MMU and caches disabled, and is
therefore not permitted to make unaligned accesses. This shouldn't
generally happen in any case for the algorithm as implemented, but to be
sure, let's pass -mstrict-align to the compiler just in case.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-66-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:07 +01:00
|
|
|
0: adrp x0, __idmap_text_start
|
2023-01-11 11:22:35 +01:00
|
|
|
adr_l x1, __idmap_text_end
|
arm64: fix .idmap.text assertion for large kernels
When building a kernel with many debug options enabled (which happens in
test configurations use by myself and syzbot), the kernel can become
large enough that portions of .text can be more than 128M away from
.idmap.text (which is placed inside the .rodata section). Where idmap
code branches into .text, the linker will place veneers in the
.idmap.text section to make those branches possible.
Unfortunately, as Ard reports, GNU LD has bseen observed to add 4K of
padding when adding such veneers, e.g.
| .idmap.text 0xffffffc01e48e5c0 0x32c arch/arm64/mm/proc.o
| 0xffffffc01e48e5c0 idmap_cpu_replace_ttbr1
| 0xffffffc01e48e600 idmap_kpti_install_ng_mappings
| 0xffffffc01e48e800 __cpu_setup
| *fill* 0xffffffc01e48e8ec 0x4
| .idmap.text.stub
| 0xffffffc01e48e8f0 0x18 linker stubs
| 0xffffffc01e48f8f0 __idmap_text_end = .
| 0xffffffc01e48f000 . = ALIGN (0x1000)
| *fill* 0xffffffc01e48f8f0 0x710
| 0xffffffc01e490000 idmap_pg_dir = .
This makes the __idmap_text_start .. __idmap_text_end region bigger than
the 4K we require it to fit within, and triggers an assertion in arm64's
vmlinux.lds.S, which breaks the build:
| LD .tmp_vmlinux.kallsyms1
| aarch64-linux-gnu-ld: ID map text too big or misaligned
| make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 1
| make: *** [Makefile:1264: vmlinux] Error 2
Avoid this by using an `ADRP+ADD+BLR` sequence for branches out of
.idmap.text, which avoids the need for veneers. These branches are only
executed once per boot, and only when the MMU is on, so there should be
no noticeable performance penalty in replacing `BL` with `ADRP+ADD+BLR`.
At the same time, remove the "x" and "w" attributes when placing code in
.idmap.text, as these are not necessary, and this will prevent the
linker from assuming that it is safe to place PLTs into .idmap.text,
causing it to warn if and when there are out-of-range branches within
.idmap.text, e.g.
| LD .tmp_vmlinux.kallsyms1
| arch/arm64/kernel/head.o: in function `primary_entry':
| (.idmap.text+0x1c): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| arch/arm64/kernel/head.o: in function `init_el2':
| (.idmap.text+0x88): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| make[1]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
| make: *** [Makefile:1252: vmlinux] Error 2
Thus, if future changes add out-of-range branches in .idmap.text, it
should be easy enough to identify those from the resulting linker
errors.
Reported-by: syzbot+f8ac312e31226e23302b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-arm-kernel/00000000000028ea4105f4e2ef54@google.com/
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Will Deacon <will@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20230220162317.1581208-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-02-20 16:23:17 +00:00
|
|
|
adr_l x2, dcache_clean_poc
|
|
|
|
blr x2
|
arm64: kernel: Create initial ID map from C code
The asm code that creates the initial ID map is rather intricate and
hard to follow. This is problematic because it makes adding support for
things like LPA2 or WXN more difficult than necessary. Also, it is
parameterized like the rest of the MM code to run with a configurable
number of levels, which is rather pointless, given that all AArch64 CPUs
implement support for 48-bit virtual addressing, and that many systems
exist with DRAM located outside of the 39-bit addressable range, which
is the only smaller VA size that is widely used, and we need additional
tricks to make things work in that combination.
So let's bite the bullet, and rip out all the asm macros, and fiddly
code, and replace it with a C implementation based on the newly added
routines for creating the early kernel VA mappings. And while at it,
create the initial ID map based on 48-bit virtual addressing as well,
regardless of the number of configured levels for the kernel proper.
Note that this code may execute with the MMU and caches disabled, and is
therefore not permitted to make unaligned accesses. This shouldn't
generally happen in any case for the algorithm as implemented, but to be
sure, let's pass -mstrict-align to the compiler just in case.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-66-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:07 +01:00
|
|
|
|
|
|
|
1: mov x0, x19
|
2020-11-13 12:49:23 +00:00
|
|
|
bl init_kernel_el // w0=cpu_boot_mode
|
2022-06-24 17:06:48 +02:00
|
|
|
mov x20, x0
|
2022-06-24 17:06:37 +02:00
|
|
|
|
2012-03-05 11:49:27 +00:00
|
|
|
/*
|
2015-03-18 14:55:20 +00:00
|
|
|
* The following calls CPU setup code, see arch/arm64/mm/proc.S for
|
|
|
|
* details.
|
2012-03-05 11:49:27 +00:00
|
|
|
* On return, the CPU will be ready for the MMU to be turned on and
|
|
|
|
* the TCR will have been set.
|
|
|
|
*/
|
2016-04-18 17:09:43 +02:00
|
|
|
bl __cpu_setup // initialise processor
|
2016-08-31 12:05:13 +01:00
|
|
|
b __primary_switch
|
2020-03-26 18:14:23 +01:00
|
|
|
SYM_CODE_END(primary_entry)
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2023-01-11 11:22:35 +01:00
|
|
|
__INIT
|
2023-01-11 11:22:33 +01:00
|
|
|
SYM_CODE_START_LOCAL(record_mmu_state)
|
|
|
|
mrs x19, CurrentEL
|
|
|
|
cmp x19, #CurrentEL_EL2
|
|
|
|
mrs x19, sctlr_el1
|
|
|
|
b.ne 0f
|
|
|
|
mrs x19, sctlr_el2
|
2023-01-25 19:59:10 +01:00
|
|
|
0:
|
|
|
|
CPU_LE( tbnz x19, #SCTLR_ELx_EE_SHIFT, 1f )
|
|
|
|
CPU_BE( tbz x19, #SCTLR_ELx_EE_SHIFT, 1f )
|
|
|
|
tst x19, #SCTLR_ELx_C // Z := (C == 0)
|
2023-01-11 11:22:33 +01:00
|
|
|
and x19, x19, #SCTLR_ELx_M // isolate M bit
|
|
|
|
csel x19, xzr, x19, eq // clear x19 if Z
|
|
|
|
ret
|
2023-01-25 19:59:10 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Set the correct endianness early so all memory accesses issued
|
|
|
|
* before init_kernel_el() occur in the correct byte order. Note that
|
|
|
|
* this means the MMU must be disabled, or the active ID map will end
|
|
|
|
* up getting interpreted with the wrong byte order.
|
|
|
|
*/
|
|
|
|
1: eor x19, x19, #SCTLR_ELx_EE
|
|
|
|
bic x19, x19, #SCTLR_ELx_M
|
|
|
|
b.ne 2f
|
|
|
|
pre_disable_mmu_workaround
|
|
|
|
msr sctlr_el2, x19
|
|
|
|
b 3f
|
2023-04-25 15:27:00 +05:30
|
|
|
2: pre_disable_mmu_workaround
|
|
|
|
msr sctlr_el1, x19
|
2023-01-25 19:59:10 +01:00
|
|
|
3: isb
|
|
|
|
mov x19, xzr
|
|
|
|
ret
|
2023-01-11 11:22:33 +01:00
|
|
|
SYM_CODE_END(record_mmu_state)
|
|
|
|
|
2015-03-17 10:55:12 +01:00
|
|
|
/*
|
|
|
|
* Preserve the arguments passed by the bootloader in x0 .. x3
|
|
|
|
*/
|
2020-02-18 19:58:34 +00:00
|
|
|
SYM_CODE_START_LOCAL(preserve_boot_args)
|
2015-03-17 10:55:12 +01:00
|
|
|
mov x21, x0 // x21=FDT
|
|
|
|
|
|
|
|
adr_l x0, boot_args // record the contents of
|
|
|
|
stp x21, x1, [x0] // x0 .. x3 at kernel entry
|
|
|
|
stp x2, x3, [x0, #16]
|
|
|
|
|
2023-01-11 11:22:33 +01:00
|
|
|
cbnz x19, 0f // skip cache invalidation if MMU is on
|
2015-03-17 10:55:12 +01:00
|
|
|
dmb sy // needed before dc ivac with
|
|
|
|
// MMU off
|
|
|
|
|
2021-05-24 09:29:53 +01:00
|
|
|
add x1, x0, #0x20 // 4 x 8 bytes
|
arm64: Rename arm64-internal cache maintenance functions
Although naming across the codebase isn't that consistent, it
tends to follow certain patterns. Moreover, the term "flush"
isn't defined in the Arm Architecture reference manual, and might
be interpreted to mean clean, invalidate, or both for a cache.
Rename arm64-internal functions to make the naming internally
consistent, as well as making it consistent with the Arm ARM, by
specifying whether it applies to the instruction, data, or both
caches, whether the operation is a clean, invalidate, or both.
Also specify which point the operation applies to, i.e., to the
point of unification (PoU), coherency (PoC), or persistence
(PoP).
This commit applies the following sed transformation to all files
under arch/arm64:
"s/\b__flush_cache_range\b/caches_clean_inval_pou_macro/g;"\
"s/\b__flush_icache_range\b/caches_clean_inval_pou/g;"\
"s/\binvalidate_icache_range\b/icache_inval_pou/g;"\
"s/\b__flush_dcache_area\b/dcache_clean_inval_poc/g;"\
"s/\b__inval_dcache_area\b/dcache_inval_poc/g;"\
"s/__clean_dcache_area_poc\b/dcache_clean_poc/g;"\
"s/\b__clean_dcache_area_pop\b/dcache_clean_pop/g;"\
"s/\b__clean_dcache_area_pou\b/dcache_clean_pou/g;"\
"s/\b__flush_cache_user_range\b/caches_clean_inval_user_pou/g;"\
"s/\b__flush_icache_all\b/icache_inval_all_pou/g;"
Note that __clean_dcache_area_poc is deliberately missing a word
boundary check at the beginning in order to match the efistub
symbols in image-vars.h.
Also note that, despite its name, __flush_icache_range operates
on both instruction and data caches. The name change here
reflects that.
No functional change intended.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20210524083001.2586635-19-tabba@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-24 09:30:01 +01:00
|
|
|
b dcache_inval_poc // tail call
|
2023-01-11 11:22:33 +01:00
|
|
|
0: str_l x19, mmu_enabled_at_boot, x0
|
|
|
|
ret
|
2020-02-18 19:58:34 +00:00
|
|
|
SYM_CODE_END(preserve_boot_args)
|
2015-03-17 10:55:12 +01:00
|
|
|
|
arm64: Implement stack trace termination record
Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.
We'd like to use task_pt_regs(task)->stackframe as the final frame
record, as this is already setup upon exception entry from EL0. For
kernel tasks we need to consistently reserve the pt_regs and point x29
at this, which we can do with small changes to __primary_switched,
__secondary_switched, and copy_process().
Since the final frame record must be at a specific location, we must
create the final frame record in __primary_switched and
__secondary_switched rather than leaving this to start_kernel and
secondary_start_kernel. Thus, __primary_switched and
__secondary_switched will now show up in stacktraces for the idle tasks.
Since the final frame record is now identified by its location rather
than by its contents, we identify it at the start of unwind_frame(),
before we read any values from it.
External debuggers may terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel. That said, gdb
does not show the extra record probably because it uses DWARF and not
frame pointers for stack traces.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
[Mark: rebase, use ASM_BUG(), update comments, update commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210510110026.18061-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-10 12:00:26 +01:00
|
|
|
/*
|
2021-05-20 12:50:30 +01:00
|
|
|
* Initialize CPU registers with task-specific and cpu-specific context.
|
|
|
|
*
|
arm64: Implement stack trace termination record
Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.
We'd like to use task_pt_regs(task)->stackframe as the final frame
record, as this is already setup upon exception entry from EL0. For
kernel tasks we need to consistently reserve the pt_regs and point x29
at this, which we can do with small changes to __primary_switched,
__secondary_switched, and copy_process().
Since the final frame record must be at a specific location, we must
create the final frame record in __primary_switched and
__secondary_switched rather than leaving this to start_kernel and
secondary_start_kernel. Thus, __primary_switched and
__secondary_switched will now show up in stacktraces for the idle tasks.
Since the final frame record is now identified by its location rather
than by its contents, we identify it at the start of unwind_frame(),
before we read any values from it.
External debuggers may terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel. That said, gdb
does not show the extra record probably because it uses DWARF and not
frame pointers for stack traces.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
[Mark: rebase, use ASM_BUG(), update comments, update commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210510110026.18061-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-10 12:00:26 +01:00
|
|
|
* Create a final frame record at task_pt_regs(current)->stackframe, so
|
|
|
|
* that the unwinder can identify the final frame record of any task by
|
|
|
|
* its location in the task stack. We reserve the entire pt_regs space
|
|
|
|
* for consistency with user tasks and kthreads.
|
|
|
|
*/
|
2021-05-20 12:50:31 +01:00
|
|
|
.macro init_cpu_task tsk, tmp1, tmp2
|
2021-05-20 12:50:30 +01:00
|
|
|
msr sp_el0, \tsk
|
|
|
|
|
2021-05-20 12:50:31 +01:00
|
|
|
ldr \tmp1, [\tsk, #TSK_STACK]
|
|
|
|
add sp, \tmp1, #THREAD_SIZE
|
arm64: Implement stack trace termination record
Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.
We'd like to use task_pt_regs(task)->stackframe as the final frame
record, as this is already setup upon exception entry from EL0. For
kernel tasks we need to consistently reserve the pt_regs and point x29
at this, which we can do with small changes to __primary_switched,
__secondary_switched, and copy_process().
Since the final frame record must be at a specific location, we must
create the final frame record in __primary_switched and
__secondary_switched rather than leaving this to start_kernel and
secondary_start_kernel. Thus, __primary_switched and
__secondary_switched will now show up in stacktraces for the idle tasks.
Since the final frame record is now identified by its location rather
than by its contents, we identify it at the start of unwind_frame(),
before we read any values from it.
External debuggers may terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel. That said, gdb
does not show the extra record probably because it uses DWARF and not
frame pointers for stack traces.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
[Mark: rebase, use ASM_BUG(), update comments, update commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210510110026.18061-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-10 12:00:26 +01:00
|
|
|
sub sp, sp, #PT_REGS_SIZE
|
2021-05-20 12:50:30 +01:00
|
|
|
|
arm64: Implement stack trace termination record
Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.
We'd like to use task_pt_regs(task)->stackframe as the final frame
record, as this is already setup upon exception entry from EL0. For
kernel tasks we need to consistently reserve the pt_regs and point x29
at this, which we can do with small changes to __primary_switched,
__secondary_switched, and copy_process().
Since the final frame record must be at a specific location, we must
create the final frame record in __primary_switched and
__secondary_switched rather than leaving this to start_kernel and
secondary_start_kernel. Thus, __primary_switched and
__secondary_switched will now show up in stacktraces for the idle tasks.
Since the final frame record is now identified by its location rather
than by its contents, we identify it at the start of unwind_frame(),
before we read any values from it.
External debuggers may terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel. That said, gdb
does not show the extra record probably because it uses DWARF and not
frame pointers for stack traces.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
[Mark: rebase, use ASM_BUG(), update comments, update commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210510110026.18061-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-10 12:00:26 +01:00
|
|
|
stp xzr, xzr, [sp, #S_STACKFRAME]
|
arm64: stacktrace: unwind exception boundaries
When arm64's stack unwinder encounters an exception boundary, it uses
the pt_regs::stackframe created by the entry code, which has a copy of
the PC and FP at the time the exception was taken. The unwinder doesn't
know anything about pt_regs, and reports the PC from the stackframe, but
does not report the LR.
The LR is only guaranteed to contain the return address at function call
boundaries, and can be used as a scratch register at other times, so the
LR at an exception boundary may or may not be a legitimate return
address. It would be useful to report the LR value regardless, as it can
be helpful when debugging, and in future it will be helpful for reliable
stacktrace support.
This patch changes the way we unwind across exception boundaries,
allowing both the PC and LR to be reported. The entry code creates a
frame_record_meta structure embedded within pt_regs, which the unwinder
uses to find the pt_regs. The unwinder can then extract pt_regs::pc and
pt_regs::lr as two separate unwind steps before continuing with a
regular walk of frame records.
When a PC is unwound from pt_regs::lr, dump_backtrace() will log this
with an "L" marker so that it can be identified easily. For example,
an unwind across an exception boundary will appear as follows:
| el1h_64_irq+0x6c/0x70
| _raw_spin_unlock_irqrestore+0x10/0x60 (P)
| __aarch64_insn_write+0x6c/0x90 (L)
| aarch64_insn_patch_text_nosync+0x28/0x80
... with a (P) entry for pt_regs::pc, and an (L) entry for pt_regs:lr.
Note that the LR may be stale at the point of the exception, for example,
shortly after a return:
| el1h_64_irq+0x6c/0x70
| default_idle_call+0x34/0x180 (P)
| default_idle_call+0x28/0x180 (L)
| do_idle+0x204/0x268
... where the LR points a few instructions before the current PC.
This plays nicely with all the other unwind metadata tracking. With the
ftrace_graph profiler enabled globally, and kretprobes installed on
generic_handle_domain_irq() and do_interrupt_handler(), a backtrace triggered
by magic-sysrq + L reports:
| Call trace:
| show_stack+0x20/0x40 (CF)
| dump_stack_lvl+0x60/0x80 (F)
| dump_stack+0x18/0x28
| nmi_cpu_backtrace+0xfc/0x140
| nmi_trigger_cpumask_backtrace+0x1c8/0x200
| arch_trigger_cpumask_backtrace+0x20/0x40
| sysrq_handle_showallcpus+0x24/0x38 (F)
| __handle_sysrq+0xa8/0x1b0 (F)
| handle_sysrq+0x38/0x50 (F)
| pl011_int+0x460/0x5a8 (F)
| __handle_irq_event_percpu+0x60/0x220 (F)
| handle_irq_event+0x54/0xc0 (F)
| handle_fasteoi_irq+0xa8/0x1d0 (F)
| generic_handle_domain_irq+0x34/0x58 (F)
| gic_handle_irq+0x54/0x140 (FK)
| call_on_irq_stack+0x24/0x58 (F)
| do_interrupt_handler+0x88/0xa0
| el1_interrupt+0x34/0x68 (FK)
| el1h_64_irq_handler+0x18/0x28
| el1h_64_irq+0x6c/0x70
| default_idle_call+0x34/0x180 (P)
| default_idle_call+0x28/0x180 (L)
| do_idle+0x204/0x268
| cpu_startup_entry+0x3c/0x50 (F)
| rest_init+0xe4/0xf0
| start_kernel+0x744/0x750
| __primary_switched+0x88/0x98
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241017092538.1859841-11-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 10:25:38 +01:00
|
|
|
mov \tmp1, #FRAME_META_TYPE_FINAL
|
|
|
|
str \tmp1, [sp, #S_STACKFRAME_TYPE]
|
arm64: Implement stack trace termination record
Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.
We'd like to use task_pt_regs(task)->stackframe as the final frame
record, as this is already setup upon exception entry from EL0. For
kernel tasks we need to consistently reserve the pt_regs and point x29
at this, which we can do with small changes to __primary_switched,
__secondary_switched, and copy_process().
Since the final frame record must be at a specific location, we must
create the final frame record in __primary_switched and
__secondary_switched rather than leaving this to start_kernel and
secondary_start_kernel. Thus, __primary_switched and
__secondary_switched will now show up in stacktraces for the idle tasks.
Since the final frame record is now identified by its location rather
than by its contents, we identify it at the start of unwind_frame(),
before we read any values from it.
External debuggers may terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel. That said, gdb
does not show the extra record probably because it uses DWARF and not
frame pointers for stack traces.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
[Mark: rebase, use ASM_BUG(), update comments, update commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210510110026.18061-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-10 12:00:26 +01:00
|
|
|
add x29, sp, #S_STACKFRAME
|
2021-05-20 12:50:30 +01:00
|
|
|
|
2023-01-09 18:47:59 +01:00
|
|
|
scs_load_current
|
2021-05-20 12:50:31 +01:00
|
|
|
|
|
|
|
adr_l \tmp1, __per_cpu_offset
|
2021-09-14 14:10:33 +02:00
|
|
|
ldr w\tmp2, [\tsk, #TSK_TI_CPU]
|
2021-05-20 12:50:31 +01:00
|
|
|
ldr \tmp1, [\tmp1, \tmp2, lsl #3]
|
|
|
|
set_this_cpu_offset \tmp1
|
arm64: Implement stack trace termination record
Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.
We'd like to use task_pt_regs(task)->stackframe as the final frame
record, as this is already setup upon exception entry from EL0. For
kernel tasks we need to consistently reserve the pt_regs and point x29
at this, which we can do with small changes to __primary_switched,
__secondary_switched, and copy_process().
Since the final frame record must be at a specific location, we must
create the final frame record in __primary_switched and
__secondary_switched rather than leaving this to start_kernel and
secondary_start_kernel. Thus, __primary_switched and
__secondary_switched will now show up in stacktraces for the idle tasks.
Since the final frame record is now identified by its location rather
than by its contents, we identify it at the start of unwind_frame(),
before we read any values from it.
External debuggers may terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel. That said, gdb
does not show the extra record probably because it uses DWARF and not
frame pointers for stack traces.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
[Mark: rebase, use ASM_BUG(), update comments, update commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210510110026.18061-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-10 12:00:26 +01:00
|
|
|
.endm
|
|
|
|
|
2014-11-21 13:50:41 -08:00
|
|
|
/*
|
2015-03-04 11:51:48 +01:00
|
|
|
* The following fragment of code is executed with the MMU enabled.
|
2016-08-31 12:05:15 +01:00
|
|
|
*
|
2022-06-29 09:42:07 +05:30
|
|
|
* x0 = __pa(KERNEL_START)
|
2014-11-21 13:50:41 -08:00
|
|
|
*/
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_START_LOCAL(__primary_switched)
|
2021-05-20 12:50:30 +01:00
|
|
|
adr_l x4, init_task
|
2021-05-20 12:50:31 +01:00
|
|
|
init_cpu_task x4, x5, x6
|
2016-08-31 12:05:16 +01:00
|
|
|
|
2015-12-26 12:46:40 +01:00
|
|
|
adr_l x8, vectors // load VBAR_EL1 with virtual
|
|
|
|
msr vbar_el1, x8 // vector table address
|
|
|
|
isb
|
|
|
|
|
2021-05-20 12:50:30 +01:00
|
|
|
stp x29, x30, [sp, #-16]!
|
2016-08-31 12:05:16 +01:00
|
|
|
mov x29, sp
|
|
|
|
|
2016-08-31 12:05:15 +01:00
|
|
|
str_l x21, __fdt_pointer, x5 // Save FDT pointer
|
|
|
|
|
2023-11-29 12:15:59 +01:00
|
|
|
adrp x4, _text // Save the offset between
|
2016-08-31 12:05:15 +01:00
|
|
|
sub x4, x4, x0 // the kernel virtual and
|
|
|
|
str_l x4, kimage_voffset, x5 // physical mappings
|
|
|
|
|
2022-06-24 17:06:48 +02:00
|
|
|
mov x0, x20
|
|
|
|
bl set_cpu_boot_mode_flag
|
|
|
|
|
2020-12-22 12:02:06 -08:00
|
|
|
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
|
2015-10-12 18:52:58 +03:00
|
|
|
bl kasan_early_init
|
2022-10-27 17:59:08 +02:00
|
|
|
#endif
|
2022-06-24 17:06:48 +02:00
|
|
|
mov x0, x20
|
2022-06-30 17:04:52 +01:00
|
|
|
bl finalise_el2 // Prefer VHE if possible
|
2021-05-20 12:50:30 +01:00
|
|
|
ldp x29, x30, [sp], #16
|
arm64: Implement stack trace termination record
Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.
We'd like to use task_pt_regs(task)->stackframe as the final frame
record, as this is already setup upon exception entry from EL0. For
kernel tasks we need to consistently reserve the pt_regs and point x29
at this, which we can do with small changes to __primary_switched,
__secondary_switched, and copy_process().
Since the final frame record must be at a specific location, we must
create the final frame record in __primary_switched and
__secondary_switched rather than leaving this to start_kernel and
secondary_start_kernel. Thus, __primary_switched and
__secondary_switched will now show up in stacktraces for the idle tasks.
Since the final frame record is now identified by its location rather
than by its contents, we identify it at the start of unwind_frame(),
before we read any values from it.
External debuggers may terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel. That said, gdb
does not show the extra record probably because it uses DWARF and not
frame pointers for stack traces.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
[Mark: rebase, use ASM_BUG(), update comments, update commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210510110026.18061-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-10 12:00:26 +01:00
|
|
|
bl start_kernel
|
|
|
|
ASM_BUG()
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_END(__primary_switched)
|
2014-11-21 13:50:41 -08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* end early head section, begin head code that is also used for
|
|
|
|
* hotplug and needs to have the same protections as the text region
|
|
|
|
*/
|
arm64: fix .idmap.text assertion for large kernels
When building a kernel with many debug options enabled (which happens in
test configurations use by myself and syzbot), the kernel can become
large enough that portions of .text can be more than 128M away from
.idmap.text (which is placed inside the .rodata section). Where idmap
code branches into .text, the linker will place veneers in the
.idmap.text section to make those branches possible.
Unfortunately, as Ard reports, GNU LD has bseen observed to add 4K of
padding when adding such veneers, e.g.
| .idmap.text 0xffffffc01e48e5c0 0x32c arch/arm64/mm/proc.o
| 0xffffffc01e48e5c0 idmap_cpu_replace_ttbr1
| 0xffffffc01e48e600 idmap_kpti_install_ng_mappings
| 0xffffffc01e48e800 __cpu_setup
| *fill* 0xffffffc01e48e8ec 0x4
| .idmap.text.stub
| 0xffffffc01e48e8f0 0x18 linker stubs
| 0xffffffc01e48f8f0 __idmap_text_end = .
| 0xffffffc01e48f000 . = ALIGN (0x1000)
| *fill* 0xffffffc01e48f8f0 0x710
| 0xffffffc01e490000 idmap_pg_dir = .
This makes the __idmap_text_start .. __idmap_text_end region bigger than
the 4K we require it to fit within, and triggers an assertion in arm64's
vmlinux.lds.S, which breaks the build:
| LD .tmp_vmlinux.kallsyms1
| aarch64-linux-gnu-ld: ID map text too big or misaligned
| make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 1
| make: *** [Makefile:1264: vmlinux] Error 2
Avoid this by using an `ADRP+ADD+BLR` sequence for branches out of
.idmap.text, which avoids the need for veneers. These branches are only
executed once per boot, and only when the MMU is on, so there should be
no noticeable performance penalty in replacing `BL` with `ADRP+ADD+BLR`.
At the same time, remove the "x" and "w" attributes when placing code in
.idmap.text, as these are not necessary, and this will prevent the
linker from assuming that it is safe to place PLTs into .idmap.text,
causing it to warn if and when there are out-of-range branches within
.idmap.text, e.g.
| LD .tmp_vmlinux.kallsyms1
| arch/arm64/kernel/head.o: in function `primary_entry':
| (.idmap.text+0x1c): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| arch/arm64/kernel/head.o: in function `init_el2':
| (.idmap.text+0x88): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| make[1]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
| make: *** [Makefile:1252: vmlinux] Error 2
Thus, if future changes add out-of-range branches in .idmap.text, it
should be easy enough to identify those from the resulting linker
errors.
Reported-by: syzbot+f8ac312e31226e23302b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-arm-kernel/00000000000028ea4105f4e2ef54@google.com/
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Will Deacon <will@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20230220162317.1581208-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-02-20 16:23:17 +00:00
|
|
|
.section ".idmap.text","a"
|
arm64: add support for kernel ASLR
This adds support for KASLR is implemented, based on entropy provided by
the bootloader in the /chosen/kaslr-seed DT property. Depending on the size
of the address space (VA_BITS) and the page size, the entropy in the
virtual displacement is up to 13 bits (16k/2 levels) and up to 25 bits (all
4 levels), with the sidenote that displacements that result in the kernel
image straddling a 1GB/32MB/512MB alignment boundary (for 4KB/16KB/64KB
granule kernels, respectively) are not allowed, and will be rounded up to
an acceptable value.
If CONFIG_RANDOMIZE_MODULE_REGION_FULL is enabled, the module region is
randomized independently from the core kernel. This makes it less likely
that the location of core kernel data structures can be determined by an
adversary, but causes all function calls from modules into the core kernel
to be resolved via entries in the module PLTs.
If CONFIG_RANDOMIZE_MODULE_REGION_FULL is not enabled, the module region is
randomized by choosing a page aligned 128 MB region inside the interval
[_etext - 128 MB, _stext + 128 MB). This gives between 10 and 14 bits of
entropy (depending on page size), independently of the kernel randomization,
but still guarantees that modules are within the range of relative branch
and jump instructions (with the caveat that, since the module region is
shared with other uses of the vmalloc area, modules may need to be loaded
further away if the module region is exhausted)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-01-26 14:12:01 +01:00
|
|
|
|
2012-03-05 11:49:27 +00:00
|
|
|
/*
|
2020-11-13 12:49:23 +00:00
|
|
|
* Starting from EL2 or EL1, configure the CPU to execute at the highest
|
|
|
|
* reachable EL supported by the kernel in a chosen default state. If dropping
|
|
|
|
* from EL2 to EL1, configure EL2 before configuring EL1.
|
2013-10-11 14:52:16 +01:00
|
|
|
*
|
2020-11-13 12:49:25 +00:00
|
|
|
* Since we cannot always rely on ERET synchronizing writes to sysregs (e.g. if
|
|
|
|
* SCTLR_ELx.EOS is clear), we place an ISB prior to ERET.
|
2013-10-11 14:52:16 +01:00
|
|
|
*
|
2022-06-30 17:04:53 +01:00
|
|
|
* Returns either BOOT_CPU_MODE_EL1 or BOOT_CPU_MODE_EL2 in x0 if
|
|
|
|
* booted in EL1 or EL2 respectively, with the top 32 bits containing
|
|
|
|
* potential context flags. These flags are *not* stored in __boot_cpu_mode.
|
2023-01-11 11:22:35 +01:00
|
|
|
*
|
|
|
|
* x0: whether we are being called from the primary boot path with the MMU on
|
2012-03-05 11:49:27 +00:00
|
|
|
*/
|
2020-11-13 12:49:23 +00:00
|
|
|
SYM_FUNC_START(init_kernel_el)
|
2023-01-11 11:22:35 +01:00
|
|
|
mrs x1, CurrentEL
|
|
|
|
cmp x1, #CurrentEL_EL2
|
2020-11-13 12:49:25 +00:00
|
|
|
b.eq init_el2
|
|
|
|
|
|
|
|
SYM_INNER_LABEL(init_el1, SYM_L_LOCAL)
|
2021-04-08 14:10:09 +01:00
|
|
|
mov_q x0, INIT_SCTLR_EL1_MMU_OFF
|
2023-01-11 11:22:33 +01:00
|
|
|
pre_disable_mmu_workaround
|
2021-04-08 14:10:09 +01:00
|
|
|
msr sctlr_el1, x0
|
2013-10-11 14:52:17 +01:00
|
|
|
isb
|
2020-11-13 12:49:25 +00:00
|
|
|
mov_q x0, INIT_PSTATE_EL1
|
|
|
|
msr spsr_el1, x0
|
|
|
|
msr elr_el1, lr
|
|
|
|
mov w0, #BOOT_CPU_MODE_EL1
|
|
|
|
eret
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2020-11-13 12:49:25 +00:00
|
|
|
SYM_INNER_LABEL(init_el2, SYM_L_LOCAL)
|
2023-01-11 11:22:35 +01:00
|
|
|
msr elr_el2, lr
|
|
|
|
|
|
|
|
// clean all HYP code to the PoC if we booted at EL2 with the MMU on
|
|
|
|
cbz x0, 0f
|
|
|
|
adrp x0, __hyp_idmap_text_start
|
|
|
|
adr_l x1, __hyp_text_end
|
arm64: fix .idmap.text assertion for large kernels
When building a kernel with many debug options enabled (which happens in
test configurations use by myself and syzbot), the kernel can become
large enough that portions of .text can be more than 128M away from
.idmap.text (which is placed inside the .rodata section). Where idmap
code branches into .text, the linker will place veneers in the
.idmap.text section to make those branches possible.
Unfortunately, as Ard reports, GNU LD has bseen observed to add 4K of
padding when adding such veneers, e.g.
| .idmap.text 0xffffffc01e48e5c0 0x32c arch/arm64/mm/proc.o
| 0xffffffc01e48e5c0 idmap_cpu_replace_ttbr1
| 0xffffffc01e48e600 idmap_kpti_install_ng_mappings
| 0xffffffc01e48e800 __cpu_setup
| *fill* 0xffffffc01e48e8ec 0x4
| .idmap.text.stub
| 0xffffffc01e48e8f0 0x18 linker stubs
| 0xffffffc01e48f8f0 __idmap_text_end = .
| 0xffffffc01e48f000 . = ALIGN (0x1000)
| *fill* 0xffffffc01e48f8f0 0x710
| 0xffffffc01e490000 idmap_pg_dir = .
This makes the __idmap_text_start .. __idmap_text_end region bigger than
the 4K we require it to fit within, and triggers an assertion in arm64's
vmlinux.lds.S, which breaks the build:
| LD .tmp_vmlinux.kallsyms1
| aarch64-linux-gnu-ld: ID map text too big or misaligned
| make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 1
| make: *** [Makefile:1264: vmlinux] Error 2
Avoid this by using an `ADRP+ADD+BLR` sequence for branches out of
.idmap.text, which avoids the need for veneers. These branches are only
executed once per boot, and only when the MMU is on, so there should be
no noticeable performance penalty in replacing `BL` with `ADRP+ADD+BLR`.
At the same time, remove the "x" and "w" attributes when placing code in
.idmap.text, as these are not necessary, and this will prevent the
linker from assuming that it is safe to place PLTs into .idmap.text,
causing it to warn if and when there are out-of-range branches within
.idmap.text, e.g.
| LD .tmp_vmlinux.kallsyms1
| arch/arm64/kernel/head.o: in function `primary_entry':
| (.idmap.text+0x1c): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| arch/arm64/kernel/head.o: in function `init_el2':
| (.idmap.text+0x88): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| make[1]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
| make: *** [Makefile:1252: vmlinux] Error 2
Thus, if future changes add out-of-range branches in .idmap.text, it
should be easy enough to identify those from the resulting linker
errors.
Reported-by: syzbot+f8ac312e31226e23302b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-arm-kernel/00000000000028ea4105f4e2ef54@google.com/
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Will Deacon <will@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20230220162317.1581208-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-02-20 16:23:17 +00:00
|
|
|
adr_l x2, dcache_clean_poc
|
|
|
|
blr x2
|
2024-04-15 09:54:15 +02:00
|
|
|
|
|
|
|
mov_q x0, INIT_SCTLR_EL2_MMU_OFF
|
|
|
|
pre_disable_mmu_workaround
|
|
|
|
msr sctlr_el2, x0
|
|
|
|
isb
|
2023-01-11 11:22:35 +01:00
|
|
|
0:
|
2020-12-02 18:41:04 +00:00
|
|
|
|
KVM: arm64: Initialize HCR_EL2.E2H early
On CPUs without FEAT_E2H0, HCR_EL2.E2H is RES1, but may reset to an
UNKNOWN value out of reset and consequently may not read as 1 unless it
has been explicitly initialized.
We handled this for the head.S boot code in commits:
3944382fa6f22b54 ("arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative")
b3320142f3db9b3f ("arm64: Fix early handling of FEAT_E2H0 not being implemented")
Unfortunately, we forgot to apply a similar fix to the KVM PSCI entry
points used when relaying CPU_ON, CPU_SUSPEND, and SYSTEM SUSPEND. When
KVM is entered via these entry points, the value of HCR_EL2.E2H may be
consumed before it has been initialized (e.g. by the 'init_el2_state'
macro).
Initialize HCR_EL2.E2H early in these paths such that it can be consumed
reliably. The existing code in head.S is factored out into a new
'init_el2_hcr' macro, and this is used in the __kvm_hyp_init_cpu()
function common to all the relevant PSCI entry points.
For clarity, I've tweaked the assembly used to check whether
ID_AA64MMFR4_EL1.E2H0 is negative. The bitfield is extracted as a signed
value, and this is checked with a signed-greater-or-equal (GE) comparison.
As the hyp code will reconfigure HCR_EL2 later in ___kvm_hyp_init(), all
bits other than E2H are initialized to zero in __kvm_hyp_init_cpu().
Fixes: 3944382fa6f22b54 ("arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative")
Fixes: b3320142f3db9b3f ("arm64: Fix early handling of FEAT_E2H0 not being implemented")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ahmed Genidi <ahmed.genidi@arm.com>
Cc: Ben Horgan <ben.horgan@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250227180526.1204723-2-mark.rutland@arm.com
[maz: fixed LT->GE thinko]
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-27 18:05:25 +00:00
|
|
|
init_el2_hcr HCR_HOST_NVHE_FLAGS
|
2021-02-08 09:57:17 +00:00
|
|
|
init_el2_state
|
2017-10-31 15:51:04 +00:00
|
|
|
|
2012-10-19 17:46:27 +01:00
|
|
|
/* Hypervisor stub */
|
2020-12-02 18:41:04 +00:00
|
|
|
adr_l x0, __hyp_stub_vectors
|
2012-10-19 17:46:27 +01:00
|
|
|
msr vbar_el2, x0
|
2020-11-13 12:49:25 +00:00
|
|
|
isb
|
2020-12-02 18:41:04 +00:00
|
|
|
|
2022-06-30 17:04:54 +01:00
|
|
|
mov_q x1, INIT_SCTLR_EL1_MMU_OFF
|
|
|
|
|
2021-04-08 14:10:09 +01:00
|
|
|
mrs x0, hcr_el2
|
|
|
|
and x0, x0, #HCR_E2H
|
2024-01-22 18:13:41 +00:00
|
|
|
cbz x0, 2f
|
2024-03-21 11:54:14 +00:00
|
|
|
|
2022-06-30 17:04:54 +01:00
|
|
|
/* Set a sane SCTLR_EL1, the VHE way */
|
|
|
|
msr_s SYS_SCTLR_EL12, x1
|
|
|
|
mov x2, #BOOT_CPU_FLAG_E2H
|
2024-01-22 18:13:41 +00:00
|
|
|
b 3f
|
2021-04-08 14:10:09 +01:00
|
|
|
|
2024-01-22 18:13:41 +00:00
|
|
|
2:
|
2022-06-30 17:04:54 +01:00
|
|
|
msr sctlr_el1, x1
|
|
|
|
mov x2, xzr
|
2024-01-22 18:13:41 +00:00
|
|
|
3:
|
KVM: arm64: Initialize SCTLR_EL1 in __kvm_hyp_init_cpu()
When KVM is in protected mode, host calls to PSCI are proxied via EL2,
and cold entries from CPU_ON, CPU_SUSPEND, and SYSTEM_SUSPEND bounce
through __kvm_hyp_init_cpu() at EL2 before entering the host kernel's
entry point at EL1. While __kvm_hyp_init_cpu() initializes SPSR_EL2 for
the exception return to EL1, it does not initialize SCTLR_EL1.
Due to this, it's possible to enter EL1 with SCTLR_EL1 in an UNKNOWN
state. In practice this has been seen to result in kernel crashes after
CPU_ON as a result of SCTLR_EL1.M being 1 in violation of the initial
core configuration specified by PSCI.
Fix this by initializing SCTLR_EL1 for cold entry to the host kernel.
As it's necessary to write to SCTLR_EL12 in VHE mode, this
initialization is moved into __kvm_host_psci_cpu_entry() where we can
use write_sysreg_el1().
The remnants of the '__init_el2_nvhe_prepare_eret' macro are folded into
its only caller, as this is clearer than having the macro.
Fixes: cdf367192766ad11 ("KVM: arm64: Intercept host's CPU_ON SMCs")
Reported-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ahmed Genidi <ahmed.genidi@arm.com>
[ Mark: clarify commit message, handle E2H, move to C, remove macro ]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ahmed Genidi <ahmed.genidi@arm.com>
Cc: Ben Horgan <ben.horgan@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Link: https://lore.kernel.org/r/20250227180526.1204723-3-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-27 18:05:26 +00:00
|
|
|
mov x0, #INIT_PSTATE_EL1
|
|
|
|
msr spsr_el2, x0
|
2023-06-14 16:51:29 +01:00
|
|
|
|
2020-11-13 12:49:25 +00:00
|
|
|
mov w0, #BOOT_CPU_MODE_EL2
|
2022-06-30 17:04:54 +01:00
|
|
|
orr x0, x0, x2
|
2012-03-05 11:49:27 +00:00
|
|
|
eret
|
2020-11-13 12:49:23 +00:00
|
|
|
SYM_FUNC_END(init_kernel_el)
|
2012-03-05 11:49:27 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This provides a "holding pen" for platforms to hold all secondary
|
|
|
|
* cores are held until we're ready for them to initialise.
|
|
|
|
*/
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_START(secondary_holding_pen)
|
2023-01-11 11:22:35 +01:00
|
|
|
mov x0, xzr
|
2020-11-13 12:49:23 +00:00
|
|
|
bl init_kernel_el // w0=cpu_boot_mode
|
2022-06-24 17:06:48 +02:00
|
|
|
mrs x2, mpidr_el1
|
2016-04-18 17:09:45 +02:00
|
|
|
mov_q x1, MPIDR_HWID_BITMASK
|
2022-06-24 17:06:48 +02:00
|
|
|
and x2, x2, x1
|
2015-03-10 15:00:03 +01:00
|
|
|
adr_l x3, secondary_holding_pen_release
|
2012-03-05 11:49:27 +00:00
|
|
|
pen: ldr x4, [x3]
|
2022-06-24 17:06:48 +02:00
|
|
|
cmp x4, x2
|
2012-03-05 11:49:27 +00:00
|
|
|
b.eq secondary_startup
|
|
|
|
wfe
|
|
|
|
b pen
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_END(secondary_holding_pen)
|
arm64: factor out spin-table boot method
The arm64 kernel has an internal holding pen, which is necessary for
some systems where we can't bring CPUs online individually and must hold
multiple CPUs in a safe area until the kernel is able to handle them.
The current SMP infrastructure for arm64 is closely coupled to this
holding pen, and alternative boot methods must launch CPUs into the pen,
where they sit before they are launched into the kernel proper.
With PSCI (and possibly other future boot methods), we can bring CPUs
online individually, and need not perform the secondary_holding_pen
dance. Instead, this patch factors the holding pen management code out
to the spin-table boot method code, as it is the only boot method
requiring the pen.
A new entry point for secondaries, secondary_entry is added for other
boot methods to use, which bypasses the holding pen and its associated
overhead when bringing CPUs online. The smp.pen.text section is also
removed, as the pen can live in head.text without problem.
The cpu_operations structure is extended with two new functions,
cpu_boot and cpu_postboot, for bringing a cpu into the kernel and
performing any post-boot cleanup required by a bootmethod (e.g.
resetting the secondary_holding_pen_release to INVALID_HWID).
Documentation is added for cpu_operations.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2013-10-24 20:30:16 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Secondary entry point that jumps straight into the kernel. Only to
|
|
|
|
* be used where CPUs are brought online dynamically by the kernel.
|
|
|
|
*/
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_START(secondary_entry)
|
2023-01-11 11:22:35 +01:00
|
|
|
mov x0, xzr
|
2020-11-13 12:49:23 +00:00
|
|
|
bl init_kernel_el // w0=cpu_boot_mode
|
arm64: factor out spin-table boot method
The arm64 kernel has an internal holding pen, which is necessary for
some systems where we can't bring CPUs online individually and must hold
multiple CPUs in a safe area until the kernel is able to handle them.
The current SMP infrastructure for arm64 is closely coupled to this
holding pen, and alternative boot methods must launch CPUs into the pen,
where they sit before they are launched into the kernel proper.
With PSCI (and possibly other future boot methods), we can bring CPUs
online individually, and need not perform the secondary_holding_pen
dance. Instead, this patch factors the holding pen management code out
to the spin-table boot method code, as it is the only boot method
requiring the pen.
A new entry point for secondaries, secondary_entry is added for other
boot methods to use, which bypasses the holding pen and its associated
overhead when bringing CPUs online. The smp.pen.text section is also
removed, as the pen can live in head.text without problem.
The cpu_operations structure is extended with two new functions,
cpu_boot and cpu_postboot, for bringing a cpu into the kernel and
performing any post-boot cleanup required by a bootmethod (e.g.
resetting the secondary_holding_pen_release to INVALID_HWID).
Documentation is added for cpu_operations.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2013-10-24 20:30:16 +01:00
|
|
|
b secondary_startup
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_END(secondary_entry)
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_START_LOCAL(secondary_startup)
|
2012-03-05 11:49:27 +00:00
|
|
|
/*
|
|
|
|
* Common entry point for secondary CPUs.
|
|
|
|
*/
|
2022-06-24 17:06:48 +02:00
|
|
|
mov x20, x0 // preserve boot mode
|
arm64: mm: Handle LVA support as a CPU feature
Currently, we detect CPU support for 52-bit virtual addressing (LVA)
extremely early, before creating the kernel page tables or enabling the
MMU. We cannot override the feature this early, and so large virtual
addressing is always enabled on CPUs that implement support for it if
the software support for it was enabled at build time. It also means we
rely on non-trivial code in asm to deal with this feature.
Given that both the ID map and the TTBR1 mapping of the kernel image are
guaranteed to be 48-bit addressable, it is not actually necessary to
enable support this early, and instead, we can model it as a CPU
feature. That way, we can rely on code patching to get the correct
TCR.T1SZ values programmed on secondary boot and resume from suspend.
On the primary boot path, we simply enable the MMU with 48-bit virtual
addressing initially, and update TCR.T1SZ if LVA is supported from C
code, right before creating the kernel mapping. Given that TTBR1 still
points to reserved_pg_dir at this point, updating TCR.T1SZ should be
safe without the need for explicit TLB maintenance.
Since this gets rid of all accesses to the vabits_actual variable from
asm code that occurred before TCR.T1SZ had been programmed, we no longer
have a need for this variable, and we can replace it with a C expression
that produces the correct value directly, based on the value of TCR.T1SZ.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-70-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:11 +01:00
|
|
|
|
|
|
|
#ifdef CONFIG_ARM64_VA_BITS_52
|
|
|
|
alternative_if ARM64_HAS_VA52
|
2018-12-06 22:50:40 +00:00
|
|
|
bl __cpu_secondary_check52bitva
|
arm64: mm: Handle LVA support as a CPU feature
Currently, we detect CPU support for 52-bit virtual addressing (LVA)
extremely early, before creating the kernel page tables or enabling the
MMU. We cannot override the feature this early, and so large virtual
addressing is always enabled on CPUs that implement support for it if
the software support for it was enabled at build time. It also means we
rely on non-trivial code in asm to deal with this feature.
Given that both the ID map and the TTBR1 mapping of the kernel image are
guaranteed to be 48-bit addressable, it is not actually necessary to
enable support this early, and instead, we can model it as a CPU
feature. That way, we can rely on code patching to get the correct
TCR.T1SZ values programmed on secondary boot and resume from suspend.
On the primary boot path, we simply enable the MMU with 48-bit virtual
addressing initially, and update TCR.T1SZ if LVA is supported from C
code, right before creating the kernel mapping. Given that TTBR1 still
points to reserved_pg_dir at this point, updating TCR.T1SZ should be
safe without the need for explicit TLB maintenance.
Since this gets rid of all accesses to the vabits_actual variable from
asm code that occurred before TCR.T1SZ had been programmed, we no longer
have a need for this variable, and we can replace it with a C expression
that produces the correct value directly, based on the value of TCR.T1SZ.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-70-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:11 +01:00
|
|
|
alternative_else_nop_endif
|
2022-07-01 13:10:45 +02:00
|
|
|
#endif
|
arm64: mm: Handle LVA support as a CPU feature
Currently, we detect CPU support for 52-bit virtual addressing (LVA)
extremely early, before creating the kernel page tables or enabling the
MMU. We cannot override the feature this early, and so large virtual
addressing is always enabled on CPUs that implement support for it if
the software support for it was enabled at build time. It also means we
rely on non-trivial code in asm to deal with this feature.
Given that both the ID map and the TTBR1 mapping of the kernel image are
guaranteed to be 48-bit addressable, it is not actually necessary to
enable support this early, and instead, we can model it as a CPU
feature. That way, we can rely on code patching to get the correct
TCR.T1SZ values programmed on secondary boot and resume from suspend.
On the primary boot path, we simply enable the MMU with 48-bit virtual
addressing initially, and update TCR.T1SZ if LVA is supported from C
code, right before creating the kernel mapping. Given that TTBR1 still
points to reserved_pg_dir at this point, updating TCR.T1SZ should be
safe without the need for explicit TLB maintenance.
Since this gets rid of all accesses to the vabits_actual variable from
asm code that occurred before TCR.T1SZ had been programmed, we no longer
have a need for this variable, and we can replace it with a C expression
that produces the correct value directly, based on the value of TCR.T1SZ.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-70-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:11 +01:00
|
|
|
|
2015-03-18 14:55:20 +00:00
|
|
|
bl __cpu_setup // initialise processor
|
2018-09-24 14:51:13 +01:00
|
|
|
adrp x1, swapper_pg_dir
|
2022-06-24 17:06:39 +02:00
|
|
|
adrp x2, idmap_pg_dir
|
2016-08-31 12:05:14 +01:00
|
|
|
bl __enable_mmu
|
|
|
|
ldr x8, =__secondary_switched
|
|
|
|
br x8
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_END(secondary_startup)
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2023-01-11 11:22:32 +01:00
|
|
|
.text
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_START_LOCAL(__secondary_switched)
|
2022-06-24 17:06:48 +02:00
|
|
|
mov x0, x20
|
|
|
|
bl set_cpu_boot_mode_flag
|
2023-01-11 11:22:31 +01:00
|
|
|
|
|
|
|
mov x0, x20
|
|
|
|
bl finalise_el2
|
|
|
|
|
2022-06-24 17:06:48 +02:00
|
|
|
str_l xzr, __early_cpu_boot_status, x3
|
2015-12-26 12:46:40 +01:00
|
|
|
adr_l x5, vectors
|
|
|
|
msr vbar_el1, x5
|
|
|
|
isb
|
|
|
|
|
2016-02-23 10:31:42 +00:00
|
|
|
adr_l x0, secondary_data
|
arm64: split thread_info from task stack
This patch moves arm64's struct thread_info from the task stack into
task_struct. This protects thread_info from corruption in the case of
stack overflows, and makes its address harder to determine if stack
addresses are leaked, making a number of attacks more difficult. Precise
detection and handling of overflow is left for subsequent patches.
Largely, this involves changing code to store the task_struct in sp_el0,
and acquire the thread_info from the task struct. Core code now
implements current_thread_info(), and as noted in <linux/sched.h> this
relies on offsetof(task_struct, thread_info) == 0, enforced by core
code.
This change means that the 'tsk' register used in entry.S now points to
a task_struct, rather than a thread_info as it used to. To make this
clear, the TI_* field offsets are renamed to TSK_TI_*, with asm-offsets
appropriately updated to account for the structural change.
Userspace clobbers sp_el0, and we can no longer restore this from the
stack. Instead, the current task is cached in a per-cpu variable that we
can safely access from early assembly as interrupts are disabled (and we
are thus not preemptible).
Both secondary entry and idle are updated to stash the sp and task
pointer separately.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: James Morse <james.morse@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-11-03 20:23:13 +00:00
|
|
|
ldr x2, [x0, #CPU_BOOT_TASK]
|
2019-08-27 14:36:38 +01:00
|
|
|
cbz x2, __secondary_too_slow
|
2021-05-20 12:50:29 +01:00
|
|
|
|
2021-05-20 12:50:31 +01:00
|
|
|
init_cpu_task x2, x1, x3
|
arm64: simplify ptrauth initialization
Currently __cpu_setup conditionally initializes the address
authentication keys and enables them in SCTLR_EL1, doing so differently
for the primary CPU and secondary CPUs, and skipping this work for CPUs
returning from an idle state. For the latter case, cpu_do_resume
restores the keys and SCTLR_EL1 value after the MMU has been enabled.
This flow is rather difficult to follow, so instead let's move the
primary and secondary CPU initialization into their respective boot
paths. By following the example of cpu_do_resume and doing so once the
MMU is enabled, we can always initialize the keys from the values in
thread_struct, and avoid the machinery necessary to pass the keys in
secondary_data or open-coding initialization for the boot CPU.
This means we perform an additional RMW of SCTLR_EL1, but we already do
this in the cpu_do_resume path, and for other features in cpufeature.c,
so this isn't a major concern in a bringup path. Note that even while
the enable bits are clear, the key registers are accessible.
As this now renders the argument to __cpu_setup redundant, let's also
remove that entirely. Future extensions can follow a similar approach to
initialize values that differ for primary/secondary CPUs.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Reviewed-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Cc: Amit Daniel Kachhap <amit.kachhap@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20200423101606.37601-3-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-04-23 11:16:06 +01:00
|
|
|
|
|
|
|
#ifdef CONFIG_ARM64_PTR_AUTH
|
|
|
|
ptrauth_keys_init_cpu x2, x3, x4, x5
|
|
|
|
#endif
|
|
|
|
|
arm64: Implement stack trace termination record
Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.
We'd like to use task_pt_regs(task)->stackframe as the final frame
record, as this is already setup upon exception entry from EL0. For
kernel tasks we need to consistently reserve the pt_regs and point x29
at this, which we can do with small changes to __primary_switched,
__secondary_switched, and copy_process().
Since the final frame record must be at a specific location, we must
create the final frame record in __primary_switched and
__secondary_switched rather than leaving this to start_kernel and
secondary_start_kernel. Thus, __primary_switched and
__secondary_switched will now show up in stacktraces for the idle tasks.
Since the final frame record is now identified by its location rather
than by its contents, we identify it at the start of unwind_frame(),
before we read any values from it.
External debuggers may terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel. That said, gdb
does not show the extra record probably because it uses DWARF and not
frame pointers for stack traces.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
[Mark: rebase, use ASM_BUG(), update comments, update commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210510110026.18061-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-10 12:00:26 +01:00
|
|
|
bl secondary_start_kernel
|
|
|
|
ASM_BUG()
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_END(__secondary_switched)
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_START_LOCAL(__secondary_too_slow)
|
2019-08-27 14:36:38 +01:00
|
|
|
wfe
|
|
|
|
wfi
|
|
|
|
b __secondary_too_slow
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_END(__secondary_too_slow)
|
2019-08-27 14:36:38 +01:00
|
|
|
|
2023-01-11 11:22:32 +01:00
|
|
|
/*
|
|
|
|
* Sets the __boot_cpu_mode flag depending on the CPU boot mode passed
|
|
|
|
* in w0. See arch/arm64/include/asm/virt.h for more info.
|
|
|
|
*/
|
|
|
|
SYM_FUNC_START_LOCAL(set_cpu_boot_mode_flag)
|
|
|
|
adr_l x1, __boot_cpu_mode
|
|
|
|
cmp w0, #BOOT_CPU_MODE_EL2
|
|
|
|
b.ne 1f
|
|
|
|
add x1, x1, #4
|
|
|
|
1: str w0, [x1] // Save CPU boot mode
|
|
|
|
ret
|
|
|
|
SYM_FUNC_END(set_cpu_boot_mode_flag)
|
|
|
|
|
2016-02-23 10:31:42 +00:00
|
|
|
/*
|
|
|
|
* The booting CPU updates the failed status @__early_cpu_boot_status,
|
|
|
|
* with MMU turned off.
|
|
|
|
*
|
|
|
|
* update_early_cpu_boot_status tmp, status
|
|
|
|
* - Corrupts tmp1, tmp2
|
|
|
|
* - Writes 'status' to __early_cpu_boot_status and makes sure
|
|
|
|
* it is committed to memory.
|
|
|
|
*/
|
|
|
|
|
|
|
|
.macro update_early_cpu_boot_status status, tmp1, tmp2
|
|
|
|
mov \tmp2, #\status
|
arm64: fix invalidation of wrong __early_cpu_boot_status cacheline
In head.S, the str_l macro, which takes a source register, a symbol name
and a temp register, is used to store a status value to the variable
__early_cpu_boot_status. Subsequently, the value of the temp register is
reused to invalidate any cachelines covering this variable.
However, since str_l resolves to
adrp \tmp, \sym
str \src, [\tmp, :lo12:\sym]
the temp register never actually holds the address of the variable but
only of the 4 KB window that covers it, and reusing it leads to the
wrong cacheline being invalidated. So instead, take the address
explicitly before doing the store, and reuse that value to perform
the cache invalidation.
Fixes: bb9052744f4b ("arm64: Handle early CPU boot failures")
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Suzuki K Poulose <Suzuki.Poulose@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-04-15 12:11:21 +02:00
|
|
|
adr_l \tmp1, __early_cpu_boot_status
|
|
|
|
str \tmp2, [\tmp1]
|
2016-02-23 10:31:42 +00:00
|
|
|
dmb sy
|
|
|
|
dc ivac, \tmp1 // Invalidate potentially stale cache line
|
|
|
|
.endm
|
|
|
|
|
2012-03-05 11:49:27 +00:00
|
|
|
/*
|
2015-03-17 08:59:53 +01:00
|
|
|
* Enable the MMU.
|
2012-03-05 11:49:27 +00:00
|
|
|
*
|
2015-03-17 08:59:53 +01:00
|
|
|
* x0 = SCTLR_EL1 value for turning on the MMU.
|
2018-09-24 14:51:13 +01:00
|
|
|
* x1 = TTBR1_EL1 value
|
2022-06-24 17:06:39 +02:00
|
|
|
* x2 = ID map root table address
|
2015-03-17 08:59:53 +01:00
|
|
|
*
|
2016-08-31 12:05:14 +01:00
|
|
|
* Returns to the caller via x30/lr. This requires the caller to be covered
|
|
|
|
* by the .idmap.text section.
|
2015-10-19 14:19:35 +01:00
|
|
|
*
|
|
|
|
* Checks if the selected granule size is supported by the CPU.
|
|
|
|
* If it isn't, park the CPU
|
2012-03-05 11:49:27 +00:00
|
|
|
*/
|
arm64: fix .idmap.text assertion for large kernels
When building a kernel with many debug options enabled (which happens in
test configurations use by myself and syzbot), the kernel can become
large enough that portions of .text can be more than 128M away from
.idmap.text (which is placed inside the .rodata section). Where idmap
code branches into .text, the linker will place veneers in the
.idmap.text section to make those branches possible.
Unfortunately, as Ard reports, GNU LD has bseen observed to add 4K of
padding when adding such veneers, e.g.
| .idmap.text 0xffffffc01e48e5c0 0x32c arch/arm64/mm/proc.o
| 0xffffffc01e48e5c0 idmap_cpu_replace_ttbr1
| 0xffffffc01e48e600 idmap_kpti_install_ng_mappings
| 0xffffffc01e48e800 __cpu_setup
| *fill* 0xffffffc01e48e8ec 0x4
| .idmap.text.stub
| 0xffffffc01e48e8f0 0x18 linker stubs
| 0xffffffc01e48f8f0 __idmap_text_end = .
| 0xffffffc01e48f000 . = ALIGN (0x1000)
| *fill* 0xffffffc01e48f8f0 0x710
| 0xffffffc01e490000 idmap_pg_dir = .
This makes the __idmap_text_start .. __idmap_text_end region bigger than
the 4K we require it to fit within, and triggers an assertion in arm64's
vmlinux.lds.S, which breaks the build:
| LD .tmp_vmlinux.kallsyms1
| aarch64-linux-gnu-ld: ID map text too big or misaligned
| make[1]: *** [scripts/Makefile.vmlinux:35: vmlinux] Error 1
| make: *** [Makefile:1264: vmlinux] Error 2
Avoid this by using an `ADRP+ADD+BLR` sequence for branches out of
.idmap.text, which avoids the need for veneers. These branches are only
executed once per boot, and only when the MMU is on, so there should be
no noticeable performance penalty in replacing `BL` with `ADRP+ADD+BLR`.
At the same time, remove the "x" and "w" attributes when placing code in
.idmap.text, as these are not necessary, and this will prevent the
linker from assuming that it is safe to place PLTs into .idmap.text,
causing it to warn if and when there are out-of-range branches within
.idmap.text, e.g.
| LD .tmp_vmlinux.kallsyms1
| arch/arm64/kernel/head.o: in function `primary_entry':
| (.idmap.text+0x1c): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| arch/arm64/kernel/head.o: in function `init_el2':
| (.idmap.text+0x88): relocation truncated to fit: R_AARCH64_CALL26 against symbol `dcache_clean_poc' defined in .text section in arch/arm64/mm/cache.o
| make[1]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
| make: *** [Makefile:1252: vmlinux] Error 2
Thus, if future changes add out-of-range branches in .idmap.text, it
should be easy enough to identify those from the resulting linker
errors.
Reported-by: syzbot+f8ac312e31226e23302b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-arm-kernel/00000000000028ea4105f4e2ef54@google.com/
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Will Deacon <will@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20230220162317.1581208-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-02-20 16:23:17 +00:00
|
|
|
.section ".idmap.text","a"
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_START(__enable_mmu)
|
2022-06-24 17:06:39 +02:00
|
|
|
mrs x3, ID_AA64MMFR0_EL1
|
2022-09-05 23:54:01 +01:00
|
|
|
ubfx x3, x3, #ID_AA64MMFR0_EL1_TGRAN_SHIFT, 4
|
|
|
|
cmp x3, #ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MIN
|
2021-03-10 11:23:10 +05:30
|
|
|
b.lt __no_granule_support
|
2022-09-05 23:54:01 +01:00
|
|
|
cmp x3, #ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MAX
|
2021-03-10 11:23:10 +05:30
|
|
|
b.gt __no_granule_support
|
2018-09-24 14:51:13 +01:00
|
|
|
phys_to_ttbr x2, x2
|
|
|
|
msr ttbr0_el1, x2 // load TTBR0
|
2022-06-24 17:06:46 +02:00
|
|
|
load_ttbr1 x1, x1, x3
|
2021-02-08 09:57:12 +00:00
|
|
|
|
|
|
|
set_sctlr_el1 x0
|
|
|
|
|
2016-08-31 12:05:14 +01:00
|
|
|
ret
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_END(__enable_mmu)
|
2015-10-19 14:19:35 +01:00
|
|
|
|
arm64: mm: Handle LVA support as a CPU feature
Currently, we detect CPU support for 52-bit virtual addressing (LVA)
extremely early, before creating the kernel page tables or enabling the
MMU. We cannot override the feature this early, and so large virtual
addressing is always enabled on CPUs that implement support for it if
the software support for it was enabled at build time. It also means we
rely on non-trivial code in asm to deal with this feature.
Given that both the ID map and the TTBR1 mapping of the kernel image are
guaranteed to be 48-bit addressable, it is not actually necessary to
enable support this early, and instead, we can model it as a CPU
feature. That way, we can rely on code patching to get the correct
TCR.T1SZ values programmed on secondary boot and resume from suspend.
On the primary boot path, we simply enable the MMU with 48-bit virtual
addressing initially, and update TCR.T1SZ if LVA is supported from C
code, right before creating the kernel mapping. Given that TTBR1 still
points to reserved_pg_dir at this point, updating TCR.T1SZ should be
safe without the need for explicit TLB maintenance.
Since this gets rid of all accesses to the vabits_actual variable from
asm code that occurred before TCR.T1SZ had been programmed, we no longer
have a need for this variable, and we can replace it with a C expression
that produces the correct value directly, based on the value of TCR.T1SZ.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-70-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:11 +01:00
|
|
|
#ifdef CONFIG_ARM64_VA_BITS_52
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_START(__cpu_secondary_check52bitva)
|
arm64: Enable LPA2 at boot if supported by the system
Update the early kernel mapping code to take 52-bit virtual addressing
into account based on the LPA2 feature. This is a bit more involved than
LVA (which is supported with 64k pages only), given that some page table
descriptor bits change meaning in this case.
To keep the handling in asm to a minimum, the initial ID map is still
created with 48-bit virtual addressing, which implies that the kernel
image must be loaded into 48-bit addressable physical memory. This is
currently required by the boot protocol, even though we happen to
support placement outside of that for LVA/64k based configurations.
Enabling LPA2 involves more than setting TCR.T1SZ to a lower value,
there is also a DS bit in TCR that needs to be set, and which changes
the meaning of bits [9:8] in all page table descriptors. Since we cannot
enable DS and every live page table descriptor at the same time, let's
pivot through another temporary mapping. This avoids the need to
reintroduce manipulations of the page tables with the MMU and caches
disabled.
To permit the LPA2 feature to be overridden on the kernel command line,
which may be necessary to work around silicon errata, or to deal with
mismatched features on heterogeneous SoC designs, test for CPU feature
overrides first, and only then enable LPA2.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-78-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:19 +01:00
|
|
|
#ifndef CONFIG_ARM64_LPA2
|
2018-12-06 22:50:40 +00:00
|
|
|
mrs_s x0, SYS_ID_AA64MMFR2_EL1
|
2023-07-11 14:50:55 +05:30
|
|
|
and x0, x0, ID_AA64MMFR2_EL1_VARange_MASK
|
2018-12-06 22:50:40 +00:00
|
|
|
cbnz x0, 2f
|
arm64: Enable LPA2 at boot if supported by the system
Update the early kernel mapping code to take 52-bit virtual addressing
into account based on the LPA2 feature. This is a bit more involved than
LVA (which is supported with 64k pages only), given that some page table
descriptor bits change meaning in this case.
To keep the handling in asm to a minimum, the initial ID map is still
created with 48-bit virtual addressing, which implies that the kernel
image must be loaded into 48-bit addressable physical memory. This is
currently required by the boot protocol, even though we happen to
support placement outside of that for LVA/64k based configurations.
Enabling LPA2 involves more than setting TCR.T1SZ to a lower value,
there is also a DS bit in TCR that needs to be set, and which changes
the meaning of bits [9:8] in all page table descriptors. Since we cannot
enable DS and every live page table descriptor at the same time, let's
pivot through another temporary mapping. This avoids the need to
reintroduce manipulations of the page tables with the MMU and caches
disabled.
To permit the LPA2 feature to be overridden on the kernel command line,
which may be necessary to work around silicon errata, or to deal with
mismatched features on heterogeneous SoC designs, test for CPU feature
overrides first, and only then enable LPA2.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-78-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:19 +01:00
|
|
|
#else
|
|
|
|
mrs x0, id_aa64mmfr0_el1
|
|
|
|
sbfx x0, x0, #ID_AA64MMFR0_EL1_TGRAN_SHIFT, 4
|
|
|
|
cmp x0, #ID_AA64MMFR0_EL1_TGRAN_LPA2
|
|
|
|
b.ge 2f
|
|
|
|
#endif
|
2018-12-06 22:50:40 +00:00
|
|
|
|
2018-12-10 14:21:13 +00:00
|
|
|
update_early_cpu_boot_status \
|
|
|
|
CPU_STUCK_IN_KERNEL | CPU_STUCK_REASON_52_BIT_VA, x0, x1
|
2018-12-06 22:50:40 +00:00
|
|
|
1: wfe
|
|
|
|
wfi
|
|
|
|
b 1b
|
|
|
|
|
|
|
|
2: ret
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_END(__cpu_secondary_check52bitva)
|
arm64: mm: Handle LVA support as a CPU feature
Currently, we detect CPU support for 52-bit virtual addressing (LVA)
extremely early, before creating the kernel page tables or enabling the
MMU. We cannot override the feature this early, and so large virtual
addressing is always enabled on CPUs that implement support for it if
the software support for it was enabled at build time. It also means we
rely on non-trivial code in asm to deal with this feature.
Given that both the ID map and the TTBR1 mapping of the kernel image are
guaranteed to be 48-bit addressable, it is not actually necessary to
enable support this early, and instead, we can model it as a CPU
feature. That way, we can rely on code patching to get the correct
TCR.T1SZ values programmed on secondary boot and resume from suspend.
On the primary boot path, we simply enable the MMU with 48-bit virtual
addressing initially, and update TCR.T1SZ if LVA is supported from C
code, right before creating the kernel mapping. Given that TTBR1 still
points to reserved_pg_dir at this point, updating TCR.T1SZ should be
safe without the need for explicit TLB maintenance.
Since this gets rid of all accesses to the vabits_actual variable from
asm code that occurred before TCR.T1SZ had been programmed, we no longer
have a need for this variable, and we can replace it with a C expression
that produces the correct value directly, based on the value of TCR.T1SZ.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-70-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:11 +01:00
|
|
|
#endif
|
2018-12-06 22:50:40 +00:00
|
|
|
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_START_LOCAL(__no_granule_support)
|
2016-02-23 10:31:42 +00:00
|
|
|
/* Indicate that this CPU can't boot and is stuck in the kernel */
|
2018-12-10 14:21:13 +00:00
|
|
|
update_early_cpu_boot_status \
|
|
|
|
CPU_STUCK_IN_KERNEL | CPU_STUCK_REASON_NO_GRAN, x1, x2
|
2016-02-23 10:31:42 +00:00
|
|
|
1:
|
2015-10-19 14:19:35 +01:00
|
|
|
wfe
|
2016-02-23 10:31:42 +00:00
|
|
|
wfi
|
2016-08-31 12:05:13 +01:00
|
|
|
b 1b
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_END(__no_granule_support)
|
2016-04-18 17:09:42 +02:00
|
|
|
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_START_LOCAL(__primary_switch)
|
2022-06-24 17:06:47 +02:00
|
|
|
adrp x1, reserved_pg_dir
|
2025-05-08 13:43:30 +02:00
|
|
|
adrp x2, __pi_init_idmap_pg_dir
|
2016-08-31 12:05:14 +01:00
|
|
|
bl __enable_mmu
|
2024-02-14 13:28:52 +01:00
|
|
|
|
2024-02-14 13:28:49 +01:00
|
|
|
adrp x1, early_init_stack
|
arm64: head: avoid relocating the kernel twice for KASLR
Currently, when KASLR is in effect, we set up the kernel virtual address
space twice: the first time, the KASLR seed is looked up in the device
tree, and the kernel virtual mapping is torn down and recreated again,
after which the relocations are applied a second time. The latter step
means that statically initialized global pointer variables will be reset
to their initial values, and to ensure that BSS variables are not set to
values based on the initial translation, they are cleared again as well.
All of this is needed because we need the command line (taken from the
DT) to tell us whether or not to randomize the virtual address space
before entering the kernel proper. However, this code has expanded
little by little and now creates global state unrelated to the virtual
randomization of the kernel before the mapping is torn down and set up
again, and the BSS cleared for a second time. This has created some
issues in the past, and it would be better to avoid this little dance if
possible.
So instead, let's use the temporary mapping of the device tree, and
execute the bare minimum of code to decide whether or not KASLR should
be enabled, and what the seed is. Only then, create the virtual kernel
mapping, clear BSS, etc and proceed as normal. This avoids the issues
around inconsistent global state due to BSS being cleared twice, and is
generally more maintainable, as it permits us to defer all the remaining
DT parsing and KASLR initialization to a later time.
This means the relocation fixup code runs only a single time as well,
allowing us to simplify the RELR handling code too, which is not
idempotent and was therefore required to keep track of the offset that
was applied the first time around.
Note that this means we have to clone a pair of FDT library objects, so
that we can control how they are built - we need the stack protector
and other instrumentation disabled so that the code can tolerate being
called this early. Note that only the kernel page tables and the
temporary stack are mapped read-write at this point, which ensures that
the early code does not modify any global state inadvertently.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20220624150651.1358849-21-ardb@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2022-06-24 17:06:50 +02:00
|
|
|
mov sp, x1
|
|
|
|
mov x29, xzr
|
2024-02-14 13:28:54 +01:00
|
|
|
mov x0, x20 // pass the full boot status
|
arm64: kernel: Create initial ID map from C code
The asm code that creates the initial ID map is rather intricate and
hard to follow. This is problematic because it makes adding support for
things like LPA2 or WXN more difficult than necessary. Also, it is
parameterized like the rest of the MM code to run with a configurable
number of levels, which is rather pointless, given that all AArch64 CPUs
implement support for 48-bit virtual addressing, and that many systems
exist with DRAM located outside of the 39-bit addressable range, which
is the only smaller VA size that is widely used, and we need additional
tricks to make things work in that combination.
So let's bite the bullet, and rip out all the asm macros, and fiddly
code, and replace it with a C implementation based on the newly added
routines for creating the early kernel VA mappings. And while at it,
create the initial ID map based on 48-bit virtual addressing as well,
regardless of the number of configured levels for the kernel proper.
Note that this code may execute with the MMU and caches disabled, and is
therefore not permitted to make unaligned accesses. This shouldn't
generally happen in any case for the algorithm as implemented, but to be
sure, let's pass -mstrict-align to the compiler just in case.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-66-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 13:29:07 +01:00
|
|
|
mov x1, x21 // pass the FDT
|
2024-02-14 13:29:04 +01:00
|
|
|
bl __pi_early_map_kernel // Map and relocate the kernel
|
2022-06-24 17:06:47 +02:00
|
|
|
|
2016-04-18 17:09:43 +02:00
|
|
|
ldr x8, =__primary_switched
|
2022-06-29 09:42:07 +05:30
|
|
|
adrp x0, KERNEL_START // __pa(KERNEL_START)
|
2016-04-18 17:09:43 +02:00
|
|
|
br x8
|
2020-02-18 19:58:33 +00:00
|
|
|
SYM_FUNC_END(__primary_switch)
|