2019-06-03 07:44:50 +02:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2012-03-05 11:49:28 +00:00
|
|
|
/*
|
|
|
|
* Based on arch/arm/kernel/process.c
|
|
|
|
*
|
|
|
|
* Original Copyright (C) 1995 Linus Torvalds
|
|
|
|
* Copyright (C) 1996-2000 Russell King - Converted to ARM.
|
|
|
|
* Copyright (C) 2012 ARM Ltd.
|
|
|
|
*/
|
2014-04-30 10:51:32 +01:00
|
|
|
#include <linux/compat.h>
|
2015-03-06 15:49:24 +01:00
|
|
|
#include <linux/efi.h>
|
2020-03-16 16:50:47 +00:00
|
|
|
#include <linux/elf.h>
|
2012-03-05 11:49:28 +00:00
|
|
|
#include <linux/export.h>
|
|
|
|
#include <linux/sched.h>
|
2017-02-08 18:51:35 +01:00
|
|
|
#include <linux/sched/debug.h>
|
2017-02-08 18:51:36 +01:00
|
|
|
#include <linux/sched/task.h>
|
2017-02-08 18:51:37 +01:00
|
|
|
#include <linux/sched/task_stack.h>
|
2012-03-05 11:49:28 +00:00
|
|
|
#include <linux/kernel.h>
|
2020-03-16 16:50:47 +00:00
|
|
|
#include <linux/mman.h>
|
2012-03-05 11:49:28 +00:00
|
|
|
#include <linux/mm.h>
|
2020-09-28 14:03:00 +01:00
|
|
|
#include <linux/nospec.h>
|
2012-03-05 11:49:28 +00:00
|
|
|
#include <linux/stddef.h>
|
2019-07-23 19:58:39 +02:00
|
|
|
#include <linux/sysctl.h>
|
2012-03-05 11:49:28 +00:00
|
|
|
#include <linux/unistd.h>
|
|
|
|
#include <linux/user.h>
|
|
|
|
#include <linux/delay.h>
|
|
|
|
#include <linux/reboot.h>
|
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/elfcore.h>
|
|
|
|
#include <linux/pm.h>
|
|
|
|
#include <linux/tick.h>
|
|
|
|
#include <linux/utsname.h>
|
|
|
|
#include <linux/uaccess.h>
|
|
|
|
#include <linux/random.h>
|
|
|
|
#include <linux/hw_breakpoint.h>
|
|
|
|
#include <linux/personality.h>
|
|
|
|
#include <linux/notifier.h>
|
2015-09-16 22:23:21 +08:00
|
|
|
#include <trace/events/power.h>
|
arm64: split thread_info from task stack
This patch moves arm64's struct thread_info from the task stack into
task_struct. This protects thread_info from corruption in the case of
stack overflows, and makes its address harder to determine if stack
addresses are leaked, making a number of attacks more difficult. Precise
detection and handling of overflow is left for subsequent patches.
Largely, this involves changing code to store the task_struct in sp_el0,
and acquire the thread_info from the task struct. Core code now
implements current_thread_info(), and as noted in <linux/sched.h> this
relies on offsetof(task_struct, thread_info) == 0, enforced by core
code.
This change means that the 'tsk' register used in entry.S now points to
a task_struct, rather than a thread_info as it used to. To make this
clear, the TI_* field offsets are renamed to TSK_TI_*, with asm-offsets
appropriately updated to account for the structural change.
Userspace clobbers sp_el0, and we can no longer restore this from the
stack. Instead, the current task is cached in a per-cpu variable that we
can safely access from early assembly as interrupts are disabled (and we
are thus not preemptible).
Both secondary entry and idle are updated to stash the sp and task
pointer separately.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: James Morse <james.morse@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-11-03 20:23:13 +00:00
|
|
|
#include <linux/percpu.h>
|
arm64/sve: Core task context handling
This patch adds the core support for switching and managing the SVE
architectural state of user tasks.
Calls to the existing FPSIMD low-level save/restore functions are
factored out as new functions task_fpsimd_{save,load}(), since SVE
now dynamically may or may not need to be handled at these points
depending on the kernel configuration, hardware features discovered
at boot, and the runtime state of the task. To make these
decisions as fast as possible, const cpucaps are used where
feasible, via the system_supports_sve() helper.
The SVE registers are only tracked for threads that have explicitly
used SVE, indicated by the new thread flag TIF_SVE. Otherwise, the
FPSIMD view of the architectural state is stored in
thread.fpsimd_state as usual.
When in use, the SVE registers are not stored directly in
thread_struct due to their potentially large and variable size.
Because the task_struct slab allocator must be configured very
early during kernel boot, it is also tricky to configure it
correctly to match the maximum vector length provided by the
hardware, since this depends on examining secondary CPUs as well as
the primary. Instead, a pointer sve_state in thread_struct points
to a dynamically allocated buffer containing the SVE register data,
and code is added to allocate and free this buffer at appropriate
times.
TIF_SVE is set when taking an SVE access trap from userspace, if
suitable hardware support has been detected. This enables SVE for
the thread: a subsequent return to userspace will disable the trap
accordingly. If such a trap is taken without sufficient system-
wide hardware support, SIGILL is sent to the thread instead as if
an undefined instruction had been executed: this may happen if
userspace tries to use SVE in a system where not all CPUs support
it for example.
The kernel will clear TIF_SVE and disable SVE for the thread
whenever an explicit syscall is made by userspace. For backwards
compatibility reasons and conformance with the spirit of the base
AArch64 procedure call standard, the subset of the SVE register
state that aliases the FPSIMD registers is still preserved across a
syscall even if this happens. The remainder of the SVE register
state logically becomes zero at syscall entry, though the actual
zeroing work is currently deferred until the thread next tries to
use SVE, causing another trap to the kernel. This implementation
is suboptimal: in the future, the fastpath case may be optimised
to zero the registers in-place and leave SVE enabled for the task,
where beneficial.
TIF_SVE is also cleared in the following slowpath cases, which are
taken as reasonable hints that the task may no longer use SVE:
* exec
* fork and clone
Code is added to sync data between thread.fpsimd_state and
thread.sve_state whenever enabling/disabling SVE, in a manner
consistent with the SVE architectural programmer's model.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Alex Bennée <alex.bennee@linaro.org>
[will: added #include to fix allnoconfig build]
[will: use enable_daif in do_sve_acc]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-10-31 15:51:05 +00:00
|
|
|
#include <linux/thread_info.h>
|
2019-07-23 19:58:39 +02:00
|
|
|
#include <linux/prctl.h>
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 14:28:45 +00:00
|
|
|
#include <linux/stacktrace.h>
|
2012-03-05 11:49:28 +00:00
|
|
|
|
2016-02-05 14:58:48 +00:00
|
|
|
#include <asm/alternative.h>
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
#include <asm/arch_timer.h>
|
2012-03-05 11:49:28 +00:00
|
|
|
#include <asm/compat.h>
|
arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled
Preempting from IRQ-return means that the task has its PSTATE saved
on the stack, which will get restored when the task is resumed and does
the actual IRQ return.
However, enabling some CPU features requires modifying the PSTATE. This
means that, if a task was scheduled out during an IRQ-return before all
CPU features are enabled, the task might restore a PSTATE that does not
include the feature enablement changes once scheduled back in.
* Task 1:
PAN == 0 ---| |---------------
| |<- return from IRQ, PSTATE.PAN = 0
| <- IRQ |
+--------+ <- preempt() +--
^
|
reschedule Task 1, PSTATE.PAN == 1
* Init:
--------------------+------------------------
^
|
enable_cpu_features
set PSTATE.PAN on all CPUs
Worse than this, since PSTATE is untouched when task switching is done,
a task missing the new bits in PSTATE might affect another task, if both
do direct calls to schedule() (outside of IRQ/exception contexts).
Fix this by preventing preemption on IRQ-return until features are
enabled on all CPUs.
This way the only PSTATE values that are saved on the stack are from
synchronous exceptions. These are expected to be fatal this early, the
exception is BRK for WARN_ON(), but as this uses do_debug_exception()
which keeps IRQs masked, it shouldn't call schedule().
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
[james: Replaced a really cool hack, with an even simpler static key in C.
expanded commit message with Julien's cover-letter ascii art]
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2019-10-15 18:25:44 +01:00
|
|
|
#include <asm/cpufeature.h>
|
2012-03-05 11:49:28 +00:00
|
|
|
#include <asm/cacheflush.h>
|
2016-10-18 11:27:48 +01:00
|
|
|
#include <asm/exec.h>
|
2013-01-17 12:31:45 +00:00
|
|
|
#include <asm/fpsimd.h>
|
2024-10-01 23:59:00 +01:00
|
|
|
#include <asm/gcs.h>
|
2013-01-17 12:31:45 +00:00
|
|
|
#include <asm/mmu_context.h>
|
2019-09-16 11:51:17 +01:00
|
|
|
#include <asm/mte.h>
|
2012-03-05 11:49:28 +00:00
|
|
|
#include <asm/processor.h>
|
arm64: add basic pointer authentication support
This patch adds basic support for pointer authentication, allowing
userspace to make use of APIAKey, APIBKey, APDAKey, APDBKey, and
APGAKey. The kernel maintains key values for each process (shared by all
threads within), which are initialised to random values at exec() time.
The ID_AA64ISAR1_EL1.{APA,API,GPA,GPI} fields are exposed to userspace,
to describe that pointer authentication instructions are available and
that the kernel is managing the keys. Two new hwcaps are added for the
same reason: PACA (for address authentication) and PACG (for generic
authentication).
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
Tested-by: Adam Wallis <awallis@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
[will: Fix sizeof() usage and unroll address key initialisation]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-12-07 18:39:25 +00:00
|
|
|
#include <asm/pointer_auth.h>
|
2012-03-05 11:49:28 +00:00
|
|
|
#include <asm/stacktrace.h>
|
2021-03-24 12:24:58 +05:30
|
|
|
#include <asm/switch_to.h>
|
|
|
|
#include <asm/system_misc.h>
|
2012-03-05 11:49:28 +00:00
|
|
|
|
2018-12-12 13:08:44 +01:00
|
|
|
#if defined(CONFIG_STACKPROTECTOR) && !defined(CONFIG_STACKPROTECTOR_PER_TASK)
|
2014-06-25 23:55:03 +01:00
|
|
|
#include <linux/stackprotector.h>
|
2021-09-14 17:44:02 +08:00
|
|
|
unsigned long __stack_chk_guard __ro_after_init;
|
2014-06-25 23:55:03 +01:00
|
|
|
EXPORT_SYMBOL(__stack_chk_guard);
|
|
|
|
#endif
|
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
/*
|
|
|
|
* Function pointers to optional machine specific functions
|
|
|
|
*/
|
|
|
|
void (*pm_power_off)(void);
|
|
|
|
EXPORT_SYMBOL_GPL(pm_power_off);
|
|
|
|
|
2013-10-24 20:30:18 +01:00
|
|
|
#ifdef CONFIG_HOTPLUG_CPU
|
2023-02-13 23:05:58 -08:00
|
|
|
void __noreturn arch_cpu_idle_dead(void)
|
2013-10-24 20:30:18 +01:00
|
|
|
{
|
|
|
|
cpu_die();
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 02:41:22 +01:00
|
|
|
/*
|
|
|
|
* Called by kexec, immediately prior to machine_kexec().
|
|
|
|
*
|
|
|
|
* This must completely disable all secondary CPUs; simply causing those CPUs
|
|
|
|
* to execute e.g. a RAM-based pin loop is not sufficient. This allows the
|
|
|
|
* kexec'd kernel to use any and all RAM as it sees fit, without having to
|
|
|
|
* avoid any code or data used by any SW CPU pin loop. The CPU hotplug
|
2020-03-23 13:50:59 +00:00
|
|
|
* functionality embodied in smpt_shutdown_nonboot_cpus() to achieve this.
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 02:41:22 +01:00
|
|
|
*/
|
2012-03-05 11:49:28 +00:00
|
|
|
void machine_shutdown(void)
|
|
|
|
{
|
2020-03-23 13:51:00 +00:00
|
|
|
smp_shutdown_nonboot_cpus(reboot_cpu);
|
2012-03-05 11:49:28 +00:00
|
|
|
}
|
|
|
|
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 02:41:22 +01:00
|
|
|
/*
|
|
|
|
* Halting simply requires that the secondary CPUs stop performing any
|
|
|
|
* activity (executing tasks, handling interrupts). smp_send_stop()
|
|
|
|
* achieves this.
|
|
|
|
*/
|
2012-03-05 11:49:28 +00:00
|
|
|
void machine_halt(void)
|
|
|
|
{
|
2014-05-07 02:41:23 +01:00
|
|
|
local_irq_disable();
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 02:41:22 +01:00
|
|
|
smp_send_stop();
|
2012-03-05 11:49:28 +00:00
|
|
|
while (1);
|
|
|
|
}
|
|
|
|
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 02:41:22 +01:00
|
|
|
/*
|
|
|
|
* Power-off simply requires that the secondary CPUs stop performing any
|
|
|
|
* activity (executing tasks, handling interrupts). smp_send_stop()
|
|
|
|
* achieves this. When the system power is turned off, it will take all CPUs
|
|
|
|
* with it.
|
|
|
|
*/
|
2012-03-05 11:49:28 +00:00
|
|
|
void machine_power_off(void)
|
|
|
|
{
|
2014-05-07 02:41:23 +01:00
|
|
|
local_irq_disable();
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 02:41:22 +01:00
|
|
|
smp_send_stop();
|
2022-05-10 02:32:20 +03:00
|
|
|
do_kernel_power_off();
|
2012-03-05 11:49:28 +00:00
|
|
|
}
|
|
|
|
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 02:41:22 +01:00
|
|
|
/*
|
|
|
|
* Restart requires that the secondary CPUs stop performing any activity
|
2015-04-20 10:24:35 +01:00
|
|
|
* while the primary CPU resets the system. Systems with multiple CPUs must
|
arm64: Fix machine_shutdown() definition
This patch ports most of commit 19ab428f4b79 "ARM: 7759/1: decouple CPU
offlining from reboot/shutdown" by Stephen Warren from arch/arm to
arch/arm64.
machine_shutdown() is a hook for kexec. Add a comment saying so, since
it isn't obvious from the function name.
Halt, power-off, and restart have different requirements re: stopping
secondary CPUs than kexec has. The former simply require the secondary
CPUs to be quiesced somehow, whereas kexec requires them to be
completely non-operational, so that no matter where the kexec target
images are written in RAM, they won't influence operation of the
secondary CPUS,which could happen if the CPUs were still executing some
kind of pin loop. To this end, modify machine_halt, power_off, and
restart to call smp_send_stop() directly, rather than calling
machine_shutdown().
In machine_shutdown(), replace the call to smp_send_stop() with a call
to disable_nonboot_cpus(). This completely disables all but one CPU,
thus satisfying the kexec requirements a couple paragraphs above.
Signed-off-by: Arun KS <getarunks@gmail.com>
Acked-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2014-05-07 02:41:22 +01:00
|
|
|
* provide a HW restart implementation, to ensure that all CPUs reset at once.
|
|
|
|
* This is required so that any code running after reset on the primary CPU
|
|
|
|
* doesn't have to co-ordinate with other CPUs to ensure they aren't still
|
|
|
|
* executing pre-reset code, and using RAM that the primary CPU's code wishes
|
|
|
|
* to use. Implementing such co-ordination would be essentially impossible.
|
|
|
|
*/
|
2012-03-05 11:49:28 +00:00
|
|
|
void machine_restart(char *cmd)
|
|
|
|
{
|
|
|
|
/* Disable interrupts first */
|
|
|
|
local_irq_disable();
|
2014-05-07 02:41:23 +01:00
|
|
|
smp_send_stop();
|
2012-03-05 11:49:28 +00:00
|
|
|
|
2015-03-06 15:49:24 +01:00
|
|
|
/*
|
|
|
|
* UpdateCapsule() depends on the system being reset via
|
|
|
|
* ResetSystem().
|
|
|
|
*/
|
|
|
|
if (efi_enabled(EFI_RUNTIME_SERVICES))
|
|
|
|
efi_reboot(reboot_mode, NULL);
|
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
/* Now call the architecture specific reboot code. */
|
2021-06-04 15:07:36 +01:00
|
|
|
do_kernel_restart(cmd);
|
2012-03-05 11:49:28 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Whoops - the architecture was unable to reboot.
|
|
|
|
*/
|
|
|
|
printk("Reboot failed -- System halted\n");
|
|
|
|
while (1);
|
|
|
|
}
|
|
|
|
|
arm64: BTI: Decode BYTPE bits when printing PSTATE
The current code to print PSTATE symbolically when generating
backtraces etc., does not include the BYTPE field used by Branch
Target Identification.
So, decode BYTPE and print it too.
In the interests of human-readability, print the classes of BTI
matched. The symbolic notation, BYTPE (PSTATE[11:10]) and
permitted classes of subsequent instruction are:
-- (BTYPE=0b00): any insn
jc (BTYPE=0b01): BTI jc, BTI j, BTI c, PACIxSP
-c (BYTPE=0b10): BTI jc, BTI c, PACIxSP
j- (BTYPE=0b11): BTI jc, BTI j
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-16 16:50:48 +00:00
|
|
|
#define bstr(suffix, str) [PSR_BTYPE_ ## suffix >> PSR_BTYPE_SHIFT] = str
|
|
|
|
static const char *const btypes[] = {
|
|
|
|
bstr(NONE, "--"),
|
|
|
|
bstr( JC, "jc"),
|
|
|
|
bstr( C, "-c"),
|
|
|
|
bstr( J , "j-")
|
|
|
|
};
|
|
|
|
#undef bstr
|
|
|
|
|
2017-10-19 13:26:26 +01:00
|
|
|
static void print_pstate(struct pt_regs *regs)
|
|
|
|
{
|
|
|
|
u64 pstate = regs->pstate;
|
|
|
|
|
|
|
|
if (compat_user_mode(regs)) {
|
2021-07-22 10:20:36 +08:00
|
|
|
printk("pstate: %08llx (%c%c%c%c %c %s %s %c%c%c %cDIT %cSSBS)\n",
|
2017-10-19 13:26:26 +01:00
|
|
|
pstate,
|
2018-07-05 15:16:52 +01:00
|
|
|
pstate & PSR_AA32_N_BIT ? 'N' : 'n',
|
|
|
|
pstate & PSR_AA32_Z_BIT ? 'Z' : 'z',
|
|
|
|
pstate & PSR_AA32_C_BIT ? 'C' : 'c',
|
|
|
|
pstate & PSR_AA32_V_BIT ? 'V' : 'v',
|
|
|
|
pstate & PSR_AA32_Q_BIT ? 'Q' : 'q',
|
|
|
|
pstate & PSR_AA32_T_BIT ? "T32" : "A32",
|
|
|
|
pstate & PSR_AA32_E_BIT ? "BE" : "LE",
|
|
|
|
pstate & PSR_AA32_A_BIT ? 'A' : 'a',
|
|
|
|
pstate & PSR_AA32_I_BIT ? 'I' : 'i',
|
2021-07-22 10:20:36 +08:00
|
|
|
pstate & PSR_AA32_F_BIT ? 'F' : 'f',
|
|
|
|
pstate & PSR_AA32_DIT_BIT ? '+' : '-',
|
|
|
|
pstate & PSR_AA32_SSBS_BIT ? '+' : '-');
|
2017-10-19 13:26:26 +01:00
|
|
|
} else {
|
arm64: BTI: Decode BYTPE bits when printing PSTATE
The current code to print PSTATE symbolically when generating
backtraces etc., does not include the BYTPE field used by Branch
Target Identification.
So, decode BYTPE and print it too.
In the interests of human-readability, print the classes of BTI
matched. The symbolic notation, BYTPE (PSTATE[11:10]) and
permitted classes of subsequent instruction are:
-- (BTYPE=0b00): any insn
jc (BTYPE=0b01): BTI jc, BTI j, BTI c, PACIxSP
-c (BYTPE=0b10): BTI jc, BTI c, PACIxSP
j- (BTYPE=0b11): BTI jc, BTI j
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-16 16:50:48 +00:00
|
|
|
const char *btype_str = btypes[(pstate & PSR_BTYPE_MASK) >>
|
|
|
|
PSR_BTYPE_SHIFT];
|
|
|
|
|
2021-07-22 10:20:36 +08:00
|
|
|
printk("pstate: %08llx (%c%c%c%c %c%c%c%c %cPAN %cUAO %cTCO %cDIT %cSSBS BTYPE=%s)\n",
|
2017-10-19 13:26:26 +01:00
|
|
|
pstate,
|
|
|
|
pstate & PSR_N_BIT ? 'N' : 'n',
|
|
|
|
pstate & PSR_Z_BIT ? 'Z' : 'z',
|
|
|
|
pstate & PSR_C_BIT ? 'C' : 'c',
|
|
|
|
pstate & PSR_V_BIT ? 'V' : 'v',
|
|
|
|
pstate & PSR_D_BIT ? 'D' : 'd',
|
|
|
|
pstate & PSR_A_BIT ? 'A' : 'a',
|
|
|
|
pstate & PSR_I_BIT ? 'I' : 'i',
|
|
|
|
pstate & PSR_F_BIT ? 'F' : 'f',
|
|
|
|
pstate & PSR_PAN_BIT ? '+' : '-',
|
arm64: BTI: Decode BYTPE bits when printing PSTATE
The current code to print PSTATE symbolically when generating
backtraces etc., does not include the BYTPE field used by Branch
Target Identification.
So, decode BYTPE and print it too.
In the interests of human-readability, print the classes of BTI
matched. The symbolic notation, BYTPE (PSTATE[11:10]) and
permitted classes of subsequent instruction are:
-- (BTYPE=0b00): any insn
jc (BTYPE=0b01): BTI jc, BTI j, BTI c, PACIxSP
-c (BYTPE=0b10): BTI jc, BTI c, PACIxSP
j- (BTYPE=0b11): BTI jc, BTI j
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-16 16:50:48 +00:00
|
|
|
pstate & PSR_UAO_BIT ? '+' : '-',
|
2019-09-16 11:51:17 +01:00
|
|
|
pstate & PSR_TCO_BIT ? '+' : '-',
|
2021-07-22 10:20:36 +08:00
|
|
|
pstate & PSR_DIT_BIT ? '+' : '-',
|
|
|
|
pstate & PSR_SSBS_BIT ? '+' : '-',
|
arm64: BTI: Decode BYTPE bits when printing PSTATE
The current code to print PSTATE symbolically when generating
backtraces etc., does not include the BYTPE field used by Branch
Target Identification.
So, decode BYTPE and print it too.
In the interests of human-readability, print the classes of BTI
matched. The symbolic notation, BYTPE (PSTATE[11:10]) and
permitted classes of subsequent instruction are:
-- (BTYPE=0b00): any insn
jc (BTYPE=0b01): BTI jc, BTI j, BTI c, PACIxSP
-c (BYTPE=0b10): BTI jc, BTI c, PACIxSP
j- (BTYPE=0b11): BTI jc, BTI j
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-16 16:50:48 +00:00
|
|
|
btype_str);
|
2017-10-19 13:26:26 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
void __show_regs(struct pt_regs *regs)
|
|
|
|
{
|
2013-09-17 18:49:46 +01:00
|
|
|
int i, top_reg;
|
|
|
|
u64 lr, sp;
|
|
|
|
|
|
|
|
if (compat_user_mode(regs)) {
|
|
|
|
lr = regs->compat_lr;
|
|
|
|
sp = regs->compat_sp;
|
|
|
|
top_reg = 12;
|
|
|
|
} else {
|
|
|
|
lr = regs->regs[30];
|
|
|
|
sp = regs->sp;
|
|
|
|
top_reg = 29;
|
|
|
|
}
|
2012-03-05 11:49:28 +00:00
|
|
|
|
dump_stack: unify debug information printed by show_regs()
show_regs() is inherently arch-dependent but it does make sense to print
generic debug information and some archs already do albeit in slightly
different forms. This patch introduces a generic function to print debug
information from show_regs() so that different archs print out the same
information and it's much easier to modify what's printed.
show_regs_print_info() prints out the same debug info as dump_stack()
does plus task and thread_info pointers.
* Archs which didn't print debug info now do.
alpha, arc, blackfin, c6x, cris, frv, h8300, hexagon, ia64, m32r,
metag, microblaze, mn10300, openrisc, parisc, score, sh64, sparc,
um, xtensa
* Already prints debug info. Replaced with show_regs_print_info().
The printed information is superset of what used to be there.
arm, arm64, avr32, mips, powerpc, sh32, tile, unicore32, x86
* s390 is special in that it used to print arch-specific information
along with generic debug info. Heiko and Martin think that the
arch-specific extra isn't worth keeping s390 specfic implementation.
Converted to use the generic version.
Note that now all archs print the debug info before actual register
dumps.
An example BUG() dump follows.
kernel BUG at /work/os/work/kernel/workqueue.c:4841!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.9.0-rc1-work+ #7
Hardware name: empty empty/S3992, BIOS 080011 10/26/2007
task: ffff88007c85e040 ti: ffff88007c860000 task.ti: ffff88007c860000
RIP: 0010:[<ffffffff8234a07e>] [<ffffffff8234a07e>] init_workqueues+0x4/0x6
RSP: 0000:ffff88007c861ec8 EFLAGS: 00010246
RAX: ffff88007c861fd8 RBX: ffffffff824466a8 RCX: 0000000000000001
RDX: 0000000000000046 RSI: 0000000000000001 RDI: ffffffff8234a07a
RBP: ffff88007c861ec8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8234a07a
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88015f7ff000 CR3: 00000000021f1000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff88007c861ef8 ffffffff81000312 ffffffff824466a8 ffff88007c85e650
0000000000000003 0000000000000000 ffff88007c861f38 ffffffff82335e5d
ffff88007c862080 ffffffff8223d8c0 ffff88007c862080 ffffffff81c47760
Call Trace:
[<ffffffff81000312>] do_one_initcall+0x122/0x170
[<ffffffff82335e5d>] kernel_init_freeable+0x9b/0x1c8
[<ffffffff81c47760>] ? rest_init+0x140/0x140
[<ffffffff81c4776e>] kernel_init+0xe/0xf0
[<ffffffff81c6be9c>] ret_from_fork+0x7c/0xb0
[<ffffffff81c47760>] ? rest_init+0x140/0x140
...
v2: Typo fix in x86-32.
v3: CPU number dropped from show_regs_print_info() as
dump_stack_print_info() has been updated to print it. s390
specific implementation dropped as requested by s390 maintainers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com> [tile bits]
Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon bits]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 15:27:17 -07:00
|
|
|
show_regs_print_info(KERN_DEFAULT);
|
2017-10-19 13:26:26 +01:00
|
|
|
print_pstate(regs);
|
2018-02-19 16:46:57 +00:00
|
|
|
|
|
|
|
if (!user_mode(regs)) {
|
|
|
|
printk("pc : %pS\n", (void *)regs->pc);
|
arm64: use XPACLRI to strip PAC
Currently we strip the PAC from pointers using C code, which requires
generating bitmasks, and conditionally clearing/setting bits depending
on bit 55. We can do better by using XPACLRI directly.
When the logic was originally written to strip PACs from user pointers,
contemporary toolchains used for the kernel had assemblers which were
unaware of the PAC instructions. As stripping the PAC from userspace
pointers required unconditional clearing of a fixed set of bits (which
could be performed with a single instruction), it was simpler to
implement the masking in C than it was to make use of XPACI or XPACLRI.
When support for in-kernel pointer authentication was added, the
stripping logic was extended to cover TTBR1 pointers, requiring several
instructions to handle whether to clear/set bits dependent on bit 55 of
the pointer.
This patch simplifies the stripping of PACs by using XPACLRI directly,
as contemporary toolchains do within __builtin_return_address(). This
saves a number of instructions, especially where
__builtin_return_address() does not implicitly strip the PAC but is
heavily used (e.g. with tracepoints). As the kernel might be compiled
with an assembler without knowledge of XPACLRI, it is assembled using
the 'HINT #7' alias, which results in an identical opcode.
At the same time, I've split ptrauth_strip_insn_pac() into
ptrauth_strip_user_insn_pac() and ptrauth_strip_kernel_insn_pac()
helpers so that we can avoid unnecessary PAC stripping when pointer
authentication is not in use in userspace or kernel respectively.
The underlying xpaclri() macro uses inline assembly which clobbers x30.
The clobber causes the compiler to save/restore the original x30 value
in a frame record (protected with PACIASP and AUTIASP when in-kernel
authentication is enabled), so this does not provide a gadget to alter
the return address. Similarly this does not adversely affect unwinding
due to the presence of the frame record.
The ptrauth_user_pac_mask() and ptrauth_kernel_pac_mask() are exported
from the kernel in ptrace and core dumps, so these are retained. A
subsequent patch will move them out of <asm/compiler.h>.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Amit Daniel Kachhap <amit.kachhap@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Kristina Martsenko <kristina.martsenko@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230412160134.306148-3-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2023-04-12 17:01:33 +01:00
|
|
|
printk("lr : %pS\n", (void *)ptrauth_strip_kernel_insn_pac(lr));
|
2018-02-19 16:46:57 +00:00
|
|
|
} else {
|
|
|
|
printk("pc : %016llx\n", regs->pc);
|
|
|
|
printk("lr : %016llx\n", lr);
|
|
|
|
}
|
|
|
|
|
2017-10-19 13:26:26 +01:00
|
|
|
printk("sp : %016llx\n", sp);
|
arm64: fix show_regs fallout from KERN_CONT changes
Recently in commit 4bcc595ccd80decb ("printk: reinstate KERN_CONT for
printing continuation lines"), the behaviour of printk changed w.r.t.
KERN_CONT. Now, KERN_CONT is mandatory to continue existing lines.
Without this, prefixes are inserted, making output illegible, e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0 [ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8 [ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001 [ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
... or when dumped with the userpace dmesg tool, which has slightly
different implicit newline behaviour. e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0
[ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8
[ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001
[ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
We can't simply always use KERN_CONT for lines which may or may not be
continuations. That causes line prefixes (e.g. timestamps) to be
supressed, and the alignment of all but the first line will be broken.
For even more fun, we can't simply insert some dummy empty-string printk
calls, as GCC warns for an empty printk string, and even if we pass
KERN_DEFAULT explcitly to silence the warning, the prefix gets swallowed
unless there is an additional part to the string.
Instead, we must manually iterate over pairs of registers, which gives
us the legible output we want in either case, e.g.
[ 169.771790] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 169.779109] sp : ffff000008d53ec0
[ 169.782386] x29: ffff000008d53ec0 x28: 0000000080c50018
[ 169.787650] x27: ffff000008e0c7f8 x26: ffff80097631de00
[ 169.792913] x25: 0000000000000001 x24: 00000027827b2cf4
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-10-20 12:23:16 +01:00
|
|
|
|
2019-01-31 14:58:46 +00:00
|
|
|
if (system_uses_irq_prio_masking())
|
2024-10-17 10:25:32 +01:00
|
|
|
printk("pmr: %08x\n", regs->pmr);
|
2019-01-31 14:58:46 +00:00
|
|
|
|
arm64: fix show_regs fallout from KERN_CONT changes
Recently in commit 4bcc595ccd80decb ("printk: reinstate KERN_CONT for
printing continuation lines"), the behaviour of printk changed w.r.t.
KERN_CONT. Now, KERN_CONT is mandatory to continue existing lines.
Without this, prefixes are inserted, making output illegible, e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0 [ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8 [ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001 [ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
... or when dumped with the userpace dmesg tool, which has slightly
different implicit newline behaviour. e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0
[ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8
[ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001
[ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
We can't simply always use KERN_CONT for lines which may or may not be
continuations. That causes line prefixes (e.g. timestamps) to be
supressed, and the alignment of all but the first line will be broken.
For even more fun, we can't simply insert some dummy empty-string printk
calls, as GCC warns for an empty printk string, and even if we pass
KERN_DEFAULT explcitly to silence the warning, the prefix gets swallowed
unless there is an additional part to the string.
Instead, we must manually iterate over pairs of registers, which gives
us the legible output we want in either case, e.g.
[ 169.771790] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 169.779109] sp : ffff000008d53ec0
[ 169.782386] x29: ffff000008d53ec0 x28: 0000000080c50018
[ 169.787650] x27: ffff000008e0c7f8 x26: ffff80097631de00
[ 169.792913] x25: 0000000000000001 x24: 00000027827b2cf4
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-10-20 12:23:16 +01:00
|
|
|
i = top_reg;
|
|
|
|
|
|
|
|
while (i >= 0) {
|
2021-04-20 18:22:45 +01:00
|
|
|
printk("x%-2d: %016llx", i, regs->regs[i]);
|
arm64: fix show_regs fallout from KERN_CONT changes
Recently in commit 4bcc595ccd80decb ("printk: reinstate KERN_CONT for
printing continuation lines"), the behaviour of printk changed w.r.t.
KERN_CONT. Now, KERN_CONT is mandatory to continue existing lines.
Without this, prefixes are inserted, making output illegible, e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0 [ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8 [ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001 [ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
... or when dumped with the userpace dmesg tool, which has slightly
different implicit newline behaviour. e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0
[ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8
[ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001
[ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
We can't simply always use KERN_CONT for lines which may or may not be
continuations. That causes line prefixes (e.g. timestamps) to be
supressed, and the alignment of all but the first line will be broken.
For even more fun, we can't simply insert some dummy empty-string printk
calls, as GCC warns for an empty printk string, and even if we pass
KERN_DEFAULT explcitly to silence the warning, the prefix gets swallowed
unless there is an additional part to the string.
Instead, we must manually iterate over pairs of registers, which gives
us the legible output we want in either case, e.g.
[ 169.771790] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 169.779109] sp : ffff000008d53ec0
[ 169.782386] x29: ffff000008d53ec0 x28: 0000000080c50018
[ 169.787650] x27: ffff000008e0c7f8 x26: ffff80097631de00
[ 169.792913] x25: 0000000000000001 x24: 00000027827b2cf4
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-10-20 12:23:16 +01:00
|
|
|
|
2021-04-20 18:22:45 +01:00
|
|
|
while (i-- % 3)
|
|
|
|
pr_cont(" x%-2d: %016llx", i, regs->regs[i]);
|
arm64: fix show_regs fallout from KERN_CONT changes
Recently in commit 4bcc595ccd80decb ("printk: reinstate KERN_CONT for
printing continuation lines"), the behaviour of printk changed w.r.t.
KERN_CONT. Now, KERN_CONT is mandatory to continue existing lines.
Without this, prefixes are inserted, making output illegible, e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0 [ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8 [ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001 [ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
... or when dumped with the userpace dmesg tool, which has slightly
different implicit newline behaviour. e.g.
[ 1007.069010] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 1007.076329] sp : ffff000008d53ec0
[ 1007.079606] x29: ffff000008d53ec0
[ 1007.082797] x28: 0000000080c50018
[ 1007.086160]
[ 1007.087630] x27: ffff000008e0c7f8
[ 1007.090820] x26: ffff80097631ca00
[ 1007.094183]
[ 1007.095653] x25: 0000000000000001
[ 1007.098843] x24: 000000ea68b61cac
[ 1007.102206]
We can't simply always use KERN_CONT for lines which may or may not be
continuations. That causes line prefixes (e.g. timestamps) to be
supressed, and the alignment of all but the first line will be broken.
For even more fun, we can't simply insert some dummy empty-string printk
calls, as GCC warns for an empty printk string, and even if we pass
KERN_DEFAULT explcitly to silence the warning, the prefix gets swallowed
unless there is an additional part to the string.
Instead, we must manually iterate over pairs of registers, which gives
us the legible output we want in either case, e.g.
[ 169.771790] pc : [<ffff00000871898c>] lr : [<ffff000008718948>] pstate: 40000145
[ 169.779109] sp : ffff000008d53ec0
[ 169.782386] x29: ffff000008d53ec0 x28: 0000000080c50018
[ 169.787650] x27: ffff000008e0c7f8 x26: ffff80097631de00
[ 169.792913] x25: 0000000000000001 x24: 00000027827b2cf4
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-10-20 12:23:16 +01:00
|
|
|
|
|
|
|
pr_cont("\n");
|
2012-03-05 11:49:28 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-02-04 09:43:49 +08:00
|
|
|
void show_regs(struct pt_regs *regs)
|
2012-03-05 11:49:28 +00:00
|
|
|
{
|
|
|
|
__show_regs(regs);
|
2020-06-08 21:30:23 -07:00
|
|
|
dump_backtrace(regs, NULL, KERN_DEFAULT);
|
2012-03-05 11:49:28 +00:00
|
|
|
}
|
|
|
|
|
2014-09-11 14:38:16 +01:00
|
|
|
static void tls_thread_flush(void)
|
|
|
|
{
|
2016-09-08 13:55:38 +01:00
|
|
|
write_sysreg(0, tpidr_el0);
|
2022-04-19 12:22:20 +01:00
|
|
|
if (system_supports_tpidr2())
|
|
|
|
write_sysreg_s(0, SYS_TPIDR2_EL0);
|
2014-09-11 14:38:16 +01:00
|
|
|
|
|
|
|
if (is_compat_task()) {
|
2018-03-28 10:50:49 +01:00
|
|
|
current->thread.uw.tp_value = 0;
|
2014-09-11 14:38:16 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We need to ensure ordering between the shadow state and the
|
|
|
|
* hardware state, so that we don't corrupt the hardware state
|
|
|
|
* with a stale shadow state during context switch.
|
|
|
|
*/
|
|
|
|
barrier();
|
2016-09-08 13:55:38 +01:00
|
|
|
write_sysreg(0, tpidrro_el0);
|
2014-09-11 14:38:16 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-07-23 19:58:39 +02:00
|
|
|
static void flush_tagged_addr_state(void)
|
|
|
|
{
|
|
|
|
if (IS_ENABLED(CONFIG_ARM64_TAGGED_ADDR_ABI))
|
|
|
|
clear_thread_flag(TIF_TAGGED_ADDR);
|
|
|
|
}
|
|
|
|
|
2024-08-22 16:10:49 +01:00
|
|
|
static void flush_poe(void)
|
|
|
|
{
|
|
|
|
if (!system_supports_poe())
|
|
|
|
return;
|
|
|
|
|
|
|
|
write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0);
|
|
|
|
}
|
|
|
|
|
2024-10-01 23:59:00 +01:00
|
|
|
#ifdef CONFIG_ARM64_GCS
|
|
|
|
|
|
|
|
static void flush_gcs(void)
|
|
|
|
{
|
|
|
|
if (!system_supports_gcs())
|
|
|
|
return;
|
|
|
|
|
2025-06-11 17:28:13 +01:00
|
|
|
current->thread.gcspr_el0 = 0;
|
|
|
|
current->thread.gcs_base = 0;
|
|
|
|
current->thread.gcs_size = 0;
|
2024-10-01 23:59:00 +01:00
|
|
|
current->thread.gcs_el0_mode = 0;
|
|
|
|
write_sysreg_s(GCSCRE0_EL1_nTR, SYS_GCSCRE0_EL1);
|
|
|
|
write_sysreg_s(0, SYS_GCSPR_EL0);
|
|
|
|
}
|
|
|
|
|
2024-10-01 23:59:01 +01:00
|
|
|
static int copy_thread_gcs(struct task_struct *p,
|
|
|
|
const struct kernel_clone_args *args)
|
|
|
|
{
|
|
|
|
unsigned long gcs;
|
|
|
|
|
|
|
|
if (!system_supports_gcs())
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
p->thread.gcs_base = 0;
|
|
|
|
p->thread.gcs_size = 0;
|
|
|
|
|
2025-07-18 23:37:33 -05:00
|
|
|
p->thread.gcs_el0_mode = current->thread.gcs_el0_mode;
|
|
|
|
p->thread.gcs_el0_locked = current->thread.gcs_el0_locked;
|
|
|
|
|
2024-10-01 23:59:01 +01:00
|
|
|
gcs = gcs_alloc_thread_stack(p, args);
|
|
|
|
if (IS_ERR_VALUE(gcs))
|
|
|
|
return PTR_ERR((void *)gcs);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2024-10-01 23:59:00 +01:00
|
|
|
#else
|
|
|
|
|
|
|
|
static void flush_gcs(void) { }
|
2024-10-01 23:59:01 +01:00
|
|
|
static int copy_thread_gcs(struct task_struct *p,
|
|
|
|
const struct kernel_clone_args *args)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2024-10-01 23:59:00 +01:00
|
|
|
|
|
|
|
#endif
|
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
void flush_thread(void)
|
|
|
|
{
|
|
|
|
fpsimd_flush_thread();
|
2014-09-11 14:38:16 +01:00
|
|
|
tls_thread_flush();
|
2012-03-05 11:49:28 +00:00
|
|
|
flush_ptrace_hw_breakpoint(current);
|
2019-07-23 19:58:39 +02:00
|
|
|
flush_tagged_addr_state();
|
2024-08-22 16:10:49 +01:00
|
|
|
flush_poe();
|
2024-10-01 23:59:00 +01:00
|
|
|
flush_gcs();
|
2012-03-05 11:49:28 +00:00
|
|
|
}
|
|
|
|
|
arm64/sve: Core task context handling
This patch adds the core support for switching and managing the SVE
architectural state of user tasks.
Calls to the existing FPSIMD low-level save/restore functions are
factored out as new functions task_fpsimd_{save,load}(), since SVE
now dynamically may or may not need to be handled at these points
depending on the kernel configuration, hardware features discovered
at boot, and the runtime state of the task. To make these
decisions as fast as possible, const cpucaps are used where
feasible, via the system_supports_sve() helper.
The SVE registers are only tracked for threads that have explicitly
used SVE, indicated by the new thread flag TIF_SVE. Otherwise, the
FPSIMD view of the architectural state is stored in
thread.fpsimd_state as usual.
When in use, the SVE registers are not stored directly in
thread_struct due to their potentially large and variable size.
Because the task_struct slab allocator must be configured very
early during kernel boot, it is also tricky to configure it
correctly to match the maximum vector length provided by the
hardware, since this depends on examining secondary CPUs as well as
the primary. Instead, a pointer sve_state in thread_struct points
to a dynamically allocated buffer containing the SVE register data,
and code is added to allocate and free this buffer at appropriate
times.
TIF_SVE is set when taking an SVE access trap from userspace, if
suitable hardware support has been detected. This enables SVE for
the thread: a subsequent return to userspace will disable the trap
accordingly. If such a trap is taken without sufficient system-
wide hardware support, SIGILL is sent to the thread instead as if
an undefined instruction had been executed: this may happen if
userspace tries to use SVE in a system where not all CPUs support
it for example.
The kernel will clear TIF_SVE and disable SVE for the thread
whenever an explicit syscall is made by userspace. For backwards
compatibility reasons and conformance with the spirit of the base
AArch64 procedure call standard, the subset of the SVE register
state that aliases the FPSIMD registers is still preserved across a
syscall even if this happens. The remainder of the SVE register
state logically becomes zero at syscall entry, though the actual
zeroing work is currently deferred until the thread next tries to
use SVE, causing another trap to the kernel. This implementation
is suboptimal: in the future, the fastpath case may be optimised
to zero the registers in-place and leave SVE enabled for the task,
where beneficial.
TIF_SVE is also cleared in the following slowpath cases, which are
taken as reasonable hints that the task may no longer use SVE:
* exec
* fork and clone
Code is added to sync data between thread.fpsimd_state and
thread.sve_state whenever enabling/disabling SVE, in a manner
consistent with the SVE architectural programmer's model.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Alex Bennée <alex.bennee@linaro.org>
[will: added #include to fix allnoconfig build]
[will: use enable_daif in do_sve_acc]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-10-31 15:51:05 +00:00
|
|
|
void arch_release_task_struct(struct task_struct *tsk)
|
|
|
|
{
|
|
|
|
fpsimd_release_task(tsk);
|
|
|
|
}
|
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
|
|
|
|
{
|
2025-05-08 14:26:31 +01:00
|
|
|
/*
|
|
|
|
* The current/src task's FPSIMD state may or may not be live, and may
|
|
|
|
* have been altered by ptrace after entry to the kernel. Save the
|
|
|
|
* effective FPSIMD state so that this will be copied into dst.
|
|
|
|
*/
|
|
|
|
fpsimd_save_and_flush_current_state();
|
|
|
|
fpsimd_sync_from_effective_state(src);
|
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
*dst = *src;
|
arm64/sve: Core task context handling
This patch adds the core support for switching and managing the SVE
architectural state of user tasks.
Calls to the existing FPSIMD low-level save/restore functions are
factored out as new functions task_fpsimd_{save,load}(), since SVE
now dynamically may or may not need to be handled at these points
depending on the kernel configuration, hardware features discovered
at boot, and the runtime state of the task. To make these
decisions as fast as possible, const cpucaps are used where
feasible, via the system_supports_sve() helper.
The SVE registers are only tracked for threads that have explicitly
used SVE, indicated by the new thread flag TIF_SVE. Otherwise, the
FPSIMD view of the architectural state is stored in
thread.fpsimd_state as usual.
When in use, the SVE registers are not stored directly in
thread_struct due to their potentially large and variable size.
Because the task_struct slab allocator must be configured very
early during kernel boot, it is also tricky to configure it
correctly to match the maximum vector length provided by the
hardware, since this depends on examining secondary CPUs as well as
the primary. Instead, a pointer sve_state in thread_struct points
to a dynamically allocated buffer containing the SVE register data,
and code is added to allocate and free this buffer at appropriate
times.
TIF_SVE is set when taking an SVE access trap from userspace, if
suitable hardware support has been detected. This enables SVE for
the thread: a subsequent return to userspace will disable the trap
accordingly. If such a trap is taken without sufficient system-
wide hardware support, SIGILL is sent to the thread instead as if
an undefined instruction had been executed: this may happen if
userspace tries to use SVE in a system where not all CPUs support
it for example.
The kernel will clear TIF_SVE and disable SVE for the thread
whenever an explicit syscall is made by userspace. For backwards
compatibility reasons and conformance with the spirit of the base
AArch64 procedure call standard, the subset of the SVE register
state that aliases the FPSIMD registers is still preserved across a
syscall even if this happens. The remainder of the SVE register
state logically becomes zero at syscall entry, though the actual
zeroing work is currently deferred until the thread next tries to
use SVE, causing another trap to the kernel. This implementation
is suboptimal: in the future, the fastpath case may be optimised
to zero the registers in-place and leave SVE enabled for the task,
where beneficial.
TIF_SVE is also cleared in the following slowpath cases, which are
taken as reasonable hints that the task may no longer use SVE:
* exec
* fork and clone
Code is added to sync data between thread.fpsimd_state and
thread.sve_state whenever enabling/disabling SVE, in a manner
consistent with the SVE architectural programmer's model.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Alex Bennée <alex.bennee@linaro.org>
[will: added #include to fix allnoconfig build]
[will: use enable_daif in do_sve_acc]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-10-31 15:51:05 +00:00
|
|
|
|
2019-09-30 16:56:00 -04:00
|
|
|
/*
|
arm64/fpsimd: Clear PSTATE.SM during clone()
Currently arch_dup_task_struct() doesn't handle cases where the parent
task has PSTATE.SM==1. Since syscall entry exits streaming mode, the
parent will usually have PSTATE.SM==0, but this can be change by ptrace
after syscall entry. When this happens, arch_dup_task_struct() will
initialise the new task into an invalid state. The new task inherits the
parent's configuration of PSTATE.SM, but fp_type is set to
FP_STATE_FPSIMD, TIF_SVE and SME may be cleared, and both sve_state and
sme_state may be set to NULL.
This can result in a variety of problems whenever the new task's state
is manipulated, including kernel NULL pointer dereferences and leaking
of streaming mode state between tasks.
When ptrace is not involved, the parent will have PSTATE.SM==0 as a
result of syscall entry, and the documentation in
Documentation/arch/arm64/sme.rst says:
| On process creation (eg, clone()) the newly created process will have
| PSTATE.SM cleared.
... so make this true by using task_smstop_sm() to exit streaming mode
in the child task, avoiding the problems above.
Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-13-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08 14:26:32 +01:00
|
|
|
* Drop stale reference to src's sve_state and convert dst to
|
|
|
|
* non-streaming FPSIMD mode.
|
2019-09-30 16:56:00 -04:00
|
|
|
*/
|
arm64/fpsimd: Clear PSTATE.SM during clone()
Currently arch_dup_task_struct() doesn't handle cases where the parent
task has PSTATE.SM==1. Since syscall entry exits streaming mode, the
parent will usually have PSTATE.SM==0, but this can be change by ptrace
after syscall entry. When this happens, arch_dup_task_struct() will
initialise the new task into an invalid state. The new task inherits the
parent's configuration of PSTATE.SM, but fp_type is set to
FP_STATE_FPSIMD, TIF_SVE and SME may be cleared, and both sve_state and
sme_state may be set to NULL.
This can result in a variety of problems whenever the new task's state
is manipulated, including kernel NULL pointer dereferences and leaking
of streaming mode state between tasks.
When ptrace is not involved, the parent will have PSTATE.SM==0 as a
result of syscall entry, and the documentation in
Documentation/arch/arm64/sme.rst says:
| On process creation (eg, clone()) the newly created process will have
| PSTATE.SM cleared.
... so make this true by using task_smstop_sm() to exit streaming mode
in the child task, avoiding the problems above.
Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-13-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08 14:26:32 +01:00
|
|
|
dst->thread.fp_type = FP_STATE_FPSIMD;
|
2019-09-30 16:56:00 -04:00
|
|
|
dst->thread.sve_state = NULL;
|
|
|
|
clear_tsk_thread_flag(dst, TIF_SVE);
|
arm64/fpsimd: Clear PSTATE.SM during clone()
Currently arch_dup_task_struct() doesn't handle cases where the parent
task has PSTATE.SM==1. Since syscall entry exits streaming mode, the
parent will usually have PSTATE.SM==0, but this can be change by ptrace
after syscall entry. When this happens, arch_dup_task_struct() will
initialise the new task into an invalid state. The new task inherits the
parent's configuration of PSTATE.SM, but fp_type is set to
FP_STATE_FPSIMD, TIF_SVE and SME may be cleared, and both sve_state and
sme_state may be set to NULL.
This can result in a variety of problems whenever the new task's state
is manipulated, including kernel NULL pointer dereferences and leaking
of streaming mode state between tasks.
When ptrace is not involved, the parent will have PSTATE.SM==0 as a
result of syscall entry, and the documentation in
Documentation/arch/arm64/sme.rst says:
| On process creation (eg, clone()) the newly created process will have
| PSTATE.SM cleared.
... so make this true by using task_smstop_sm() to exit streaming mode
in the child task, avoiding the problems above.
Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-13-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08 14:26:32 +01:00
|
|
|
task_smstop_sm(dst);
|
2019-09-30 16:56:00 -04:00
|
|
|
|
2022-04-19 12:22:24 +01:00
|
|
|
/*
|
arm64/fpsimd: Make clone() compatible with ZA lazy saving
Linux is intended to be compatible with userspace written to Arm's
AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension
(SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's
ZA tile is lazily callee-saved and caller-restored. In this scheme,
TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by
pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer"
pointer. This scheme has been implemented in GCC and LLVM, with
necessary runtime support implemented in glibc and bionic.
AAPCS64 does not specify how the ZA lazy saving scheme is expected to
interact with thread creation mechanisms such as fork() and
pthread_create(), which would be implemented in terms of the Linux clone
syscall. The behaviour implemented by Linux and glibc/bionic doesn't
always compose safely, as explained below.
Currently the clone syscall is implemented such that PSTATE.ZA and the
ZA tile are always inherited by the new task, and TPIDR2_EL0 is
inherited unless the 'flags' argument includes CLONE_SETTLS,
in which case TPIDR2_EL0 is set to 0/NULL. This doesn't make much sense:
(a) TPIDR2_EL0 is part of the calling convention, and changes as control
is passed between functions. It is *NOT* used for thread local
storage, despite superficial similarity to TPIDR_EL0, which is is
used as the TLS register.
(b) TPIDR2_EL0 and PSTATE.ZA are tightly coupled in the procedure call
standard, and some combinations of states are illegal. In general,
manipulating the two independently is not guaranteed to be safe.
In practice, code which is compliant with the procedure call standard
may issue a clone syscall while in the "ZA dormant" state, where
PSTATE.ZA==1 and TPIDR2_EL0 is non-null and indicates that ZA needs to
be saved. This can cause a variety of problems, including:
* If the implementation of pthread_create() passes CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2==NULL. Per the
procedure call standard this is not a legitimate state for most
functions. This can cause data corruption (e.g. as code may rely on
PSTATE.ZA being 0 to guarantee that an SMSTART ZA instruction will
zero the ZA tile contents), and may result in other undefined
behaviour.
* If the implementation of pthread_create() does not pass CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2 pointing to a
TPIDR2 block on the parent thread's stack. This can result in a
variety of problems, e.g.
- The child may write back to the parent's za_save_buffer, corrupting
its contents.
- The child may read from the TPIDR2 block after the parent has reused
this memory for something else, and consequently the child may abort
or clobber arbitrary memory.
Ideally we'd require that userspace ensures that a task is in the "ZA
off" state (with PSTATE.ZA==0 and TPIDR2_EL0==NULL) prior to issuing a
clone syscall, and have the kernel force this state for new threads.
Unfortunately, contemporary C libraries do not do this, and simply
forcing this state within the implementation of clone would break
fork().
Instead, we can bodge around this by considering the CLONE_VM flag, and
manipulate PSTATE.ZA and TPIDR2_EL0 as a pair. CLONE_VM indicates that
the new task will run in the same address space as its parent, and in
that case it doesn't make sense to inherit a stale pointer to the
parent's TPIDR2 block:
* For fork(), CLONE_VM will not be set, and it is safe to inherit both
PSTATE.ZA and TPIDR2_EL0 as the new task will have its own copy of the
address space, and cannot clobber its parent's stack.
* For pthread_create() and vfork(), CLONE_VM will be set, and discarding
PSTATE.ZA and TPIDR2_EL0 for the new task doesn't break any existing
assumptions in userspace.
Implement this behaviour for clone(). We currently inherit PSTATE.ZA in
arch_dup_task_struct(), but this does not have access to the clone
flags, so move this logic under copy_thread(). Documentation is updated
to describe the new behaviour.
[1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf
[2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>
Cc: Sander De Smalen <sander.desmalen@arm.com>
Cc: Tamas Petz <tamas.petz@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yury Khrustalev <yury.khrustalev@arm.com>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Link: https://lore.kernel.org/r/20250508132644.1395904-14-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08 14:26:33 +01:00
|
|
|
* Drop stale reference to src's sme_state and ensure dst has ZA
|
|
|
|
* disabled.
|
|
|
|
*
|
|
|
|
* When necessary, ZA will be inherited later in copy_thread_za().
|
2022-04-19 12:22:24 +01:00
|
|
|
*/
|
arm64/fpsimd: Make clone() compatible with ZA lazy saving
Linux is intended to be compatible with userspace written to Arm's
AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension
(SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's
ZA tile is lazily callee-saved and caller-restored. In this scheme,
TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by
pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer"
pointer. This scheme has been implemented in GCC and LLVM, with
necessary runtime support implemented in glibc and bionic.
AAPCS64 does not specify how the ZA lazy saving scheme is expected to
interact with thread creation mechanisms such as fork() and
pthread_create(), which would be implemented in terms of the Linux clone
syscall. The behaviour implemented by Linux and glibc/bionic doesn't
always compose safely, as explained below.
Currently the clone syscall is implemented such that PSTATE.ZA and the
ZA tile are always inherited by the new task, and TPIDR2_EL0 is
inherited unless the 'flags' argument includes CLONE_SETTLS,
in which case TPIDR2_EL0 is set to 0/NULL. This doesn't make much sense:
(a) TPIDR2_EL0 is part of the calling convention, and changes as control
is passed between functions. It is *NOT* used for thread local
storage, despite superficial similarity to TPIDR_EL0, which is is
used as the TLS register.
(b) TPIDR2_EL0 and PSTATE.ZA are tightly coupled in the procedure call
standard, and some combinations of states are illegal. In general,
manipulating the two independently is not guaranteed to be safe.
In practice, code which is compliant with the procedure call standard
may issue a clone syscall while in the "ZA dormant" state, where
PSTATE.ZA==1 and TPIDR2_EL0 is non-null and indicates that ZA needs to
be saved. This can cause a variety of problems, including:
* If the implementation of pthread_create() passes CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2==NULL. Per the
procedure call standard this is not a legitimate state for most
functions. This can cause data corruption (e.g. as code may rely on
PSTATE.ZA being 0 to guarantee that an SMSTART ZA instruction will
zero the ZA tile contents), and may result in other undefined
behaviour.
* If the implementation of pthread_create() does not pass CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2 pointing to a
TPIDR2 block on the parent thread's stack. This can result in a
variety of problems, e.g.
- The child may write back to the parent's za_save_buffer, corrupting
its contents.
- The child may read from the TPIDR2 block after the parent has reused
this memory for something else, and consequently the child may abort
or clobber arbitrary memory.
Ideally we'd require that userspace ensures that a task is in the "ZA
off" state (with PSTATE.ZA==0 and TPIDR2_EL0==NULL) prior to issuing a
clone syscall, and have the kernel force this state for new threads.
Unfortunately, contemporary C libraries do not do this, and simply
forcing this state within the implementation of clone would break
fork().
Instead, we can bodge around this by considering the CLONE_VM flag, and
manipulate PSTATE.ZA and TPIDR2_EL0 as a pair. CLONE_VM indicates that
the new task will run in the same address space as its parent, and in
that case it doesn't make sense to inherit a stale pointer to the
parent's TPIDR2 block:
* For fork(), CLONE_VM will not be set, and it is safe to inherit both
PSTATE.ZA and TPIDR2_EL0 as the new task will have its own copy of the
address space, and cannot clobber its parent's stack.
* For pthread_create() and vfork(), CLONE_VM will be set, and discarding
PSTATE.ZA and TPIDR2_EL0 for the new task doesn't break any existing
assumptions in userspace.
Implement this behaviour for clone(). We currently inherit PSTATE.ZA in
arch_dup_task_struct(), but this does not have access to the clone
flags, so move this logic under copy_thread(). Documentation is updated
to describe the new behaviour.
[1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf
[2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>
Cc: Sander De Smalen <sander.desmalen@arm.com>
Cc: Tamas Petz <tamas.petz@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yury Khrustalev <yury.khrustalev@arm.com>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Link: https://lore.kernel.org/r/20250508132644.1395904-14-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08 14:26:33 +01:00
|
|
|
dst->thread.sme_state = NULL;
|
|
|
|
clear_tsk_thread_flag(dst, TIF_SME);
|
|
|
|
dst->thread.svcr &= ~SVCR_ZA_MASK;
|
2022-11-15 09:46:34 +00:00
|
|
|
|
2019-09-16 11:51:17 +01:00
|
|
|
/* clear any pending asynchronous tag fault raised by the parent */
|
|
|
|
clear_tsk_thread_flag(dst, TIF_MTE_ASYNC_FAULT);
|
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
arm64/fpsimd: Make clone() compatible with ZA lazy saving
Linux is intended to be compatible with userspace written to Arm's
AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension
(SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's
ZA tile is lazily callee-saved and caller-restored. In this scheme,
TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by
pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer"
pointer. This scheme has been implemented in GCC and LLVM, with
necessary runtime support implemented in glibc and bionic.
AAPCS64 does not specify how the ZA lazy saving scheme is expected to
interact with thread creation mechanisms such as fork() and
pthread_create(), which would be implemented in terms of the Linux clone
syscall. The behaviour implemented by Linux and glibc/bionic doesn't
always compose safely, as explained below.
Currently the clone syscall is implemented such that PSTATE.ZA and the
ZA tile are always inherited by the new task, and TPIDR2_EL0 is
inherited unless the 'flags' argument includes CLONE_SETTLS,
in which case TPIDR2_EL0 is set to 0/NULL. This doesn't make much sense:
(a) TPIDR2_EL0 is part of the calling convention, and changes as control
is passed between functions. It is *NOT* used for thread local
storage, despite superficial similarity to TPIDR_EL0, which is is
used as the TLS register.
(b) TPIDR2_EL0 and PSTATE.ZA are tightly coupled in the procedure call
standard, and some combinations of states are illegal. In general,
manipulating the two independently is not guaranteed to be safe.
In practice, code which is compliant with the procedure call standard
may issue a clone syscall while in the "ZA dormant" state, where
PSTATE.ZA==1 and TPIDR2_EL0 is non-null and indicates that ZA needs to
be saved. This can cause a variety of problems, including:
* If the implementation of pthread_create() passes CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2==NULL. Per the
procedure call standard this is not a legitimate state for most
functions. This can cause data corruption (e.g. as code may rely on
PSTATE.ZA being 0 to guarantee that an SMSTART ZA instruction will
zero the ZA tile contents), and may result in other undefined
behaviour.
* If the implementation of pthread_create() does not pass CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2 pointing to a
TPIDR2 block on the parent thread's stack. This can result in a
variety of problems, e.g.
- The child may write back to the parent's za_save_buffer, corrupting
its contents.
- The child may read from the TPIDR2 block after the parent has reused
this memory for something else, and consequently the child may abort
or clobber arbitrary memory.
Ideally we'd require that userspace ensures that a task is in the "ZA
off" state (with PSTATE.ZA==0 and TPIDR2_EL0==NULL) prior to issuing a
clone syscall, and have the kernel force this state for new threads.
Unfortunately, contemporary C libraries do not do this, and simply
forcing this state within the implementation of clone would break
fork().
Instead, we can bodge around this by considering the CLONE_VM flag, and
manipulate PSTATE.ZA and TPIDR2_EL0 as a pair. CLONE_VM indicates that
the new task will run in the same address space as its parent, and in
that case it doesn't make sense to inherit a stale pointer to the
parent's TPIDR2 block:
* For fork(), CLONE_VM will not be set, and it is safe to inherit both
PSTATE.ZA and TPIDR2_EL0 as the new task will have its own copy of the
address space, and cannot clobber its parent's stack.
* For pthread_create() and vfork(), CLONE_VM will be set, and discarding
PSTATE.ZA and TPIDR2_EL0 for the new task doesn't break any existing
assumptions in userspace.
Implement this behaviour for clone(). We currently inherit PSTATE.ZA in
arch_dup_task_struct(), but this does not have access to the clone
flags, so move this logic under copy_thread(). Documentation is updated
to describe the new behaviour.
[1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf
[2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>
Cc: Sander De Smalen <sander.desmalen@arm.com>
Cc: Tamas Petz <tamas.petz@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yury Khrustalev <yury.khrustalev@arm.com>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Link: https://lore.kernel.org/r/20250508132644.1395904-14-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08 14:26:33 +01:00
|
|
|
static int copy_thread_za(struct task_struct *dst, struct task_struct *src)
|
|
|
|
{
|
|
|
|
if (!thread_za_enabled(&src->thread))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
dst->thread.sve_state = kzalloc(sve_state_size(src),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!dst->thread.sve_state)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
dst->thread.sme_state = kmemdup(src->thread.sme_state,
|
|
|
|
sme_state_size(src),
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!dst->thread.sme_state) {
|
|
|
|
kfree(dst->thread.sve_state);
|
|
|
|
dst->thread.sve_state = NULL;
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
set_tsk_thread_flag(dst, TIF_SME);
|
|
|
|
dst->thread.svcr |= SVCR_ZA_MASK;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
asmlinkage void ret_from_fork(void) asm("ret_from_fork");
|
|
|
|
|
2022-04-08 18:07:50 -05:00
|
|
|
int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
|
2012-03-05 11:49:28 +00:00
|
|
|
{
|
2022-04-08 18:07:50 -05:00
|
|
|
unsigned long clone_flags = args->flags;
|
|
|
|
unsigned long stack_start = args->stack;
|
|
|
|
unsigned long tls = args->tls;
|
2012-03-05 11:49:28 +00:00
|
|
|
struct pt_regs *childregs = task_pt_regs(p);
|
2024-10-01 23:59:01 +01:00
|
|
|
int ret;
|
2012-03-05 11:49:28 +00:00
|
|
|
|
2012-10-05 12:31:20 +01:00
|
|
|
memset(&p->thread.cpu_context, 0, sizeof(struct cpu_context));
|
2012-03-05 11:49:28 +00:00
|
|
|
|
arm64: fpsimd: Prevent registers leaking from dead tasks
Currently, loading of a task's fpsimd state into the CPU registers
is skipped if that task's state is already present in the registers
of that CPU.
However, the code relies on the struct fpsimd_state * (and by
extension struct task_struct *) to unambiguously identify a task.
There is a particular case in which this doesn't work reliably:
when a task exits, its task_struct may be recycled to describe a
new task.
Consider the following scenario:
1) Task P loads its fpsimd state onto cpu C.
per_cpu(fpsimd_last_state, C) := P;
P->thread.fpsimd_state.cpu := C;
2) Task X is scheduled onto C and loads its fpsimd state on C.
per_cpu(fpsimd_last_state, C) := X;
X->thread.fpsimd_state.cpu := C;
3) X exits, causing X's task_struct to be freed.
4) P forks a new child T, which obtains X's recycled task_struct.
T == X.
T->thread.fpsimd_state.cpu == C (inherited from P).
5) T is scheduled on C.
T's fpsimd state is not loaded, because
per_cpu(fpsimd_last_state, C) == T (== X) &&
T->thread.fpsimd_state.cpu == C.
(This is the check performed by fpsimd_thread_switch().)
So, T gets X's registers because the last registers loaded onto C
were those of X, in (2).
This patch fixes the problem by ensuring that the sched-in check
fails in (5): fpsimd_flush_task_state(T) is called when T is
forked, so that T->thread.fpsimd_state.cpu == C cannot be true.
This relies on the fact that T is not schedulable until after
copy_thread() completes.
Once T's fpsimd state has been loaded on some CPU C there may still
be other cpus D for which per_cpu(fpsimd_last_state, D) ==
&X->thread.fpsimd_state. But D is necessarily != C in this case,
and the check in (5) must fail.
An alternative fix would be to do refcounting on task_struct. This
would result in each CPU holding a reference to the last task whose
fpsimd state was loaded there. It's not clear whether this is
preferable, and it involves higher overhead than the fix proposed
in this patch. It would also move all the task_struct freeing
work into the context switch critical section, or otherwise some
deferred cleanup mechanism would need to be introduced, neither of
which seems obviously justified.
Cc: <stable@vger.kernel.org>
Fixes: 005f78cd8849 ("arm64: defer reloading a task's FPSIMD state to userland resume")
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[will: word-smithed the comment so it makes more sense]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-12-05 14:56:42 +00:00
|
|
|
/*
|
|
|
|
* In case p was allocated the same task_struct pointer as some
|
|
|
|
* other recently-exited task, make sure p is disassociated from
|
|
|
|
* any cpu that may have run that now-exited task recently.
|
|
|
|
* Otherwise we could erroneously skip reloading the FPSIMD
|
|
|
|
* registers for p.
|
|
|
|
*/
|
|
|
|
fpsimd_flush_task_state(p);
|
|
|
|
|
2020-03-13 14:34:56 +05:30
|
|
|
ptrauth_thread_init_kernel(p);
|
|
|
|
|
2022-04-12 10:18:48 -05:00
|
|
|
if (likely(!args->fn)) {
|
2012-10-21 15:56:52 -04:00
|
|
|
*childregs = *current_pt_regs();
|
2012-10-05 12:31:20 +01:00
|
|
|
childregs->regs[0] = 0;
|
2015-05-27 15:39:40 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Read the current TLS pointer from tpidr_el0 as it may be
|
|
|
|
* out-of-sync with the saved value.
|
|
|
|
*/
|
2016-09-08 13:55:38 +01:00
|
|
|
*task_user_tls(p) = read_sysreg(tpidr_el0);
|
2015-05-27 15:39:40 +01:00
|
|
|
|
2024-08-22 16:10:49 +01:00
|
|
|
if (system_supports_poe())
|
|
|
|
p->thread.por_el0 = read_sysreg_s(SYS_POR_EL0);
|
|
|
|
|
2015-05-27 15:39:40 +01:00
|
|
|
if (stack_start) {
|
|
|
|
if (is_compat_thread(task_thread_info(p)))
|
2012-10-18 00:55:54 -04:00
|
|
|
childregs->compat_sp = stack_start;
|
2015-05-27 15:39:40 +01:00
|
|
|
else
|
2012-10-18 00:55:54 -04:00
|
|
|
childregs->sp = stack_start;
|
2012-10-05 12:31:20 +01:00
|
|
|
}
|
2015-05-27 15:39:40 +01:00
|
|
|
|
arm64/fpsimd: Make clone() compatible with ZA lazy saving
Linux is intended to be compatible with userspace written to Arm's
AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension
(SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's
ZA tile is lazily callee-saved and caller-restored. In this scheme,
TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by
pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer"
pointer. This scheme has been implemented in GCC and LLVM, with
necessary runtime support implemented in glibc and bionic.
AAPCS64 does not specify how the ZA lazy saving scheme is expected to
interact with thread creation mechanisms such as fork() and
pthread_create(), which would be implemented in terms of the Linux clone
syscall. The behaviour implemented by Linux and glibc/bionic doesn't
always compose safely, as explained below.
Currently the clone syscall is implemented such that PSTATE.ZA and the
ZA tile are always inherited by the new task, and TPIDR2_EL0 is
inherited unless the 'flags' argument includes CLONE_SETTLS,
in which case TPIDR2_EL0 is set to 0/NULL. This doesn't make much sense:
(a) TPIDR2_EL0 is part of the calling convention, and changes as control
is passed between functions. It is *NOT* used for thread local
storage, despite superficial similarity to TPIDR_EL0, which is is
used as the TLS register.
(b) TPIDR2_EL0 and PSTATE.ZA are tightly coupled in the procedure call
standard, and some combinations of states are illegal. In general,
manipulating the two independently is not guaranteed to be safe.
In practice, code which is compliant with the procedure call standard
may issue a clone syscall while in the "ZA dormant" state, where
PSTATE.ZA==1 and TPIDR2_EL0 is non-null and indicates that ZA needs to
be saved. This can cause a variety of problems, including:
* If the implementation of pthread_create() passes CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2==NULL. Per the
procedure call standard this is not a legitimate state for most
functions. This can cause data corruption (e.g. as code may rely on
PSTATE.ZA being 0 to guarantee that an SMSTART ZA instruction will
zero the ZA tile contents), and may result in other undefined
behaviour.
* If the implementation of pthread_create() does not pass CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2 pointing to a
TPIDR2 block on the parent thread's stack. This can result in a
variety of problems, e.g.
- The child may write back to the parent's za_save_buffer, corrupting
its contents.
- The child may read from the TPIDR2 block after the parent has reused
this memory for something else, and consequently the child may abort
or clobber arbitrary memory.
Ideally we'd require that userspace ensures that a task is in the "ZA
off" state (with PSTATE.ZA==0 and TPIDR2_EL0==NULL) prior to issuing a
clone syscall, and have the kernel force this state for new threads.
Unfortunately, contemporary C libraries do not do this, and simply
forcing this state within the implementation of clone would break
fork().
Instead, we can bodge around this by considering the CLONE_VM flag, and
manipulate PSTATE.ZA and TPIDR2_EL0 as a pair. CLONE_VM indicates that
the new task will run in the same address space as its parent, and in
that case it doesn't make sense to inherit a stale pointer to the
parent's TPIDR2 block:
* For fork(), CLONE_VM will not be set, and it is safe to inherit both
PSTATE.ZA and TPIDR2_EL0 as the new task will have its own copy of the
address space, and cannot clobber its parent's stack.
* For pthread_create() and vfork(), CLONE_VM will be set, and discarding
PSTATE.ZA and TPIDR2_EL0 for the new task doesn't break any existing
assumptions in userspace.
Implement this behaviour for clone(). We currently inherit PSTATE.ZA in
arch_dup_task_struct(), but this does not have access to the clone
flags, so move this logic under copy_thread(). Documentation is updated
to describe the new behaviour.
[1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf
[2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>
Cc: Sander De Smalen <sander.desmalen@arm.com>
Cc: Tamas Petz <tamas.petz@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yury Khrustalev <yury.khrustalev@arm.com>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Link: https://lore.kernel.org/r/20250508132644.1395904-14-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08 14:26:33 +01:00
|
|
|
/*
|
|
|
|
* Due to the AAPCS64 "ZA lazy saving scheme", PSTATE.ZA and
|
|
|
|
* TPIDR2 need to be manipulated as a pair, and either both
|
|
|
|
* need to be inherited or both need to be reset.
|
|
|
|
*
|
|
|
|
* Within a process, child threads must not inherit their
|
|
|
|
* parent's TPIDR2 value or they may clobber their parent's
|
|
|
|
* stack at some later point.
|
|
|
|
*
|
|
|
|
* When a process is fork()'d, the child must inherit ZA and
|
|
|
|
* TPIDR2 from its parent in case there was dormant ZA state.
|
|
|
|
*
|
|
|
|
* Use CLONE_VM to determine when the child will share the
|
|
|
|
* address space with the parent, and cannot safely inherit the
|
|
|
|
* state.
|
|
|
|
*/
|
|
|
|
if (system_supports_sme()) {
|
|
|
|
if (!(clone_flags & CLONE_VM)) {
|
|
|
|
p->thread.tpidr2_el0 = read_sysreg_s(SYS_TPIDR2_EL0);
|
|
|
|
ret = copy_thread_za(p, current);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
} else {
|
|
|
|
p->thread.tpidr2_el0 = 0;
|
|
|
|
WARN_ON_ONCE(p->thread.svcr & SVCR_ZA_MASK);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
/*
|
2020-01-02 18:24:08 +01:00
|
|
|
* If a TLS pointer was passed to clone, use it for the new
|
arm64/fpsimd: Make clone() compatible with ZA lazy saving
Linux is intended to be compatible with userspace written to Arm's
AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension
(SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's
ZA tile is lazily callee-saved and caller-restored. In this scheme,
TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by
pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer"
pointer. This scheme has been implemented in GCC and LLVM, with
necessary runtime support implemented in glibc and bionic.
AAPCS64 does not specify how the ZA lazy saving scheme is expected to
interact with thread creation mechanisms such as fork() and
pthread_create(), which would be implemented in terms of the Linux clone
syscall. The behaviour implemented by Linux and glibc/bionic doesn't
always compose safely, as explained below.
Currently the clone syscall is implemented such that PSTATE.ZA and the
ZA tile are always inherited by the new task, and TPIDR2_EL0 is
inherited unless the 'flags' argument includes CLONE_SETTLS,
in which case TPIDR2_EL0 is set to 0/NULL. This doesn't make much sense:
(a) TPIDR2_EL0 is part of the calling convention, and changes as control
is passed between functions. It is *NOT* used for thread local
storage, despite superficial similarity to TPIDR_EL0, which is is
used as the TLS register.
(b) TPIDR2_EL0 and PSTATE.ZA are tightly coupled in the procedure call
standard, and some combinations of states are illegal. In general,
manipulating the two independently is not guaranteed to be safe.
In practice, code which is compliant with the procedure call standard
may issue a clone syscall while in the "ZA dormant" state, where
PSTATE.ZA==1 and TPIDR2_EL0 is non-null and indicates that ZA needs to
be saved. This can cause a variety of problems, including:
* If the implementation of pthread_create() passes CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2==NULL. Per the
procedure call standard this is not a legitimate state for most
functions. This can cause data corruption (e.g. as code may rely on
PSTATE.ZA being 0 to guarantee that an SMSTART ZA instruction will
zero the ZA tile contents), and may result in other undefined
behaviour.
* If the implementation of pthread_create() does not pass CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2 pointing to a
TPIDR2 block on the parent thread's stack. This can result in a
variety of problems, e.g.
- The child may write back to the parent's za_save_buffer, corrupting
its contents.
- The child may read from the TPIDR2 block after the parent has reused
this memory for something else, and consequently the child may abort
or clobber arbitrary memory.
Ideally we'd require that userspace ensures that a task is in the "ZA
off" state (with PSTATE.ZA==0 and TPIDR2_EL0==NULL) prior to issuing a
clone syscall, and have the kernel force this state for new threads.
Unfortunately, contemporary C libraries do not do this, and simply
forcing this state within the implementation of clone would break
fork().
Instead, we can bodge around this by considering the CLONE_VM flag, and
manipulate PSTATE.ZA and TPIDR2_EL0 as a pair. CLONE_VM indicates that
the new task will run in the same address space as its parent, and in
that case it doesn't make sense to inherit a stale pointer to the
parent's TPIDR2 block:
* For fork(), CLONE_VM will not be set, and it is safe to inherit both
PSTATE.ZA and TPIDR2_EL0 as the new task will have its own copy of the
address space, and cannot clobber its parent's stack.
* For pthread_create() and vfork(), CLONE_VM will be set, and discarding
PSTATE.ZA and TPIDR2_EL0 for the new task doesn't break any existing
assumptions in userspace.
Implement this behaviour for clone(). We currently inherit PSTATE.ZA in
arch_dup_task_struct(), but this does not have access to the clone
flags, so move this logic under copy_thread(). Documentation is updated
to describe the new behaviour.
[1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf
[2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>
Cc: Sander De Smalen <sander.desmalen@arm.com>
Cc: Tamas Petz <tamas.petz@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yury Khrustalev <yury.khrustalev@arm.com>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Link: https://lore.kernel.org/r/20250508132644.1395904-14-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08 14:26:33 +01:00
|
|
|
* thread.
|
2012-03-05 11:49:28 +00:00
|
|
|
*/
|
arm64/fpsimd: Make clone() compatible with ZA lazy saving
Linux is intended to be compatible with userspace written to Arm's
AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension
(SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's
ZA tile is lazily callee-saved and caller-restored. In this scheme,
TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by
pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer"
pointer. This scheme has been implemented in GCC and LLVM, with
necessary runtime support implemented in glibc and bionic.
AAPCS64 does not specify how the ZA lazy saving scheme is expected to
interact with thread creation mechanisms such as fork() and
pthread_create(), which would be implemented in terms of the Linux clone
syscall. The behaviour implemented by Linux and glibc/bionic doesn't
always compose safely, as explained below.
Currently the clone syscall is implemented such that PSTATE.ZA and the
ZA tile are always inherited by the new task, and TPIDR2_EL0 is
inherited unless the 'flags' argument includes CLONE_SETTLS,
in which case TPIDR2_EL0 is set to 0/NULL. This doesn't make much sense:
(a) TPIDR2_EL0 is part of the calling convention, and changes as control
is passed between functions. It is *NOT* used for thread local
storage, despite superficial similarity to TPIDR_EL0, which is is
used as the TLS register.
(b) TPIDR2_EL0 and PSTATE.ZA are tightly coupled in the procedure call
standard, and some combinations of states are illegal. In general,
manipulating the two independently is not guaranteed to be safe.
In practice, code which is compliant with the procedure call standard
may issue a clone syscall while in the "ZA dormant" state, where
PSTATE.ZA==1 and TPIDR2_EL0 is non-null and indicates that ZA needs to
be saved. This can cause a variety of problems, including:
* If the implementation of pthread_create() passes CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2==NULL. Per the
procedure call standard this is not a legitimate state for most
functions. This can cause data corruption (e.g. as code may rely on
PSTATE.ZA being 0 to guarantee that an SMSTART ZA instruction will
zero the ZA tile contents), and may result in other undefined
behaviour.
* If the implementation of pthread_create() does not pass CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2 pointing to a
TPIDR2 block on the parent thread's stack. This can result in a
variety of problems, e.g.
- The child may write back to the parent's za_save_buffer, corrupting
its contents.
- The child may read from the TPIDR2 block after the parent has reused
this memory for something else, and consequently the child may abort
or clobber arbitrary memory.
Ideally we'd require that userspace ensures that a task is in the "ZA
off" state (with PSTATE.ZA==0 and TPIDR2_EL0==NULL) prior to issuing a
clone syscall, and have the kernel force this state for new threads.
Unfortunately, contemporary C libraries do not do this, and simply
forcing this state within the implementation of clone would break
fork().
Instead, we can bodge around this by considering the CLONE_VM flag, and
manipulate PSTATE.ZA and TPIDR2_EL0 as a pair. CLONE_VM indicates that
the new task will run in the same address space as its parent, and in
that case it doesn't make sense to inherit a stale pointer to the
parent's TPIDR2 block:
* For fork(), CLONE_VM will not be set, and it is safe to inherit both
PSTATE.ZA and TPIDR2_EL0 as the new task will have its own copy of the
address space, and cannot clobber its parent's stack.
* For pthread_create() and vfork(), CLONE_VM will be set, and discarding
PSTATE.ZA and TPIDR2_EL0 for the new task doesn't break any existing
assumptions in userspace.
Implement this behaviour for clone(). We currently inherit PSTATE.ZA in
arch_dup_task_struct(), but this does not have access to the clone
flags, so move this logic under copy_thread(). Documentation is updated
to describe the new behaviour.
[1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf
[2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Richard Sandiford <richard.sandiford@arm.com>
Cc: Sander De Smalen <sander.desmalen@arm.com>
Cc: Tamas Petz <tamas.petz@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yury Khrustalev <yury.khrustalev@arm.com>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Link: https://lore.kernel.org/r/20250508132644.1395904-14-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08 14:26:33 +01:00
|
|
|
if (clone_flags & CLONE_SETTLS)
|
2020-01-02 18:24:08 +01:00
|
|
|
p->thread.uw.tp_value = tls;
|
2024-10-01 23:59:01 +01:00
|
|
|
|
|
|
|
ret = copy_thread_gcs(p, args);
|
|
|
|
if (ret != 0)
|
|
|
|
return ret;
|
2012-10-05 12:31:20 +01:00
|
|
|
} else {
|
2020-11-13 12:49:21 +00:00
|
|
|
/*
|
|
|
|
* A kthread has no context to ERET to, so ensure any buggy
|
|
|
|
* ERET is treated as an illegal exception return.
|
|
|
|
*
|
|
|
|
* When a user task is created from a kthread, childregs will
|
|
|
|
* be initialized by start_thread() or start_compat_thread().
|
|
|
|
*/
|
2012-10-05 12:31:20 +01:00
|
|
|
memset(childregs, 0, sizeof(struct pt_regs));
|
2020-11-13 12:49:21 +00:00
|
|
|
childregs->pstate = PSR_MODE_EL1h | PSR_IL_BIT;
|
arm64: stacktrace: unwind exception boundaries
When arm64's stack unwinder encounters an exception boundary, it uses
the pt_regs::stackframe created by the entry code, which has a copy of
the PC and FP at the time the exception was taken. The unwinder doesn't
know anything about pt_regs, and reports the PC from the stackframe, but
does not report the LR.
The LR is only guaranteed to contain the return address at function call
boundaries, and can be used as a scratch register at other times, so the
LR at an exception boundary may or may not be a legitimate return
address. It would be useful to report the LR value regardless, as it can
be helpful when debugging, and in future it will be helpful for reliable
stacktrace support.
This patch changes the way we unwind across exception boundaries,
allowing both the PC and LR to be reported. The entry code creates a
frame_record_meta structure embedded within pt_regs, which the unwinder
uses to find the pt_regs. The unwinder can then extract pt_regs::pc and
pt_regs::lr as two separate unwind steps before continuing with a
regular walk of frame records.
When a PC is unwound from pt_regs::lr, dump_backtrace() will log this
with an "L" marker so that it can be identified easily. For example,
an unwind across an exception boundary will appear as follows:
| el1h_64_irq+0x6c/0x70
| _raw_spin_unlock_irqrestore+0x10/0x60 (P)
| __aarch64_insn_write+0x6c/0x90 (L)
| aarch64_insn_patch_text_nosync+0x28/0x80
... with a (P) entry for pt_regs::pc, and an (L) entry for pt_regs:lr.
Note that the LR may be stale at the point of the exception, for example,
shortly after a return:
| el1h_64_irq+0x6c/0x70
| default_idle_call+0x34/0x180 (P)
| default_idle_call+0x28/0x180 (L)
| do_idle+0x204/0x268
... where the LR points a few instructions before the current PC.
This plays nicely with all the other unwind metadata tracking. With the
ftrace_graph profiler enabled globally, and kretprobes installed on
generic_handle_domain_irq() and do_interrupt_handler(), a backtrace triggered
by magic-sysrq + L reports:
| Call trace:
| show_stack+0x20/0x40 (CF)
| dump_stack_lvl+0x60/0x80 (F)
| dump_stack+0x18/0x28
| nmi_cpu_backtrace+0xfc/0x140
| nmi_trigger_cpumask_backtrace+0x1c8/0x200
| arch_trigger_cpumask_backtrace+0x20/0x40
| sysrq_handle_showallcpus+0x24/0x38 (F)
| __handle_sysrq+0xa8/0x1b0 (F)
| handle_sysrq+0x38/0x50 (F)
| pl011_int+0x460/0x5a8 (F)
| __handle_irq_event_percpu+0x60/0x220 (F)
| handle_irq_event+0x54/0xc0 (F)
| handle_fasteoi_irq+0xa8/0x1d0 (F)
| generic_handle_domain_irq+0x34/0x58 (F)
| gic_handle_irq+0x54/0x140 (FK)
| call_on_irq_stack+0x24/0x58 (F)
| do_interrupt_handler+0x88/0xa0
| el1_interrupt+0x34/0x68 (FK)
| el1h_64_irq_handler+0x18/0x28
| el1h_64_irq+0x6c/0x70
| default_idle_call+0x34/0x180 (P)
| default_idle_call+0x28/0x180 (L)
| do_idle+0x204/0x268
| cpu_startup_entry+0x3c/0x50 (F)
| rest_init+0xe4/0xf0
| start_kernel+0x744/0x750
| __primary_switched+0x88/0x98
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241017092538.1859841-11-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 10:25:38 +01:00
|
|
|
childregs->stackframe.type = FRAME_META_TYPE_FINAL;
|
2019-01-31 14:58:46 +00:00
|
|
|
|
2022-04-12 10:18:48 -05:00
|
|
|
p->thread.cpu_context.x19 = (unsigned long)args->fn;
|
|
|
|
p->thread.cpu_context.x20 = (unsigned long)args->fn_arg;
|
2024-10-01 14:36:17 +01:00
|
|
|
|
|
|
|
if (system_supports_poe())
|
|
|
|
p->thread.por_el0 = POR_EL0_INIT;
|
2012-03-05 11:49:28 +00:00
|
|
|
}
|
|
|
|
p->thread.cpu_context.pc = (unsigned long)ret_from_fork;
|
2012-10-05 12:31:20 +01:00
|
|
|
p->thread.cpu_context.sp = (unsigned long)childregs;
|
arm64: Implement stack trace termination record
Reliable stacktracing requires that we identify when a stacktrace is
terminated early. We can do this by ensuring all tasks have a final
frame record at a known location on their task stack, and checking
that this is the final frame record in the chain.
We'd like to use task_pt_regs(task)->stackframe as the final frame
record, as this is already setup upon exception entry from EL0. For
kernel tasks we need to consistently reserve the pt_regs and point x29
at this, which we can do with small changes to __primary_switched,
__secondary_switched, and copy_process().
Since the final frame record must be at a specific location, we must
create the final frame record in __primary_switched and
__secondary_switched rather than leaving this to start_kernel and
secondary_start_kernel. Thus, __primary_switched and
__secondary_switched will now show up in stacktraces for the idle tasks.
Since the final frame record is now identified by its location rather
than by its contents, we identify it at the start of unwind_frame(),
before we read any values from it.
External debuggers may terminate the stack trace when FP == 0. In the
pt_regs->stackframe, the PC is 0 as well. So, stack traces taken in the
debugger may print an extra record 0x0 at the end. While this is not
pretty, this does not do any harm. This is a small price to pay for
having reliable stack trace termination in the kernel. That said, gdb
does not show the extra record probably because it uses DWARF and not
frame pointers for stack traces.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
[Mark: rebase, use ASM_BUG(), update comments, update commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20210510110026.18061-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2021-05-10 12:00:26 +01:00
|
|
|
/*
|
|
|
|
* For the benefit of the unwinder, set up childregs->stackframe
|
|
|
|
* as the final frame for the new task.
|
|
|
|
*/
|
2024-10-17 10:25:33 +01:00
|
|
|
p->thread.cpu_context.fp = (unsigned long)&childregs->stackframe;
|
2012-03-05 11:49:28 +00:00
|
|
|
|
|
|
|
ptrace_hw_copy_thread(p);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-06-21 16:00:44 +01:00
|
|
|
void tls_preserve_current_state(void)
|
|
|
|
{
|
|
|
|
*task_user_tls(current) = read_sysreg(tpidr_el0);
|
2022-04-19 12:22:20 +01:00
|
|
|
if (system_supports_tpidr2() && !is_compat_task())
|
|
|
|
current->thread.tpidr2_el0 = read_sysreg_s(SYS_TPIDR2_EL0);
|
2017-06-21 16:00:44 +01:00
|
|
|
}
|
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
static void tls_thread_switch(struct task_struct *next)
|
|
|
|
{
|
2017-06-21 16:00:44 +01:00
|
|
|
tls_preserve_current_state();
|
2012-03-05 11:49:28 +00:00
|
|
|
|
2017-11-14 14:33:28 +00:00
|
|
|
if (is_compat_thread(task_thread_info(next)))
|
2018-03-28 10:50:49 +01:00
|
|
|
write_sysreg(next->thread.uw.tp_value, tpidrro_el0);
|
2024-11-14 09:53:32 +00:00
|
|
|
else
|
2017-11-14 14:33:28 +00:00
|
|
|
write_sysreg(0, tpidrro_el0);
|
2012-03-05 11:49:28 +00:00
|
|
|
|
2017-11-14 14:33:28 +00:00
|
|
|
write_sysreg(*task_user_tls(next), tpidr_el0);
|
2022-04-19 12:22:20 +01:00
|
|
|
if (system_supports_tpidr2())
|
|
|
|
write_sysreg_s(next->thread.tpidr2_el0, SYS_TPIDR2_EL0);
|
2012-03-05 11:49:28 +00:00
|
|
|
}
|
|
|
|
|
2019-07-22 14:53:09 +01:00
|
|
|
/*
|
|
|
|
* Force SSBS state on context-switch, since it may be lost after migrating
|
|
|
|
* from a CPU which treats the bit as RES0 in a heterogeneous system.
|
|
|
|
*/
|
|
|
|
static void ssbs_thread_switch(struct task_struct *next)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Nothing to do for kernel threads, but 'regs' may be junk
|
|
|
|
* (e.g. idle task) so check the flags and bail early.
|
|
|
|
*/
|
|
|
|
if (unlikely(next->flags & PF_KTHREAD))
|
|
|
|
return;
|
|
|
|
|
2020-02-06 10:42:58 +00:00
|
|
|
/*
|
|
|
|
* If all CPUs implement the SSBS extension, then we just need to
|
|
|
|
* context-switch the PSTATE field.
|
|
|
|
*/
|
arm64: Avoid cpus_have_const_cap() for ARM64_SSBS
In ssbs_thread_switch() we use cpus_have_const_cap() to check for
ARM64_SSBS, but this is not necessary and alternative_has_cap_*() would
be preferable.
For historical reasons, cpus_have_const_cap() is more complicated than
it needs to be. Before cpucaps are finalized, it will perform a bitmap
test of the system_cpucaps bitmap, and once cpucaps are finalized it
will use an alternative branch. This used to be necessary to handle some
race conditions in the window between cpucap detection and the
subsequent patching of alternatives and static branches, where different
branches could be out-of-sync with one another (or w.r.t. alternative
sequences). Now that we use alternative branches instead of static
branches, these are all patched atomically w.r.t. one another, and there
are only a handful of cases that need special care in the window between
cpucap detection and alternative patching.
Due to the above, it would be nice to remove cpus_have_const_cap(), and
migrate callers over to alternative_has_cap_*(), cpus_have_final_cap(),
or cpus_have_cap() depending on when their requirements. This will
remove redundant instructions and improve code generation, and will make
it easier to determine how each callsite will behave before, during, and
after alternative patching.
The cpus_have_const_cap() check in ssbs_thread_switch() is an
optimization to avoid the overhead of
spectre_v4_enable_task_mitigation() where all CPUs implement SSBS and
naturally preserve the SSBS bit in SPSR_ELx. In the window between
detecting the ARM64_SSBS system-wide and patching alternative branches
it is benign to continue to call spectre_v4_enable_task_mitigation().
This patch replaces the use of cpus_have_const_cap() with
alternative_has_cap_unlikely(), which will avoid generating code to test
the system_cpucaps bitmap and should be better for all subsequent calls
at runtime.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-10-16 11:24:50 +01:00
|
|
|
if (alternative_has_cap_unlikely(ARM64_SSBS))
|
2019-07-22 14:53:09 +01:00
|
|
|
return;
|
|
|
|
|
2020-09-18 11:54:33 +01:00
|
|
|
spectre_v4_enable_task_mitigation(next);
|
2019-07-22 14:53:09 +01:00
|
|
|
}
|
|
|
|
|
arm64: split thread_info from task stack
This patch moves arm64's struct thread_info from the task stack into
task_struct. This protects thread_info from corruption in the case of
stack overflows, and makes its address harder to determine if stack
addresses are leaked, making a number of attacks more difficult. Precise
detection and handling of overflow is left for subsequent patches.
Largely, this involves changing code to store the task_struct in sp_el0,
and acquire the thread_info from the task struct. Core code now
implements current_thread_info(), and as noted in <linux/sched.h> this
relies on offsetof(task_struct, thread_info) == 0, enforced by core
code.
This change means that the 'tsk' register used in entry.S now points to
a task_struct, rather than a thread_info as it used to. To make this
clear, the TI_* field offsets are renamed to TSK_TI_*, with asm-offsets
appropriately updated to account for the structural change.
Userspace clobbers sp_el0, and we can no longer restore this from the
stack. Instead, the current task is cached in a per-cpu variable that we
can safely access from early assembly as interrupts are disabled (and we
are thus not preemptible).
Both secondary entry and idle are updated to stash the sp and task
pointer separately.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: James Morse <james.morse@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-11-03 20:23:13 +00:00
|
|
|
/*
|
|
|
|
* We store our current task in sp_el0, which is clobbered by userspace. Keep a
|
|
|
|
* shadow copy so that we can restore this upon entry from userspace.
|
|
|
|
*
|
|
|
|
* This is *only* for exception entry from EL0, and is not valid until we
|
|
|
|
* __switch_to() a user task.
|
|
|
|
*/
|
|
|
|
DEFINE_PER_CPU(struct task_struct *, __entry_task);
|
|
|
|
|
|
|
|
static void entry_task_switch(struct task_struct *next)
|
|
|
|
{
|
|
|
|
__this_cpu_write(__entry_task, next);
|
|
|
|
}
|
|
|
|
|
2024-10-01 23:59:00 +01:00
|
|
|
#ifdef CONFIG_ARM64_GCS
|
|
|
|
|
|
|
|
void gcs_preserve_current_state(void)
|
|
|
|
{
|
|
|
|
current->thread.gcspr_el0 = read_sysreg_s(SYS_GCSPR_EL0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void gcs_thread_switch(struct task_struct *next)
|
|
|
|
{
|
|
|
|
if (!system_supports_gcs())
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* GCSPR_EL0 is always readable */
|
|
|
|
gcs_preserve_current_state();
|
|
|
|
write_sysreg_s(next->thread.gcspr_el0, SYS_GCSPR_EL0);
|
|
|
|
|
|
|
|
if (current->thread.gcs_el0_mode != next->thread.gcs_el0_mode)
|
|
|
|
gcs_set_el0_mode(next);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ensure that GCS memory effects of the 'prev' thread are
|
|
|
|
* ordered before other memory accesses with release semantics
|
|
|
|
* (or preceded by a DMB) on the current PE. In addition, any
|
|
|
|
* memory accesses with acquire semantics (or succeeded by a
|
|
|
|
* DMB) are ordered before GCS memory effects of the 'next'
|
|
|
|
* thread. This will ensure that the GCS memory effects are
|
|
|
|
* visible to other PEs in case of migration.
|
|
|
|
*/
|
|
|
|
if (task_gcs_el0_enabled(current) || task_gcs_el0_enabled(next))
|
|
|
|
gcsb_dsync();
|
|
|
|
}
|
|
|
|
|
|
|
|
#else
|
|
|
|
|
|
|
|
static void gcs_thread_switch(struct task_struct *next)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
2020-07-31 18:38:23 +01:00
|
|
|
/*
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
* Handle sysreg updates for ARM erratum 1418040 which affects the 32bit view of
|
|
|
|
* CNTVCT, various other errata which require trapping all CNTVCT{,_EL0}
|
|
|
|
* accesses and prctl(PR_SET_TSC). Ensure access is disabled iff a workaround is
|
|
|
|
* required or PR_TSC_SIGSEGV is set.
|
2020-07-31 18:38:23 +01:00
|
|
|
*/
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
static void update_cntkctl_el1(struct task_struct *next)
|
2020-07-31 18:38:23 +01:00
|
|
|
{
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
struct thread_info *ti = task_thread_info(next);
|
2020-07-31 18:38:23 +01:00
|
|
|
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
if (test_ti_thread_flag(ti, TIF_TSC_SIGSEGV) ||
|
|
|
|
has_erratum_handler(read_cntvct_el0) ||
|
|
|
|
(IS_ENABLED(CONFIG_ARM64_ERRATUM_1418040) &&
|
|
|
|
this_cpu_has_cap(ARM64_WORKAROUND_1418040) &&
|
|
|
|
is_compat_thread(ti)))
|
2021-12-20 15:41:14 -08:00
|
|
|
sysreg_clear_set(cntkctl_el1, ARCH_TIMER_USR_VCT_ACCESS_EN, 0);
|
2020-07-31 18:38:23 +01:00
|
|
|
else
|
2021-12-20 15:41:14 -08:00
|
|
|
sysreg_clear_set(cntkctl_el1, 0, ARCH_TIMER_USR_VCT_ACCESS_EN);
|
|
|
|
}
|
2020-07-31 18:38:23 +01:00
|
|
|
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
static void cntkctl_thread_switch(struct task_struct *prev,
|
|
|
|
struct task_struct *next)
|
|
|
|
{
|
|
|
|
if ((read_ti_thread_flags(task_thread_info(prev)) &
|
|
|
|
(_TIF_32BIT | _TIF_TSC_SIGSEGV)) !=
|
|
|
|
(read_ti_thread_flags(task_thread_info(next)) &
|
|
|
|
(_TIF_32BIT | _TIF_TSC_SIGSEGV)))
|
|
|
|
update_cntkctl_el1(next);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int do_set_tsc_mode(unsigned int val)
|
2021-12-20 15:41:14 -08:00
|
|
|
{
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
bool tsc_sigsegv;
|
|
|
|
|
|
|
|
if (val == PR_TSC_SIGSEGV)
|
|
|
|
tsc_sigsegv = true;
|
|
|
|
else if (val == PR_TSC_ENABLE)
|
|
|
|
tsc_sigsegv = false;
|
|
|
|
else
|
|
|
|
return -EINVAL;
|
|
|
|
|
2021-12-20 15:41:14 -08:00
|
|
|
preempt_disable();
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
update_thread_flag(TIF_TSC_SIGSEGV, tsc_sigsegv);
|
|
|
|
update_cntkctl_el1(current);
|
2021-12-20 15:41:14 -08:00
|
|
|
preempt_enable();
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
|
|
|
|
return 0;
|
2020-07-31 18:38:23 +01:00
|
|
|
}
|
|
|
|
|
2024-08-22 16:10:49 +01:00
|
|
|
static void permission_overlay_switch(struct task_struct *next)
|
|
|
|
{
|
|
|
|
if (!system_supports_poe())
|
|
|
|
return;
|
|
|
|
|
|
|
|
current->thread.por_el0 = read_sysreg_s(SYS_POR_EL0);
|
|
|
|
if (current->thread.por_el0 != next->thread.por_el0) {
|
|
|
|
write_sysreg_s(next->thread.por_el0, SYS_POR_EL0);
|
arm64: poe: Handle spurious Overlay faults
We do not currently issue an ISB after updating POR_EL0 when
context-switching it, for instance. The rationale is that if the old
value of POR_EL0 is more restrictive and causes a fault during
uaccess, the access will be retried [1]. In other words, we are
trading an ISB on every context-switching for the (unlikely)
possibility of a spurious fault. We may also miss faults if the new
value of POR_EL0 is more restrictive, but that's considered
acceptable.
However, as things stand, a spurious Overlay fault results in
uaccess failing right away since it causes fault_from_pkey() to
return true. If an Overlay fault is reported, we therefore need to
double check POR_EL0 against vma_pkey(vma) - this is what
arch_vma_access_permitted() already does.
As it turns out, we already perform that explicit check if no
Overlay fault is reported, and we need to keep that check (see
comment added in fault_from_pkey()). Net result: the Overlay ISS2
bit isn't of much help to decide whether a pkey fault occurred.
Remove the check for the Overlay bit from fault_from_pkey() and
add a comment to try and explain the situation. While at it, also
add a comment to permission_overlay_switch() in case anyone gets
surprised by the lack of ISB.
[1] https://lore.kernel.org/linux-arm-kernel/ZtYNGBrcE-j35fpw@arm.com/
Fixes: 160a8e13de6c ("arm64: context switch POR_EL0 register")
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Link: https://lore.kernel.org/r/20250619160042.2499290-2-kevin.brodsky@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-06-19 17:00:41 +01:00
|
|
|
/*
|
|
|
|
* No ISB required as we can tolerate spurious Overlay faults -
|
|
|
|
* the fault handler will check again based on the new value
|
|
|
|
* of POR_EL0.
|
|
|
|
*/
|
2024-08-22 16:10:49 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-07-27 13:52:57 -07:00
|
|
|
/*
|
|
|
|
* __switch_to() checks current->thread.sctlr_user as an optimisation. Therefore
|
|
|
|
* this function must be called with preemption disabled and the update to
|
|
|
|
* sctlr_user must be made in the same preemption disabled block so that
|
|
|
|
* __switch_to() does not see the variable update before the SCTLR_EL1 one.
|
|
|
|
*/
|
|
|
|
void update_sctlr_el1(u64 sctlr)
|
2021-03-18 20:10:52 -07:00
|
|
|
{
|
2021-03-18 20:10:53 -07:00
|
|
|
/*
|
|
|
|
* EnIA must not be cleared while in the kernel as this is necessary for
|
|
|
|
* in-kernel PAC. It will be cleared on kernel exit if needed.
|
|
|
|
*/
|
|
|
|
sysreg_clear_set(sctlr_el1, SCTLR_USER_MASK & ~SCTLR_ELx_ENIA, sctlr);
|
2021-03-18 20:10:52 -07:00
|
|
|
|
|
|
|
/* ISB required for the kernel uaccess routines when setting TCF0. */
|
|
|
|
isb();
|
|
|
|
}
|
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
/*
|
|
|
|
* Thread switching.
|
|
|
|
*/
|
2021-11-29 14:28:43 +00:00
|
|
|
__notrace_funcgraph __sched
|
|
|
|
struct task_struct *__switch_to(struct task_struct *prev,
|
2012-03-05 11:49:28 +00:00
|
|
|
struct task_struct *next)
|
|
|
|
{
|
|
|
|
struct task_struct *last;
|
|
|
|
|
|
|
|
fpsimd_thread_switch(next);
|
|
|
|
tls_thread_switch(next);
|
|
|
|
hw_breakpoint_thread_switch(next);
|
2013-04-03 19:01:01 +01:00
|
|
|
contextidr_thread_switch(next);
|
arm64: split thread_info from task stack
This patch moves arm64's struct thread_info from the task stack into
task_struct. This protects thread_info from corruption in the case of
stack overflows, and makes its address harder to determine if stack
addresses are leaked, making a number of attacks more difficult. Precise
detection and handling of overflow is left for subsequent patches.
Largely, this involves changing code to store the task_struct in sp_el0,
and acquire the thread_info from the task struct. Core code now
implements current_thread_info(), and as noted in <linux/sched.h> this
relies on offsetof(task_struct, thread_info) == 0, enforced by core
code.
This change means that the 'tsk' register used in entry.S now points to
a task_struct, rather than a thread_info as it used to. To make this
clear, the TI_* field offsets are renamed to TSK_TI_*, with asm-offsets
appropriately updated to account for the structural change.
Userspace clobbers sp_el0, and we can no longer restore this from the
stack. Instead, the current task is cached in a per-cpu variable that we
can safely access from early assembly as interrupts are disabled (and we
are thus not preemptible).
Both secondary entry and idle are updated to stash the sp and task
pointer separately.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: James Morse <james.morse@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2016-11-03 20:23:13 +00:00
|
|
|
entry_task_switch(next);
|
2019-07-22 14:53:09 +01:00
|
|
|
ssbs_thread_switch(next);
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
cntkctl_thread_switch(prev, next);
|
2021-03-18 20:10:54 -07:00
|
|
|
ptrauth_thread_switch_user(next);
|
2024-08-22 16:10:49 +01:00
|
|
|
permission_overlay_switch(next);
|
2024-10-01 23:59:00 +01:00
|
|
|
gcs_thread_switch(next);
|
2012-03-05 11:49:28 +00:00
|
|
|
|
2013-04-24 14:47:02 +01:00
|
|
|
/*
|
2025-04-22 09:18:19 +01:00
|
|
|
* Complete any pending TLB or cache maintenance on this CPU in case the
|
|
|
|
* thread migrates to a different CPU. This full barrier is also
|
|
|
|
* required by the membarrier system call. Additionally it makes any
|
|
|
|
* in-progress pgtable writes visible to the table walker; See
|
|
|
|
* emit_pte_barriers().
|
2013-04-24 14:47:02 +01:00
|
|
|
*/
|
2014-05-02 16:24:10 +01:00
|
|
|
dsb(ish);
|
2012-03-05 11:49:28 +00:00
|
|
|
|
2019-11-27 10:30:15 +00:00
|
|
|
/*
|
|
|
|
* MTE thread switching must happen after the DSB above to ensure that
|
|
|
|
* any asynchronous tag check faults have been logged in the TFSR*_EL1
|
|
|
|
* registers.
|
|
|
|
*/
|
|
|
|
mte_thread_switch(next);
|
2021-03-18 20:10:52 -07:00
|
|
|
/* avoid expensive SCTLR_EL1 accesses if no change */
|
|
|
|
if (prev->thread.sctlr_user != next->thread.sctlr_user)
|
|
|
|
update_sctlr_el1(next->thread.sctlr_user);
|
2019-11-27 10:30:15 +00:00
|
|
|
|
2012-03-05 11:49:28 +00:00
|
|
|
/* the actual thread switch */
|
|
|
|
last = cpu_switch_to(prev, next);
|
|
|
|
|
|
|
|
return last;
|
|
|
|
}
|
|
|
|
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 14:28:45 +00:00
|
|
|
struct wchan_info {
|
|
|
|
unsigned long pc;
|
|
|
|
int count;
|
|
|
|
};
|
|
|
|
|
|
|
|
static bool get_wchan_cb(void *arg, unsigned long pc)
|
|
|
|
{
|
|
|
|
struct wchan_info *wchan_info = arg;
|
|
|
|
|
|
|
|
if (!in_sched_functions(pc)) {
|
|
|
|
wchan_info->pc = pc;
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return wchan_info->count++ < 16;
|
|
|
|
}
|
|
|
|
|
2021-09-29 15:02:14 -07:00
|
|
|
unsigned long __get_wchan(struct task_struct *p)
|
2012-03-05 11:49:28 +00:00
|
|
|
{
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 14:28:45 +00:00
|
|
|
struct wchan_info wchan_info = {
|
|
|
|
.pc = 0,
|
|
|
|
.count = 0,
|
|
|
|
};
|
2012-03-05 11:49:28 +00:00
|
|
|
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 14:28:45 +00:00
|
|
|
if (!try_get_task_stack(p))
|
2016-11-03 20:23:08 +00:00
|
|
|
return 0;
|
|
|
|
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 14:28:45 +00:00
|
|
|
arch_stack_walk(get_wchan_cb, &wchan_info, p, NULL);
|
2019-07-02 14:07:28 +01:00
|
|
|
|
2016-11-03 20:23:08 +00:00
|
|
|
put_task_stack(p);
|
arm64: Make __get_wchan() use arch_stack_walk()
To enable RELIABLE_STACKTRACE and LIVEPATCH on arm64, we need to
substantially rework arm64's unwinding code. As part of this, we want to
minimize the set of unwind interfaces we expose, and avoid open-coding
of unwind logic outside of stacktrace.c.
Currently, __get_wchan() walks the stack of a blocked task by calling
start_backtrace() with the task's saved PC and FP values, and iterating
unwind steps using unwind_frame(). The initialization is functionally
equivalent to calling arch_stack_walk() with the blocked task, which
will start with the task's saved PC and FP values.
Currently __get_wchan() always performs an initial unwind step, which
will stkip __switch_to(), but as this is now marked as a __sched
function, this no longer needs special handling and will be skipped in
the same way as other sched functions.
Make __get_wchan() use arch_stack_walk(). This simplifies __get_wchan(),
and in future will alow us to make unwind_frame() private to
stacktrace.c. At the same time, we can simplify the try_get_task_stack()
check and avoid the unnecessary `stack_page` variable.
The change to the skipping logic means we may terminate one frame
earlier than previously where there are an excessive number of sched
functions in the trace, but this isn't seen in practice, and wchan is
best-effort anyway, so this should not be a problem.
Other than the above, there should be no functional change as a result
of this patch.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
[Mark: rebase atop wchan changes, elaborate commit message, fix includes]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20211129142849.3056714-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-11-29 14:28:45 +00:00
|
|
|
|
|
|
|
return wchan_info.pc;
|
2012-03-05 11:49:28 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
unsigned long arch_align_stack(unsigned long sp)
|
|
|
|
{
|
|
|
|
if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
|
2022-10-09 20:44:02 -06:00
|
|
|
sp -= get_random_u32_below(PAGE_SIZE);
|
2012-03-05 11:49:28 +00:00
|
|
|
return sp & ~0xf;
|
|
|
|
}
|
|
|
|
|
2021-07-30 12:24:38 +01:00
|
|
|
#ifdef CONFIG_COMPAT
|
|
|
|
int compat_elf_check_arch(const struct elf32_hdr *hdr)
|
|
|
|
{
|
|
|
|
if (!system_supports_32bit_el0())
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if ((hdr)->e_machine != EM_ARM)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (!((hdr)->e_flags & EF_ARM_EABI_MASK))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prevent execve() of a 32-bit program from a deadline task
|
|
|
|
* if the restricted affinity mask would be inadmissible on an
|
|
|
|
* asymmetric system.
|
|
|
|
*/
|
|
|
|
return !static_branch_unlikely(&arm64_mismatched_32bit_el0) ||
|
|
|
|
!dl_task_check_affinity(current, system_32bit_el0_cpumask());
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2017-08-20 13:20:48 +03:00
|
|
|
/*
|
|
|
|
* Called from setup_new_exec() after (COMPAT_)SET_PERSONALITY.
|
|
|
|
*/
|
|
|
|
void arch_setup_new_exec(void)
|
|
|
|
{
|
2021-06-08 19:02:57 +01:00
|
|
|
unsigned long mmflags = 0;
|
|
|
|
|
|
|
|
if (is_compat_task()) {
|
|
|
|
mmflags = MMCF_AARCH32;
|
2021-07-30 12:24:38 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Restrict the CPU affinity mask for a 32-bit task so that
|
|
|
|
* it contains only 32-bit-capable CPUs.
|
|
|
|
*
|
|
|
|
* From the perspective of the task, this looks similar to
|
|
|
|
* what would happen if the 64-bit-only CPUs were hot-unplugged
|
|
|
|
* at the point of execve(), although we try a bit harder to
|
|
|
|
* honour the cpuset hierarchy.
|
|
|
|
*/
|
2021-06-08 19:02:57 +01:00
|
|
|
if (static_branch_unlikely(&arm64_mismatched_32bit_el0))
|
2021-07-30 12:24:38 +01:00
|
|
|
force_compatible_cpus_allowed_ptr(current);
|
|
|
|
} else if (static_branch_unlikely(&arm64_mismatched_32bit_el0)) {
|
|
|
|
relax_compatible_cpus_allowed_ptr(current);
|
2021-06-08 19:02:57 +01:00
|
|
|
}
|
arm64: add basic pointer authentication support
This patch adds basic support for pointer authentication, allowing
userspace to make use of APIAKey, APIBKey, APDAKey, APDBKey, and
APGAKey. The kernel maintains key values for each process (shared by all
threads within), which are initialised to random values at exec() time.
The ID_AA64ISAR1_EL1.{APA,API,GPA,GPI} fields are exposed to userspace,
to describe that pointer authentication instructions are available and
that the kernel is managing the keys. Two new hwcaps are added for the
same reason: PACA (for address authentication) and PACG (for generic
authentication).
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
Tested-by: Adam Wallis <awallis@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
[will: Fix sizeof() usage and unroll address key initialisation]
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-12-07 18:39:25 +00:00
|
|
|
|
2021-06-08 19:02:57 +01:00
|
|
|
current->mm->context.flags = mmflags;
|
2021-03-18 20:10:53 -07:00
|
|
|
ptrauth_thread_init_user();
|
|
|
|
mte_thread_init_user();
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
do_set_tsc_mode(PR_TSC_ENABLE);
|
2020-09-28 14:03:00 +01:00
|
|
|
|
|
|
|
if (task_spec_ssb_noexec(current)) {
|
|
|
|
arch_prctl_spec_ctrl_set(current, PR_SPEC_STORE_BYPASS,
|
|
|
|
PR_SPEC_ENABLE);
|
|
|
|
}
|
2017-08-20 13:20:48 +03:00
|
|
|
}
|
2019-07-23 19:58:39 +02:00
|
|
|
|
|
|
|
#ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
|
|
|
|
/*
|
|
|
|
* Control the relaxed ABI allowing tagged user addresses into the kernel.
|
|
|
|
*/
|
2019-08-15 16:44:01 +01:00
|
|
|
static unsigned int tagged_addr_disabled;
|
2019-07-23 19:58:39 +02:00
|
|
|
|
2020-07-03 14:25:50 +01:00
|
|
|
long set_tagged_addr_ctrl(struct task_struct *task, unsigned long arg)
|
2019-07-23 19:58:39 +02:00
|
|
|
{
|
2019-11-27 10:30:15 +00:00
|
|
|
unsigned long valid_mask = PR_TAGGED_ADDR_ENABLE;
|
2020-07-03 14:25:50 +01:00
|
|
|
struct thread_info *ti = task_thread_info(task);
|
2019-11-27 10:30:15 +00:00
|
|
|
|
2020-07-03 14:25:50 +01:00
|
|
|
if (is_compat_thread(ti))
|
2019-07-23 19:58:39 +02:00
|
|
|
return -EINVAL;
|
2019-11-27 10:30:15 +00:00
|
|
|
|
2025-06-18 10:29:52 +01:00
|
|
|
if (system_supports_mte()) {
|
2022-02-16 17:32:24 +00:00
|
|
|
valid_mask |= PR_MTE_TCF_SYNC | PR_MTE_TCF_ASYNC \
|
|
|
|
| PR_MTE_TAG_MASK;
|
2019-11-27 10:30:15 +00:00
|
|
|
|
2025-06-18 10:29:52 +01:00
|
|
|
if (cpus_have_cap(ARM64_MTE_STORE_ONLY))
|
|
|
|
valid_mask |= PR_MTE_STORE_ONLY;
|
|
|
|
}
|
|
|
|
|
2019-11-27 10:30:15 +00:00
|
|
|
if (arg & ~valid_mask)
|
2019-07-23 19:58:39 +02:00
|
|
|
return -EINVAL;
|
|
|
|
|
2019-08-15 16:44:01 +01:00
|
|
|
/*
|
|
|
|
* Do not allow the enabling of the tagged address ABI if globally
|
|
|
|
* disabled via sysctl abi.tagged_addr_disabled.
|
|
|
|
*/
|
|
|
|
if (arg & PR_TAGGED_ADDR_ENABLE && tagged_addr_disabled)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2020-07-03 14:25:50 +01:00
|
|
|
if (set_mte_ctrl(task, arg) != 0)
|
2019-11-27 10:30:15 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2020-07-03 14:25:50 +01:00
|
|
|
update_ti_thread_flag(ti, TIF_TAGGED_ADDR, arg & PR_TAGGED_ADDR_ENABLE);
|
2019-07-23 19:58:39 +02:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-07-03 14:25:50 +01:00
|
|
|
long get_tagged_addr_ctrl(struct task_struct *task)
|
2019-07-23 19:58:39 +02:00
|
|
|
{
|
2019-11-27 10:30:15 +00:00
|
|
|
long ret = 0;
|
2020-07-03 14:25:50 +01:00
|
|
|
struct thread_info *ti = task_thread_info(task);
|
2019-11-27 10:30:15 +00:00
|
|
|
|
2020-07-03 14:25:50 +01:00
|
|
|
if (is_compat_thread(ti))
|
2019-07-23 19:58:39 +02:00
|
|
|
return -EINVAL;
|
|
|
|
|
2020-07-03 14:25:50 +01:00
|
|
|
if (test_ti_thread_flag(ti, TIF_TAGGED_ADDR))
|
2019-11-27 10:30:15 +00:00
|
|
|
ret = PR_TAGGED_ADDR_ENABLE;
|
2019-07-23 19:58:39 +02:00
|
|
|
|
2020-07-03 14:25:50 +01:00
|
|
|
ret |= get_mte_ctrl(task);
|
2019-11-27 10:30:15 +00:00
|
|
|
|
|
|
|
return ret;
|
2019-07-23 19:58:39 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Global sysctl to disable the tagged user addresses support. This control
|
|
|
|
* only prevents the tagged address ABI enabling via prctl() and does not
|
|
|
|
* disable it for tasks that already opted in to the relaxed ABI.
|
|
|
|
*/
|
|
|
|
|
2025-01-28 13:48:37 +01:00
|
|
|
static const struct ctl_table tagged_addr_sysctl_table[] = {
|
2019-07-23 19:58:39 +02:00
|
|
|
{
|
2019-08-15 16:44:01 +01:00
|
|
|
.procname = "tagged_addr_disabled",
|
2019-07-23 19:58:39 +02:00
|
|
|
.mode = 0644,
|
2019-08-15 16:44:01 +01:00
|
|
|
.data = &tagged_addr_disabled,
|
2019-07-23 19:58:39 +02:00
|
|
|
.maxlen = sizeof(int),
|
|
|
|
.proc_handler = proc_dointvec_minmax,
|
2020-01-24 16:51:27 +01:00
|
|
|
.extra1 = SYSCTL_ZERO,
|
|
|
|
.extra2 = SYSCTL_ONE,
|
2019-07-23 19:58:39 +02:00
|
|
|
},
|
|
|
|
};
|
|
|
|
|
|
|
|
static int __init tagged_addr_init(void)
|
|
|
|
{
|
|
|
|
if (!register_sysctl("abi", tagged_addr_sysctl_table))
|
|
|
|
return -EINVAL;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
core_initcall(tagged_addr_init);
|
|
|
|
#endif /* CONFIG_ARM64_TAGGED_ADDR_ABI */
|
arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled
Preempting from IRQ-return means that the task has its PSTATE saved
on the stack, which will get restored when the task is resumed and does
the actual IRQ return.
However, enabling some CPU features requires modifying the PSTATE. This
means that, if a task was scheduled out during an IRQ-return before all
CPU features are enabled, the task might restore a PSTATE that does not
include the feature enablement changes once scheduled back in.
* Task 1:
PAN == 0 ---| |---------------
| |<- return from IRQ, PSTATE.PAN = 0
| <- IRQ |
+--------+ <- preempt() +--
^
|
reschedule Task 1, PSTATE.PAN == 1
* Init:
--------------------+------------------------
^
|
enable_cpu_features
set PSTATE.PAN on all CPUs
Worse than this, since PSTATE is untouched when task switching is done,
a task missing the new bits in PSTATE might affect another task, if both
do direct calls to schedule() (outside of IRQ/exception contexts).
Fix this by preventing preemption on IRQ-return until features are
enabled on all CPUs.
This way the only PSTATE values that are saved on the stack are from
synchronous exceptions. These are expected to be fatal this early, the
exception is BRK for WARN_ON(), but as this uses do_debug_exception()
which keeps IRQs masked, it shouldn't call schedule().
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
[james: Replaced a really cool hack, with an even simpler static key in C.
expanded commit message with Julien's cover-letter ascii art]
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2019-10-15 18:25:44 +01:00
|
|
|
|
2020-03-16 16:50:47 +00:00
|
|
|
#ifdef CONFIG_BINFMT_ELF
|
|
|
|
int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state,
|
|
|
|
bool has_interp, bool is_interp)
|
|
|
|
{
|
2020-03-23 17:01:19 +00:00
|
|
|
/*
|
|
|
|
* For dynamically linked executables the interpreter is
|
|
|
|
* responsible for setting PROT_BTI on everything except
|
|
|
|
* itself.
|
|
|
|
*/
|
2020-03-16 16:50:47 +00:00
|
|
|
if (is_interp != has_interp)
|
|
|
|
return prot;
|
|
|
|
|
|
|
|
if (!(state->flags & ARM64_ELF_BTI))
|
|
|
|
return prot;
|
|
|
|
|
|
|
|
if (prot & PROT_EXEC)
|
|
|
|
prot |= PROT_BTI;
|
|
|
|
|
|
|
|
return prot;
|
|
|
|
}
|
|
|
|
#endif
|
arm64: Implement prctl(PR_{G,S}ET_TSC)
On arm64, this prctl controls access to CNTVCT_EL0, CNTVCTSS_EL0 and
CNTFRQ_EL0 via CNTKCTL_EL1.EL0VCTEN. Since this bit is also used to
implement various erratum workarounds, check whether the CPU needs
a workaround whenever we potentially need to change it.
This is needed for a correct implementation of non-instrumenting
record-replay debugging on arm64 (i.e. rr; https://rr-project.org/).
rr must trap and record any sources of non-determinism from the
userspace program's perspective so it can be replayed later. This
includes the results of syscalls as well as the results of access
to architected timers exposed directly to the program. This prctl
was originally added for x86 by commit 8fb402bccf20 ("generic, x86:
add prctl commands PR_GET_TSC and PR_SET_TSC"), and rr uses it to
trap RDTSC on x86 for the same reason.
We also considered exposing this as a PTRACE_EVENT. However, prctl
seems like a better choice for these reasons:
1) In general an in-process control seems more useful than an
out-of-process control, since anything that you would be able to
do with ptrace could also be done with prctl (tracer can inject a
call to the prctl and handle signal-delivery-stops), and it avoids
needing an additional process (which will complicate debugging
of the ptraced process since it cannot have more than one tracer,
and will be incompatible with ptrace_scope=3) in cases where that
is not otherwise necessary.
2) Consistency with x86_64. Note that on x86_64, RDTSC has been there
since the start, so it's the same situation as on arm64.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I233a1867d1ccebe2933a347552e7eae862344421
Link: https://lore.kernel.org/r/20240824015415.488474-1-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-23 18:54:13 -07:00
|
|
|
|
|
|
|
int get_tsc_mode(unsigned long adr)
|
|
|
|
{
|
|
|
|
unsigned int val;
|
|
|
|
|
|
|
|
if (is_compat_task())
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (test_thread_flag(TIF_TSC_SIGSEGV))
|
|
|
|
val = PR_TSC_SIGSEGV;
|
|
|
|
else
|
|
|
|
val = PR_TSC_ENABLE;
|
|
|
|
|
|
|
|
return put_user(val, (unsigned int __user *)adr);
|
|
|
|
}
|
|
|
|
|
|
|
|
int set_tsc_mode(unsigned int val)
|
|
|
|
{
|
|
|
|
if (is_compat_task())
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return do_set_tsc_mode(val);
|
|
|
|
}
|