linux/tools/testing/memblock
Mike Rapoport (Microsoft) 4c78cc596b memblock: add MEMBLOCK_RSRV_KERN flag
Patch series "kexec: introduce Kexec HandOver (KHO)", v8.

Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.

However, there are use cases where this mode of operation is not what we
actually want.  In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched. 
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched.  If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem.  If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory.  See "pkernfs: Persisting guest memory
and kernel/device state safely across kexec" Linux Plumbers Conference
2023 presentation for details:

  https://lpc.events/event/17/contributions/1485/

To start us on the journey to support all the use cases above, this patch
implements basic infrastructure to allow hand over of kernel state across
kexec (Kexec HandOver, aka KHO).  As a really simple example target, we
use memblock's reserve_mem.

With this patchset applied, memory that was reserved using "reserve_mem"
command line options remains intact after kexec and it is guaranteed to
reside at the same physical address.

== Alternatives ==

There are alternative approaches to (parts of) the problems above:

  * Memory Pools [1] - preallocated persistent memory region + allocator
  * PRMEM [2] - resizable persistent memory regions with fixed metadata
                pointer on the kernel command line + allocator
  * Pkernfs [3] - preallocated file system for in-kernel data with fixed
                  address location on the kernel command line
  * PKRAM [4] - handover of user space pages using a fixed metadata page
                specified via command line

All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command line
to pass data (including memory reservations) between kexec'ing kernels.

KHO provides that base foundation.  We will determine later whether we
still need any of the approaches above for fast bulk memory handover of
for example IOMMU page tables.  But IMHO they would all be users of KHO,
with KHO providing the foundational primitive to pass metadata and bulk
memory reservations as well as provide easy versioning for data.

== Overview ==

We introduce a metadata file that the kernels pass between each other. 
How they pass it is architecture specific.  The file's format is a
Flattened Device Tree (fdt) which has a generator and parser already
included in Linux.  KHO is enabled in the kernel command line by `kho=on`.
When the root user enables KHO through
/sys/kernel/debug/kho/out/finalize, the kernel invokes callbacks to every
KHO users to register preserved memory regions, which contain drivers'
states.

When the actual kexec happens, the fdt is part of the image set that we
boot into.  In addition, we keep "scratch regions" available for kexec:
physically contiguous memory regions that are guaranteed to not have any
memory that KHO would preserve.  The new kernel bootstraps itself using
the scratch regions and sets all handed over memory as in use.  When
drivers initialize that support KHO, they introspect the fdt, restore
preserved memory regions, and retrieve their states stored in the
preserved memory.

== Limitations ==

Currently KHO is only implemented for file based kexec.  The kernel
interfaces in the patch set are already in place to support user space
kexec as well, but it is still not implemented it yet inside kexec tools.

== How to Use ==

To use the code, please boot the kernel with the "kho=on" command line
parameter.  KHO will automatically create scratch regions.  If you want to
set the scratch size explicitly you can use "kho_scratch=" command line
parameter.  For instance, "kho_scratch=16M,512M,256M" will reserve a 16
MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB
per NUMA node scratch regions on boot.

Make sure to have a reserved memory range requested with reserv_mem
command line option, for example, "reserve_mem=64m:4k:n1".

Then before you invoke file based "kexec -l", finalize KHO FDT:

  # echo 1 > /sys/kernel/debug/kho/out/finalize

You can preview the generated FDT using `dtc`,

  # dtc /sys/kernel/debug/kho/out/fdt
  # dtc /sys/kernel/debug/kho/out/sub_fdts/memblock

`dtc` is available on ubuntu by `sudo apt-get install device-tree-compiler`.

Now kexec into the new kernel,

  # kexec -l Image --initrd=initrd -s
  # kexec -e

(The order of KHO finalization and "kexec -l" does not matter.)

The new kernel will boot up and contain the previous kernel's reserve_mem
contents at the same physical address as the first kernel.

You can also review the FDT passed from the old kernel,

  # dtc /sys/kernel/debug/kho/in/fdt
  # dtc /sys/kernel/debug/kho/in/sub_fdts/memblock


This patch (of 17):

To denote areas that were reserved for kernel use either directly with
memblock_reserve_kern() or via memblock allocations.

Link: https://lore.kernel.org/lkml/20250424083258.2228122-1-changyuanl@google.com/
Link: https://lore.kernel.org/lkml/aAeaJ2iqkrv_ffhT@kernel.org/
Link: https://lore.kernel.org/lkml/35c58191-f774-40cf-8d66-d1e2aaf11a62@intel.com/
Link: https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/
Link: https://lkml.kernel.org/r/20250509074635.3187114-1-changyuanl@google.com
Link: https://lkml.kernel.org/r/20250509074635.3187114-2-changyuanl@google.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:38 -07:00
..
asm
lib
linux memblock tests: Fix mutex related build error 2025-04-07 09:28:01 +03:00
scripts memblock tests: add simulation of physical memory with multiple NUMA nodes 2022-09-18 10:30:20 +03:00
tests memblock: add MEMBLOCK_RSRV_KERN flag 2025-05-12 23:50:38 -07:00
.gitignore memblock tests: Fix compilation error. 2023-01-04 12:26:59 +02:00
internal.h memblock tests: Fix mutex related build error 2025-04-07 09:28:01 +03:00
main.c memblock tests: introduce range tests for memblock_alloc_exact_nid_raw 2022-11-08 09:50:24 +02:00
Makefile memblock test: fix implicit declaration of function 'memparse' 2024-08-06 08:21:25 +03:00
mmzone.c memblock tests: Fix compilation errors. 2023-09-14 08:42:53 +03:00
README memblock test: Modify the obsolete description in README 2022-07-30 11:46:49 +03:00
TODO memblock tests: remove completed TODO item 2022-11-08 09:50:24 +02:00

==================
Memblock simulator
==================

Introduction
============

Memblock is a boot time memory allocator[1] that manages memory regions before
the actual memory management is initialized. Its APIs allow to register physical
memory regions, mark them as available or reserved, allocate a block of memory
within the requested range and/or in specific NUMA node, and many more.

Because it is used so early in the booting process, testing and debugging it is
difficult. This test suite, usually referred as memblock simulator, is
an attempt at testing the memblock mechanism. It runs one monolithic test that
consist of a series of checks that exercise both the basic operations and
allocation functionalities of memblock. The main data structure of the boot time
memory allocator is initialized at the build time, so the checks here reuse its
instance throughout the duration of the test. To ensure that tests don't affect
each other, region arrays are reset in between.

As this project uses the actual memblock code and has to run in user space,
some of the kernel definitions were stubbed by the initial commit that
introduced memblock simulator (commit 16802e55dea9 ("memblock tests: Add
skeleton of the memblock simulator")) and a few preparation commits just
before it. Most of them don't match the kernel implementation, so one should
consult them first before making any significant changes to the project.

Usage
=====

To run the tests, build the main target and run it:

$ make && ./main

A successful run produces no output. It is possible to control the behavior
by passing options from command line. For example, to include verbose output,
append the `-v` options when you run the tests:

$ ./main -v

This will print information about which functions are being tested and the
number of test cases that passed.

For the full list of options from command line, see `./main --help`.

It is also possible to override different configuration parameters to change
the test functions. For example, to simulate enabled NUMA, use:

$ make NUMA=1

For the full list of build options, see `make help`.

Project structure
=================

The project has one target, main, which calls a group of checks for basic and
allocation functions. Tests for each group are defined in dedicated files, as it
can be seen here:

memblock
|-- asm       ------------------,
|-- lib                         |-- implement function and struct stubs
|-- linux     ------------------'
|-- scripts
|    |-- Makefile.include        -- handles `make` parameters
|-- tests
|    |-- alloc_api.(c|h)         -- memblock_alloc tests
|    |-- alloc_helpers_api.(c|h) -- memblock_alloc_from tests
|    |-- alloc_nid_api.(c|h)     -- memblock_alloc_try_nid tests
|    |-- basic_api.(c|h)         -- memblock_add/memblock_reserve/... tests
|    |-- common.(c|h)            -- helper functions for resetting memblock;
|-- main.c        --------------.   dummy physical memory definition
|-- Makefile                     `- test runner
|-- README
|-- TODO
|-- .gitignore

Simulating physical memory
==========================

Some allocation functions clear the memory in the process, so it is required for
memblock to track valid memory ranges. To achieve this, the test suite registers
with memblock memory stored by test_memory struct. It is a small wrapper that
points to a block of memory allocated via malloc. For each group of allocation
tests, dummy physical memory is allocated, added to memblock, and then released
at the end of the test run. The structure of a test runner checking allocation
functions is as follows:

int memblock_alloc_foo_checks(void)
{
	reset_memblock_attributes();     /* data structure reset */
	dummy_physical_memory_init();    /* allocate and register memory */

	(...allocation checks...)

	dummy_physical_memory_cleanup(); /* free the memory */
}

There's no need to explicitly free the dummy memory from memblock via
memblock_free() call. The entry will be erased by reset_memblock_regions(),
called at the beginning of each test.

Known issues
============

1. Requesting a specific NUMA node via memblock_alloc_node() does not work as
   intended. Once the fix is in place, tests for this function can be added.

2. Tests for memblock_alloc_low() can't be easily implemented. The function uses
   ARCH_LOW_ADDRESS_LIMIT marco, which can't be changed to point at the low
   memory of the memory_block.

References
==========

1. Boot time memory management documentation page:
   https://www.kernel.org/doc/html/latest/core-api/boot-time-mm.html