mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-08-05 16:54:27 +00:00

The process addresses documentation already contains a great deal of
information about mmap/VMA locking and page table traversal and
manipulation.
However it waves it hands about non-VMA traversal. Add a section for this
and explain the caveats around this kind of traversal.
Additionally, commit 6375e95f38
("mm: pgtable: reclaim empty PTE page in
madvise(MADV_DONTNEED)") caused zapping to also free empty PTE page
tables. Highlight this.
Link: https://lkml.kernel.org/r/20250604180308.137116-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
909 lines
46 KiB
ReStructuredText
909 lines
46 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
||
|
||
=================
|
||
Process Addresses
|
||
=================
|
||
|
||
.. toctree::
|
||
:maxdepth: 3
|
||
|
||
|
||
Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
|
||
'VMA's of type :c:struct:`!struct vm_area_struct`.
|
||
|
||
Each VMA describes a virtually contiguous memory range with identical
|
||
attributes, each described by a :c:struct:`!struct vm_area_struct`
|
||
object. Userland access outside of VMAs is invalid except in the case where an
|
||
adjacent stack VMA could be extended to contain the accessed address.
|
||
|
||
All VMAs are contained within one and only one virtual address space, described
|
||
by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
|
||
threads) which share the virtual address space. We refer to this as the
|
||
:c:struct:`!mm`.
|
||
|
||
Each mm object contains a maple tree data structure which describes all VMAs
|
||
within the virtual address space.
|
||
|
||
.. note:: An exception to this is the 'gate' VMA which is provided by
|
||
architectures which use :c:struct:`!vsyscall` and is a global static
|
||
object which does not belong to any specific mm.
|
||
|
||
-------
|
||
Locking
|
||
-------
|
||
|
||
The kernel is designed to be highly scalable against concurrent read operations
|
||
on VMA **metadata** so a complicated set of locks are required to ensure memory
|
||
corruption does not occur.
|
||
|
||
.. note:: Locking VMAs for their metadata does not have any impact on the memory
|
||
they describe nor the page tables that map them.
|
||
|
||
Terminology
|
||
-----------
|
||
|
||
* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
|
||
which locks at a process address space granularity which can be acquired via
|
||
:c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
|
||
* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
|
||
as a read/write semaphore in practice. A VMA read lock is obtained via
|
||
:c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
|
||
write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
|
||
automatically when the mmap write lock is released). To take a VMA write lock
|
||
you **must** have already acquired an :c:func:`!mmap_write_lock`.
|
||
* **rmap locks** - When trying to access VMAs through the reverse mapping via a
|
||
:c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
|
||
(reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via
|
||
:c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
|
||
anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
|
||
:c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
|
||
locks as the reverse mapping locks, or 'rmap locks' for brevity.
|
||
|
||
We discuss page table locks separately in the dedicated section below.
|
||
|
||
The first thing **any** of these locks achieve is to **stabilise** the VMA
|
||
within the MM tree. That is, guaranteeing that the VMA object will not be
|
||
deleted from under you nor modified (except for some specific fields
|
||
described below).
|
||
|
||
Stabilising a VMA also keeps the address space described by it around.
|
||
|
||
Lock usage
|
||
----------
|
||
|
||
If you want to **read** VMA metadata fields or just keep the VMA stable, you
|
||
must do one of the following:
|
||
|
||
* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
|
||
suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
|
||
you're done with the VMA, *or*
|
||
* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
|
||
acquire the lock atomically so might fail, in which case fall-back logic is
|
||
required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
|
||
*or*
|
||
* Acquire an rmap lock before traversing the locked interval tree (whether
|
||
anonymous or file-backed) to obtain the required VMA.
|
||
|
||
If you want to **write** VMA metadata fields, then things vary depending on the
|
||
field (we explore each VMA field in detail below). For the majority you must:
|
||
|
||
* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
|
||
suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
|
||
you're done with the VMA, *and*
|
||
* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
|
||
modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
|
||
called.
|
||
* If you want to be able to write to **any** field, you must also hide the VMA
|
||
from the reverse mapping by obtaining an **rmap write lock**.
|
||
|
||
VMA locks are special in that you must obtain an mmap **write** lock **first**
|
||
in order to obtain a VMA **write** lock. A VMA **read** lock however can be
|
||
obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then
|
||
release an RCU lock to lookup the VMA for you).
|
||
|
||
This constrains the impact of writers on readers, as a writer can interact with
|
||
one VMA while a reader interacts with another simultaneously.
|
||
|
||
.. note:: The primary users of VMA read locks are page fault handlers, which
|
||
means that without a VMA write lock, page faults will run concurrent with
|
||
whatever you are doing.
|
||
|
||
Examining all valid lock states:
|
||
|
||
.. table::
|
||
|
||
========= ======== ========= ======= ===== =========== ==========
|
||
mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
|
||
========= ======== ========= ======= ===== =========== ==========
|
||
\- \- \- N N N N
|
||
\- R \- Y Y N N
|
||
\- \- R/W Y Y N N
|
||
R/W \-/R \-/R/W Y Y N N
|
||
W W \-/R Y Y Y N
|
||
W W W Y Y Y Y
|
||
========= ======== ========= ======= ===== =========== ==========
|
||
|
||
.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
|
||
attempting to do the reverse is invalid as it can result in deadlock - if
|
||
another task already holds an mmap write lock and attempts to acquire a VMA
|
||
write lock that will deadlock on the VMA read lock.
|
||
|
||
All of these locks behave as read/write semaphores in practice, so you can
|
||
obtain either a read or a write lock for each of these.
|
||
|
||
.. note:: Generally speaking, a read/write semaphore is a class of lock which
|
||
permits concurrent readers. However a write lock can only be obtained
|
||
once all readers have left the critical region (and pending readers
|
||
made to wait).
|
||
|
||
This renders read locks on a read/write semaphore concurrent with other
|
||
readers and write locks exclusive against all others holding the semaphore.
|
||
|
||
VMA fields
|
||
^^^^^^^^^^
|
||
|
||
We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
|
||
easier to explore their locking characteristics:
|
||
|
||
.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
|
||
are in effect an internal implementation detail.
|
||
|
||
.. table:: Virtual layout fields
|
||
|
||
===================== ======================================== ===========
|
||
Field Description Write lock
|
||
===================== ======================================== ===========
|
||
:c:member:`!vm_start` Inclusive start virtual address of range mmap write,
|
||
VMA describes. VMA write,
|
||
rmap write.
|
||
:c:member:`!vm_end` Exclusive end virtual address of range mmap write,
|
||
VMA describes. VMA write,
|
||
rmap write.
|
||
:c:member:`!vm_pgoff` Describes the page offset into the file, mmap write,
|
||
the original page offset within the VMA write,
|
||
virtual address space (prior to any rmap write.
|
||
:c:func:`!mremap`), or PFN if a PFN map
|
||
and the architecture does not support
|
||
:c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
|
||
===================== ======================================== ===========
|
||
|
||
These fields describes the size, start and end of the VMA, and as such cannot be
|
||
modified without first being hidden from the reverse mapping since these fields
|
||
are used to locate VMAs within the reverse mapping interval trees.
|
||
|
||
.. table:: Core fields
|
||
|
||
============================ ======================================== =========================
|
||
Field Description Write lock
|
||
============================ ======================================== =========================
|
||
:c:member:`!vm_mm` Containing mm_struct. None - written once on
|
||
initial map.
|
||
:c:member:`!vm_page_prot` Architecture-specific page table mmap write, VMA write.
|
||
protection bits determined from VMA
|
||
flags.
|
||
:c:member:`!vm_flags` Read-only access to VMA flags describing N/A
|
||
attributes of the VMA, in union with
|
||
private writable
|
||
:c:member:`!__vm_flags`.
|
||
:c:member:`!__vm_flags` Private, writable access to VMA flags mmap write, VMA write.
|
||
field, updated by
|
||
:c:func:`!vm_flags_*` functions.
|
||
:c:member:`!vm_file` If the VMA is file-backed, points to a None - written once on
|
||
struct file object describing the initial map.
|
||
underlying file, if anonymous then
|
||
:c:macro:`!NULL`.
|
||
:c:member:`!vm_ops` If the VMA is file-backed, then either None - Written once on
|
||
the driver or file-system provides a initial map by
|
||
:c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
|
||
object describing callbacks to be
|
||
invoked on VMA lifetime events.
|
||
:c:member:`!vm_private_data` A :c:member:`!void *` field for Handled by driver.
|
||
driver-specific metadata.
|
||
============================ ======================================== =========================
|
||
|
||
These are the core fields which describe the MM the VMA belongs to and its attributes.
|
||
|
||
.. table:: Config-specific fields
|
||
|
||
================================= ===================== ======================================== ===============
|
||
Field Configuration option Description Write lock
|
||
================================= ===================== ======================================== ===============
|
||
:c:member:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a mmap write,
|
||
:c:struct:`!struct anon_vma_name` VMA write.
|
||
object providing a name for anonymous
|
||
mappings, or :c:macro:`!NULL` if none
|
||
is set or the VMA is file-backed. The
|
||
underlying object is reference counted
|
||
and can be shared across multiple VMAs
|
||
for scalability.
|
||
:c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read,
|
||
to perform readahead. This field is swap-specific
|
||
accessed atomically. lock.
|
||
:c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write,
|
||
describes the NUMA behaviour of the VMA write.
|
||
VMA. The underlying object is reference
|
||
counted.
|
||
:c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read,
|
||
describes the current state of numab-specific
|
||
NUMA balancing in relation to this VMA. lock.
|
||
Updated under mmap read lock by
|
||
:c:func:`!task_numa_work`.
|
||
:c:member:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write,
|
||
type :c:type:`!vm_userfaultfd_ctx`, VMA write.
|
||
either of zero size if userfaultfd is
|
||
disabled, or containing a pointer
|
||
to an underlying
|
||
:c:type:`!userfaultfd_ctx` object which
|
||
describes userfaultfd metadata.
|
||
================================= ===================== ======================================== ===============
|
||
|
||
These fields are present or not depending on whether the relevant kernel
|
||
configuration option is set.
|
||
|
||
.. table:: Reverse mapping fields
|
||
|
||
=================================== ========================================= ============================
|
||
Field Description Write lock
|
||
=================================== ========================================= ============================
|
||
:c:member:`!shared.rb` A red/black tree node used, if the mmap write, VMA write,
|
||
mapping is file-backed, to place the VMA i_mmap write.
|
||
in the
|
||
:c:member:`!struct address_space->i_mmap`
|
||
red/black interval tree.
|
||
:c:member:`!shared.rb_subtree_last` Metadata used for management of the mmap write, VMA write,
|
||
interval tree if the VMA is file-backed. i_mmap write.
|
||
:c:member:`!anon_vma_chain` List of pointers to both forked/CoW’d mmap read, anon_vma write.
|
||
:c:type:`!anon_vma` objects and
|
||
:c:member:`!vma->anon_vma` if it is
|
||
non-:c:macro:`!NULL`.
|
||
:c:member:`!anon_vma` :c:type:`!anon_vma` object used by When :c:macro:`NULL` and
|
||
anonymous folios mapped exclusively to setting non-:c:macro:`NULL`:
|
||
this VMA. Initially set by mmap read, page_table_lock.
|
||
:c:func:`!anon_vma_prepare` serialised
|
||
by the :c:macro:`!page_table_lock`. This When non-:c:macro:`NULL` and
|
||
is set as soon as any page is faulted in. setting :c:macro:`NULL`:
|
||
mmap write, VMA write,
|
||
anon_vma write.
|
||
=================================== ========================================= ============================
|
||
|
||
These fields are used to both place the VMA within the reverse mapping, and for
|
||
anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
|
||
and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should
|
||
reside.
|
||
|
||
.. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set
|
||
then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap`
|
||
trees at the same time, so all of these fields might be utilised at
|
||
once.
|
||
|
||
Page tables
|
||
-----------
|
||
|
||
We won't speak exhaustively on the subject but broadly speaking, page tables map
|
||
virtual addresses to physical ones through a series of page tables, each of
|
||
which contain entries with physical addresses for the next page table level
|
||
(along with flags), and at the leaf level the physical addresses of the
|
||
underlying physical data pages or a special entry such as a swap entry,
|
||
migration entry or other special marker. Offsets into these pages are provided
|
||
by the virtual address itself.
|
||
|
||
In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
|
||
pages might eliminate one or two of these levels, but when this is the case we
|
||
typically refer to the leaf level as the PTE level regardless.
|
||
|
||
.. note:: In instances where the architecture supports fewer page tables than
|
||
five the kernel cleverly 'folds' page table levels, that is stubbing
|
||
out functions related to the skipped levels. This allows us to
|
||
conceptually act as if there were always five levels, even if the
|
||
compiler might, in practice, eliminate any code relating to missing
|
||
ones.
|
||
|
||
There are four key operations typically performed on page tables:
|
||
|
||
1. **Traversing** page tables - Simply reading page tables in order to traverse
|
||
them. This only requires that the VMA is kept stable, so a lock which
|
||
establishes this suffices for traversal (there are also lockless variants
|
||
which eliminate even this requirement, such as :c:func:`!gup_fast`). There is
|
||
also a special case of page table traversal for non-VMA regions which we
|
||
consider separately below.
|
||
2. **Installing** page table mappings - Whether creating a new mapping or
|
||
modifying an existing one in such a way as to change its identity. This
|
||
requires that the VMA is kept stable via an mmap or VMA lock (explicitly not
|
||
rmap locks).
|
||
3. **Zapping/unmapping** page table entries - This is what the kernel calls
|
||
clearing page table mappings at the leaf level only, whilst leaving all page
|
||
tables in place. This is a very common operation in the kernel performed on
|
||
file truncation, the :c:macro:`!MADV_DONTNEED` operation via
|
||
:c:func:`!madvise`, and others. This is performed by a number of functions
|
||
including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`.
|
||
The VMA need only be kept stable for this operation.
|
||
4. **Freeing** page tables - When finally the kernel removes page tables from a
|
||
userland process (typically via :c:func:`!free_pgtables`) extreme care must
|
||
be taken to ensure this is done safely, as this logic finally frees all page
|
||
tables in the specified range, ignoring existing leaf entries (it assumes the
|
||
caller has both zapped the range and prevented any further faults or
|
||
modifications within it).
|
||
|
||
.. note:: Modifying mappings for reclaim or migration is performed under rmap
|
||
lock as it, like zapping, does not fundamentally modify the identity
|
||
of what is being mapped.
|
||
|
||
**Traversing** and **zapping** ranges can be performed holding any one of the
|
||
locks described in the terminology section above - that is the mmap lock, the
|
||
VMA lock or either of the reverse mapping locks.
|
||
|
||
That is - as long as you keep the relevant VMA **stable** - you are good to go
|
||
ahead and perform these operations on page tables (though internally, kernel
|
||
operations that perform writes also acquire internal page table locks to
|
||
serialise - see the page table implementation detail section for more details).
|
||
|
||
.. note:: We free empty PTE tables on zap under the RCU lock - this does not
|
||
change the aforementioned locking requirements around zapping.
|
||
|
||
When **installing** page table entries, the mmap or VMA lock must be held to
|
||
keep the VMA stable. We explore why this is in the page table locking details
|
||
section below.
|
||
|
||
**Freeing** page tables is an entirely internal memory management operation and
|
||
has special requirements (see the page freeing section below for more details).
|
||
|
||
.. warning:: When **freeing** page tables, it must not be possible for VMAs
|
||
containing the ranges those page tables map to be accessible via
|
||
the reverse mapping.
|
||
|
||
The :c:func:`!free_pgtables` function removes the relevant VMAs
|
||
from the reverse mappings, but no other VMAs can be permitted to be
|
||
accessible and span the specified range.
|
||
|
||
Traversing non-VMA page tables
|
||
------------------------------
|
||
|
||
We've focused above on traversal of page tables belonging to VMAs. It is also
|
||
possible to traverse page tables which are not represented by VMAs.
|
||
|
||
Kernel page table mappings themselves are generally managed but whatever part of
|
||
the kernel established them and the aforementioned locking rules do not apply -
|
||
for instance vmalloc has its own set of locks which are utilised for
|
||
establishing and tearing down page its page tables.
|
||
|
||
However, for convenience we provide the :c:func:`!walk_kernel_page_table_range`
|
||
function which is synchronised via the mmap lock on the :c:macro:`!init_mm`
|
||
kernel instantiation of the :c:struct:`!struct mm_struct` metadata object.
|
||
|
||
If an operation requires exclusive access, a write lock is used, but if not, a
|
||
read lock suffices - we assert only that at least a read lock has been acquired.
|
||
|
||
Since, aside from vmalloc and memory hot plug, kernel page tables are not torn
|
||
down all that often - this usually suffices, however any caller of this
|
||
functionality must ensure that any additionally required locks are acquired in
|
||
advance.
|
||
|
||
We also permit a truly unusual case is the traversal of non-VMA ranges in
|
||
**userland** ranges, as provided for by :c:func:`!walk_page_range_debug`.
|
||
|
||
This has only one user - the general page table dumping logic (implemented in
|
||
:c:macro:`!mm/ptdump.c`) - which seeks to expose all mappings for debug purposes
|
||
even if they are highly unusual (possibly architecture-specific) and are not
|
||
backed by a VMA.
|
||
|
||
We must take great care in this case, as the :c:func:`!munmap` implementation
|
||
detaches VMAs under an mmap write lock before tearing down page tables under a
|
||
downgraded mmap read lock.
|
||
|
||
This means such an operation could race with this, and thus an mmap **write**
|
||
lock is required.
|
||
|
||
Lock ordering
|
||
-------------
|
||
|
||
As we have multiple locks across the kernel which may or may not be taken at the
|
||
same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
|
||
the **order** in which locks are acquired and released becomes very important.
|
||
|
||
.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
|
||
but in doing so inadvertently cause a mutual deadlock.
|
||
|
||
For example, consider thread 1 which holds lock A and tries to acquire lock B,
|
||
while thread 2 holds lock B and tries to acquire lock A.
|
||
|
||
Both threads are now deadlocked on each other. However, had they attempted to
|
||
acquire locks in the same order, one would have waited for the other to
|
||
complete its work and no deadlock would have occurred.
|
||
|
||
The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required
|
||
ordering of locks within memory management code:
|
||
|
||
.. code-block::
|
||
|
||
inode->i_rwsem (while writing or truncating, not reading or faulting)
|
||
mm->mmap_lock
|
||
mapping->invalidate_lock (in filemap_fault)
|
||
folio_lock
|
||
hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
|
||
vma_start_write
|
||
mapping->i_mmap_rwsem
|
||
anon_vma->rwsem
|
||
mm->page_table_lock or pte_lock
|
||
swap_lock (in swap_duplicate, swap_info_get)
|
||
mmlist_lock (in mmput, drain_mmlist and others)
|
||
mapping->private_lock (in block_dirty_folio)
|
||
i_pages lock (widely used)
|
||
lruvec->lru_lock (in folio_lruvec_lock_irq)
|
||
inode->i_lock (in set_page_dirty's __mark_inode_dirty)
|
||
bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
|
||
sb_lock (within inode_lock in fs/fs-writeback.c)
|
||
i_pages lock (widely used, in set_page_dirty,
|
||
in arch-dependent flush_dcache_mmap_lock,
|
||
within bdi.wb->list_lock in __sync_single_inode)
|
||
|
||
There is also a file-system specific lock ordering comment located at the top of
|
||
:c:macro:`!mm/filemap.c`:
|
||
|
||
.. code-block::
|
||
|
||
->i_mmap_rwsem (truncate_pagecache)
|
||
->private_lock (__free_pte->block_dirty_folio)
|
||
->swap_lock (exclusive_swap_page, others)
|
||
->i_pages lock
|
||
|
||
->i_rwsem
|
||
->invalidate_lock (acquired by fs in truncate path)
|
||
->i_mmap_rwsem (truncate->unmap_mapping_range)
|
||
|
||
->mmap_lock
|
||
->i_mmap_rwsem
|
||
->page_table_lock or pte_lock (various, mainly in memory.c)
|
||
->i_pages lock (arch-dependent flush_dcache_mmap_lock)
|
||
|
||
->mmap_lock
|
||
->invalidate_lock (filemap_fault)
|
||
->lock_page (filemap_fault, access_process_vm)
|
||
|
||
->i_rwsem (generic_perform_write)
|
||
->mmap_lock (fault_in_readable->do_page_fault)
|
||
|
||
bdi->wb.list_lock
|
||
sb_lock (fs/fs-writeback.c)
|
||
->i_pages lock (__sync_single_inode)
|
||
|
||
->i_mmap_rwsem
|
||
->anon_vma.lock (vma_merge)
|
||
|
||
->anon_vma.lock
|
||
->page_table_lock or pte_lock (anon_vma_prepare and various)
|
||
|
||
->page_table_lock or pte_lock
|
||
->swap_lock (try_to_unmap_one)
|
||
->private_lock (try_to_unmap_one)
|
||
->i_pages lock (try_to_unmap_one)
|
||
->lruvec->lru_lock (follow_page_mask->mark_page_accessed)
|
||
->lruvec->lru_lock (check_pte_range->folio_isolate_lru)
|
||
->private_lock (folio_remove_rmap_pte->set_page_dirty)
|
||
->i_pages lock (folio_remove_rmap_pte->set_page_dirty)
|
||
bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty)
|
||
->inode->i_lock (folio_remove_rmap_pte->set_page_dirty)
|
||
bdi.wb->list_lock (zap_pte_range->set_page_dirty)
|
||
->inode->i_lock (zap_pte_range->set_page_dirty)
|
||
->private_lock (zap_pte_range->block_dirty_folio)
|
||
|
||
Please check the current state of these comments which may have changed since
|
||
the time of writing of this document.
|
||
|
||
------------------------------
|
||
Locking Implementation Details
|
||
------------------------------
|
||
|
||
.. warning:: Locking rules for PTE-level page tables are very different from
|
||
locking rules for page tables at other levels.
|
||
|
||
Page table locking details
|
||
--------------------------
|
||
|
||
.. note:: This section explores page table locking requirements for page tables
|
||
encompassed by a VMA. See the above section on non-VMA page table
|
||
traversal for details on how we handle that case.
|
||
|
||
In addition to the locks described in the terminology section above, we have
|
||
additional locks dedicated to page tables:
|
||
|
||
* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
|
||
and PUD each make use of the process address space granularity
|
||
:c:member:`!mm->page_table_lock` lock when modified.
|
||
|
||
* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
|
||
either kept within the folios describing the page tables or allocated
|
||
separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
|
||
set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
|
||
mapped into higher memory (if a 32-bit system) and carefully locked via
|
||
:c:func:`!pte_offset_map_lock`.
|
||
|
||
These locks represent the minimum required to interact with each page table
|
||
level, but there are further requirements.
|
||
|
||
Importantly, note that on a **traversal** of page tables, sometimes no such
|
||
locks are taken. However, at the PTE level, at least concurrent page table
|
||
deletion must be prevented (using RCU) and the page table must be mapped into
|
||
high memory, see below.
|
||
|
||
Whether care is taken on reading the page table entries depends on the
|
||
architecture, see the section on atomicity below.
|
||
|
||
Locking rules
|
||
^^^^^^^^^^^^^
|
||
|
||
We establish basic locking rules when interacting with page tables:
|
||
|
||
* When changing a page table entry the page table lock for that page table
|
||
**must** be held, except if you can safely assume nobody can access the page
|
||
tables concurrently (such as on invocation of :c:func:`!free_pgtables`).
|
||
* Reads from and writes to page table entries must be *appropriately*
|
||
atomic. See the section on atomicity below for details.
|
||
* Populating previously empty entries requires that the mmap or VMA locks are
|
||
held (read or write), doing so with only rmap locks would be dangerous (see
|
||
the warning below).
|
||
* As mentioned previously, zapping can be performed while simply keeping the VMA
|
||
stable, that is holding any one of the mmap, VMA or rmap locks.
|
||
|
||
.. warning:: Populating previously empty entries is dangerous as, when unmapping
|
||
VMAs, :c:func:`!vms_clear_ptes` has a window of time between
|
||
zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via
|
||
:c:func:`!free_pgtables`), where the VMA is still visible in the
|
||
rmap tree. :c:func:`!free_pgtables` assumes that the zap has
|
||
already been performed and removes PTEs unconditionally (along with
|
||
all other page tables in the freed range), so installing new PTE
|
||
entries could leak memory and also cause other unexpected and
|
||
dangerous behaviour.
|
||
|
||
There are additional rules applicable when moving page tables, which we discuss
|
||
in the section on this topic below.
|
||
|
||
PTE-level page tables are different from page tables at other levels, and there
|
||
are extra requirements for accessing them:
|
||
|
||
* On 32-bit architectures, they may be in high memory (meaning they need to be
|
||
mapped into kernel memory to be accessible).
|
||
* When empty, they can be unlinked and RCU-freed while holding an mmap lock or
|
||
rmap lock for reading in combination with the PTE and PMD page table locks.
|
||
In particular, this happens in :c:func:`!retract_page_tables` when handling
|
||
:c:macro:`!MADV_COLLAPSE`.
|
||
So accessing PTE-level page tables requires at least holding an RCU read lock;
|
||
but that only suffices for readers that can tolerate racing with concurrent
|
||
page table updates such that an empty PTE is observed (in a page table that
|
||
has actually already been detached and marked for RCU freeing) while another
|
||
new page table has been installed in the same location and filled with
|
||
entries. Writers normally need to take the PTE lock and revalidate that the
|
||
PMD entry still refers to the same PTE-level page table.
|
||
If the writer does not care whether it is the same PTE-level page table, it
|
||
can take the PMD lock and revalidate that the contents of pmd entry still meet
|
||
the requirements. In particular, this also happens in :c:func:`!retract_page_tables`
|
||
when handling :c:macro:`!MADV_COLLAPSE`.
|
||
|
||
To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or
|
||
:c:func:`!pte_offset_map` can be used depending on stability requirements.
|
||
These map the page table into kernel memory if required, take the RCU lock, and
|
||
depending on variant, may also look up or acquire the PTE lock.
|
||
See the comment on :c:func:`!__pte_offset_map_lock`.
|
||
|
||
Atomicity
|
||
^^^^^^^^^
|
||
|
||
Regardless of page table locks, the MMU hardware concurrently updates accessed
|
||
and dirty bits (perhaps more, depending on architecture). Additionally, page
|
||
table traversal operations in parallel (though holding the VMA stable) and
|
||
functionality like GUP-fast locklessly traverses (that is reads) page tables,
|
||
without even keeping the VMA stable at all.
|
||
|
||
When performing a page table traversal and keeping the VMA stable, whether a
|
||
read must be performed once and only once or not depends on the architecture
|
||
(for instance x86-64 does not require any special precautions).
|
||
|
||
If a write is being performed, or if a read informs whether a write takes place
|
||
(on an installation of a page table entry say, for instance in
|
||
:c:func:`!__pud_install`), special care must always be taken. In these cases we
|
||
can never assume that page table locks give us entirely exclusive access, and
|
||
must retrieve page table entries once and only once.
|
||
|
||
If we are reading page table entries, then we need only ensure that the compiler
|
||
does not rearrange our loads. This is achieved via :c:func:`!pXXp_get`
|
||
functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`,
|
||
:c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
|
||
|
||
Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
|
||
the page table entry only once.
|
||
|
||
However, if we wish to manipulate an existing page table entry and care about
|
||
the previously stored data, we must go further and use an hardware atomic
|
||
operation as, for example, in :c:func:`!ptep_get_and_clear`.
|
||
|
||
Equally, operations that do not rely on the VMA being held stable, such as
|
||
GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like
|
||
:c:func:`!gup_fast_pte_range`), must very carefully interact with page table
|
||
entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for
|
||
higher level page table levels.
|
||
|
||
Writes to page table entries must also be appropriately atomic, as established
|
||
by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
|
||
:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
|
||
|
||
Equally functions which clear page table entries must be appropriately atomic,
|
||
as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`,
|
||
:c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and
|
||
:c:func:`!pte_clear`.
|
||
|
||
Page table installation
|
||
^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
Page table installation is performed with the VMA held stable explicitly by an
|
||
mmap or VMA lock in read or write mode (see the warning in the locking rules
|
||
section for details as to why).
|
||
|
||
When allocating a P4D, PUD or PMD and setting the relevant entry in the above
|
||
PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
|
||
acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
|
||
:c:func:`!__pmd_alloc` respectively.
|
||
|
||
.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
|
||
:c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
|
||
references the :c:member:`!mm->page_table_lock`.
|
||
|
||
Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
|
||
:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD
|
||
physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
|
||
:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
|
||
:c:func:`!__pte_alloc`.
|
||
|
||
Finally, modifying the contents of the PTE requires special treatment, as the
|
||
PTE page table lock must be acquired whenever we want stable and exclusive
|
||
access to entries contained within a PTE, especially when we wish to modify
|
||
them.
|
||
|
||
This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
|
||
ensure that the PTE hasn't changed from under us, ultimately invoking
|
||
:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
|
||
the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
|
||
must be released via :c:func:`!pte_unmap_unlock`.
|
||
|
||
.. note:: There are some variants on this, such as
|
||
:c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
|
||
for brevity we do not explore this. See the comment for
|
||
:c:func:`!__pte_offset_map_lock` for more details.
|
||
|
||
When modifying data in ranges we typically only wish to allocate higher page
|
||
tables as necessary, using these locks to avoid races or overwriting anything,
|
||
and set/clear data at the PTE level as required (for instance when page faulting
|
||
or zapping).
|
||
|
||
A typical pattern taken when traversing page table entries to install a new
|
||
mapping is to optimistically determine whether the page table entry in the table
|
||
above is empty, if so, only then acquiring the page table lock and checking
|
||
again to see if it was allocated underneath us.
|
||
|
||
This allows for a traversal with page table locks only being taken when
|
||
required. An example of this is :c:func:`!__pud_alloc`.
|
||
|
||
At the leaf page table, that is the PTE, we can't entirely rely on this pattern
|
||
as we have separate PMD and PTE locks and a THP collapse for instance might have
|
||
eliminated the PMD entry as well as the PTE from under us.
|
||
|
||
This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
|
||
for the PTE, carefully checking it is as expected, before acquiring the
|
||
PTE-specific lock, and then *again* checking that the PMD entry is as expected.
|
||
|
||
If a THP collapse (or similar) were to occur then the lock on both pages would
|
||
be acquired, so we can ensure this is prevented while the PTE lock is held.
|
||
|
||
Installing entries this way ensures mutual exclusion on write.
|
||
|
||
Page table freeing
|
||
^^^^^^^^^^^^^^^^^^
|
||
|
||
Tearing down page tables themselves is something that requires significant
|
||
care. There must be no way that page tables designated for removal can be
|
||
traversed or referenced by concurrent tasks.
|
||
|
||
It is insufficient to simply hold an mmap write lock and VMA lock (which will
|
||
prevent racing faults, and rmap operations), as a file-backed mapping can be
|
||
truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone.
|
||
|
||
As a result, no VMA which can be accessed via the reverse mapping (either
|
||
through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct
|
||
address_space->i_mmap` interval trees) can have its page tables torn down.
|
||
|
||
The operation is typically performed via :c:func:`!free_pgtables`, which assumes
|
||
either the mmap write lock has been taken (as specified by its
|
||
:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
|
||
|
||
It carefully removes the VMA from all reverse mappings, however it's important
|
||
that no new ones overlap these or any route remain to permit access to addresses
|
||
within the range whose page tables are being torn down.
|
||
|
||
Additionally, it assumes that a zap has already been performed and steps have
|
||
been taken to ensure that no further page table entries can be installed between
|
||
the zap and the invocation of :c:func:`!free_pgtables`.
|
||
|
||
Since it is assumed that all such steps have been taken, page table entries are
|
||
cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
|
||
:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions.
|
||
|
||
.. note:: It is possible for leaf page tables to be torn down independent of
|
||
the page tables above it as is done by
|
||
:c:func:`!retract_page_tables`, which is performed under the i_mmap
|
||
read lock, PMD, and PTE page table locks, without this level of care.
|
||
|
||
Page table moving
|
||
^^^^^^^^^^^^^^^^^
|
||
|
||
Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
|
||
page tables). Most notable of these is :c:func:`!mremap`, which is capable of
|
||
moving higher level page tables.
|
||
|
||
In these instances, it is required that **all** locks are taken, that is
|
||
the mmap lock, the VMA lock and the relevant rmap locks.
|
||
|
||
You can observe this in the :c:func:`!mremap` implementation in the functions
|
||
:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
|
||
side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
|
||
|
||
VMA lock internals
|
||
------------------
|
||
|
||
Overview
|
||
^^^^^^^^
|
||
|
||
VMA read locking is entirely optimistic - if the lock is contended or a competing
|
||
write has started, then we do not obtain a read lock.
|
||
|
||
A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first
|
||
calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
|
||
critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
|
||
before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
|
||
|
||
In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
|
||
and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not
|
||
fail due to lock contention but the caller should still check their return values
|
||
in case they fail for other reasons.
|
||
|
||
VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their
|
||
duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via
|
||
:c:func:`!vma_end_read`.
|
||
|
||
VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
|
||
VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
|
||
acquired. An mmap write lock **must** be held for the duration of the VMA write
|
||
lock, releasing or downgrading the mmap write lock also releases the VMA write
|
||
lock so there is no :c:func:`!vma_end_write` function.
|
||
|
||
Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily
|
||
modified so that readers can detect the presense of a writer. The reference counter is
|
||
restored once the vma sequence number used for serialisation is updated.
|
||
|
||
This ensures the semantics we require - VMA write locks provide exclusive write
|
||
access to the VMA.
|
||
|
||
Implementation details
|
||
^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The VMA lock mechanism is designed to be a lightweight means of avoiding the use
|
||
of the heavily contended mmap lock. It is implemented using a combination of a
|
||
reference counter and sequence numbers belonging to the containing
|
||
:c:struct:`!struct mm_struct` and the VMA.
|
||
|
||
Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
|
||
operation, i.e. it tries to acquire a read lock but returns false if it is
|
||
unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
|
||
called to release the VMA read lock.
|
||
|
||
Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has
|
||
been called first, establishing that we are in an RCU critical section upon VMA
|
||
read lock acquisition. Once acquired, the RCU lock can be released as it is only
|
||
required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which
|
||
is the interface a user should use.
|
||
|
||
Writing requires the mmap to be write-locked and the VMA lock to be acquired via
|
||
:c:func:`!vma_start_write`, however the write lock is released by the termination or
|
||
downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
|
||
|
||
All this is achieved by the use of per-mm and per-VMA sequence counts, which are
|
||
used in order to reduce complexity, especially for operations which write-lock
|
||
multiple VMAs at once.
|
||
|
||
If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
|
||
sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
|
||
they differ, then it is not.
|
||
|
||
Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
|
||
:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
|
||
also increments :c:member:`!mm->mm_lock_seq` via
|
||
:c:func:`!mm_lock_seqcount_end`.
|
||
|
||
This way, we ensure that, regardless of the VMA's sequence number, a write lock
|
||
is never incorrectly indicated and that when we release an mmap write lock we
|
||
efficiently release **all** VMA write locks contained within the mmap at the
|
||
same time.
|
||
|
||
Since the mmap write lock is exclusive against others who hold it, the automatic
|
||
release of any VMA locks on its release makes sense, as you would never want to
|
||
keep VMAs locked across entirely separate write operations. It also maintains
|
||
correct lock ordering.
|
||
|
||
Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt`
|
||
reference counter and check that the sequence count of the VMA does not match
|
||
that of the mm.
|
||
|
||
If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped.
|
||
If it does not, we keep the reference counter raised, excluding writers, but
|
||
permitting other readers, who can also obtain this lock under RCU.
|
||
|
||
Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
|
||
are also RCU safe, so the whole read lock operation is guaranteed to function
|
||
correctly.
|
||
|
||
On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be
|
||
modified by readers and wait for all readers to drop their reference count.
|
||
Once there are no readers, the VMA's sequence number is set to match that of
|
||
the mm. During this entire operation mmap write lock is held.
|
||
|
||
This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
|
||
until these are finished and mutual exclusion is achieved.
|
||
|
||
After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt`
|
||
indicating a writer is cleared. From this point on, VMA's sequence number will
|
||
indicate VMA's write-locked state until mmap write lock is dropped or downgraded.
|
||
|
||
This clever combination of a reference counter and sequence count allows for
|
||
fast RCU-based per-VMA lock acquisition (especially on page fault, though
|
||
utilised elsewhere) with minimal complexity around lock ordering.
|
||
|
||
mmap write lock downgrading
|
||
---------------------------
|
||
|
||
When an mmap write lock is held one has exclusive access to resources within the
|
||
mmap (with the usual caveats about requiring VMA write locks to avoid races with
|
||
tasks holding VMA read locks).
|
||
|
||
It is then possible to **downgrade** from a write lock to a read lock via
|
||
:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
|
||
implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
|
||
importantly does not relinquish the mmap lock while downgrading, therefore
|
||
keeping the locked virtual address space stable.
|
||
|
||
An interesting consequence of this is that downgraded locks are exclusive
|
||
against any other task possessing a downgraded lock (since a racing task would
|
||
have to acquire a write lock first to downgrade it, and the downgraded lock
|
||
prevents a new write lock from being obtained until the original lock is
|
||
released).
|
||
|
||
For clarity, we map read (R)/downgraded write (D)/write (W) locks against one
|
||
another showing which locks exclude the others:
|
||
|
||
.. list-table:: Lock exclusivity
|
||
:widths: 5 5 5 5
|
||
:header-rows: 1
|
||
:stub-columns: 1
|
||
|
||
* -
|
||
- R
|
||
- D
|
||
- W
|
||
* - R
|
||
- N
|
||
- N
|
||
- Y
|
||
* - D
|
||
- N
|
||
- Y
|
||
- Y
|
||
* - W
|
||
- Y
|
||
- Y
|
||
- Y
|
||
|
||
Here a Y indicates the locks in the matching row/column are mutually exclusive,
|
||
and N indicates that they are not.
|
||
|
||
Stack expansion
|
||
---------------
|
||
|
||
Stack expansion throws up additional complexities in that we cannot permit there
|
||
to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
|
||
prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.
|