License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* mm/mremap.c
|
|
|
|
*
|
|
|
|
* (C) Copyright 1996 Linus Torvalds
|
|
|
|
*
|
2009-01-05 14:06:29 +00:00
|
|
|
* Address space accounting code <alan@lxorguk.ukuu.org.uk>
|
2005-04-16 15:20:36 -07:00
|
|
|
* (C) Copyright 2002 Red Hat Inc, All Rights Reserved
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
2022-06-03 16:57:19 +02:00
|
|
|
#include <linux/mm_inline.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/hugetlb.h>
|
|
|
|
#include <linux/shm.h>
|
2009-09-21 17:02:05 -07:00
|
|
|
#include <linux/ksm.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/mman.h>
|
|
|
|
#include <linux/swap.h>
|
2006-01-11 12:17:46 -08:00
|
|
|
#include <linux/capability.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/fs.h>
|
2013-08-27 12:37:18 +04:00
|
|
|
#include <linux/swapops.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/security.h>
|
|
|
|
#include <linux/syscalls.h>
|
mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;
if (!kvm)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28 15:46:29 -07:00
|
|
|
#include <linux/mmu_notifier.h>
|
2014-10-09 15:29:01 -07:00
|
|
|
#include <linux/uaccess.h>
|
2017-02-22 15:42:34 -08:00
|
|
|
#include <linux/userfaultfd_k.h>
|
2022-06-03 16:57:19 +02:00
|
|
|
#include <linux/mempolicy.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
#include <asm/cacheflush.h>
|
2021-07-07 18:10:18 -07:00
|
|
|
#include <asm/tlb.h>
|
2021-07-07 18:10:12 -07:00
|
|
|
#include <asm/pgalloc.h>
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2008-10-18 20:26:50 -07:00
|
|
|
#include "internal.h"
|
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
/* Classify the kind of remap operation being performed. */
|
|
|
|
enum mremap_type {
|
|
|
|
MREMAP_INVALID, /* Initial state. */
|
|
|
|
MREMAP_NO_RESIZE, /* old_len == new_len, if not moved, do nothing. */
|
|
|
|
MREMAP_SHRINK, /* old_len > new_len. */
|
|
|
|
MREMAP_EXPAND, /* old_len < new_len. */
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Describes a VMA mremap() operation and is threaded throughout it.
|
|
|
|
*
|
|
|
|
* Any of the fields may be mutated by the operation, however these values will
|
|
|
|
* always accurately reflect the remap (for instance, we may adjust lengths and
|
|
|
|
* delta to account for hugetlb alignment).
|
|
|
|
*/
|
|
|
|
struct vma_remap_struct {
|
|
|
|
/* User-provided state. */
|
|
|
|
unsigned long addr; /* User-specified address from which we remap. */
|
|
|
|
unsigned long old_len; /* Length of range being remapped. */
|
|
|
|
unsigned long new_len; /* Desired new length of mapping. */
|
mm/mremap: perform some simple cleanups
Patch series "mm/mremap: permit mremap() move of multiple VMAs", v4.
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
This patch (of 10):
We const-ify the vrm flags parameter to indicate this will never change.
We rename resize_is_valid() to remap_is_valid(), as this function does not
only apply to cases where we resize, so it's simply confusing to refer to
that here.
We remove the BUG() from mremap_at(), as we should not BUG() unless we are
certain it'll result in system instability.
We rename vrm_charge() to vrm_calc_charge() to make it clear this simply
calculates the charged number of pages rather than actually adjusting any
state.
We update the comment for vrm_implies_new_addr() to explain that
MREMAP_DONTUNMAP does not require a set address, but will always be moved.
Additionally consistently use 'res' rather than 'ret' for result values.
No functional change intended.
Link: https://lkml.kernel.org/r/cover.1752770784.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d35ad8ce6b2c33b2f2f4ef7ec415f04a35cba34f.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:51 +01:00
|
|
|
const unsigned long flags; /* user-specified MREMAP_* flags. */
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
unsigned long new_addr; /* Optionally, desired new address. */
|
|
|
|
|
|
|
|
/* uffd state. */
|
|
|
|
struct vm_userfaultfd_ctx *uf;
|
|
|
|
struct list_head *uf_unmap_early;
|
|
|
|
struct list_head *uf_unmap;
|
|
|
|
|
|
|
|
/* VMA state, determined in do_mremap(). */
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
|
|
|
|
/* Internal state, determined in do_mremap(). */
|
|
|
|
unsigned long delta; /* Absolute delta of old_len,new_len. */
|
2025-07-17 17:55:58 +01:00
|
|
|
bool populate_expand; /* mlock()'d expanded, must populate. */
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
enum mremap_type remap_type; /* expand, shrink, etc. */
|
|
|
|
bool mmap_locked; /* Is mm currently write-locked? */
|
2025-03-10 20:50:37 +00:00
|
|
|
unsigned long charged; /* If VM_ACCOUNT, # pages to account. */
|
mm/mremap: permit mremap() move of multiple VMAs
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is
faulted, then moved back and a user fails to observe a merge from
otherwise compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
Should no VMA exist in the range, -EFAULT is returned as usual.
If a VMA move spans a single VMA - then there is no functional change.
Otherwise, we place additional requirements upon VMAs:
* They must not have a userfaultfd context associated with them - this
requires dropping the lock to notify users, and we want to perform the
operation with the mmap write lock held throughout.
* If file-backed, they cannot have a custom get_unmapped_area handler -
this might result in MREMAP_FIXED not being honoured, which could result
in unexpected positioning of VMAs in the moved region.
There may be gaps in the range of VMAs that are moved:
X Y X Y
<---> <-> <---> <->
|-------| |-----| |-----| |-------| |-----| |-----|
| A | | B | | C | ---> | A' | | B' | | C' |
|-------| |-----| |-----| |-------| |-----| |-----|
addr new_addr
The move will preserve the gaps between each VMA.
Note that any failures encountered will result in a partial move. Since
an mremap() can fail at any time, this might result in only some of the
VMAs being moved.
Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.
We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.
So we optimise for the far, far more likely case of the operation being
entirely permissible.
In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.
Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:59 +01:00
|
|
|
bool vmi_needs_invalidate; /* Is the VMA iterator invalidated? */
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
};
|
|
|
|
|
2020-12-14 19:07:30 -08:00
|
|
|
static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
pgd_t *pgd;
|
2017-03-09 17:24:07 +03:00
|
|
|
p4d_t *p4d;
|
2005-04-16 15:20:36 -07:00
|
|
|
pud_t *pud;
|
|
|
|
|
|
|
|
pgd = pgd_offset(mm, addr);
|
|
|
|
if (pgd_none_or_clear_bad(pgd))
|
|
|
|
return NULL;
|
|
|
|
|
2017-03-09 17:24:07 +03:00
|
|
|
p4d = p4d_offset(pgd, addr);
|
|
|
|
if (p4d_none_or_clear_bad(p4d))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
pud = pud_offset(p4d, addr);
|
2005-04-16 15:20:36 -07:00
|
|
|
if (pud_none_or_clear_bad(pud))
|
|
|
|
return NULL;
|
|
|
|
|
2020-12-14 19:07:30 -08:00
|
|
|
return pud;
|
|
|
|
}
|
|
|
|
|
|
|
|
static pmd_t *get_old_pmd(struct mm_struct *mm, unsigned long addr)
|
|
|
|
{
|
|
|
|
pud_t *pud;
|
|
|
|
pmd_t *pmd;
|
|
|
|
|
|
|
|
pud = get_old_pud(mm, addr);
|
|
|
|
if (!pud)
|
|
|
|
return NULL;
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
pmd = pmd_offset(pud, addr);
|
thp: mremap support and TLB optimization
This adds THP support to mremap (decreases the number of split_huge_page()
calls).
Here are also some benchmarks with a proggy like this:
===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (5UL*1024*1024*1024)
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");
return 0;
}
===
THP on
Performance counter stats for './largepage13' (3 runs):
69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)
7.314454164 seconds time elapsed ( +- 0.023% )
THP off
Performance counter stats for './largepage13' (3 runs):
1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)
9.693655243 seconds time elapsed ( +- 0.069% )
grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0
thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):
THP on
usec 14824
usec 14862
usec 14859
THP off
usec 256416
usec 255981
usec 255847
With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).
THP on
usec 392107
usec 390237
usec 404124
THP off
usec 444294
usec 445237
usec 445820
I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.
All debug options are off except DEBUG_VM to avoid skewing the
results.
The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.
[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:08:30 -07:00
|
|
|
if (pmd_none(*pmd))
|
2005-04-16 15:20:36 -07:00
|
|
|
return NULL;
|
|
|
|
|
2005-10-29 18:16:00 -07:00
|
|
|
return pmd;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
static pud_t *alloc_new_pud(struct mm_struct *mm, unsigned long addr)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
|
|
|
pgd_t *pgd;
|
2017-03-09 17:24:07 +03:00
|
|
|
p4d_t *p4d;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
pgd = pgd_offset(mm, addr);
|
2017-03-09 17:24:07 +03:00
|
|
|
p4d = p4d_alloc(mm, pgd, addr);
|
|
|
|
if (!p4d)
|
|
|
|
return NULL;
|
2020-12-14 19:07:30 -08:00
|
|
|
|
|
|
|
return pud_alloc(mm, p4d, addr);
|
|
|
|
}
|
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr)
|
2020-12-14 19:07:30 -08:00
|
|
|
{
|
|
|
|
pud_t *pud;
|
|
|
|
pmd_t *pmd;
|
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
pud = alloc_new_pud(mm, addr);
|
2005-04-16 15:20:36 -07:00
|
|
|
if (!pud)
|
2005-10-29 18:16:23 -07:00
|
|
|
return NULL;
|
2005-10-29 18:16:00 -07:00
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
pmd = pmd_alloc(mm, pud, addr);
|
2013-10-16 13:47:09 -07:00
|
|
|
if (!pmd)
|
2005-10-29 18:16:23 -07:00
|
|
|
return NULL;
|
2005-10-29 18:16:00 -07:00
|
|
|
|
2011-01-13 15:46:43 -08:00
|
|
|
VM_BUG_ON(pmd_trans_huge(*pmd));
|
2005-10-29 18:16:23 -07:00
|
|
|
|
2005-10-29 18:16:00 -07:00
|
|
|
return pmd;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2016-05-19 17:12:57 -07:00
|
|
|
static void take_rmap_locks(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
if (vma->vm_file)
|
|
|
|
i_mmap_lock_write(vma->vm_file->f_mapping);
|
|
|
|
if (vma->anon_vma)
|
|
|
|
anon_vma_lock_write(vma->anon_vma);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void drop_rmap_locks(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
if (vma->anon_vma)
|
|
|
|
anon_vma_unlock_write(vma->anon_vma);
|
|
|
|
if (vma->vm_file)
|
|
|
|
i_mmap_unlock_write(vma->vm_file->f_mapping);
|
|
|
|
}
|
|
|
|
|
2013-08-27 12:37:18 +04:00
|
|
|
static pte_t move_soft_dirty_pte(pte_t pte)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Set soft dirty bit so we can notice
|
|
|
|
* in userspace the ptes were moved.
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_MEM_SOFT_DIRTY
|
|
|
|
if (pte_present(pte))
|
|
|
|
pte = pte_mksoft_dirty(pte);
|
|
|
|
else if (is_swap_pte(pte))
|
|
|
|
pte = pte_swp_mksoft_dirty(pte);
|
|
|
|
#endif
|
|
|
|
return pte;
|
|
|
|
}
|
|
|
|
|
mm: optimize mremap() by PTE batching
Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are
painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and
only do them on the starting and ending contig block.
For split folios, there will be no pte batching; nr_ptes will be 1. For
pagetable splitting, the ptes will still point to the same large folio;
for arm64, this results in the optimization described above, and for other
arches (including the general case), a minor improvement is expected due
to a reduction in the number of function calls.
Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:43 +05:30
|
|
|
static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
|
|
|
|
pte_t *ptep, pte_t pte, int max_nr)
|
|
|
|
{
|
|
|
|
struct folio *folio;
|
|
|
|
|
|
|
|
if (max_nr == 1)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
folio = vm_normal_folio(vma, addr, pte);
|
|
|
|
if (!folio || !folio_test_large(folio))
|
|
|
|
return 1;
|
|
|
|
|
mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_flags()
Many users (including upcoming ones) don't really need the flags etc, and
can live with the possible overhead of a function call.
So let's provide a basic, non-inlined folio_pte_batch(), to avoid code
bloat while still providing a variant that optimizes out all flag checks
at runtime. folio_pte_batch_flags() will get inlined into
folio_pte_batch(), optimizing out any conditionals that depend on input
flags.
folio_pte_batch() will behave like folio_pte_batch_flags() when no flags
are specified. It's okay to add new users of folio_pte_batch_flags(), but
using folio_pte_batch() if applicable is preferred.
So, before this change, folio_pte_batch() was inlined into the C file
optimized by propagating constants within the resulting object file.
With this change, we now also have a folio_pte_batch() that is optimized
by propagating all constants. But instead of having one instance per
object file, we have a single shared one.
In zap_present_ptes(), where we care about performance, the compiler
already seem to generate a call to a common inlined folio_pte_batch()
variant, shared with fork() code. So calling the new non-inlined variant
should not make a difference.
While at it, drop the "addr" parameter that is unused.
Link: https://lkml.kernel.org/r/20250702104926.212243-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-02 12:49:25 +02:00
|
|
|
return folio_pte_batch(folio, ptep, pte, max_nr);
|
mm: optimize mremap() by PTE batching
Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are
painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and
only do them on the starting and ending contig block.
For split folios, there will be no pte batching; nr_ptes will be 1. For
pagetable splitting, the ptes will still point to the same large folio;
for arm64, this results in the optimization described above, and for other
arches (including the general case), a minor improvement is expected due
to a reduction in the number of function calls.
Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:43 +05:30
|
|
|
}
|
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
static int move_ptes(struct pagetable_move_control *pmc,
|
|
|
|
unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2025-03-10 20:50:40 +00:00
|
|
|
struct vm_area_struct *vma = pmc->old;
|
2025-01-07 14:47:52 +00:00
|
|
|
bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
|
2005-04-16 15:20:36 -07:00
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
mm: call pointers to ptes as ptep
Patch series "Optimize mremap() for large folios", v4.
Currently move_ptes() iterates through ptes one by one. If the underlying
folio mapped by the ptes is large, we can process those ptes in a batch
using folio_pte_batch(), thus clearing and setting the PTEs in one go.
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls (since on a contig block, ptep_get() on arm64 will
iterate through all 16 entries to collect a/d bits), and we also elide
extra TLBIs through get_and_clear_full_ptes, replacing ptep_get_and_clear.
Mapping 1M of memory with 64K folios, memsetting it, remapping it to src +
1M, and munmapping it 10,000 times, the average execution time reduces
from 1.9 to 1.2 seconds, giving a 37% performance optimization, on Apple
M3 (arm64). No regression is observed for small folios.
Test program for reference:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#define SIZE (1UL << 20) // 1M
int main(void) {
void *new_addr, *addr;
for (int i = 0; i < 10000; ++i) {
addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return 1;
}
memset(addr, 0xAA, SIZE);
new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
if (new_addr != (addr + SIZE)) {
perror("mremap");
return 1;
}
munmap(new_addr, SIZE);
}
}
This patch (of 2):
Avoid confusion between pte_t* and pte_t data types by suffixing pointer
type variables with p. No functional change.
Link: https://lkml.kernel.org/r/20250610035043.75448-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20250610035043.75448-2-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:42 +05:30
|
|
|
pte_t *old_ptep, *new_ptep;
|
mm: optimize mremap() by PTE batching
Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are
painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and
only do them on the starting and ending contig block.
For split folios, there will be no pte batching; nr_ptes will be 1. For
pagetable splitting, the ptes will still point to the same large folio;
for arm64, this results in the optimization described above, and for other
arches (including the general case), a minor improvement is expected due
to a reduction in the number of function calls.
Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:43 +05:30
|
|
|
pte_t old_pte, pte;
|
2024-09-26 14:46:22 +08:00
|
|
|
pmd_t dummy_pmdval;
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 18:16:40 -07:00
|
|
|
spinlock_t *old_ptl, *new_ptl;
|
mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-10 17:16:33 +08:00
|
|
|
bool force_flush = false;
|
2025-03-10 20:50:40 +00:00
|
|
|
unsigned long old_addr = pmc->old_addr;
|
|
|
|
unsigned long new_addr = pmc->new_addr;
|
|
|
|
unsigned long old_end = old_addr + extent;
|
mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-10 17:16:33 +08:00
|
|
|
unsigned long len = old_end - old_addr;
|
mm: optimize mremap() by PTE batching
Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are
painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and
only do them on the starting and ending contig block.
For split folios, there will be no pte batching; nr_ptes will be 1. For
pagetable splitting, the ptes will still point to the same large folio;
for arm64, this results in the optimization described above, and for other
arches (including the general case), a minor improvement is expected due
to a reduction in the number of function calls.
Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:43 +05:30
|
|
|
int max_nr_ptes;
|
|
|
|
int nr_ptes;
|
2023-06-08 18:32:47 -07:00
|
|
|
int err = 0;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2012-10-08 16:31:50 -07:00
|
|
|
/*
|
2014-12-12 16:54:24 -08:00
|
|
|
* When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
|
2012-10-08 16:31:50 -07:00
|
|
|
* locks to ensure that rmap will always observe either the old or the
|
|
|
|
* new ptes. This is the easiest way to avoid races with
|
|
|
|
* truncate_pagecache(), page migration, etc...
|
|
|
|
*
|
|
|
|
* When need_rmap_locks is false, we use other ways to avoid
|
|
|
|
* such races:
|
|
|
|
*
|
|
|
|
* - During exec() shift_arg_pages(), we use a specially tagged vma
|
2020-04-01 21:07:52 -07:00
|
|
|
* which rmap call sites look for using vma_is_temporary_stack().
|
2012-10-08 16:31:50 -07:00
|
|
|
*
|
|
|
|
* - During mremap(), new_vma is often known to be placed after vma
|
|
|
|
* in rmap traversal order. This ensures rmap will always observe
|
|
|
|
* either the old pte, or the new pte, or both (the page table locks
|
|
|
|
* serialize access to individual ptes, but only rmap traversal
|
|
|
|
* order guarantees that we won't miss both the old and new ptes).
|
|
|
|
*/
|
2025-03-10 20:50:40 +00:00
|
|
|
if (pmc->need_rmap_locks)
|
2016-05-19 17:12:57 -07:00
|
|
|
take_rmap_locks(vma);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 18:16:40 -07:00
|
|
|
/*
|
|
|
|
* We don't have to worry about the ordering of src and dst
|
2020-06-08 21:33:54 -07:00
|
|
|
* pte locks because exclusive mmap_lock prevents deadlock.
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 18:16:40 -07:00
|
|
|
*/
|
mm: call pointers to ptes as ptep
Patch series "Optimize mremap() for large folios", v4.
Currently move_ptes() iterates through ptes one by one. If the underlying
folio mapped by the ptes is large, we can process those ptes in a batch
using folio_pte_batch(), thus clearing and setting the PTEs in one go.
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls (since on a contig block, ptep_get() on arm64 will
iterate through all 16 entries to collect a/d bits), and we also elide
extra TLBIs through get_and_clear_full_ptes, replacing ptep_get_and_clear.
Mapping 1M of memory with 64K folios, memsetting it, remapping it to src +
1M, and munmapping it 10,000 times, the average execution time reduces
from 1.9 to 1.2 seconds, giving a 37% performance optimization, on Apple
M3 (arm64). No regression is observed for small folios.
Test program for reference:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#define SIZE (1UL << 20) // 1M
int main(void) {
void *new_addr, *addr;
for (int i = 0; i < 10000; ++i) {
addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return 1;
}
memset(addr, 0xAA, SIZE);
new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
if (new_addr != (addr + SIZE)) {
perror("mremap");
return 1;
}
munmap(new_addr, SIZE);
}
}
This patch (of 2):
Avoid confusion between pte_t* and pte_t data types by suffixing pointer
type variables with p. No functional change.
Link: https://lkml.kernel.org/r/20250610035043.75448-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20250610035043.75448-2-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:42 +05:30
|
|
|
old_ptep = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
|
|
|
|
if (!old_ptep) {
|
2023-06-08 18:32:47 -07:00
|
|
|
err = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
2024-09-26 14:46:22 +08:00
|
|
|
/*
|
|
|
|
* Now new_pte is none, so hpage_collapse_scan_file() path can not find
|
|
|
|
* this by traversing file->f_mapping, so there is no concurrency with
|
|
|
|
* retract_page_tables(). In addition, we already hold the exclusive
|
|
|
|
* mmap_lock, so this new_pte page is stable, so there is no need to get
|
|
|
|
* pmdval and do pmd_same() check.
|
|
|
|
*/
|
mm: call pointers to ptes as ptep
Patch series "Optimize mremap() for large folios", v4.
Currently move_ptes() iterates through ptes one by one. If the underlying
folio mapped by the ptes is large, we can process those ptes in a batch
using folio_pte_batch(), thus clearing and setting the PTEs in one go.
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls (since on a contig block, ptep_get() on arm64 will
iterate through all 16 entries to collect a/d bits), and we also elide
extra TLBIs through get_and_clear_full_ptes, replacing ptep_get_and_clear.
Mapping 1M of memory with 64K folios, memsetting it, remapping it to src +
1M, and munmapping it 10,000 times, the average execution time reduces
from 1.9 to 1.2 seconds, giving a 37% performance optimization, on Apple
M3 (arm64). No regression is observed for small folios.
Test program for reference:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#define SIZE (1UL << 20) // 1M
int main(void) {
void *new_addr, *addr;
for (int i = 0; i < 10000; ++i) {
addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return 1;
}
memset(addr, 0xAA, SIZE);
new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
if (new_addr != (addr + SIZE)) {
perror("mremap");
return 1;
}
munmap(new_addr, SIZE);
}
}
This patch (of 2):
Avoid confusion between pte_t* and pte_t data types by suffixing pointer
type variables with p. No functional change.
Link: https://lkml.kernel.org/r/20250610035043.75448-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20250610035043.75448-2-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:42 +05:30
|
|
|
new_ptep = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
|
2024-09-26 14:46:22 +08:00
|
|
|
&new_ptl);
|
mm: call pointers to ptes as ptep
Patch series "Optimize mremap() for large folios", v4.
Currently move_ptes() iterates through ptes one by one. If the underlying
folio mapped by the ptes is large, we can process those ptes in a batch
using folio_pte_batch(), thus clearing and setting the PTEs in one go.
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls (since on a contig block, ptep_get() on arm64 will
iterate through all 16 entries to collect a/d bits), and we also elide
extra TLBIs through get_and_clear_full_ptes, replacing ptep_get_and_clear.
Mapping 1M of memory with 64K folios, memsetting it, remapping it to src +
1M, and munmapping it 10,000 times, the average execution time reduces
from 1.9 to 1.2 seconds, giving a 37% performance optimization, on Apple
M3 (arm64). No regression is observed for small folios.
Test program for reference:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#define SIZE (1UL << 20) // 1M
int main(void) {
void *new_addr, *addr;
for (int i = 0; i < 10000; ++i) {
addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return 1;
}
memset(addr, 0xAA, SIZE);
new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
if (new_addr != (addr + SIZE)) {
perror("mremap");
return 1;
}
munmap(new_addr, SIZE);
}
}
This patch (of 2):
Avoid confusion between pte_t* and pte_t data types by suffixing pointer
type variables with p. No functional change.
Link: https://lkml.kernel.org/r/20250610035043.75448-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20250610035043.75448-2-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:42 +05:30
|
|
|
if (!new_ptep) {
|
|
|
|
pte_unmap_unlock(old_ptep, old_ptl);
|
2023-06-08 18:32:47 -07:00
|
|
|
err = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 18:16:40 -07:00
|
|
|
if (new_ptl != old_ptl)
|
2006-07-03 00:25:08 -07:00
|
|
|
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
|
2017-08-02 13:31:52 -07:00
|
|
|
flush_tlb_batched_pending(vma->vm_mm);
|
2006-09-30 23:29:33 -07:00
|
|
|
arch_enter_lazy_mmu_mode();
|
2005-10-29 18:16:00 -07:00
|
|
|
|
mm: optimize mremap() by PTE batching
Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are
painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and
only do them on the starting and ending contig block.
For split folios, there will be no pte batching; nr_ptes will be 1. For
pagetable splitting, the ptes will still point to the same large folio;
for arm64, this results in the optimization described above, and for other
arches (including the general case), a minor improvement is expected due
to a reduction in the number of function calls.
Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:43 +05:30
|
|
|
for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
|
|
|
|
new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
|
mm: call pointers to ptes as ptep
Patch series "Optimize mremap() for large folios", v4.
Currently move_ptes() iterates through ptes one by one. If the underlying
folio mapped by the ptes is large, we can process those ptes in a batch
using folio_pte_batch(), thus clearing and setting the PTEs in one go.
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls (since on a contig block, ptep_get() on arm64 will
iterate through all 16 entries to collect a/d bits), and we also elide
extra TLBIs through get_and_clear_full_ptes, replacing ptep_get_and_clear.
Mapping 1M of memory with 64K folios, memsetting it, remapping it to src +
1M, and munmapping it 10,000 times, the average execution time reduces
from 1.9 to 1.2 seconds, giving a 37% performance optimization, on Apple
M3 (arm64). No regression is observed for small folios.
Test program for reference:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#define SIZE (1UL << 20) // 1M
int main(void) {
void *new_addr, *addr;
for (int i = 0; i < 10000; ++i) {
addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return 1;
}
memset(addr, 0xAA, SIZE);
new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
if (new_addr != (addr + SIZE)) {
perror("mremap");
return 1;
}
munmap(new_addr, SIZE);
}
}
This patch (of 2):
Avoid confusion between pte_t* and pte_t data types by suffixing pointer
type variables with p. No functional change.
Link: https://lkml.kernel.org/r/20250610035043.75448-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20250610035043.75448-2-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:42 +05:30
|
|
|
VM_WARN_ON_ONCE(!pte_none(*new_ptep));
|
2025-05-29 15:56:48 +00:00
|
|
|
|
mm: optimize mremap() by PTE batching
Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are
painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and
only do them on the starting and ending contig block.
For split folios, there will be no pte batching; nr_ptes will be 1. For
pagetable splitting, the ptes will still point to the same large folio;
for arm64, this results in the optimization described above, and for other
arches (including the general case), a minor improvement is expected due
to a reduction in the number of function calls.
Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:43 +05:30
|
|
|
nr_ptes = 1;
|
|
|
|
max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
|
|
|
|
old_pte = ptep_get(old_ptep);
|
|
|
|
if (pte_none(old_pte))
|
2005-10-29 18:16:00 -07:00
|
|
|
continue;
|
mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-10 17:16:33 +08:00
|
|
|
|
|
|
|
/*
|
2018-10-12 15:22:59 -07:00
|
|
|
* If we are remapping a valid PTE, make sure
|
2016-11-29 13:27:31 +08:00
|
|
|
* to flush TLB before we drop the PTL for the
|
2018-10-12 15:22:59 -07:00
|
|
|
* PTE.
|
2016-11-29 13:27:31 +08:00
|
|
|
*
|
2018-10-12 15:22:59 -07:00
|
|
|
* NOTE! Both old and new PTL matter: the old one
|
2024-06-04 19:48:22 +08:00
|
|
|
* for racing with folio_mkclean(), the new one to
|
2018-10-12 15:22:59 -07:00
|
|
|
* make sure the physical page stays valid until
|
|
|
|
* the TLB entry for the old mapping has been
|
|
|
|
* flushed.
|
mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-10 17:16:33 +08:00
|
|
|
*/
|
mm: optimize mremap() by PTE batching
Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are
painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and
only do them on the starting and ending contig block.
For split folios, there will be no pte batching; nr_ptes will be 1. For
pagetable splitting, the ptes will still point to the same large folio;
for arm64, this results in the optimization described above, and for other
arches (including the general case), a minor improvement is expected due
to a reduction in the number of function calls.
Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:43 +05:30
|
|
|
if (pte_present(old_pte)) {
|
|
|
|
nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
|
|
|
|
old_pte, max_nr_ptes);
|
mremap: fix race between mremap() and page cleanning
Prior to 3.15, there was a race between zap_pte_range() and
page_mkclean() where writes to a page could be lost. Dave Hansen
discovered by inspection that there is a similar race between
move_ptes() and page_mkclean().
We've been able to reproduce the issue by enlarging the race window with
a msleep(), but have not been able to hit it without modifying the code.
So, we think it's a real issue, but is difficult or impossible to hit in
practice.
The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
this patch is to fix the race between page_mkclean() and mremap().
Here is one possible way to hit the race: suppose a process mmapped a
file with READ | WRITE and SHARED, it has two threads and they are bound
to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
1 did a write to addr X so that CPU1 now has a writable TLB for addr X
on it. Thread 2 starts mremaping from addr X to Y while thread 1
cleaned the page and then did another write to the old addr X again.
The 2nd write from thread 1 could succeed but the value will get lost.
thread 1 thread 2
(bound to CPU1) (bound to CPU2)
1: write 1 to addr X to get a
writeable TLB on this CPU
2: mremap starts
3: move_ptes emptied PTE for addr X
and setup new PTE for addr Y and
then dropped PTL for X and Y
4: page laundering for N by doing
fadvise FADV_DONTNEED. When done,
pageframe N is deemed clean.
5: *write 2 to addr X
6: tlb flush for addr X
7: munmap (Y, pagesize) to make the
page unmapped
8: fadvise with FADV_DONTNEED again
to kick the page off the pagecache
9: pread the page from file to verify
the value. If 1 is there, it means
we have lost the written 2.
*the write may or may not cause segmentation fault, it depends on
if the TLB is still on the CPU.
Please note that this is only one specific way of how the race could
occur, it didn't mean that the race could only occur in exact the above
config, e.g. more than 2 threads could be involved and fadvise() could
be done in another thread, etc.
For anonymous pages, they could race between mremap() and page reclaim:
THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
PMD gets unmapped/splitted/pagedout before the flush tlb happened for
the old huge PMD in move_page_tables() and we could still write data to
it. The normal anonymous page has similar situation.
To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
if any, did the flush before dropping the PTL. If we did the flush for
every move_ptes()/move_huge_pmd() call then we do not need to do the
flush in move_pages_tables() for the whole range. But if we didn't, we
still need to do the whole range flush.
Alternatively, we can track which part of the range is flushed in
move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
range in move_page_tables(). But that would require multiple tlb
flushes for the different sub-ranges and should be less efficient than
the single whole range flush.
KBuild test on my Sandybridge desktop doesn't show any noticeable change.
v4.9-rc4:
real 5m14.048s
user 32m19.800s
sys 4m50.320s
With this commit:
real 5m13.888s
user 32m19.330s
sys 4m51.200s
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-10 17:16:33 +08:00
|
|
|
force_flush = true;
|
mm: optimize mremap() by PTE batching
Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are
painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and
only do them on the starting and ending contig block.
For split folios, there will be no pte batching; nr_ptes will be 1. For
pagetable splitting, the ptes will still point to the same large folio;
for arm64, this results in the optimization described above, and for other
arches (including the general case), a minor improvement is expected due
to a reduction in the number of function calls.
Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:43 +05:30
|
|
|
}
|
mm: add get_and_clear_ptes() and clear_ptes()
Patch series "Optimizations for khugepaged", v4.
If the underlying folio mapped by the ptes is large, we can process those
ptes in a batch using folio_pte_batch().
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls, since on a contig block, ptep_get() on arm64 will
iterate through all 16 entries to collect a/d bits. Next, ptep_clear()
will cause a TLBI for every contig block in the range via
contpte_try_unfold(). Instead, use clear_ptes() to only do the TLBI at
the first and last contig block of the range.
For split folios, there will be no pte batching; the batch size returned
by folio_pte_batch() will be 1. For pagetable split folios, the ptes will
still point to the same large folio; for arm64, this results in the
optimization described above, and for other arches, a minor improvement is
expected due to a reduction in the number of function calls and batching
atomic operations.
This patch (of 3):
Let's add variants to be used where "full" does not apply -- which will
be the majority of cases in the future. "full" really only applies if
we are about to tear down a full MM.
Use get_and_clear_ptes() in existing code, clear_ptes() users will
be added next.
Link: https://lkml.kernel.org/r/20250724052301.23844-2-dev.jain@arm.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24 10:52:59 +05:30
|
|
|
pte = get_and_clear_ptes(mm, old_addr, old_ptep, nr_ptes);
|
2024-03-27 15:33:01 +01:00
|
|
|
pte = move_pte(pte, old_addr, new_addr);
|
2013-08-27 12:37:18 +04:00
|
|
|
pte = move_soft_dirty_pte(pte);
|
2025-01-07 14:47:52 +00:00
|
|
|
|
|
|
|
if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
|
mm: call pointers to ptes as ptep
Patch series "Optimize mremap() for large folios", v4.
Currently move_ptes() iterates through ptes one by one. If the underlying
folio mapped by the ptes is large, we can process those ptes in a batch
using folio_pte_batch(), thus clearing and setting the PTEs in one go.
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls (since on a contig block, ptep_get() on arm64 will
iterate through all 16 entries to collect a/d bits), and we also elide
extra TLBIs through get_and_clear_full_ptes, replacing ptep_get_and_clear.
Mapping 1M of memory with 64K folios, memsetting it, remapping it to src +
1M, and munmapping it 10,000 times, the average execution time reduces
from 1.9 to 1.2 seconds, giving a 37% performance optimization, on Apple
M3 (arm64). No regression is observed for small folios.
Test program for reference:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#define SIZE (1UL << 20) // 1M
int main(void) {
void *new_addr, *addr;
for (int i = 0; i < 10000; ++i) {
addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return 1;
}
memset(addr, 0xAA, SIZE);
new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
if (new_addr != (addr + SIZE)) {
perror("mremap");
return 1;
}
munmap(new_addr, SIZE);
}
}
This patch (of 2):
Avoid confusion between pte_t* and pte_t data types by suffixing pointer
type variables with p. No functional change.
Link: https://lkml.kernel.org/r/20250610035043.75448-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20250610035043.75448-2-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:42 +05:30
|
|
|
pte_clear(mm, new_addr, new_ptep);
|
2025-01-07 14:47:52 +00:00
|
|
|
else {
|
|
|
|
if (need_clear_uffd_wp) {
|
|
|
|
if (pte_present(pte))
|
|
|
|
pte = pte_clear_uffd_wp(pte);
|
|
|
|
else if (is_swap_pte(pte))
|
|
|
|
pte = pte_swp_clear_uffd_wp(pte);
|
|
|
|
}
|
mm: optimize mremap() by PTE batching
Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes are
painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and
only do them on the starting and ending contig block.
For split folios, there will be no pte batching; nr_ptes will be 1. For
pagetable splitting, the ptes will still point to the same large folio;
for arm64, this results in the optimization described above, and for other
arches (including the general case), a minor improvement is expected due
to a reduction in the number of function calls.
Link: https://lkml.kernel.org/r/20250610035043.75448-3-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:43 +05:30
|
|
|
set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
|
2025-01-07 14:47:52 +00:00
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2005-10-29 18:16:00 -07:00
|
|
|
|
2006-09-30 23:29:33 -07:00
|
|
|
arch_leave_lazy_mmu_mode();
|
2018-10-12 15:22:59 -07:00
|
|
|
if (force_flush)
|
|
|
|
flush_tlb_range(vma, old_end - len, old_end);
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 18:16:40 -07:00
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_unlock(new_ptl);
|
mm: call pointers to ptes as ptep
Patch series "Optimize mremap() for large folios", v4.
Currently move_ptes() iterates through ptes one by one. If the underlying
folio mapped by the ptes is large, we can process those ptes in a batch
using folio_pte_batch(), thus clearing and setting the PTEs in one go.
For arm64 specifically, this results in a 16x reduction in the number of
ptep_get() calls (since on a contig block, ptep_get() on arm64 will
iterate through all 16 entries to collect a/d bits), and we also elide
extra TLBIs through get_and_clear_full_ptes, replacing ptep_get_and_clear.
Mapping 1M of memory with 64K folios, memsetting it, remapping it to src +
1M, and munmapping it 10,000 times, the average execution time reduces
from 1.9 to 1.2 seconds, giving a 37% performance optimization, on Apple
M3 (arm64). No regression is observed for small folios.
Test program for reference:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#define SIZE (1UL << 20) // 1M
int main(void) {
void *new_addr, *addr;
for (int i = 0; i < 10000; ++i) {
addr = mmap((void *)(1UL << 30), SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
return 1;
}
memset(addr, 0xAA, SIZE);
new_addr = mremap(addr, SIZE, SIZE, MREMAP_MAYMOVE | MREMAP_FIXED, addr + SIZE);
if (new_addr != (addr + SIZE)) {
perror("mremap");
return 1;
}
munmap(new_addr, SIZE);
}
}
This patch (of 2):
Avoid confusion between pte_t* and pte_t data types by suffixing pointer
type variables with p. No functional change.
Link: https://lkml.kernel.org/r/20250610035043.75448-1-dev.jain@arm.com
Link: https://lkml.kernel.org/r/20250610035043.75448-2-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: bibo mao <maobibo@loongson.cn>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-10 09:20:42 +05:30
|
|
|
pte_unmap(new_ptep - 1);
|
|
|
|
pte_unmap_unlock(old_ptep - 1, old_ptl);
|
2023-06-08 18:32:47 -07:00
|
|
|
out:
|
2025-03-10 20:50:40 +00:00
|
|
|
if (pmc->need_rmap_locks)
|
2016-05-19 17:12:57 -07:00
|
|
|
drop_rmap_locks(vma);
|
2023-06-08 18:32:47 -07:00
|
|
|
return err;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2021-07-07 18:10:18 -07:00
|
|
|
#ifndef arch_supports_page_table_move
|
|
|
|
#define arch_supports_page_table_move arch_supports_page_table_move
|
|
|
|
static inline bool arch_supports_page_table_move(void)
|
|
|
|
{
|
|
|
|
return IS_ENABLED(CONFIG_HAVE_MOVE_PMD) ||
|
|
|
|
IS_ENABLED(CONFIG_HAVE_MOVE_PUD);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2019-01-03 15:28:38 -08:00
|
|
|
#ifdef CONFIG_HAVE_MOVE_PMD
|
2025-03-10 20:50:40 +00:00
|
|
|
static bool move_normal_pmd(struct pagetable_move_control *pmc,
|
|
|
|
pmd_t *old_pmd, pmd_t *new_pmd)
|
2019-01-03 15:28:38 -08:00
|
|
|
{
|
|
|
|
spinlock_t *old_ptl, *new_ptl;
|
2025-03-10 20:50:40 +00:00
|
|
|
struct vm_area_struct *vma = pmc->old;
|
2019-01-03 15:28:38 -08:00
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
mm/mremap: fix move_normal_pmd/retract_page_tables race
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Closes: https://project-zero.issues.chromium.org/371047675
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-07 23:42:04 +02:00
|
|
|
bool res = false;
|
2019-01-03 15:28:38 -08:00
|
|
|
pmd_t pmd;
|
|
|
|
|
2021-07-07 18:10:18 -07:00
|
|
|
if (!arch_supports_page_table_move())
|
|
|
|
return false;
|
2019-01-03 15:28:38 -08:00
|
|
|
/*
|
|
|
|
* The destination pmd shouldn't be established, free_pgtables()
|
2020-07-13 11:37:39 -07:00
|
|
|
* should have released it.
|
|
|
|
*
|
|
|
|
* However, there's a case during execve() where we use mremap
|
|
|
|
* to move the initial stack, and in that case the target area
|
|
|
|
* may overlap the source area (always moving down).
|
|
|
|
*
|
|
|
|
* If everything is PMD-aligned, that works fine, as moving
|
|
|
|
* each pmd down will clear the source pmd. But if we first
|
|
|
|
* have a few 4kB-only pages that get moved down, and then
|
|
|
|
* hit the "now the rest is PMD-aligned, let's do everything
|
|
|
|
* one pmd at a time", we will still have the old (now empty
|
|
|
|
* of any 4kB pages, but still there) PMD in the page table
|
|
|
|
* tree.
|
|
|
|
*
|
|
|
|
* Warn on it once - because we really should try to figure
|
|
|
|
* out how to do this better - but then say "I won't move
|
|
|
|
* this pmd".
|
|
|
|
*
|
|
|
|
* One alternative might be to just unmap the target pmd at
|
|
|
|
* this point, and verify that it really is empty. We'll see.
|
2019-01-03 15:28:38 -08:00
|
|
|
*/
|
2020-07-13 11:37:39 -07:00
|
|
|
if (WARN_ON_ONCE(!pmd_none(*new_pmd)))
|
2019-01-03 15:28:38 -08:00
|
|
|
return false;
|
|
|
|
|
2025-01-07 14:47:52 +00:00
|
|
|
/* If this pmd belongs to a uffd vma with remap events disabled, we need
|
|
|
|
* to ensure that the uffd-wp state is cleared from all pgtables. This
|
|
|
|
* means recursing into lower page tables in move_page_tables(), and we
|
|
|
|
* can reuse the existing code if we simply treat the entry as "not
|
|
|
|
* moved".
|
|
|
|
*/
|
|
|
|
if (vma_has_uffd_without_event_remap(vma))
|
|
|
|
return false;
|
|
|
|
|
2019-01-03 15:28:38 -08:00
|
|
|
/*
|
|
|
|
* We don't have to worry about the ordering of src and dst
|
2020-06-08 21:33:54 -07:00
|
|
|
* ptlocks because exclusive mmap_lock prevents deadlock.
|
2019-01-03 15:28:38 -08:00
|
|
|
*/
|
2025-03-10 20:50:40 +00:00
|
|
|
old_ptl = pmd_lock(mm, old_pmd);
|
2019-01-03 15:28:38 -08:00
|
|
|
new_ptl = pmd_lockptr(mm, new_pmd);
|
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
|
|
|
|
|
|
|
|
pmd = *old_pmd;
|
mm/mremap: fix move_normal_pmd/retract_page_tables race
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Closes: https://project-zero.issues.chromium.org/371047675
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-07 23:42:04 +02:00
|
|
|
|
|
|
|
/* Racing with collapse? */
|
|
|
|
if (unlikely(!pmd_present(pmd) || pmd_leaf(pmd)))
|
|
|
|
goto out_unlock;
|
|
|
|
/* Clear the pmd */
|
2019-01-03 15:28:38 -08:00
|
|
|
pmd_clear(old_pmd);
|
mm/mremap: fix move_normal_pmd/retract_page_tables race
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Closes: https://project-zero.issues.chromium.org/371047675
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-07 23:42:04 +02:00
|
|
|
res = true;
|
2019-01-03 15:28:38 -08:00
|
|
|
|
|
|
|
VM_BUG_ON(!pmd_none(*new_pmd));
|
|
|
|
|
2021-07-07 18:10:12 -07:00
|
|
|
pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
|
2025-03-10 20:50:40 +00:00
|
|
|
flush_tlb_range(vma, pmc->old_addr, pmc->old_addr + PMD_SIZE);
|
mm/mremap: fix move_normal_pmd/retract_page_tables race
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Closes: https://project-zero.issues.chromium.org/371047675
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-07 23:42:04 +02:00
|
|
|
out_unlock:
|
2019-01-03 15:28:38 -08:00
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_unlock(new_ptl);
|
|
|
|
spin_unlock(old_ptl);
|
|
|
|
|
mm/mremap: fix move_normal_pmd/retract_page_tables race
In mremap(), move_page_tables() looks at the type of the PMD entry and the
specified address range to figure out by which method the next chunk of
page table entries should be moved.
At that point, the mmap_lock is held in write mode, but no rmap locks are
held yet. For PMD entries that point to page tables and are fully covered
by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called,
which first takes rmap locks, then does move_normal_pmd().
move_normal_pmd() takes the necessary page table locks at source and
destination, then moves an entire page table from the source to the
destination.
The problem is: The rmap locks, which protect against concurrent page
table removal by retract_page_tables() in the THP code, are only taken
after the PMD entry has been read and it has been decided how to move it.
So we can race as follows (with two processes that have mappings of the
same tmpfs file that is stored on a tmpfs mount with huge=advise); note
that process A accesses page tables through the MM while process B does it
through the file rmap:
process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, ...)
take_rmap_locks
move_normal_pmd
drop_rmap_locks
When this happens, move_normal_pmd() can end up creating bogus PMD entries
in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect
depends on arch-specific and machine-specific details; on x86, you can end
up with physical page 0 mapped as a page table, which is likely
exploitable for user->kernel privilege escalation.
Fix the race by letting process B recheck that the PMD still points to a
page table after the rmap locks have been taken. Otherwise, we bail and
let the caller fall back to the PTE-level copying path, which will then
bail immediately at the pmd_none() check.
Bug reachability: Reaching this bug requires that you can create
shmem/file THP mappings - anonymous THP uses different code that doesn't
zap stuff under rmap locks. File THP is gated on an experimental config
flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need
shmem THP to hit this bug. As far as I know, getting shmem THP normally
requires that you can mount your own tmpfs with the right mount flags,
which would require creating your own user+mount namespace; though I don't
know if some distros maybe enable shmem THP by default or something like
that.
Bug impact: This issue can likely be used for user->kernel privilege
escalation when it is reachable.
Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Closes: https://project-zero.issues.chromium.org/371047675
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-07 23:42:04 +02:00
|
|
|
return res;
|
2019-01-03 15:28:38 -08:00
|
|
|
}
|
2020-12-14 19:07:30 -08:00
|
|
|
#else
|
2025-03-10 20:50:40 +00:00
|
|
|
static inline bool move_normal_pmd(struct pagetable_move_control *pmc,
|
|
|
|
pmd_t *old_pmd, pmd_t *new_pmd)
|
2020-12-14 19:07:30 -08:00
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2021-07-07 18:10:09 -07:00
|
|
|
#if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_HAVE_MOVE_PUD)
|
2025-03-10 20:50:40 +00:00
|
|
|
static bool move_normal_pud(struct pagetable_move_control *pmc,
|
|
|
|
pud_t *old_pud, pud_t *new_pud)
|
2020-12-14 19:07:30 -08:00
|
|
|
{
|
|
|
|
spinlock_t *old_ptl, *new_ptl;
|
2025-03-10 20:50:40 +00:00
|
|
|
struct vm_area_struct *vma = pmc->old;
|
2020-12-14 19:07:30 -08:00
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
|
|
|
pud_t pud;
|
|
|
|
|
2021-07-07 18:10:18 -07:00
|
|
|
if (!arch_supports_page_table_move())
|
|
|
|
return false;
|
2020-12-14 19:07:30 -08:00
|
|
|
/*
|
|
|
|
* The destination pud shouldn't be established, free_pgtables()
|
|
|
|
* should have released it.
|
|
|
|
*/
|
|
|
|
if (WARN_ON_ONCE(!pud_none(*new_pud)))
|
|
|
|
return false;
|
|
|
|
|
2025-01-07 14:47:52 +00:00
|
|
|
/* If this pud belongs to a uffd vma with remap events disabled, we need
|
|
|
|
* to ensure that the uffd-wp state is cleared from all pgtables. This
|
|
|
|
* means recursing into lower page tables in move_page_tables(), and we
|
|
|
|
* can reuse the existing code if we simply treat the entry as "not
|
|
|
|
* moved".
|
|
|
|
*/
|
|
|
|
if (vma_has_uffd_without_event_remap(vma))
|
|
|
|
return false;
|
|
|
|
|
2020-12-14 19:07:30 -08:00
|
|
|
/*
|
|
|
|
* We don't have to worry about the ordering of src and dst
|
|
|
|
* ptlocks because exclusive mmap_lock prevents deadlock.
|
|
|
|
*/
|
2025-03-10 20:50:40 +00:00
|
|
|
old_ptl = pud_lock(mm, old_pud);
|
2020-12-14 19:07:30 -08:00
|
|
|
new_ptl = pud_lockptr(mm, new_pud);
|
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
|
|
|
|
|
|
|
|
/* Clear the pud */
|
|
|
|
pud = *old_pud;
|
|
|
|
pud_clear(old_pud);
|
|
|
|
|
|
|
|
VM_BUG_ON(!pud_none(*new_pud));
|
|
|
|
|
2021-07-07 18:10:12 -07:00
|
|
|
pud_populate(mm, new_pud, pud_pgtable(pud));
|
2025-03-10 20:50:40 +00:00
|
|
|
flush_tlb_range(vma, pmc->old_addr, pmc->old_addr + PUD_SIZE);
|
2020-12-14 19:07:30 -08:00
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_unlock(new_ptl);
|
|
|
|
spin_unlock(old_ptl);
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#else
|
2025-03-10 20:50:40 +00:00
|
|
|
static inline bool move_normal_pud(struct pagetable_move_control *pmc,
|
|
|
|
pud_t *old_pud, pud_t *new_pud)
|
2020-12-14 19:07:30 -08:00
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2019-01-03 15:28:38 -08:00
|
|
|
#endif
|
|
|
|
|
2023-07-25 00:37:52 +05:30
|
|
|
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
|
2025-03-10 20:50:40 +00:00
|
|
|
static bool move_huge_pud(struct pagetable_move_control *pmc,
|
|
|
|
pud_t *old_pud, pud_t *new_pud)
|
2021-07-07 18:10:06 -07:00
|
|
|
{
|
|
|
|
spinlock_t *old_ptl, *new_ptl;
|
2025-03-10 20:50:40 +00:00
|
|
|
struct vm_area_struct *vma = pmc->old;
|
2021-07-07 18:10:06 -07:00
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
|
|
|
pud_t pud;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The destination pud shouldn't be established, free_pgtables()
|
|
|
|
* should have released it.
|
|
|
|
*/
|
|
|
|
if (WARN_ON_ONCE(!pud_none(*new_pud)))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't have to worry about the ordering of src and dst
|
|
|
|
* ptlocks because exclusive mmap_lock prevents deadlock.
|
|
|
|
*/
|
2025-03-10 20:50:40 +00:00
|
|
|
old_ptl = pud_lock(mm, old_pud);
|
2021-07-07 18:10:06 -07:00
|
|
|
new_ptl = pud_lockptr(mm, new_pud);
|
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
|
|
|
|
|
|
|
|
/* Clear the pud */
|
|
|
|
pud = *old_pud;
|
|
|
|
pud_clear(old_pud);
|
|
|
|
|
|
|
|
VM_BUG_ON(!pud_none(*new_pud));
|
|
|
|
|
|
|
|
/* Set the new pud */
|
|
|
|
/* mark soft_ditry when we add pud level soft dirty support */
|
2025-03-10 20:50:40 +00:00
|
|
|
set_pud_at(mm, pmc->new_addr, new_pud, pud);
|
|
|
|
flush_pud_tlb_range(vma, pmc->old_addr, pmc->old_addr + HPAGE_PUD_SIZE);
|
2021-07-07 18:10:06 -07:00
|
|
|
if (new_ptl != old_ptl)
|
|
|
|
spin_unlock(new_ptl);
|
|
|
|
spin_unlock(old_ptl);
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
#else
|
2025-03-10 20:50:40 +00:00
|
|
|
static bool move_huge_pud(struct pagetable_move_control *pmc,
|
|
|
|
pud_t *old_pud, pud_t *new_pud)
|
|
|
|
|
2021-07-07 18:10:06 -07:00
|
|
|
{
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
return false;
|
|
|
|
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2020-12-14 19:07:30 -08:00
|
|
|
enum pgt_entry {
|
|
|
|
NORMAL_PMD,
|
|
|
|
HPAGE_PMD,
|
|
|
|
NORMAL_PUD,
|
2021-07-07 18:10:06 -07:00
|
|
|
HPAGE_PUD,
|
2020-12-14 19:07:30 -08:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Returns an extent of the corresponding size for the pgt_entry specified if
|
|
|
|
* valid. Else returns a smaller extent bounded by the end of the source and
|
|
|
|
* destination pgt_entry.
|
|
|
|
*/
|
2021-02-09 13:42:10 -08:00
|
|
|
static __always_inline unsigned long get_extent(enum pgt_entry entry,
|
2025-03-10 20:50:40 +00:00
|
|
|
struct pagetable_move_control *pmc)
|
2020-12-14 19:07:30 -08:00
|
|
|
{
|
|
|
|
unsigned long next, extent, mask, size;
|
2025-03-10 20:50:40 +00:00
|
|
|
unsigned long old_addr = pmc->old_addr;
|
|
|
|
unsigned long old_end = pmc->old_end;
|
|
|
|
unsigned long new_addr = pmc->new_addr;
|
2020-12-14 19:07:30 -08:00
|
|
|
|
|
|
|
switch (entry) {
|
|
|
|
case HPAGE_PMD:
|
|
|
|
case NORMAL_PMD:
|
|
|
|
mask = PMD_MASK;
|
|
|
|
size = PMD_SIZE;
|
|
|
|
break;
|
2021-07-07 18:10:06 -07:00
|
|
|
case HPAGE_PUD:
|
2020-12-14 19:07:30 -08:00
|
|
|
case NORMAL_PUD:
|
|
|
|
mask = PUD_MASK;
|
|
|
|
size = PUD_SIZE;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
BUILD_BUG();
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
next = (old_addr + size) & mask;
|
|
|
|
/* even if next overflowed, extent below will be ok */
|
2020-12-29 15:14:40 -08:00
|
|
|
extent = next - old_addr;
|
|
|
|
if (extent > old_end - old_addr)
|
|
|
|
extent = old_end - old_addr;
|
2020-12-14 19:07:30 -08:00
|
|
|
next = (new_addr + size) & mask;
|
|
|
|
if (extent > next - new_addr)
|
|
|
|
extent = next - new_addr;
|
|
|
|
return extent;
|
|
|
|
}
|
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
/*
|
|
|
|
* Should move_pgt_entry() acquire the rmap locks? This is either expressed in
|
|
|
|
* the PMC, or overridden in the case of normal, larger page tables.
|
|
|
|
*/
|
|
|
|
static bool should_take_rmap_locks(struct pagetable_move_control *pmc,
|
|
|
|
enum pgt_entry entry)
|
|
|
|
{
|
|
|
|
switch (entry) {
|
|
|
|
case NORMAL_PMD:
|
|
|
|
case NORMAL_PUD:
|
|
|
|
return true;
|
|
|
|
default:
|
|
|
|
return pmc->need_rmap_locks;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-12-14 19:07:30 -08:00
|
|
|
/*
|
|
|
|
* Attempts to speedup the move by moving entry at the level corresponding to
|
|
|
|
* pgt_entry. Returns true if the move was successful, else false.
|
|
|
|
*/
|
2025-03-10 20:50:40 +00:00
|
|
|
static bool move_pgt_entry(struct pagetable_move_control *pmc,
|
|
|
|
enum pgt_entry entry, void *old_entry, void *new_entry)
|
2020-12-14 19:07:30 -08:00
|
|
|
{
|
|
|
|
bool moved = false;
|
2025-03-10 20:50:40 +00:00
|
|
|
bool need_rmap_locks = should_take_rmap_locks(pmc, entry);
|
2020-12-14 19:07:30 -08:00
|
|
|
|
|
|
|
/* See comment in move_ptes() */
|
|
|
|
if (need_rmap_locks)
|
2025-03-10 20:50:40 +00:00
|
|
|
take_rmap_locks(pmc->old);
|
2020-12-14 19:07:30 -08:00
|
|
|
|
|
|
|
switch (entry) {
|
|
|
|
case NORMAL_PMD:
|
2025-03-10 20:50:40 +00:00
|
|
|
moved = move_normal_pmd(pmc, old_entry, new_entry);
|
2020-12-14 19:07:30 -08:00
|
|
|
break;
|
|
|
|
case NORMAL_PUD:
|
2025-03-10 20:50:40 +00:00
|
|
|
moved = move_normal_pud(pmc, old_entry, new_entry);
|
2020-12-14 19:07:30 -08:00
|
|
|
break;
|
|
|
|
case HPAGE_PMD:
|
|
|
|
moved = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
|
2025-03-10 20:50:40 +00:00
|
|
|
move_huge_pmd(pmc->old, pmc->old_addr, pmc->new_addr, old_entry,
|
2020-12-14 19:07:30 -08:00
|
|
|
new_entry);
|
|
|
|
break;
|
2021-07-07 18:10:06 -07:00
|
|
|
case HPAGE_PUD:
|
|
|
|
moved = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
|
2025-03-10 20:50:40 +00:00
|
|
|
move_huge_pud(pmc, old_entry, new_entry);
|
2021-07-07 18:10:06 -07:00
|
|
|
break;
|
|
|
|
|
2020-12-14 19:07:30 -08:00
|
|
|
default:
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (need_rmap_locks)
|
2025-03-10 20:50:40 +00:00
|
|
|
drop_rmap_locks(pmc->old);
|
2020-12-14 19:07:30 -08:00
|
|
|
|
|
|
|
return moved;
|
|
|
|
}
|
|
|
|
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
/*
|
2023-09-03 15:13:23 +00:00
|
|
|
* A helper to check if aligning down is OK. The aligned address should fall
|
|
|
|
* on *no mapping*. For the stack moving down, that's a special move within
|
|
|
|
* the VMA that is created to span the source and destination of the move,
|
|
|
|
* so we make an exception for it.
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
*/
|
2025-03-10 20:50:39 +00:00
|
|
|
static bool can_align_down(struct pagetable_move_control *pmc,
|
|
|
|
struct vm_area_struct *vma, unsigned long addr_to_align,
|
|
|
|
unsigned long mask)
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
{
|
|
|
|
unsigned long addr_masked = addr_to_align & mask;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If @addr_to_align of either source or destination is not the beginning
|
|
|
|
* of the corresponding VMA, we can't align down or we will destroy part
|
|
|
|
* of the current mapping.
|
|
|
|
*/
|
2025-03-10 20:50:39 +00:00
|
|
|
if (!pmc->for_stack && vma->vm_start != addr_to_align)
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
return false;
|
|
|
|
|
2023-09-03 15:13:23 +00:00
|
|
|
/* In the stack case we explicitly permit in-VMA alignment. */
|
2025-03-10 20:50:39 +00:00
|
|
|
if (pmc->for_stack && addr_masked >= vma->vm_start)
|
2023-09-03 15:13:23 +00:00
|
|
|
return true;
|
|
|
|
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
/*
|
|
|
|
* Make sure the realignment doesn't cause the address to fall on an
|
|
|
|
* existing mapping.
|
|
|
|
*/
|
|
|
|
return find_vma_intersection(vma->vm_mm, addr_masked, vma->vm_start) == NULL;
|
|
|
|
}
|
|
|
|
|
2025-03-10 20:50:39 +00:00
|
|
|
/*
|
|
|
|
* Determine if are in fact able to realign for efficiency to a higher page
|
|
|
|
* table boundary.
|
|
|
|
*/
|
|
|
|
static bool can_realign_addr(struct pagetable_move_control *pmc,
|
|
|
|
unsigned long pagetable_mask)
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
{
|
2025-03-10 20:50:39 +00:00
|
|
|
unsigned long align_mask = ~pagetable_mask;
|
|
|
|
unsigned long old_align = pmc->old_addr & align_mask;
|
|
|
|
unsigned long new_align = pmc->new_addr & align_mask;
|
|
|
|
unsigned long pagetable_size = align_mask + 1;
|
|
|
|
unsigned long old_align_next = pagetable_size - old_align;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want to have to go hunting for VMAs from the end of the old
|
|
|
|
* VMA to the next page table boundary, also we want to make sure the
|
|
|
|
* operation is wortwhile.
|
|
|
|
*
|
|
|
|
* So ensure that we only perform this realignment if the end of the
|
|
|
|
* range being copied reaches or crosses the page table boundary.
|
|
|
|
*
|
|
|
|
* boundary boundary
|
|
|
|
* .<- old_align -> .
|
|
|
|
* . |----------------.-----------|
|
|
|
|
* . | vma . |
|
|
|
|
* . |----------------.-----------|
|
|
|
|
* . <----------------.----------->
|
|
|
|
* . len_in
|
|
|
|
* <------------------------------->
|
|
|
|
* . pagetable_size .
|
|
|
|
* . <---------------->
|
|
|
|
* . old_align_next .
|
|
|
|
*/
|
|
|
|
if (pmc->len_in < old_align_next)
|
|
|
|
return false;
|
|
|
|
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
/* Skip if the addresses are already aligned. */
|
2025-03-10 20:50:39 +00:00
|
|
|
if (old_align == 0)
|
|
|
|
return false;
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
|
|
|
|
/* Only realign if the new and old addresses are mutually aligned. */
|
2025-03-10 20:50:39 +00:00
|
|
|
if (old_align != new_align)
|
|
|
|
return false;
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
|
|
|
|
/* Ensure realignment doesn't cause overlap with existing mappings. */
|
2025-03-10 20:50:39 +00:00
|
|
|
if (!can_align_down(pmc, pmc->old, pmc->old_addr, pagetable_mask) ||
|
|
|
|
!can_align_down(pmc, pmc->new, pmc->new_addr, pagetable_mask))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Opportunistically realign to specified boundary for faster copy.
|
|
|
|
*
|
|
|
|
* Consider an mremap() of a VMA with page table boundaries as below, and no
|
|
|
|
* preceding VMAs from the lower page table boundary to the start of the VMA,
|
|
|
|
* with the end of the range reaching or crossing the page table boundary.
|
|
|
|
*
|
|
|
|
* boundary boundary
|
|
|
|
* . |----------------.-----------|
|
|
|
|
* . | vma . |
|
|
|
|
* . |----------------.-----------|
|
|
|
|
* . pmc->old_addr . pmc->old_end
|
|
|
|
* . <---------------------------->
|
|
|
|
* . move these page tables
|
|
|
|
*
|
|
|
|
* If we proceed with moving page tables in this scenario, we will have a lot of
|
|
|
|
* work to do traversing old page tables and establishing new ones in the
|
|
|
|
* destination across multiple lower level page tables.
|
|
|
|
*
|
|
|
|
* The idea here is simply to align pmc->old_addr, pmc->new_addr down to the
|
|
|
|
* page table boundary, so we can simply copy a single page table entry for the
|
|
|
|
* aligned portion of the VMA instead:
|
|
|
|
*
|
|
|
|
* boundary boundary
|
|
|
|
* . |----------------.-----------|
|
|
|
|
* . | vma . |
|
|
|
|
* . |----------------.-----------|
|
|
|
|
* pmc->old_addr . pmc->old_end
|
|
|
|
* <------------------------------------------->
|
|
|
|
* . move these page tables
|
|
|
|
*/
|
|
|
|
static void try_realign_addr(struct pagetable_move_control *pmc,
|
|
|
|
unsigned long pagetable_mask)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (!can_realign_addr(pmc, pagetable_mask))
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
return;
|
|
|
|
|
2025-03-10 20:50:39 +00:00
|
|
|
/*
|
|
|
|
* Simply align to page table boundaries. Note that we do NOT update the
|
|
|
|
* pmc->old_end value, and since the move_page_tables() operation spans
|
|
|
|
* from [old_addr, old_end) (offsetting new_addr as it is performed),
|
|
|
|
* this simply changes the start of the copy, not the end.
|
|
|
|
*/
|
|
|
|
pmc->old_addr &= pagetable_mask;
|
|
|
|
pmc->new_addr &= pagetable_mask;
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
}
|
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
/* Is the page table move operation done? */
|
|
|
|
static bool pmc_done(struct pagetable_move_control *pmc)
|
|
|
|
{
|
|
|
|
return pmc->old_addr >= pmc->old_end;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Advance to the next page table, offset by extent bytes. */
|
|
|
|
static void pmc_next(struct pagetable_move_control *pmc, unsigned long extent)
|
|
|
|
{
|
|
|
|
pmc->old_addr += extent;
|
|
|
|
pmc->new_addr += extent;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine how many bytes in the specified input range have had their page
|
|
|
|
* tables moved so far.
|
|
|
|
*/
|
|
|
|
static unsigned long pmc_progress(struct pagetable_move_control *pmc)
|
|
|
|
{
|
|
|
|
unsigned long orig_old_addr = pmc->old_end - pmc->len_in;
|
|
|
|
unsigned long old_addr = pmc->old_addr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prevent negative return values when {old,new}_addr was realigned but
|
|
|
|
* we broke out of the loop in move_page_tables() for the first PMD
|
|
|
|
* itself.
|
|
|
|
*/
|
|
|
|
return old_addr < orig_old_addr ? 0 : old_addr - orig_old_addr;
|
|
|
|
}
|
|
|
|
|
2025-03-10 20:50:39 +00:00
|
|
|
unsigned long move_page_tables(struct pagetable_move_control *pmc)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2025-03-10 20:50:40 +00:00
|
|
|
unsigned long extent;
|
2018-12-28 00:38:09 -08:00
|
|
|
struct mmu_notifier_range range;
|
2005-10-29 18:16:00 -07:00
|
|
|
pmd_t *old_pmd, *new_pmd;
|
2021-07-07 18:10:06 -07:00
|
|
|
pud_t *old_pud, *new_pud;
|
2025-03-10 20:50:40 +00:00
|
|
|
struct mm_struct *mm = pmc->old->vm_mm;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2025-03-10 20:50:39 +00:00
|
|
|
if (!pmc->len_in)
|
2022-04-08 13:09:04 -07:00
|
|
|
return 0;
|
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
if (is_vm_hugetlb_page(pmc->old))
|
2025-03-10 20:50:39 +00:00
|
|
|
return move_hugetlb_page_tables(pmc->old, pmc->new, pmc->old_addr,
|
|
|
|
pmc->new_addr, pmc->len_in);
|
2021-11-05 13:41:40 -07:00
|
|
|
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
/*
|
|
|
|
* If possible, realign addresses to PMD boundary for faster copy.
|
|
|
|
* Only realign if the mremap copying hits a PMD boundary.
|
|
|
|
*/
|
2025-03-10 20:50:39 +00:00
|
|
|
try_realign_addr(pmc, PMD_MASK);
|
mm/mremap: optimize the start addresses in move_page_tables()
Patch series "Optimize mremap during mutual alignment within PMD", v6.
This patchset optimizes the start addresses in move_page_tables() and
tests the changes. It addresses a warning [1] that occurs due to a
downward, overlapping move on a mutually-aligned offset within a PMD
during exec. By initiating the copy process at the PMD level when such
alignment is present, we can prevent this warning and speed up the copying
process at the same time. Linus Torvalds suggested this idea. Check the
individual patches for more details. [1]
https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
This patch (of 7):
Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I mean the
source and destination addresses of the mremap are at the same offset
within a PMD.
This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that does
not have PTEs in it.
This warning will only trigger when there is mutual alignment in the move
operation. A solution, as suggested by Linus Torvalds [2], is to initiate
the copy process at the PMD level whenever such alignment is present.
Implementing this approach will not only prevent the warning from being
triggered, but it will also optimize the operation as this method should
enhance the speed of the copy process whenever there's a possibility to
start copying at the PMD level.
Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.
b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.
c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.
d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230903151328.2981432-1-joel@joelfernandes.org
Link: https://lkml.kernel.org/r/20230903151328.2981432-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-03 15:13:22 +00:00
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
flush_cache_range(pmc->old, pmc->old_addr, pmc->old_end);
|
|
|
|
mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, mm,
|
|
|
|
pmc->old_addr, pmc->old_end);
|
2018-12-28 00:38:09 -08:00
|
|
|
mmu_notifier_invalidate_range_start(&range);
|
2011-10-31 17:08:26 -07:00
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
for (; !pmc_done(pmc); pmc_next(pmc, extent)) {
|
2005-04-16 15:20:36 -07:00
|
|
|
cond_resched();
|
2020-12-14 19:07:30 -08:00
|
|
|
/*
|
|
|
|
* If extent is PUD-sized try to speed up the move by moving at the
|
|
|
|
* PUD level if possible.
|
|
|
|
*/
|
2025-03-10 20:50:40 +00:00
|
|
|
extent = get_extent(NORMAL_PUD, pmc);
|
2020-12-14 19:07:30 -08:00
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
old_pud = get_old_pud(mm, pmc->old_addr);
|
2021-07-07 18:10:06 -07:00
|
|
|
if (!old_pud)
|
|
|
|
continue;
|
2025-03-10 20:50:40 +00:00
|
|
|
new_pud = alloc_new_pud(mm, pmc->new_addr);
|
2021-07-07 18:10:06 -07:00
|
|
|
if (!new_pud)
|
|
|
|
break;
|
2025-06-19 18:57:59 +10:00
|
|
|
if (pud_trans_huge(*old_pud)) {
|
2021-07-07 18:10:06 -07:00
|
|
|
if (extent == HPAGE_PUD_SIZE) {
|
2025-03-10 20:50:40 +00:00
|
|
|
move_pgt_entry(pmc, HPAGE_PUD, old_pud, new_pud);
|
2021-07-07 18:10:06 -07:00
|
|
|
/* We ignore and continue on error? */
|
2020-12-14 19:07:30 -08:00
|
|
|
continue;
|
2021-07-07 18:10:06 -07:00
|
|
|
}
|
|
|
|
} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
|
2025-03-10 20:50:40 +00:00
|
|
|
if (move_pgt_entry(pmc, NORMAL_PUD, old_pud, new_pud))
|
2020-12-14 19:07:30 -08:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
extent = get_extent(NORMAL_PMD, pmc);
|
|
|
|
old_pmd = get_old_pmd(mm, pmc->old_addr);
|
2005-10-29 18:16:00 -07:00
|
|
|
if (!old_pmd)
|
|
|
|
continue;
|
2025-03-10 20:50:40 +00:00
|
|
|
new_pmd = alloc_new_pmd(mm, pmc->new_addr);
|
2005-10-29 18:16:00 -07:00
|
|
|
if (!new_pmd)
|
|
|
|
break;
|
2023-06-08 18:32:47 -07:00
|
|
|
again:
|
2025-06-19 18:57:59 +10:00
|
|
|
if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
|
2020-12-14 19:07:30 -08:00
|
|
|
if (extent == HPAGE_PMD_SIZE &&
|
2025-03-10 20:50:40 +00:00
|
|
|
move_pgt_entry(pmc, HPAGE_PMD, old_pmd, new_pmd))
|
2020-12-14 19:07:30 -08:00
|
|
|
continue;
|
2025-03-10 20:50:40 +00:00
|
|
|
split_huge_pmd(pmc->old, old_pmd, pmc->old_addr);
|
2020-12-14 19:07:30 -08:00
|
|
|
} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
|
|
|
|
extent == PMD_SIZE) {
|
2019-01-03 15:28:38 -08:00
|
|
|
/*
|
|
|
|
* If the extent is PMD-sized, try to speed the move by
|
|
|
|
* moving at the PMD level if possible.
|
|
|
|
*/
|
2025-03-10 20:50:40 +00:00
|
|
|
if (move_pgt_entry(pmc, NORMAL_PMD, old_pmd, new_pmd))
|
2019-01-03 15:28:38 -08:00
|
|
|
continue;
|
thp: mremap support and TLB optimization
This adds THP support to mremap (decreases the number of split_huge_page()
calls).
Here are also some benchmarks with a proggy like this:
===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (5UL*1024*1024*1024)
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");
return 0;
}
===
THP on
Performance counter stats for './largepage13' (3 runs):
69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)
7.314454164 seconds time elapsed ( +- 0.023% )
THP off
Performance counter stats for './largepage13' (3 runs):
1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)
9.693655243 seconds time elapsed ( +- 0.069% )
grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0
thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):
THP on
usec 14824
usec 14862
usec 14859
THP off
usec 256416
usec 255981
usec 255847
With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).
THP on
usec 392107
usec 390237
usec 404124
THP off
usec 444294
usec 445237
usec 445820
I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.
All debug options are off except DEBUG_VM to avoid skewing the
results.
The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.
[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:08:30 -07:00
|
|
|
}
|
2023-06-08 18:32:47 -07:00
|
|
|
if (pmd_none(*old_pmd))
|
|
|
|
continue;
|
2025-03-10 20:50:39 +00:00
|
|
|
if (pte_alloc(pmc->new->vm_mm, new_pmd))
|
thp: mremap support and TLB optimization
This adds THP support to mremap (decreases the number of split_huge_page()
calls).
Here are also some benchmarks with a proggy like this:
===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (5UL*1024*1024*1024)
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");
return 0;
}
===
THP on
Performance counter stats for './largepage13' (3 runs):
69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)
7.314454164 seconds time elapsed ( +- 0.023% )
THP off
Performance counter stats for './largepage13' (3 runs):
1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)
9.693655243 seconds time elapsed ( +- 0.069% )
grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0
thp_split 0 confirms no thp split despite plenty of hugepages allocated.
The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):
THP on
usec 14824
usec 14862
usec 14859
THP off
usec 256416
usec 255981
usec 255847
With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).
THP on
usec 392107
usec 390237
usec 404124
THP off
usec 444294
usec 445237
usec 445820
I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.
All debug options are off except DEBUG_VM to avoid skewing the
results.
The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.
[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:08:30 -07:00
|
|
|
break;
|
2025-03-10 20:50:40 +00:00
|
|
|
if (move_ptes(pmc, extent, old_pmd, new_pmd) < 0)
|
2023-06-08 18:32:47 -07:00
|
|
|
goto again;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
2011-10-31 17:08:26 -07:00
|
|
|
|
2018-12-28 00:38:09 -08:00
|
|
|
mmu_notifier_invalidate_range_end(&range);
|
2005-10-29 18:16:00 -07:00
|
|
|
|
2025-03-10 20:50:40 +00:00
|
|
|
return pmc_progress(pmc);
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
/* Set vrm->delta to the difference in VMA size specified by user. */
|
|
|
|
static void vrm_set_delta(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
vrm->delta = abs_diff(vrm->old_len, vrm->new_len);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Determine what kind of remap this is - shrink, expand or no resize at all. */
|
|
|
|
static enum mremap_type vrm_remap_type(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
if (vrm->delta == 0)
|
|
|
|
return MREMAP_NO_RESIZE;
|
|
|
|
|
|
|
|
if (vrm->old_len > vrm->new_len)
|
|
|
|
return MREMAP_SHRINK;
|
|
|
|
|
|
|
|
return MREMAP_EXPAND;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When moving a VMA to vrm->new_adr, does this result in the new and old VMAs
|
|
|
|
* overlapping?
|
|
|
|
*/
|
|
|
|
static bool vrm_overlaps(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
unsigned long start_old = vrm->addr;
|
|
|
|
unsigned long start_new = vrm->new_addr;
|
|
|
|
unsigned long end_old = vrm->addr + vrm->old_len;
|
|
|
|
unsigned long end_new = vrm->new_addr + vrm->new_len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* start_old end_old
|
|
|
|
* |-----------|
|
|
|
|
* | |
|
|
|
|
* |-----------|
|
|
|
|
* |-------------|
|
|
|
|
* | |
|
|
|
|
* |-------------|
|
|
|
|
* start_new end_new
|
|
|
|
*/
|
|
|
|
if (end_old > start_new && end_new > start_old)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
mm/mremap: perform some simple cleanups
Patch series "mm/mremap: permit mremap() move of multiple VMAs", v4.
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
This patch (of 10):
We const-ify the vrm flags parameter to indicate this will never change.
We rename resize_is_valid() to remap_is_valid(), as this function does not
only apply to cases where we resize, so it's simply confusing to refer to
that here.
We remove the BUG() from mremap_at(), as we should not BUG() unless we are
certain it'll result in system instability.
We rename vrm_charge() to vrm_calc_charge() to make it clear this simply
calculates the charged number of pages rather than actually adjusting any
state.
We update the comment for vrm_implies_new_addr() to explain that
MREMAP_DONTUNMAP does not require a set address, but will always be moved.
Additionally consistently use 'res' rather than 'ret' for result values.
No functional change intended.
Link: https://lkml.kernel.org/r/cover.1752770784.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d35ad8ce6b2c33b2f2f4ef7ec415f04a35cba34f.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:51 +01:00
|
|
|
/*
|
|
|
|
* Will a new address definitely be assigned? This either if the user specifies
|
|
|
|
* it via MREMAP_FIXED, or if MREMAP_DONTUNMAP is used, indicating we will
|
|
|
|
* always detemrine a target address.
|
|
|
|
*/
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
static bool vrm_implies_new_addr(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
return vrm->flags & (MREMAP_FIXED | MREMAP_DONTUNMAP);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find an unmapped area for the requested vrm->new_addr.
|
|
|
|
*
|
|
|
|
* If MREMAP_FIXED then this is equivalent to a MAP_FIXED mmap() call. If only
|
|
|
|
* MREMAP_DONTUNMAP is set, then this is equivalent to providing a hint to
|
|
|
|
* mmap(), otherwise this is equivalent to mmap() specifying a NULL address.
|
|
|
|
*
|
|
|
|
* Returns 0 on success (with vrm->new_addr updated), or an error code upon
|
|
|
|
* failure.
|
|
|
|
*/
|
|
|
|
static unsigned long vrm_set_new_addr(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
struct vm_area_struct *vma = vrm->vma;
|
|
|
|
unsigned long map_flags = 0;
|
|
|
|
/* Page Offset _into_ the VMA. */
|
|
|
|
pgoff_t internal_pgoff = (vrm->addr - vma->vm_start) >> PAGE_SHIFT;
|
|
|
|
pgoff_t pgoff = vma->vm_pgoff + internal_pgoff;
|
|
|
|
unsigned long new_addr = vrm_implies_new_addr(vrm) ? vrm->new_addr : 0;
|
|
|
|
unsigned long res;
|
|
|
|
|
|
|
|
if (vrm->flags & MREMAP_FIXED)
|
|
|
|
map_flags |= MAP_FIXED;
|
|
|
|
if (vma->vm_flags & VM_MAYSHARE)
|
|
|
|
map_flags |= MAP_SHARED;
|
|
|
|
|
|
|
|
res = get_unmapped_area(vma->vm_file, new_addr, vrm->new_len, pgoff,
|
|
|
|
map_flags);
|
|
|
|
if (IS_ERR_VALUE(res))
|
|
|
|
return res;
|
|
|
|
|
|
|
|
vrm->new_addr = res;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2025-03-10 20:50:37 +00:00
|
|
|
/*
|
|
|
|
* Keep track of pages which have been added to the memory mapping. If the VMA
|
|
|
|
* is accounted, also check to see if there is sufficient memory.
|
|
|
|
*
|
|
|
|
* Returns true on success, false if insufficient memory to charge.
|
|
|
|
*/
|
mm/mremap: perform some simple cleanups
Patch series "mm/mremap: permit mremap() move of multiple VMAs", v4.
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
This patch (of 10):
We const-ify the vrm flags parameter to indicate this will never change.
We rename resize_is_valid() to remap_is_valid(), as this function does not
only apply to cases where we resize, so it's simply confusing to refer to
that here.
We remove the BUG() from mremap_at(), as we should not BUG() unless we are
certain it'll result in system instability.
We rename vrm_charge() to vrm_calc_charge() to make it clear this simply
calculates the charged number of pages rather than actually adjusting any
state.
We update the comment for vrm_implies_new_addr() to explain that
MREMAP_DONTUNMAP does not require a set address, but will always be moved.
Additionally consistently use 'res' rather than 'ret' for result values.
No functional change intended.
Link: https://lkml.kernel.org/r/cover.1752770784.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d35ad8ce6b2c33b2f2f4ef7ec415f04a35cba34f.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:51 +01:00
|
|
|
static bool vrm_calc_charge(struct vma_remap_struct *vrm)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2025-03-10 20:50:37 +00:00
|
|
|
unsigned long charged;
|
|
|
|
|
|
|
|
if (!(vrm->vma->vm_flags & VM_ACCOUNT))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we don't unmap the old mapping, then we account the entirety of
|
|
|
|
* the length of the new one. Otherwise it's just the delta in size.
|
|
|
|
*/
|
|
|
|
if (vrm->flags & MREMAP_DONTUNMAP)
|
|
|
|
charged = vrm->new_len >> PAGE_SHIFT;
|
|
|
|
else
|
|
|
|
charged = vrm->delta >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
|
|
|
|
/* This accounts 'charged' pages of memory. */
|
|
|
|
if (security_vm_enough_memory_mm(current->mm, charged))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
vrm->charged = charged;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* an error has occurred so we will not be using vrm->charged memory. Unaccount
|
|
|
|
* this memory if the VMA is accounted.
|
|
|
|
*/
|
|
|
|
static void vrm_uncharge(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
if (!(vrm->vma->vm_flags & VM_ACCOUNT))
|
|
|
|
return;
|
|
|
|
|
|
|
|
vm_unacct_memory(vrm->charged);
|
|
|
|
vrm->charged = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update mm exec_vm, stack_vm, data_vm, and locked_vm fields as needed to
|
|
|
|
* account for 'bytes' memory used, and if locked, indicate this in the VRM so
|
|
|
|
* we can handle this correctly later.
|
|
|
|
*/
|
|
|
|
static void vrm_stat_account(struct vma_remap_struct *vrm,
|
|
|
|
unsigned long bytes)
|
|
|
|
{
|
|
|
|
unsigned long pages = bytes >> PAGE_SHIFT;
|
|
|
|
struct mm_struct *mm = current->mm;
|
|
|
|
struct vm_area_struct *vma = vrm->vma;
|
|
|
|
|
|
|
|
vm_stat_account(mm, vma->vm_flags, pages);
|
2025-07-17 17:55:58 +01:00
|
|
|
if (vma->vm_flags & VM_LOCKED)
|
2025-03-10 20:50:37 +00:00
|
|
|
mm->locked_vm += pages;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Perform checks before attempting to write a VMA prior to it being
|
|
|
|
* moved.
|
|
|
|
*/
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
static unsigned long prep_move_vma(struct vma_remap_struct *vrm)
|
2025-03-10 20:50:37 +00:00
|
|
|
{
|
|
|
|
unsigned long err = 0;
|
|
|
|
struct vm_area_struct *vma = vrm->vma;
|
|
|
|
unsigned long old_addr = vrm->addr;
|
|
|
|
unsigned long old_len = vrm->old_len;
|
2025-06-18 20:42:53 +01:00
|
|
|
vm_flags_t dummy = vma->vm_flags;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We'd prefer to avoid failure later on in do_munmap:
|
|
|
|
* which may split one vma into three before unmapping.
|
|
|
|
*/
|
2025-03-10 20:50:37 +00:00
|
|
|
if (current->mm->map_count >= sysctl_max_map_count - 3)
|
2005-04-16 15:20:36 -07:00
|
|
|
return -ENOMEM;
|
|
|
|
|
2020-12-14 19:08:21 -08:00
|
|
|
if (vma->vm_ops && vma->vm_ops->may_split) {
|
|
|
|
if (vma->vm_start != old_addr)
|
|
|
|
err = vma->vm_ops->may_split(vma, old_addr);
|
|
|
|
if (!err && vma->vm_end != old_addr + old_len)
|
|
|
|
err = vma->vm_ops->may_split(vma, old_addr + old_len);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2009-09-21 17:02:05 -07:00
|
|
|
/*
|
|
|
|
* Advise KSM to break any KSM pages in the area to be moved:
|
|
|
|
* it would be confusing if they were to turn up at the new
|
|
|
|
* location, where they happen to coincide with different KSM
|
|
|
|
* pages recently unmapped. But leave vma->vm_flags as it was,
|
|
|
|
* so KSM can come around to merge on vma and new_vma afterwards.
|
|
|
|
*/
|
2009-09-21 17:02:28 -07:00
|
|
|
err = ksm_madvise(vma, old_addr, old_addr + old_len,
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
MADV_UNMERGEABLE, &dummy);
|
2009-09-21 17:02:28 -07:00
|
|
|
if (err)
|
|
|
|
return err;
|
2009-09-21 17:02:05 -07:00
|
|
|
|
2025-03-10 20:50:37 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
/*
|
|
|
|
* Unmap source VMA for VMA move, turning it from a copy to a move, being
|
|
|
|
* careful to ensure we do not underflow memory account while doing so if an
|
|
|
|
* accountable move.
|
|
|
|
*
|
|
|
|
* This is best effort, if we fail to unmap then we simply try to correct
|
|
|
|
* accounting and exit.
|
|
|
|
*/
|
|
|
|
static void unmap_source_vma(struct vma_remap_struct *vrm)
|
2025-03-10 20:50:37 +00:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = current->mm;
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
unsigned long addr = vrm->addr;
|
|
|
|
unsigned long len = vrm->old_len;
|
2025-03-10 20:50:37 +00:00
|
|
|
struct vm_area_struct *vma = vrm->vma;
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
VMA_ITERATOR(vmi, mm, addr);
|
2025-03-10 20:50:37 +00:00
|
|
|
int err;
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
unsigned long vm_start;
|
|
|
|
unsigned long vm_end;
|
|
|
|
/*
|
|
|
|
* It might seem odd that we check for MREMAP_DONTUNMAP here, given this
|
|
|
|
* function implies that we unmap the original VMA, which seems
|
|
|
|
* contradictory.
|
|
|
|
*
|
|
|
|
* However, this occurs when this operation was attempted and an error
|
|
|
|
* arose, in which case we _do_ wish to unmap the _new_ VMA, which means
|
|
|
|
* we actually _do_ want it be unaccounted.
|
|
|
|
*/
|
|
|
|
bool accountable_move = (vma->vm_flags & VM_ACCOUNT) &&
|
|
|
|
!(vrm->flags & MREMAP_DONTUNMAP);
|
2025-03-10 20:50:37 +00:00
|
|
|
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
/*
|
|
|
|
* So we perform a trick here to prevent incorrect accounting. Any merge
|
|
|
|
* or new VMA allocation performed in copy_vma() does not adjust
|
|
|
|
* accounting, it is expected that callers handle this.
|
|
|
|
*
|
|
|
|
* And indeed we already have, accounting appropriately in the case of
|
|
|
|
* both in vrm_charge().
|
|
|
|
*
|
|
|
|
* However, when we unmap the existing VMA (to effect the move), this
|
|
|
|
* code will, if the VMA has VM_ACCOUNT set, attempt to unaccount
|
|
|
|
* removed pages.
|
|
|
|
*
|
|
|
|
* To avoid this we temporarily clear this flag, reinstating on any
|
|
|
|
* portions of the original VMA that remain.
|
|
|
|
*/
|
|
|
|
if (accountable_move) {
|
|
|
|
vm_flags_clear(vma, VM_ACCOUNT);
|
|
|
|
/* We are about to split vma, so store the start/end. */
|
|
|
|
vm_start = vma->vm_start;
|
|
|
|
vm_end = vma->vm_end;
|
|
|
|
}
|
2025-03-10 20:50:37 +00:00
|
|
|
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
err = do_vmi_munmap(&vmi, mm, addr, len, vrm->uf_unmap, /* unlock= */false);
|
|
|
|
vrm->vma = NULL; /* Invalidated. */
|
mm/mremap: permit mremap() move of multiple VMAs
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is
faulted, then moved back and a user fails to observe a merge from
otherwise compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
Should no VMA exist in the range, -EFAULT is returned as usual.
If a VMA move spans a single VMA - then there is no functional change.
Otherwise, we place additional requirements upon VMAs:
* They must not have a userfaultfd context associated with them - this
requires dropping the lock to notify users, and we want to perform the
operation with the mmap write lock held throughout.
* If file-backed, they cannot have a custom get_unmapped_area handler -
this might result in MREMAP_FIXED not being honoured, which could result
in unexpected positioning of VMAs in the moved region.
There may be gaps in the range of VMAs that are moved:
X Y X Y
<---> <-> <---> <->
|-------| |-----| |-----| |-------| |-----| |-----|
| A | | B | | C | ---> | A' | | B' | | C' |
|-------| |-----| |-----| |-------| |-----| |-----|
addr new_addr
The move will preserve the gaps between each VMA.
Note that any failures encountered will result in a partial move. Since
an mremap() can fail at any time, this might result in only some of the
VMAs being moved.
Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.
We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.
So we optimise for the far, far more likely case of the operation being
entirely permissible.
In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.
Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:59 +01:00
|
|
|
vrm->vmi_needs_invalidate = true;
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
if (err) {
|
|
|
|
/* OOM: unable to split vma, just get accounts right */
|
|
|
|
vm_acct_memory(len >> PAGE_SHIFT);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we mremap() from a VMA like this:
|
|
|
|
*
|
|
|
|
* addr end
|
|
|
|
* | |
|
|
|
|
* v v
|
|
|
|
* |-------------|
|
|
|
|
* | |
|
|
|
|
* |-------------|
|
|
|
|
*
|
|
|
|
* Having cleared VM_ACCOUNT from the whole VMA, after we unmap above
|
|
|
|
* we'll end up with:
|
|
|
|
*
|
|
|
|
* addr end
|
|
|
|
* | |
|
|
|
|
* v v
|
|
|
|
* |---| |---|
|
|
|
|
* | A | | B |
|
|
|
|
* |---| |---|
|
|
|
|
*
|
|
|
|
* The VMI is still pointing at addr, so vma_prev() will give us A, and
|
|
|
|
* a subsequent or lone vma_next() will give as B.
|
|
|
|
*
|
|
|
|
* do_vmi_munmap() will have restored the VMI back to addr.
|
|
|
|
*/
|
|
|
|
if (accountable_move) {
|
|
|
|
unsigned long end = addr + len;
|
2020-12-14 19:08:09 -08:00
|
|
|
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
if (vm_start < addr) {
|
|
|
|
struct vm_area_struct *prev = vma_prev(&vmi);
|
|
|
|
|
|
|
|
vm_flags_set(prev, VM_ACCOUNT); /* Acquires VMA lock. */
|
|
|
|
}
|
|
|
|
|
|
|
|
if (vm_end > end) {
|
|
|
|
struct vm_area_struct *next = vma_next(&vmi);
|
|
|
|
|
|
|
|
vm_flags_set(next, VM_ACCOUNT); /* Acquires VMA lock. */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy vrm->vma over to vrm->new_addr possibly adjusting size as part of the
|
|
|
|
* process. Additionally handle an error occurring on moving of page tables,
|
|
|
|
* where we reset vrm state to cause unmapping of the new VMA.
|
|
|
|
*
|
|
|
|
* Outputs the newly installed VMA to new_vma_ptr. Returns 0 on success or an
|
|
|
|
* error code.
|
|
|
|
*/
|
|
|
|
static int copy_vma_and_data(struct vma_remap_struct *vrm,
|
|
|
|
struct vm_area_struct **new_vma_ptr)
|
|
|
|
{
|
|
|
|
unsigned long internal_offset = vrm->addr - vrm->vma->vm_start;
|
|
|
|
unsigned long internal_pgoff = internal_offset >> PAGE_SHIFT;
|
|
|
|
unsigned long new_pgoff = vrm->vma->vm_pgoff + internal_pgoff;
|
|
|
|
unsigned long moved_len;
|
2025-03-10 20:50:39 +00:00
|
|
|
struct vm_area_struct *vma = vrm->vma;
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
struct vm_area_struct *new_vma;
|
|
|
|
int err = 0;
|
2025-03-10 20:50:39 +00:00
|
|
|
PAGETABLE_MOVE(pmc, NULL, NULL, vrm->addr, vrm->new_addr, vrm->old_len);
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
|
2025-03-10 20:50:39 +00:00
|
|
|
new_vma = copy_vma(&vma, vrm->new_addr, vrm->new_len, new_pgoff,
|
|
|
|
&pmc.need_rmap_locks);
|
2020-12-14 19:08:09 -08:00
|
|
|
if (!new_vma) {
|
2025-03-10 20:50:37 +00:00
|
|
|
vrm_uncharge(vrm);
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
*new_vma_ptr = NULL;
|
2005-04-16 15:20:36 -07:00
|
|
|
return -ENOMEM;
|
2020-12-14 19:08:09 -08:00
|
|
|
}
|
mm/mremap: permit mremap() move of multiple VMAs
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is
faulted, then moved back and a user fails to observe a merge from
otherwise compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
Should no VMA exist in the range, -EFAULT is returned as usual.
If a VMA move spans a single VMA - then there is no functional change.
Otherwise, we place additional requirements upon VMAs:
* They must not have a userfaultfd context associated with them - this
requires dropping the lock to notify users, and we want to perform the
operation with the mmap write lock held throughout.
* If file-backed, they cannot have a custom get_unmapped_area handler -
this might result in MREMAP_FIXED not being honoured, which could result
in unexpected positioning of VMAs in the moved region.
There may be gaps in the range of VMAs that are moved:
X Y X Y
<---> <-> <---> <->
|-------| |-----| |-----| |-------| |-----| |-----|
| A | | B | | C | ---> | A' | | B' | | C' |
|-------| |-----| |-----| |-------| |-----| |-----|
addr new_addr
The move will preserve the gaps between each VMA.
Note that any failures encountered will result in a partial move. Since
an mremap() can fail at any time, this might result in only some of the
VMAs being moved.
Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.
We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.
So we optimise for the far, far more likely case of the operation being
entirely permissible.
In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.
Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:59 +01:00
|
|
|
/* By merging, we may have invalidated any iterator in use. */
|
|
|
|
if (vma != vrm->vma)
|
|
|
|
vrm->vmi_needs_invalidate = true;
|
|
|
|
|
2025-03-10 20:50:39 +00:00
|
|
|
vrm->vma = vma;
|
|
|
|
pmc.old = vma;
|
|
|
|
pmc.new = new_vma;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2025-03-10 20:50:39 +00:00
|
|
|
moved_len = move_page_tables(&pmc);
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
if (moved_len < vrm->old_len)
|
2015-09-04 15:48:01 -07:00
|
|
|
err = -ENOMEM;
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
else if (vma->vm_ops && vma->vm_ops->mremap)
|
2021-04-29 22:57:48 -07:00
|
|
|
err = vma->vm_ops->mremap(new_vma);
|
2015-09-04 15:48:01 -07:00
|
|
|
|
|
|
|
if (unlikely(err)) {
|
2025-03-10 20:50:39 +00:00
|
|
|
PAGETABLE_MOVE(pmc_revert, new_vma, vma, vrm->new_addr,
|
|
|
|
vrm->addr, moved_len);
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
|
|
|
* On error, move entries back from new area to old,
|
|
|
|
* which will succeed since page tables still there,
|
|
|
|
* and then proceed to unmap new area instead of old.
|
|
|
|
*/
|
2025-03-10 20:50:39 +00:00
|
|
|
pmc_revert.need_rmap_locks = true;
|
|
|
|
move_page_tables(&pmc_revert);
|
|
|
|
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
vrm->vma = new_vma;
|
|
|
|
vrm->old_len = vrm->new_len;
|
|
|
|
vrm->addr = vrm->new_addr;
|
2015-06-24 16:56:19 -07:00
|
|
|
} else {
|
2025-03-10 20:50:37 +00:00
|
|
|
mremap_userfaultfd_prep(new_vma, vrm->uf);
|
2015-04-06 17:48:54 -04:00
|
|
|
}
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2025-05-23 14:19:10 +02:00
|
|
|
fixup_hugetlb_reservations(vma);
|
2021-11-05 13:41:40 -07:00
|
|
|
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
*new_vma_ptr = new_vma;
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Perform final tasks for MADV_DONTUNMAP operation, clearing mlock() and
|
|
|
|
* account flags on remaining VMA by convention (it cannot be mlock()'d any
|
|
|
|
* longer, as pages in range are no longer mapped), and removing anon_vma_chain
|
|
|
|
* links from it (if the entire VMA was copied over).
|
|
|
|
*/
|
|
|
|
static void dontunmap_complete(struct vma_remap_struct *vrm,
|
|
|
|
struct vm_area_struct *new_vma)
|
|
|
|
{
|
|
|
|
unsigned long start = vrm->addr;
|
|
|
|
unsigned long end = vrm->addr + vrm->old_len;
|
|
|
|
unsigned long old_start = vrm->vma->vm_start;
|
|
|
|
unsigned long old_end = vrm->vma->vm_end;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We always clear VM_LOCKED[ONFAULT] | VM_ACCOUNT on the old
|
|
|
|
* vma.
|
|
|
|
*/
|
|
|
|
vm_flags_clear(vrm->vma, VM_LOCKED_MASK | VM_ACCOUNT);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* anon_vma links of the old vma is no longer needed after its page
|
|
|
|
* table has been moved.
|
|
|
|
*/
|
|
|
|
if (new_vma != vrm->vma && start == old_start && end == old_end)
|
|
|
|
unlink_anon_vmas(vrm->vma);
|
|
|
|
|
|
|
|
/* Because we won't unmap we don't need to touch locked_vm. */
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long move_vma(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
struct mm_struct *mm = current->mm;
|
|
|
|
struct vm_area_struct *new_vma;
|
|
|
|
unsigned long hiwater_vm;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = prep_move_vma(vrm);
|
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
|
mm/mremap: perform some simple cleanups
Patch series "mm/mremap: permit mremap() move of multiple VMAs", v4.
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
This patch (of 10):
We const-ify the vrm flags parameter to indicate this will never change.
We rename resize_is_valid() to remap_is_valid(), as this function does not
only apply to cases where we resize, so it's simply confusing to refer to
that here.
We remove the BUG() from mremap_at(), as we should not BUG() unless we are
certain it'll result in system instability.
We rename vrm_charge() to vrm_calc_charge() to make it clear this simply
calculates the charged number of pages rather than actually adjusting any
state.
We update the comment for vrm_implies_new_addr() to explain that
MREMAP_DONTUNMAP does not require a set address, but will always be moved.
Additionally consistently use 'res' rather than 'ret' for result values.
No functional change intended.
Link: https://lkml.kernel.org/r/cover.1752770784.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d35ad8ce6b2c33b2f2f4ef7ec415f04a35cba34f.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:51 +01:00
|
|
|
/*
|
|
|
|
* If accounted, determine the number of bytes the operation will
|
|
|
|
* charge.
|
|
|
|
*/
|
|
|
|
if (!vrm_calc_charge(vrm))
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/* We don't want racing faults. */
|
|
|
|
vma_start_write(vrm->vma);
|
|
|
|
|
|
|
|
/* Perform copy step. */
|
|
|
|
err = copy_vma_and_data(vrm, &new_vma);
|
|
|
|
/*
|
|
|
|
* If we established the copied-to VMA, we attempt to recover from the
|
|
|
|
* error by setting the destination VMA to the source VMA and unmapping
|
|
|
|
* it below.
|
|
|
|
*/
|
|
|
|
if (err && !new_vma)
|
|
|
|
return err;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2005-05-16 21:53:18 -07:00
|
|
|
/*
|
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 18:16:18 -07:00
|
|
|
* If we failed to move page tables we still do total_vm increment
|
|
|
|
* since do_munmap() will decrement it by old_len == new_len.
|
|
|
|
*
|
|
|
|
* Since total_vm is about to be raised artificially high for a
|
|
|
|
* moment, we need to restore high watermark afterwards: if stats
|
|
|
|
* are taken meanwhile, total_vm and hiwater_vm appear too high.
|
|
|
|
* If this were a serious issue, we'd add a flag to do_munmap().
|
2005-05-16 21:53:18 -07:00
|
|
|
*/
|
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 18:16:18 -07:00
|
|
|
hiwater_vm = mm->hiwater_vm;
|
2005-05-16 21:53:18 -07:00
|
|
|
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
vrm_stat_account(vrm, vrm->new_len);
|
|
|
|
if (unlikely(!err && (vrm->flags & MREMAP_DONTUNMAP)))
|
|
|
|
dontunmap_complete(vrm, new_vma);
|
|
|
|
else
|
|
|
|
unmap_source_vma(vrm);
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-01 21:09:17 -07:00
|
|
|
|
[PATCH] mm: update_hiwaters just in time
update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.
Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.
And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).
There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.
What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-29 18:16:18 -07:00
|
|
|
mm->hiwater_vm = hiwater_vm;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
mm/mremap: complete refactor of move_vma()
We invoke ksm_madvise() with an intentionally dummy flags field, so no
need to pass around.
Additionally, the code tries to be 'clever' with account_start,
account_end, using these to both check that vma->vm_start != 0 and that we
ought to account the newly split portion of VMA post-move, either before
or after it.
We need to do this because we intentionally removed VM_ACCOUNT on the VMA
prior to unmapping, so we don't erroneously unaccount memory (we have
already calculated the correct amount to account and accounted it, any
subsequent subtraction will be incorrect).
This patch significantly expands the comment (from 2002!) about
'concealing' the flag to make it abundantly clear what's going on, as well
as adding and expanding a number of other comments also.
We can remove account_start, account_end by instead tracking when we
account (i.e. vma->vm_flags has the VM_ACCOUNT flag set, and this is not
an MREMAP_DONTUNMAP operation), and figuring out when to reinstate the
VM_ACCOUNT flag on prior/subsequent VMAs separately.
We additionally break the function into logical pieces and attack the very
confusing error handling logic (where, for instance, new_addr is set to
err).
After this change the code is considerably more readable and easy to
manipulate.
Link: https://lkml.kernel.org/r/e7eaa307e444ba2b04d94fd985c907c8e896f893.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:38 +00:00
|
|
|
return err ? (unsigned long)err : vrm->new_addr;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
|
|
|
|
2024-10-18 13:41:14 -04:00
|
|
|
/*
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
* The user has requested that the VMA be shrunk (i.e., old_len > new_len), so
|
|
|
|
* execute this, optionally dropping the mmap lock when we do so.
|
2024-10-18 13:41:14 -04:00
|
|
|
*
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
* In both cases this invalidates the VMA, however if we don't drop the lock,
|
|
|
|
* then load the correct VMA into vrm->vma afterwards.
|
|
|
|
*/
|
|
|
|
static unsigned long shrink_vma(struct vma_remap_struct *vrm,
|
|
|
|
bool drop_lock)
|
|
|
|
{
|
|
|
|
struct mm_struct *mm = current->mm;
|
|
|
|
unsigned long unmap_start = vrm->addr + vrm->new_len;
|
|
|
|
unsigned long unmap_bytes = vrm->delta;
|
|
|
|
unsigned long res;
|
|
|
|
VMA_ITERATOR(vmi, mm, unmap_start);
|
|
|
|
|
|
|
|
VM_BUG_ON(vrm->remap_type != MREMAP_SHRINK);
|
|
|
|
|
|
|
|
res = do_vmi_munmap(&vmi, mm, unmap_start, unmap_bytes,
|
|
|
|
vrm->uf_unmap, drop_lock);
|
|
|
|
vrm->vma = NULL; /* Invalidated. */
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we've not dropped the lock, then we should reload the VMA to
|
|
|
|
* replace the invalidated VMA with the one that may have now been
|
|
|
|
* split.
|
|
|
|
*/
|
|
|
|
if (drop_lock) {
|
|
|
|
vrm->mmap_locked = false;
|
|
|
|
} else {
|
|
|
|
vrm->vma = vma_lookup(mm, vrm->addr);
|
|
|
|
if (!vrm->vma)
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* mremap_to() - remap a vma to a new location.
|
2024-10-18 13:41:14 -04:00
|
|
|
* Returns: The new address of the vma or an error.
|
|
|
|
*/
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
static unsigned long mremap_to(struct vma_remap_struct *vrm)
|
2009-11-24 07:28:07 -05:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = current->mm;
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
unsigned long err;
|
2009-11-24 07:28:07 -05:00
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
if (vrm->flags & MREMAP_FIXED) {
|
mseal: add mseal syscall
The new mseal() is an syscall on 64 bit CPU, and with following signature:
int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.
mseal() blocks following operations for the given memory range.
1> Unmapping, moving to another location, and shrinking the size,
via munmap() and mremap(), can leave an empty space, therefore can
be replaced with a VMA with a new set of attributes.
2> Moving or expanding a different VMA into the current location,
via mremap().
3> Modifying a VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed VMA.
5> mprotect() and pkey_mprotect().
6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
memory, when users don't have write permission to the memory. Those
behaviors can alter region contents by discarding pages, effectively a
memset(0) for anonymous memory.
Following input during RFC are incooperated into this patch:
Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Linus Torvalds: assisting in defining system call signature and scope.
Liam R. Howlett: perf optimization.
Theo de Raadt: sharing the experiences and insight gained from
implementing mimmutable() in OpenBSD.
Finally, the idea that inspired this patch comes from Stephen Röttger's
work in Chrome V8 CFI.
[jeffxu@chromium.org: add branch prediction hint, per Pedro]
Link: https://lkml.kernel.org/r/20240423192825.1273679-2-jeffxu@chromium.org
Link: https://lkml.kernel.org/r/20240415163527.626541-3-jeffxu@chromium.org
Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Xu <jeffxu@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jorge Lucangeli Obes <jorgelo@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Stephen Röttger <sroettger@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Amer Al Shanawany <amer.shanawany@gmail.com>
Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-15 16:35:21 +00:00
|
|
|
/*
|
|
|
|
* In mremap_to().
|
|
|
|
* VMA is moved to dst address, and munmap dst first.
|
|
|
|
* do_munmap will check if dst is sealed.
|
|
|
|
*/
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
err = do_munmap(mm, vrm->new_addr, vrm->new_len,
|
|
|
|
vrm->uf_unmap_early);
|
|
|
|
vrm->vma = NULL; /* Invalidated. */
|
mm/mremap: permit mremap() move of multiple VMAs
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is
faulted, then moved back and a user fails to observe a merge from
otherwise compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
Should no VMA exist in the range, -EFAULT is returned as usual.
If a VMA move spans a single VMA - then there is no functional change.
Otherwise, we place additional requirements upon VMAs:
* They must not have a userfaultfd context associated with them - this
requires dropping the lock to notify users, and we want to perform the
operation with the mmap write lock held throughout.
* If file-backed, they cannot have a custom get_unmapped_area handler -
this might result in MREMAP_FIXED not being honoured, which could result
in unexpected positioning of VMAs in the moved region.
There may be gaps in the range of VMAs that are moved:
X Y X Y
<---> <-> <---> <->
|-------| |-----| |-----| |-------| |-----| |-----|
| A | | B | | C | ---> | A' | | B' | | C' |
|-------| |-----| |-----| |-------| |-----| |-----|
addr new_addr
The move will preserve the gaps between each VMA.
Note that any failures encountered will result in a partial move. Since
an mremap() can fail at any time, this might result in only some of the
VMAs being moved.
Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.
We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.
So we optimise for the far, far more likely case of the operation being
entirely permissible.
In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.
Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:59 +01:00
|
|
|
vrm->vmi_needs_invalidate = true;
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
if (err)
|
|
|
|
return err;
|
2009-11-24 07:28:07 -05:00
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
/*
|
|
|
|
* If we remap a portion of a VMA elsewhere in the same VMA,
|
|
|
|
* this can invalidate the old VMA. Reset.
|
|
|
|
*/
|
|
|
|
vrm->vma = vma_lookup(mm, vrm->addr);
|
|
|
|
if (!vrm->vma)
|
|
|
|
return -EFAULT;
|
2009-11-24 07:28:07 -05:00
|
|
|
}
|
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
if (vrm->remap_type == MREMAP_SHRINK) {
|
|
|
|
err = shrink_vma(vrm, /* drop_lock= */false);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2009-11-24 07:28:07 -05:00
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
/* Set up for the move now shrink has been executed. */
|
|
|
|
vrm->old_len = vrm->new_len;
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-01 21:09:17 -07:00
|
|
|
}
|
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
/* MREMAP_DONTUNMAP expands by old_len since old_len == new_len */
|
|
|
|
if (vrm->flags & MREMAP_DONTUNMAP) {
|
|
|
|
vm_flags_t vm_flags = vrm->vma->vm_flags;
|
|
|
|
unsigned long pages = vrm->old_len >> PAGE_SHIFT;
|
2009-12-03 15:23:11 -05:00
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
if (!may_expand_vm(mm, vm_flags, pages))
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2009-11-24 08:43:52 -05:00
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
err = vrm_set_new_addr(vrm);
|
|
|
|
if (err)
|
|
|
|
return err;
|
mm/mremap: add MREMAP_DONTUNMAP to mremap()
When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed. The remap operation will be
performed as it would have been normally by moving over the page tables to
the new mapping. The old vma will have any locked flags cleared, have no
pagetables, and any userfaultfds that were watching that range will
continue watching it.
For a mapping that is shared or not anonymous, MREMAP_DONTUNMAP will cause
the mremap() call to fail. Because MREMAP_DONTUNMAP always results in
moving a VMA you MUST use the MREMAP_MAYMOVE flag, it's not possible to
resize a VMA while also moving with MREMAP_DONTUNMAP so old_len must
always be equal to the new_len otherwise it will return -EINVAL.
We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.
This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location. For this purpose mremap
will be used. Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well. Therefore, we'll
require performing mmap immediately after. This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage. This flag
solves the problem."
[bgeffon@google.com: v6]
Link: http://lkml.kernel.org/r/20200218173221.237674-1-bgeffon@google.com
[bgeffon@google.com: v7]
Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-01 21:09:17 -07:00
|
|
|
|
2025-03-10 20:50:37 +00:00
|
|
|
return move_vma(vrm);
|
2009-11-24 07:28:07 -05:00
|
|
|
}
|
|
|
|
|
2009-11-24 07:43:18 -05:00
|
|
|
static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
|
|
|
|
{
|
2009-11-24 08:25:18 -05:00
|
|
|
unsigned long end = vma->vm_end + delta;
|
2022-09-06 19:49:03 +00:00
|
|
|
|
2009-12-03 15:23:11 -05:00
|
|
|
if (end < vma->vm_end) /* overflow */
|
2009-11-24 08:25:18 -05:00
|
|
|
return 0;
|
2022-09-06 19:49:03 +00:00
|
|
|
if (find_vma_intersection(vma->vm_mm, vma->vm_end, end))
|
2009-11-24 08:25:18 -05:00
|
|
|
return 0;
|
|
|
|
if (get_unmapped_area(NULL, vma->vm_start, end - vma->vm_start,
|
|
|
|
0, MAP_FIXED) & ~PAGE_MASK)
|
2009-11-24 07:43:18 -05:00
|
|
|
return 0;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
/* Determine whether we are actually able to execute an in-place expansion. */
|
|
|
|
static bool vrm_can_expand_in_place(struct vma_remap_struct *vrm)
|
2025-03-10 20:50:35 +00:00
|
|
|
{
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
/* Number of bytes from vrm->addr to end of VMA. */
|
|
|
|
unsigned long suffix_bytes = vrm->vma->vm_end - vrm->addr;
|
|
|
|
|
|
|
|
/* If end of range aligns to end of VMA, we can just expand in-place. */
|
|
|
|
if (suffix_bytes != vrm->old_len)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* Check whether this is feasible. */
|
|
|
|
if (!vma_expandable(vrm->vma, vrm->delta))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
2025-03-10 20:50:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We know we can expand the VMA in-place by delta pages, so do so.
|
|
|
|
*
|
|
|
|
* If we discover the VMA is locked, update mm_struct statistics accordingly and
|
|
|
|
* indicate so to the caller.
|
|
|
|
*/
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
static unsigned long expand_vma_in_place(struct vma_remap_struct *vrm)
|
2025-03-10 20:50:35 +00:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = current->mm;
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
struct vm_area_struct *vma = vrm->vma;
|
2025-03-10 20:50:35 +00:00
|
|
|
VMA_ITERATOR(vmi, mm, vma->vm_end);
|
|
|
|
|
mm/mremap: perform some simple cleanups
Patch series "mm/mremap: permit mremap() move of multiple VMAs", v4.
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
This patch (of 10):
We const-ify the vrm flags parameter to indicate this will never change.
We rename resize_is_valid() to remap_is_valid(), as this function does not
only apply to cases where we resize, so it's simply confusing to refer to
that here.
We remove the BUG() from mremap_at(), as we should not BUG() unless we are
certain it'll result in system instability.
We rename vrm_charge() to vrm_calc_charge() to make it clear this simply
calculates the charged number of pages rather than actually adjusting any
state.
We update the comment for vrm_implies_new_addr() to explain that
MREMAP_DONTUNMAP does not require a set address, but will always be moved.
Additionally consistently use 'res' rather than 'ret' for result values.
No functional change intended.
Link: https://lkml.kernel.org/r/cover.1752770784.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d35ad8ce6b2c33b2f2f4ef7ec415f04a35cba34f.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:51 +01:00
|
|
|
if (!vrm_calc_charge(vrm))
|
2025-03-10 20:50:37 +00:00
|
|
|
return -ENOMEM;
|
2025-03-10 20:50:35 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Function vma_merge_extend() is called on the
|
|
|
|
* extension we are adding to the already existing vma,
|
|
|
|
* vma_merge_extend() will merge this extension with the
|
|
|
|
* already existing vma (expand operation itself) and
|
|
|
|
* possibly also with the next vma if it becomes
|
|
|
|
* adjacent to the expanded vma and otherwise
|
|
|
|
* compatible.
|
|
|
|
*/
|
2025-03-30 17:20:48 +01:00
|
|
|
vma = vma_merge_extend(&vmi, vma, vrm->delta);
|
2025-03-10 20:50:35 +00:00
|
|
|
if (!vma) {
|
2025-03-10 20:50:37 +00:00
|
|
|
vrm_uncharge(vrm);
|
2025-03-10 20:50:35 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2025-03-30 17:20:48 +01:00
|
|
|
vrm->vma = vma;
|
2025-03-10 20:50:35 +00:00
|
|
|
|
2025-03-10 20:50:37 +00:00
|
|
|
vrm_stat_account(vrm, vrm->delta);
|
2025-03-10 20:50:35 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
static bool align_hugetlb(struct vma_remap_struct *vrm)
|
2025-03-10 20:50:35 +00:00
|
|
|
{
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
struct hstate *h __maybe_unused = hstate_vma(vrm->vma);
|
2025-03-10 20:50:35 +00:00
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
vrm->old_len = ALIGN(vrm->old_len, huge_page_size(h));
|
|
|
|
vrm->new_len = ALIGN(vrm->new_len, huge_page_size(h));
|
2025-03-10 20:50:35 +00:00
|
|
|
|
|
|
|
/* addrs must be huge page aligned */
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
if (vrm->addr & ~huge_page_mask(h))
|
2025-03-10 20:50:35 +00:00
|
|
|
return false;
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
if (vrm->new_addr & ~huge_page_mask(h))
|
2025-03-10 20:50:35 +00:00
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't allow remap expansion, because the underlying hugetlb
|
|
|
|
* reservation is not yet capable to handle split reservation.
|
|
|
|
*/
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
if (vrm->new_len > vrm->old_len)
|
2025-03-10 20:50:35 +00:00
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We are mremap()'ing without specifying a fixed address to move to, but are
|
|
|
|
* requesting that the VMA's size be increased.
|
|
|
|
*
|
|
|
|
* Try to do so in-place, if this fails, then move the VMA to a new location to
|
|
|
|
* action the change.
|
|
|
|
*/
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
static unsigned long expand_vma(struct vma_remap_struct *vrm)
|
2025-03-10 20:50:35 +00:00
|
|
|
{
|
|
|
|
unsigned long err;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* [addr, old_len) spans precisely to the end of the VMA, so try to
|
|
|
|
* expand it in-place.
|
|
|
|
*/
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
if (vrm_can_expand_in_place(vrm)) {
|
|
|
|
err = expand_vma_in_place(vrm);
|
|
|
|
if (err)
|
2025-03-10 20:50:35 +00:00
|
|
|
return err;
|
|
|
|
|
|
|
|
/* OK we're done! */
|
2025-07-17 17:55:53 +01:00
|
|
|
return vrm->addr;
|
2025-03-10 20:50:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We weren't able to just expand or shrink the area,
|
|
|
|
* we need to create a new one and move it.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* We're not allowed to move the VMA, so error out. */
|
2025-03-10 20:50:37 +00:00
|
|
|
if (!(vrm->flags & MREMAP_MAYMOVE))
|
2025-03-10 20:50:35 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/* Find a new location to move the VMA to. */
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
err = vrm_set_new_addr(vrm);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2025-03-10 20:50:35 +00:00
|
|
|
|
2025-03-10 20:50:37 +00:00
|
|
|
return move_vma(vrm);
|
2025-03-10 20:50:35 +00:00
|
|
|
}
|
|
|
|
|
2005-04-16 15:20:36 -07:00
|
|
|
/*
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
* Attempt to resize the VMA in-place, if we cannot, then move the VMA to the
|
|
|
|
* first available address to perform the operation.
|
2005-04-16 15:20:36 -07:00
|
|
|
*/
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
static unsigned long mremap_at(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
unsigned long res;
|
|
|
|
|
|
|
|
switch (vrm->remap_type) {
|
|
|
|
case MREMAP_INVALID:
|
|
|
|
break;
|
|
|
|
case MREMAP_NO_RESIZE:
|
|
|
|
/* NO-OP CASE - resizing to the same size. */
|
|
|
|
return vrm->addr;
|
|
|
|
case MREMAP_SHRINK:
|
|
|
|
/*
|
|
|
|
* SHRINK CASE. Can always be done in-place.
|
|
|
|
*
|
|
|
|
* Simply unmap the shrunken portion of the VMA. This does all
|
|
|
|
* the needed commit accounting, and we indicate that the mmap
|
|
|
|
* lock should be dropped.
|
|
|
|
*/
|
|
|
|
res = shrink_vma(vrm, /* drop_lock= */true);
|
|
|
|
if (res)
|
|
|
|
return res;
|
|
|
|
|
|
|
|
return vrm->addr;
|
|
|
|
case MREMAP_EXPAND:
|
|
|
|
return expand_vma(vrm);
|
|
|
|
}
|
|
|
|
|
mm/mremap: perform some simple cleanups
Patch series "mm/mremap: permit mremap() move of multiple VMAs", v4.
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
This patch (of 10):
We const-ify the vrm flags parameter to indicate this will never change.
We rename resize_is_valid() to remap_is_valid(), as this function does not
only apply to cases where we resize, so it's simply confusing to refer to
that here.
We remove the BUG() from mremap_at(), as we should not BUG() unless we are
certain it'll result in system instability.
We rename vrm_charge() to vrm_calc_charge() to make it clear this simply
calculates the charged number of pages rather than actually adjusting any
state.
We update the comment for vrm_implies_new_addr() to explain that
MREMAP_DONTUNMAP does not require a set address, but will always be moved.
Additionally consistently use 'res' rather than 'ret' for result values.
No functional change intended.
Link: https://lkml.kernel.org/r/cover.1752770784.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d35ad8ce6b2c33b2f2f4ef7ec415f04a35cba34f.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:51 +01:00
|
|
|
/* Should not be possible. */
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
return -EINVAL;
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
}
|
|
|
|
|
2025-07-17 17:55:56 +01:00
|
|
|
/*
|
|
|
|
* Will this operation result in the VMA being expanded or moved and thus need
|
|
|
|
* to map a new portion of virtual address space?
|
|
|
|
*/
|
|
|
|
static bool vrm_will_map_new(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
if (vrm->remap_type == MREMAP_EXPAND)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (vrm_implies_new_addr(vrm))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
mm/mremap: permit mremap() move of multiple VMAs
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is
faulted, then moved back and a user fails to observe a merge from
otherwise compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
Should no VMA exist in the range, -EFAULT is returned as usual.
If a VMA move spans a single VMA - then there is no functional change.
Otherwise, we place additional requirements upon VMAs:
* They must not have a userfaultfd context associated with them - this
requires dropping the lock to notify users, and we want to perform the
operation with the mmap write lock held throughout.
* If file-backed, they cannot have a custom get_unmapped_area handler -
this might result in MREMAP_FIXED not being honoured, which could result
in unexpected positioning of VMAs in the moved region.
There may be gaps in the range of VMAs that are moved:
X Y X Y
<---> <-> <---> <->
|-------| |-----| |-----| |-------| |-----| |-----|
| A | | B | | C | ---> | A' | | B' | | C' |
|-------| |-----| |-----| |-------| |-----| |-----|
addr new_addr
The move will preserve the gaps between each VMA.
Note that any failures encountered will result in a partial move. Since
an mremap() can fail at any time, this might result in only some of the
VMAs being moved.
Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.
We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.
So we optimise for the far, far more likely case of the operation being
entirely permissible.
In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.
Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:59 +01:00
|
|
|
/* Does this remap ONLY move mappings? */
|
|
|
|
static bool vrm_move_only(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
if (!(vrm->flags & MREMAP_FIXED))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (vrm->old_len != vrm->new_len)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2025-07-17 17:55:57 +01:00
|
|
|
static void notify_uffd(struct vma_remap_struct *vrm, bool failed)
|
|
|
|
{
|
|
|
|
struct mm_struct *mm = current->mm;
|
|
|
|
|
|
|
|
/* Regardless of success/failure, we always notify of any unmaps. */
|
|
|
|
userfaultfd_unmap_complete(mm, vrm->uf_unmap_early);
|
|
|
|
if (failed)
|
|
|
|
mremap_userfaultfd_fail(vrm->uf);
|
|
|
|
else
|
|
|
|
mremap_userfaultfd_complete(vrm->uf, vrm->addr,
|
|
|
|
vrm->new_addr, vrm->old_len);
|
|
|
|
userfaultfd_unmap_complete(mm, vrm->uf_unmap);
|
|
|
|
}
|
|
|
|
|
mm/mremap: permit mremap() move of multiple VMAs
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is
faulted, then moved back and a user fails to observe a merge from
otherwise compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
Should no VMA exist in the range, -EFAULT is returned as usual.
If a VMA move spans a single VMA - then there is no functional change.
Otherwise, we place additional requirements upon VMAs:
* They must not have a userfaultfd context associated with them - this
requires dropping the lock to notify users, and we want to perform the
operation with the mmap write lock held throughout.
* If file-backed, they cannot have a custom get_unmapped_area handler -
this might result in MREMAP_FIXED not being honoured, which could result
in unexpected positioning of VMAs in the moved region.
There may be gaps in the range of VMAs that are moved:
X Y X Y
<---> <-> <---> <->
|-------| |-----| |-----| |-------| |-----| |-----|
| A | | B | | C | ---> | A' | | B' | | C' |
|-------| |-----| |-----| |-------| |-----| |-----|
addr new_addr
The move will preserve the gaps between each VMA.
Note that any failures encountered will result in a partial move. Since
an mremap() can fail at any time, this might result in only some of the
VMAs being moved.
Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.
We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.
So we optimise for the far, far more likely case of the operation being
entirely permissible.
In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.
Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:59 +01:00
|
|
|
static bool vma_multi_allowed(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
struct file *file;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We can't support moving multiple uffd VMAs as notify requires
|
|
|
|
* mmap lock to be dropped.
|
|
|
|
*/
|
|
|
|
if (userfaultfd_armed(vma))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Custom get unmapped area might result in MREMAP_FIXED not
|
|
|
|
* being obeyed.
|
|
|
|
*/
|
|
|
|
file = vma->vm_file;
|
|
|
|
if (file && !vma_is_shmem(vma) && !is_vm_hugetlb_page(vma)) {
|
|
|
|
const struct file_operations *fop = file->f_op;
|
|
|
|
|
|
|
|
if (fop->get_unmapped_area)
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2025-07-17 17:55:53 +01:00
|
|
|
static int check_prep_vma(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
struct vm_area_struct *vma = vrm->vma;
|
2025-07-17 17:55:57 +01:00
|
|
|
struct mm_struct *mm = current->mm;
|
|
|
|
unsigned long addr = vrm->addr;
|
|
|
|
unsigned long old_len, new_len, pgoff;
|
2025-07-17 17:55:53 +01:00
|
|
|
|
|
|
|
if (!vma)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
/* If mseal()'d, mremap() is prohibited. */
|
2025-07-25 09:29:43 +01:00
|
|
|
if (vma_is_sealed(vma))
|
2025-07-17 17:55:53 +01:00
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
/* Align to hugetlb page size, if required. */
|
|
|
|
if (is_vm_hugetlb_page(vma) && !align_hugetlb(vrm))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
vrm_set_delta(vrm);
|
|
|
|
vrm->remap_type = vrm_remap_type(vrm);
|
|
|
|
/* For convenience, we set new_addr even if VMA won't move. */
|
|
|
|
if (!vrm_implies_new_addr(vrm))
|
2025-07-17 17:55:57 +01:00
|
|
|
vrm->new_addr = addr;
|
|
|
|
|
|
|
|
/* Below only meaningful if we expand or move a VMA. */
|
|
|
|
if (!vrm_will_map_new(vrm))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
old_len = vrm->old_len;
|
|
|
|
new_len = vrm->new_len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* !old_len is a special case where an attempt is made to 'duplicate'
|
|
|
|
* a mapping. This makes no sense for private mappings as it will
|
|
|
|
* instead create a fresh/new mapping unrelated to the original. This
|
|
|
|
* is contrary to the basic idea of mremap which creates new mappings
|
|
|
|
* based on the original. There are no known use cases for this
|
|
|
|
* behavior. As a result, fail such attempts.
|
|
|
|
*/
|
|
|
|
if (!old_len && !(vma->vm_flags & (VM_SHARED | VM_MAYSHARE))) {
|
|
|
|
pr_warn_once("%s (%d): attempted to duplicate a private mapping with mremap. This is not supported.\n",
|
|
|
|
current->comm, current->pid);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((vrm->flags & MREMAP_DONTUNMAP) &&
|
|
|
|
(vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We permit crossing of boundaries for the range being unmapped due to
|
|
|
|
* a shrink.
|
|
|
|
*/
|
|
|
|
if (vrm->remap_type == MREMAP_SHRINK)
|
|
|
|
old_len = new_len;
|
|
|
|
|
mm/mremap: permit mremap() move of multiple VMAs
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is
faulted, then moved back and a user fails to observe a merge from
otherwise compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
Should no VMA exist in the range, -EFAULT is returned as usual.
If a VMA move spans a single VMA - then there is no functional change.
Otherwise, we place additional requirements upon VMAs:
* They must not have a userfaultfd context associated with them - this
requires dropping the lock to notify users, and we want to perform the
operation with the mmap write lock held throughout.
* If file-backed, they cannot have a custom get_unmapped_area handler -
this might result in MREMAP_FIXED not being honoured, which could result
in unexpected positioning of VMAs in the moved region.
There may be gaps in the range of VMAs that are moved:
X Y X Y
<---> <-> <---> <->
|-------| |-----| |-----| |-------| |-----| |-----|
| A | | B | | C | ---> | A' | | B' | | C' |
|-------| |-----| |-----| |-------| |-----| |-----|
addr new_addr
The move will preserve the gaps between each VMA.
Note that any failures encountered will result in a partial move. Since
an mremap() can fail at any time, this might result in only some of the
VMAs being moved.
Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.
We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.
So we optimise for the far, far more likely case of the operation being
entirely permissible.
In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.
Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:59 +01:00
|
|
|
/*
|
|
|
|
* We can't remap across the end of VMAs, as another VMA may be
|
|
|
|
* adjacent:
|
|
|
|
*
|
|
|
|
* addr vma->vm_end
|
|
|
|
* |-----.----------|
|
|
|
|
* | . |
|
|
|
|
* |-----.----------|
|
|
|
|
* .<--------->xxx>
|
|
|
|
* old_len
|
|
|
|
*
|
|
|
|
* We also require that vma->vm_start <= addr < vma->vm_end.
|
|
|
|
*/
|
2025-07-17 17:55:57 +01:00
|
|
|
if (old_len > vma->vm_end - addr)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (new_len == old_len)
|
|
|
|
return 0;
|
|
|
|
|
2025-07-17 17:55:58 +01:00
|
|
|
/* We are expanding and the VMA is mlock()'d so we need to populate. */
|
|
|
|
if (vma->vm_flags & VM_LOCKED)
|
|
|
|
vrm->populate_expand = true;
|
|
|
|
|
2025-07-17 17:55:57 +01:00
|
|
|
/* Need to be careful about a growing mapping */
|
|
|
|
pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
|
|
|
|
pgoff += vma->vm_pgoff;
|
|
|
|
if (pgoff + (new_len >> PAGE_SHIFT) < pgoff)
|
|
|
|
return -EINVAL;
|
2025-07-17 17:55:53 +01:00
|
|
|
|
2025-07-17 17:55:57 +01:00
|
|
|
if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta))
|
|
|
|
return -EAGAIN;
|
|
|
|
|
|
|
|
if (!may_expand_vm(mm, vma->vm_flags, vrm->delta >> PAGE_SHIFT))
|
|
|
|
return -ENOMEM;
|
2025-07-17 17:55:56 +01:00
|
|
|
|
2025-07-17 17:55:53 +01:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2025-07-17 17:55:57 +01:00
|
|
|
/*
|
|
|
|
* Are the parameters passed to mremap() valid? If so return 0, otherwise return
|
|
|
|
* error.
|
|
|
|
*/
|
|
|
|
static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
|
|
|
|
|
2025-07-17 17:55:54 +01:00
|
|
|
{
|
2025-07-17 17:55:57 +01:00
|
|
|
unsigned long addr = vrm->addr;
|
|
|
|
unsigned long flags = vrm->flags;
|
2025-07-17 17:55:54 +01:00
|
|
|
|
2025-07-17 17:55:57 +01:00
|
|
|
/* Ensure no unexpected flag values. */
|
|
|
|
if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* Start address must be page-aligned. */
|
|
|
|
if (offset_in_page(addr))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We allow a zero old-len as a special case
|
|
|
|
* for DOS-emu "duplicate shm area" thing. But
|
|
|
|
* a zero new-len is nonsensical.
|
|
|
|
*/
|
|
|
|
if (!vrm->new_len)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* Is the new length or address silly? */
|
|
|
|
if (vrm->new_len > TASK_SIZE ||
|
|
|
|
vrm->new_addr > TASK_SIZE - vrm->new_len)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* Remainder of checks are for cases with specific new_addr. */
|
|
|
|
if (!vrm_implies_new_addr(vrm))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* The new address must be page-aligned. */
|
|
|
|
if (offset_in_page(vrm->new_addr))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* A fixed address implies a move. */
|
|
|
|
if (!(flags & MREMAP_MAYMOVE))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* MREMAP_DONTUNMAP does not allow resizing in the process. */
|
|
|
|
if (flags & MREMAP_DONTUNMAP && vrm->old_len != vrm->new_len)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* Target VMA must not overlap source VMA. */
|
|
|
|
if (vrm_overlaps(vrm))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* move_vma() need us to stay 4 maps below the threshold, otherwise
|
|
|
|
* it will bail out at the very beginning.
|
|
|
|
* That is a problem if we have already unmaped the regions here
|
|
|
|
* (new_addr, and old_addr), because userspace will not know the
|
|
|
|
* state of the vma's after it gets -ENOMEM.
|
|
|
|
* So, to avoid such scenario we can pre-compute if the whole
|
|
|
|
* operation has high chances to success map-wise.
|
|
|
|
* Worst-scenario case is when both vma's (new_addr and old_addr) get
|
|
|
|
* split in 3 before unmapping it.
|
|
|
|
* That means 2 more maps (1 for each) to the ones we already hold.
|
|
|
|
* Check whether current map count plus 2 still leads us to 4 maps below
|
|
|
|
* the threshold, otherwise return -ENOMEM here to be more safe.
|
|
|
|
*/
|
|
|
|
if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
return 0;
|
2025-07-17 17:55:54 +01:00
|
|
|
}
|
|
|
|
|
mm/mremap: permit mremap() move of multiple VMAs
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is
faulted, then moved back and a user fails to observe a merge from
otherwise compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
Should no VMA exist in the range, -EFAULT is returned as usual.
If a VMA move spans a single VMA - then there is no functional change.
Otherwise, we place additional requirements upon VMAs:
* They must not have a userfaultfd context associated with them - this
requires dropping the lock to notify users, and we want to perform the
operation with the mmap write lock held throughout.
* If file-backed, they cannot have a custom get_unmapped_area handler -
this might result in MREMAP_FIXED not being honoured, which could result
in unexpected positioning of VMAs in the moved region.
There may be gaps in the range of VMAs that are moved:
X Y X Y
<---> <-> <---> <->
|-------| |-----| |-----| |-------| |-----| |-----|
| A | | B | | C | ---> | A' | | B' | | C' |
|-------| |-----| |-----| |-------| |-----| |-----|
addr new_addr
The move will preserve the gaps between each VMA.
Note that any failures encountered will result in a partial move. Since
an mremap() can fail at any time, this might result in only some of the
VMAs being moved.
Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.
We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.
So we optimise for the far, far more likely case of the operation being
entirely permissible.
In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.
Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:59 +01:00
|
|
|
static unsigned long remap_move(struct vma_remap_struct *vrm)
|
|
|
|
{
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
unsigned long start = vrm->addr;
|
|
|
|
unsigned long end = vrm->addr + vrm->old_len;
|
|
|
|
unsigned long new_addr = vrm->new_addr;
|
|
|
|
bool allowed = true, seen_vma = false;
|
|
|
|
unsigned long target_addr = new_addr;
|
|
|
|
unsigned long res = -EFAULT;
|
|
|
|
unsigned long last_end;
|
|
|
|
VMA_ITERATOR(vmi, current->mm, start);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When moving VMAs we allow for batched moves across multiple VMAs,
|
|
|
|
* with all VMAs in the input range [addr, addr + old_len) being moved
|
|
|
|
* (and split as necessary).
|
|
|
|
*/
|
|
|
|
for_each_vma_range(vmi, vma, end) {
|
|
|
|
/* Account for start, end not aligned with VMA start, end. */
|
|
|
|
unsigned long addr = max(vma->vm_start, start);
|
|
|
|
unsigned long len = min(end, vma->vm_end) - addr;
|
|
|
|
unsigned long offset, res_vma;
|
|
|
|
|
|
|
|
if (!allowed)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
/* No gap permitted at the start of the range. */
|
|
|
|
if (!seen_vma && start < vma->vm_start)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* To sensibly move multiple VMAs, accounting for the fact that
|
|
|
|
* get_unmapped_area() may align even MAP_FIXED moves, we simply
|
|
|
|
* attempt to move such that the gaps between source VMAs remain
|
|
|
|
* consistent in destination VMAs, e.g.:
|
|
|
|
*
|
|
|
|
* X Y X Y
|
|
|
|
* <---> <-> <---> <->
|
|
|
|
* |-------| |-----| |-----| |-------| |-----| |-----|
|
|
|
|
* | A | | B | | C | ---> | A' | | B' | | C' |
|
|
|
|
* |-------| |-----| |-----| |-------| |-----| |-----|
|
|
|
|
* new_addr
|
|
|
|
*
|
|
|
|
* So we map B' at A'->vm_end + X, and C' at B'->vm_end + Y.
|
|
|
|
*/
|
|
|
|
offset = seen_vma ? vma->vm_start - last_end : 0;
|
|
|
|
last_end = vma->vm_end;
|
|
|
|
|
|
|
|
vrm->vma = vma;
|
|
|
|
vrm->addr = addr;
|
|
|
|
vrm->new_addr = target_addr + offset;
|
|
|
|
vrm->old_len = vrm->new_len = len;
|
|
|
|
|
|
|
|
allowed = vma_multi_allowed(vma);
|
|
|
|
if (seen_vma && !allowed)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
res_vma = check_prep_vma(vrm);
|
|
|
|
if (!res_vma)
|
|
|
|
res_vma = mremap_to(vrm);
|
|
|
|
if (IS_ERR_VALUE(res_vma))
|
|
|
|
return res_vma;
|
|
|
|
|
|
|
|
if (!seen_vma) {
|
|
|
|
VM_WARN_ON_ONCE(allowed && res_vma != new_addr);
|
|
|
|
res = res_vma;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* mmap lock is only dropped on shrink. */
|
|
|
|
VM_WARN_ON_ONCE(!vrm->mmap_locked);
|
|
|
|
/* This is a move, no expand should occur. */
|
|
|
|
VM_WARN_ON_ONCE(vrm->populate_expand);
|
|
|
|
|
|
|
|
if (vrm->vmi_needs_invalidate) {
|
|
|
|
vma_iter_invalidate(&vmi);
|
|
|
|
vrm->vmi_needs_invalidate = false;
|
|
|
|
}
|
|
|
|
seen_vma = true;
|
|
|
|
target_addr = res_vma + vrm->new_len;
|
|
|
|
}
|
|
|
|
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
static unsigned long do_mremap(struct vma_remap_struct *vrm)
|
2005-04-16 15:20:36 -07:00
|
|
|
{
|
2005-10-29 18:16:16 -07:00
|
|
|
struct mm_struct *mm = current->mm;
|
mm/mremap: perform some simple cleanups
Patch series "mm/mremap: permit mremap() move of multiple VMAs", v4.
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
This patch (of 10):
We const-ify the vrm flags parameter to indicate this will never change.
We rename resize_is_valid() to remap_is_valid(), as this function does not
only apply to cases where we resize, so it's simply confusing to refer to
that here.
We remove the BUG() from mremap_at(), as we should not BUG() unless we are
certain it'll result in system instability.
We rename vrm_charge() to vrm_calc_charge() to make it clear this simply
calculates the charged number of pages rather than actually adjusting any
state.
We update the comment for vrm_implies_new_addr() to explain that
MREMAP_DONTUNMAP does not require a set address, but will always be moved.
Additionally consistently use 'res' rather than 'ret' for result values.
No functional change intended.
Link: https://lkml.kernel.org/r/cover.1752770784.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d35ad8ce6b2c33b2f2f4ef7ec415f04a35cba34f.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:51 +01:00
|
|
|
unsigned long res;
|
mm/mremap: use an explicit uffd failure path for mremap
Right now it appears that the code is relying upon the returned
destination address having bits outside PAGE_MASK to indicate whether an
error value is specified, and decrementing the increased refcount on the
uffd ctx if so.
This is not a safe means of determining an error value, so instead, be
specific. It makes far more sense to do so in a dedicated error path, so
add mremap_userfaultfd_fail() for this purpose and use this when an error
arises.
A vm_userfaultfd_ctx is not established until we are at the point where
mremap_userfaultfd_prep() is invoked in copy_vma_and_data(), so this is a
no-op until this happens.
That is - uffd remap notification only occurs if the VMA is actually moved
- at which point a UFFD_EVENT_REMAP event is raised.
No errors can occur after this point currently, though it's certainly not
guaranteed this will always remain the case, and we mustn't rely on this.
However, the reason for needing to handle this case is that, when an error
arises on a VMA move at the point of adjusting page tables, we revert this
operation, and propagate the error.
At this point, it is not correct to raise a uffd remap event, and we must
handle it.
This refactoring makes it abundantly clear what we are doing.
We assume vrm->new_addr is always valid, which a prior change made the
case even for mremap() invocations which don't move the VMA, however given
no uffd context would be set up in this case it's immaterial to this
change anyway.
No functional change intended.
Link: https://lkml.kernel.org/r/a70e8a1f7bce9f43d1431065b414e0f212297297.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:55 +01:00
|
|
|
bool failed;
|
mm: untag user pointers passed to memory syscalls
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.
This patch allows tagged pointers to be passed to the following memory
syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
mremap, msync, munlock, move_pages.
The mmap and mremap syscalls do not currently accept tagged addresses.
Architectures may interpret the tag as a background colour for the
corresponding vma.
Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 16:48:30 -07:00
|
|
|
|
2025-07-17 17:55:52 +01:00
|
|
|
vrm->old_len = PAGE_ALIGN(vrm->old_len);
|
|
|
|
vrm->new_len = PAGE_ALIGN(vrm->new_len);
|
|
|
|
|
mm/mremap: perform some simple cleanups
Patch series "mm/mremap: permit mremap() move of multiple VMAs", v4.
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
This patch (of 10):
We const-ify the vrm flags parameter to indicate this will never change.
We rename resize_is_valid() to remap_is_valid(), as this function does not
only apply to cases where we resize, so it's simply confusing to refer to
that here.
We remove the BUG() from mremap_at(), as we should not BUG() unless we are
certain it'll result in system instability.
We rename vrm_charge() to vrm_calc_charge() to make it clear this simply
calculates the charged number of pages rather than actually adjusting any
state.
We update the comment for vrm_implies_new_addr() to explain that
MREMAP_DONTUNMAP does not require a set address, but will always be moved.
Additionally consistently use 'res' rather than 'ret' for result values.
No functional change intended.
Link: https://lkml.kernel.org/r/cover.1752770784.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d35ad8ce6b2c33b2f2f4ef7ec415f04a35cba34f.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:51 +01:00
|
|
|
res = check_mremap_params(vrm);
|
|
|
|
if (res)
|
|
|
|
return res;
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2025-03-10 20:50:35 +00:00
|
|
|
if (mmap_write_lock_killable(mm))
|
2016-05-23 16:25:27 -07:00
|
|
|
return -EINTR;
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
vrm->mmap_locked = true;
|
2025-03-10 20:50:35 +00:00
|
|
|
|
mm/mremap: permit mremap() move of multiple VMAs
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is
faulted, then moved back and a user fails to observe a merge from
otherwise compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
Should no VMA exist in the range, -EFAULT is returned as usual.
If a VMA move spans a single VMA - then there is no functional change.
Otherwise, we place additional requirements upon VMAs:
* They must not have a userfaultfd context associated with them - this
requires dropping the lock to notify users, and we want to perform the
operation with the mmap write lock held throughout.
* If file-backed, they cannot have a custom get_unmapped_area handler -
this might result in MREMAP_FIXED not being honoured, which could result
in unexpected positioning of VMAs in the moved region.
There may be gaps in the range of VMAs that are moved:
X Y X Y
<---> <-> <---> <->
|-------| |-----| |-----| |-------| |-----| |-----|
| A | | B | | C | ---> | A' | | B' | | C' |
|-------| |-----| |-----| |-------| |-----| |-----|
addr new_addr
The move will preserve the gaps between each VMA.
Note that any failures encountered will result in a partial move. Since
an mremap() can fail at any time, this might result in only some of the
VMAs being moved.
Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.
We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.
So we optimise for the far, far more likely case of the operation being
entirely permissible.
In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.
Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:59 +01:00
|
|
|
if (vrm_move_only(vrm)) {
|
|
|
|
res = remap_move(vrm);
|
|
|
|
} else {
|
|
|
|
vrm->vma = vma_lookup(current->mm, vrm->addr);
|
|
|
|
res = check_prep_vma(vrm);
|
|
|
|
if (res)
|
|
|
|
goto out;
|
2018-10-26 15:08:50 -07:00
|
|
|
|
mm/mremap: permit mremap() move of multiple VMAs
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is
faulted, then moved back and a user fails to observe a merge from
otherwise compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
We only permit this for mremap() operations that do NOT change the size of
the VMA and DO specify MREMAP_MAYMOVE | MREMAP_FIXED.
Should no VMA exist in the range, -EFAULT is returned as usual.
If a VMA move spans a single VMA - then there is no functional change.
Otherwise, we place additional requirements upon VMAs:
* They must not have a userfaultfd context associated with them - this
requires dropping the lock to notify users, and we want to perform the
operation with the mmap write lock held throughout.
* If file-backed, they cannot have a custom get_unmapped_area handler -
this might result in MREMAP_FIXED not being honoured, which could result
in unexpected positioning of VMAs in the moved region.
There may be gaps in the range of VMAs that are moved:
X Y X Y
<---> <-> <---> <->
|-------| |-----| |-----| |-------| |-----| |-----|
| A | | B | | C | ---> | A' | | B' | | C' |
|-------| |-----| |-----| |-------| |-----| |-----|
addr new_addr
The move will preserve the gaps between each VMA.
Note that any failures encountered will result in a partial move. Since
an mremap() can fail at any time, this might result in only some of the
VMAs being moved.
Note that failures are very rare and typically require an out of a memory
condition or a mapping limit condition to be hit, assuming the VMAs being
moved are valid.
We don't try to assess ahead of time whether VMAs are valid according to
the multi VMA rules, as it would be rather unusual for a user to mix
uffd-enabled VMAs and/or VMAs which map unusual driver mappings that
specify custom get_unmapped_area() handlers in an aggregate operation.
So we optimise for the far, far more likely case of the operation being
entirely permissible.
In the case of the move of a single VMA, the above conditions are
permitted. This makes the behaviour identical for a single VMA as before.
Link: https://lkml.kernel.org/r/8cab2f2c202c4208bdfdb562635748bea6eb37bf.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:59 +01:00
|
|
|
/* Actually execute mremap. */
|
|
|
|
res = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
|
|
|
|
}
|
2025-03-10 20:50:35 +00:00
|
|
|
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
out:
|
mm/mremap: use an explicit uffd failure path for mremap
Right now it appears that the code is relying upon the returned
destination address having bits outside PAGE_MASK to indicate whether an
error value is specified, and decrementing the increased refcount on the
uffd ctx if so.
This is not a safe means of determining an error value, so instead, be
specific. It makes far more sense to do so in a dedicated error path, so
add mremap_userfaultfd_fail() for this purpose and use this when an error
arises.
A vm_userfaultfd_ctx is not established until we are at the point where
mremap_userfaultfd_prep() is invoked in copy_vma_and_data(), so this is a
no-op until this happens.
That is - uffd remap notification only occurs if the VMA is actually moved
- at which point a UFFD_EVENT_REMAP event is raised.
No errors can occur after this point currently, though it's certainly not
guaranteed this will always remain the case, and we mustn't rely on this.
However, the reason for needing to handle this case is that, when an error
arises on a VMA move at the point of adjusting page tables, we revert this
operation, and propagate the error.
At this point, it is not correct to raise a uffd remap event, and we must
handle it.
This refactoring makes it abundantly clear what we are doing.
We assume vrm->new_addr is always valid, which a prior change made the
case even for mremap() invocations which don't move the VMA, however given
no uffd context would be set up in this case it's immaterial to this
change anyway.
No functional change intended.
Link: https://lkml.kernel.org/r/a70e8a1f7bce9f43d1431065b414e0f212297297.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:55 +01:00
|
|
|
failed = IS_ERR_VALUE(res);
|
|
|
|
|
2025-07-17 17:55:54 +01:00
|
|
|
if (vrm->mmap_locked)
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
mmap_write_unlock(mm);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
2025-07-17 17:55:58 +01:00
|
|
|
/* VMA mlock'd + was expanded, so populated expanded region. */
|
|
|
|
if (!failed && vrm->populate_expand)
|
2025-07-17 17:55:54 +01:00
|
|
|
mm_populate(vrm->new_addr + vrm->old_len, vrm->delta);
|
2005-04-16 15:20:36 -07:00
|
|
|
|
mm/mremap: use an explicit uffd failure path for mremap
Right now it appears that the code is relying upon the returned
destination address having bits outside PAGE_MASK to indicate whether an
error value is specified, and decrementing the increased refcount on the
uffd ctx if so.
This is not a safe means of determining an error value, so instead, be
specific. It makes far more sense to do so in a dedicated error path, so
add mremap_userfaultfd_fail() for this purpose and use this when an error
arises.
A vm_userfaultfd_ctx is not established until we are at the point where
mremap_userfaultfd_prep() is invoked in copy_vma_and_data(), so this is a
no-op until this happens.
That is - uffd remap notification only occurs if the VMA is actually moved
- at which point a UFFD_EVENT_REMAP event is raised.
No errors can occur after this point currently, though it's certainly not
guaranteed this will always remain the case, and we mustn't rely on this.
However, the reason for needing to handle this case is that, when an error
arises on a VMA move at the point of adjusting page tables, we revert this
operation, and propagate the error.
At this point, it is not correct to raise a uffd remap event, and we must
handle it.
This refactoring makes it abundantly clear what we are doing.
We assume vrm->new_addr is always valid, which a prior change made the
case even for mremap() invocations which don't move the VMA, however given
no uffd context would be set up in this case it's immaterial to this
change anyway.
No functional change intended.
Link: https://lkml.kernel.org/r/a70e8a1f7bce9f43d1431065b414e0f212297297.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:55 +01:00
|
|
|
notify_uffd(vrm, failed);
|
mm/mremap: perform some simple cleanups
Patch series "mm/mremap: permit mremap() move of multiple VMAs", v4.
Historically we've made it a uAPI requirement that mremap() may only
operate on a single VMA at a time.
For instances where VMAs need to be resized, this makes sense, as it
becomes very difficult to determine what a user actually wants should they
indicate a desire to expand or shrink the size of multiple VMAs (truncate?
Adjust sizes individually? Some other strategy?).
However, in instances where a user is moving VMAs, it is restrictive to
disallow this.
This is especially the case when anonymous mapping remap may or may not be
mergeable depending on whether VMAs have or have not been faulted due to
anon_vma assignment and folio index alignment with vma->vm_pgoff.
Often this can result in surprising impact where a moved region is faulted,
then moved back and a user fails to observe a merge from otherwise
compatible, adjacent VMAs.
This change allows such cases to work without the user having to be
cognizant of whether a prior mremap() move or other VMA operations has
resulted in VMA fragmentation.
In order to do this, this series performs a large amount of refactoring,
most pertinently - grouping sanity checks together, separately those that
check input parameters and those relating to VMAs.
we also simplify the post-mmap lock drop processing for uffd and mlock()'d
VMAs.
With this done, we can then fairly straightforwardly implement this
functionality.
This works exclusively for mremap() invocations which specify
MREMAP_FIXED. It is not compatible with VMAs which use userfaultfd, as the
notification of the userland fault handler would require us to drop the
mmap lock.
It is also not compatible with file-backed mappings with customised
get_unmapped_area() handlers as these may not honour MREMAP_FIXED.
The input and output addresses ranges must not overlap. We carefully
account for moves which would result in VMA iterator invalidation.
While there can be gaps between VMAs in the input range, there can be no
gap before the first VMA in the range.
This patch (of 10):
We const-ify the vrm flags parameter to indicate this will never change.
We rename resize_is_valid() to remap_is_valid(), as this function does not
only apply to cases where we resize, so it's simply confusing to refer to
that here.
We remove the BUG() from mremap_at(), as we should not BUG() unless we are
certain it'll result in system instability.
We rename vrm_charge() to vrm_calc_charge() to make it clear this simply
calculates the charged number of pages rather than actually adjusting any
state.
We update the comment for vrm_implies_new_addr() to explain that
MREMAP_DONTUNMAP does not require a set address, but will always be moved.
Additionally consistently use 'res' rather than 'ret' for result values.
No functional change intended.
Link: https://lkml.kernel.org/r/cover.1752770784.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d35ad8ce6b2c33b2f2f4ef7ec415f04a35cba34f.1752770784.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-17 17:55:51 +01:00
|
|
|
return res;
|
2005-04-16 15:20:36 -07:00
|
|
|
}
|
mm/mremap: introduce and use vma_remap_struct threaded state
A number of mremap() calls both pass around and modify a large number of
parameters, making the code less readable and often repeatedly having to
determine things such as VMA, size delta, and more.
Avoid this by using the common pattern of passing a state object through
the operation, updating it as we go. We introduce the vma_remap_struct or
'VRM' for this purpose.
This also gives us the ability to accumulate further state through the
operation that would otherwise require awkward and error-prone pointer
passing.
We can also now trivially define helper functions that operate on a VRM
object.
This pattern has proven itself to be very powerful when implemented for
VMA merge, VMA unmapping and memory mapping operations, so it is
battle-tested and functional.
We both introduce the data structure and use it, introducing helper
functions as needed to make things readable, we move some state such as
mmap lock and mlock() status to the VRM, we introduce a means of
classifying the type of mremap() operation and de-duplicate the
get_unmapped_area() lookup.
We also neatly thread userfaultfd state throughout the operation.
Note that there is further refactoring to be done, chiefly adjust
move_vma() to accept a VRM parameter. We defer this as there is
pre-requisite work required to be able to do so which we will do in a
subsequent patch.
Link: https://lkml.kernel.org/r/27951739dc83b2b1523b81fa9c009ba348388d40.1741639347.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-10 20:50:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Expand (or shrink) an existing mapping, potentially moving it at the
|
|
|
|
* same time (controlled by the MREMAP_MAYMOVE flag and available VM space)
|
|
|
|
*
|
|
|
|
* MREMAP_FIXED option added 5-Dec-1999 by Benjamin LaHaise
|
|
|
|
* This option implies MREMAP_MAYMOVE.
|
|
|
|
*/
|
|
|
|
SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
|
|
|
|
unsigned long, new_len, unsigned long, flags,
|
|
|
|
unsigned long, new_addr)
|
|
|
|
{
|
|
|
|
struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
|
|
|
|
LIST_HEAD(uf_unmap_early);
|
|
|
|
LIST_HEAD(uf_unmap);
|
|
|
|
/*
|
|
|
|
* There is a deliberate asymmetry here: we strip the pointer tag
|
|
|
|
* from the old address but leave the new address alone. This is
|
|
|
|
* for consistency with mmap(), where we prevent the creation of
|
|
|
|
* aliasing mappings in userspace by leaving the tag bits of the
|
|
|
|
* mapping address intact. A non-zero tag will cause the subsequent
|
|
|
|
* range checks to reject the address as invalid.
|
|
|
|
*
|
|
|
|
* See Documentation/arch/arm64/tagged-address-abi.rst for more
|
|
|
|
* information.
|
|
|
|
*/
|
|
|
|
struct vma_remap_struct vrm = {
|
|
|
|
.addr = untagged_addr(addr),
|
|
|
|
.old_len = old_len,
|
|
|
|
.new_len = new_len,
|
|
|
|
.flags = flags,
|
|
|
|
.new_addr = new_addr,
|
|
|
|
|
|
|
|
.uf = &uf,
|
|
|
|
.uf_unmap_early = &uf_unmap_early,
|
|
|
|
.uf_unmap = &uf_unmap,
|
|
|
|
|
|
|
|
.remap_type = MREMAP_INVALID, /* We set later. */
|
|
|
|
};
|
|
|
|
|
|
|
|
return do_mremap(&vrm);
|
|
|
|
}
|