Commit graph

1077 commits

Author SHA1 Message Date
Linus Torvalds
beace86e61 Summary of significant series in this pull request:
- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
   VMAs" from Lorenzo Stoakes addresses an issue with KSM's
   PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
   merging with existing adjacent VMAs.
 
 - The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
   practical access monitoring" from SeongJae Park adds a new kernel module
   which simplifies the setup and usage of DAMON in production
   environments.
 
 - The 6 patch series "stop passing a writeback_control to swap/shmem
   writeout" from Christoph Hellwig is a cleanup to the writeback code
   which removes a couple of pointers from struct writeback_control.
 
 - The 7 patch series "drivers/base/node.c: optimization and cleanups"
   from Donet Tom contains largely uncorrelated cleanups to the NUMA node
   setup and management code.
 
 - The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
   Tal Zussman does some maintenance work on the userfaultfd code.
 
 - The 5 patch series "Readahead tweaks for larger folios" from Ryan
   Roberts implements some tuneups for pagecache readahead when it is
   reading into order>0 folios.
 
 - The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
   Brown provides some cleanups and consistency improvements to the
   selftests code.
 
 - The 4 patch series "Optimize mremap() for large folios" from Dev Jain
   does that.  A 37% reduction in execution time was measured in a
   memset+mremap+munmap microbenchmark.
 
 - The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
   zero_user() in favor of the more modern memzero_page().
 
 - The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
   vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
   which David noticed in the huge page code.  These were not known to be
   causing any issues at this time.
 
 - The 3 patch series "mm/damon: use alloc_migrate_target() for
   DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
   consolidation work in DAMON.
 
 - The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
   uses vm_flags_t in places where we were inappropriately using other
   types.
 
 - The 3 patch series "mm/memfd: Reserve hugetlb folios before
   allocation" from Vivek Kasireddy increases the reliability of large page
   allocation in the memfd code.
 
 - The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
   type" from Alistair Popple removes several now-unneeded PFN_* flags.
 
 - The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
   Park implememnts some cleanup and maintainability work in the DAMON
   sysfs layer.
 
 - The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
   lot of cleanup/maintenance work in the madvise() code.
 
 - The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
   provides additional cleanups on top or Lorenzo's effort.
 
 - The 11 patch series "Implement numa node notifier" from Oscar Salvador
   creates a standalone notifier for NUMA node memory state changes.
   Previously these were lumped under the more general memory on/offline
   notifier.
 
 - The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
   cleans up the pageblock isolation code and fixes a potential issue which
   doesn't seem to cause any problems in practice.
 
 - The 5 patch series "selftests/damon: add python and drgn based DAMON
   sysfs functionality tests" from SeongJae Park adds additional drgn- and
   python-based DAMON selftests which are more comprehensive than the
   existing selftest suite.
 
 - The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
   Salvador fixes a rather obscure deadlock in the hugetlb fault code and
   follows that fix with a series of cleanups.
 
 - The 3 patch series "cma: factor out allocation logic from
   __cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
   up the highmem-specific code in the CMA allocator.
 
 - The 28 patch series "mm/migration: rework movable_ops page migration
   (part 1)" from David Hildenbrand provides cleanups and
   future-preparedness to the migration code.
 
 - The 2 patch series "mm/damon: add trace events for auto-tuned
   monitoring intervals and DAMOS quota" from SeongJae Park adds some
   tracepoints to some DAMON auto-tuning code.
 
 - The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
   SeongJae Park does that.
 
 - The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
   does what it claims.
 
 - The 4 patch series "mm: folio_pte_batch() improvements" from David
   Hildenbrand cleans up the large folio PTE batching code.
 
 - The 13 patch series "mm/damon/vaddr: Allow interleaving in
   migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
   alteration of DAMON's inter-node allocation policy.
 
 - The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
   provides a couple of page->folio conversions.
 
 - The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
   Bueso implements a per-node control of proactive reclaim - beyond the
   current memcg-based implementation.
 
 - The 14 patch series "mm/damon: remove damon_callback" from SeongJae
   Park replaces the damon_callback interface with a more general and
   powerful damon_call()+damos_walk() interface.
 
 - The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
   from Lorenzo Stoakes implements a number of mremap cleanups (of course)
   in preparation for adding new mremap() functionality: newly permit the
   remapping of multiple VMAs when the user is specifying MREMAP_FIXED.  It
   still excludes some specialized situations where this cannot be
   performed reliably.
 
 - The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
   switches some sparc hugetlb code over to the generic version and removes
   the thus-unneeded hugetlb_free_pgd_range().
 
 - The 4 patch series "mm/damon/sysfs: support periodic and automated
   stats update" from SeongJae Park augments the present
   userspace-requested update of DAMON sysfs monitoring files.  Automatic
   update is now provided, along with a tunable to control the update
   interval.
 
 - The 4 patch series "Some randome fixes and cleanups to swapfile" from
   Kemeng Shi does what is claims.
 
 - The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
   and David Hildenbrand provides (and uses) a means by which debug-style
   functions can grab a copy of a pageframe and inspect it locklessly
   without tripping over the races inherent in operating on the live
   pageframe directly.
 
 - The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
   Suren Baghdasaryan addresses the large contention issues which can be
   triggered by reads from that procfs file.  Latencies are reduced by more
   than half in some situations.  The series also introduces several new
   selftests for the /proc/pid/maps interface.
 
 - The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
   __folio_split()!
 
 - The 7 patch series "Optimize mprotect() for large folios" from Dev
   Jain provides some quite large (>3x) speedups to mprotect() when dealing
   with large folios.
 
 - The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
   volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
   cleanup work in the selftests code.
 
 - The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
   Stoakes extends the mremap() selftest in several ways, including adding
   more checking of Lorenzo's recently added "permit mremap() move of
   multiple VMAs" feature.
 
 - The 22 patch series "selftests/damon/sysfs.py: test all parameters"
   from SeongJae Park extends the DAMON sysfs interface selftest so that it
   tests all possible user-requested parameters.  Rather than the present
   minimal subset.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
 jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
 8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
 =XsFy
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "As usual, many cleanups. The below blurbiage describes 42 patchsets.
  21 of those are partially or fully cleanup work. "cleans up",
  "cleanup", "maintainability", "rationalizes", etc.

  I never knew the MM code was so dirty.

  "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
     addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
     mapped VMAs were not eligible for merging with existing adjacent
     VMAs.

  "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
     adds a new kernel module which simplifies the setup and usage of
     DAMON in production environments.

  "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
     is a cleanup to the writeback code which removes a couple of
     pointers from struct writeback_control.

  "drivers/base/node.c: optimization and cleanups" (Donet Tom)
     contains largely uncorrelated cleanups to the NUMA node setup and
     management code.

  "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
     does some maintenance work on the userfaultfd code.

  "Readahead tweaks for larger folios" (Ryan Roberts)
     implements some tuneups for pagecache readahead when it is reading
     into order>0 folios.

  "selftests/mm: Tweaks to the cow test" (Mark Brown)
     provides some cleanups and consistency improvements to the
     selftests code.

  "Optimize mremap() for large folios" (Dev Jain)
     does that. A 37% reduction in execution time was measured in a
     memset+mremap+munmap microbenchmark.

  "Remove zero_user()" (Matthew Wilcox)
     expunges zero_user() in favor of the more modern memzero_page().

  "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
     addresses some warts which David noticed in the huge page code.
     These were not known to be causing any issues at this time.

  "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
     provides some cleanup and consolidation work in DAMON.

  "use vm_flags_t consistently" (Lorenzo Stoakes)
     uses vm_flags_t in places where we were inappropriately using other
     types.

  "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
     increases the reliability of large page allocation in the memfd
     code.

  "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
     removes several now-unneeded PFN_* flags.

  "mm/damon: decouple sysfs from core" (SeongJae Park)
     implememnts some cleanup and maintainability work in the DAMON
     sysfs layer.

  "madvise cleanup" (Lorenzo Stoakes)
     does quite a lot of cleanup/maintenance work in the madvise() code.

  "madvise anon_name cleanups" (Vlastimil Babka)
     provides additional cleanups on top or Lorenzo's effort.

  "Implement numa node notifier" (Oscar Salvador)
     creates a standalone notifier for NUMA node memory state changes.
     Previously these were lumped under the more general memory
     on/offline notifier.

  "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
     cleans up the pageblock isolation code and fixes a potential issue
     which doesn't seem to cause any problems in practice.

  "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
     adds additional drgn- and python-based DAMON selftests which are
     more comprehensive than the existing selftest suite.

  "Misc rework on hugetlb faulting path" (Oscar Salvador)
     fixes a rather obscure deadlock in the hugetlb fault code and
     follows that fix with a series of cleanups.

  "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
     rationalizes and cleans up the highmem-specific code in the CMA
     allocator.

  "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
     provides cleanups and future-preparedness to the migration code.

  "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
     adds some tracepoints to some DAMON auto-tuning code.

  "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
     does that.

  "mm/damon: misc cleanups" (SeongJae Park)
     also does what it claims.

  "mm: folio_pte_batch() improvements" (David Hildenbrand)
     cleans up the large folio PTE batching code.

  "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
     facilitates dynamic alteration of DAMON's inter-node allocation
     policy.

  "Remove unmap_and_put_page()" (Vishal Moola)
     provides a couple of page->folio conversions.

  "mm: per-node proactive reclaim" (Davidlohr Bueso)
     implements a per-node control of proactive reclaim - beyond the
     current memcg-based implementation.

  "mm/damon: remove damon_callback" (SeongJae Park)
     replaces the damon_callback interface with a more general and
     powerful damon_call()+damos_walk() interface.

  "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
     implements a number of mremap cleanups (of course) in preparation
     for adding new mremap() functionality: newly permit the remapping
     of multiple VMAs when the user is specifying MREMAP_FIXED. It still
     excludes some specialized situations where this cannot be performed
     reliably.

  "drop hugetlb_free_pgd_range()" (Anthony Yznaga)
     switches some sparc hugetlb code over to the generic version and
     removes the thus-unneeded hugetlb_free_pgd_range().

  "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
     augments the present userspace-requested update of DAMON sysfs
     monitoring files. Automatic update is now provided, along with a
     tunable to control the update interval.

  "Some randome fixes and cleanups to swapfile" (Kemeng Shi)
     does what is claims.

  "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
     provides (and uses) a means by which debug-style functions can grab
     a copy of a pageframe and inspect it locklessly without tripping
     over the races inherent in operating on the live pageframe
     directly.

  "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
     addresses the large contention issues which can be triggered by
     reads from that procfs file. Latencies are reduced by more than
     half in some situations. The series also introduces several new
     selftests for the /proc/pid/maps interface.

  "__folio_split() clean up" (Zi Yan)
     cleans up __folio_split()!

  "Optimize mprotect() for large folios" (Dev Jain)
     provides some quite large (>3x) speedups to mprotect() when dealing
     with large folios.

  "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
     does some cleanup work in the selftests code.

  "tools/testing: expand mremap testing" (Lorenzo Stoakes)
     extends the mremap() selftest in several ways, including adding
     more checking of Lorenzo's recently added "permit mremap() move of
     multiple VMAs" feature.

  "selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
     extends the DAMON sysfs interface selftest so that it tests all
     possible user-requested parameters. Rather than the present minimal
     subset"

* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
  MAINTAINERS: add missing headers to mempory policy & migration section
  MAINTAINERS: add missing file to cgroup section
  MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
  MAINTAINERS: add missing zsmalloc file
  MAINTAINERS: add missing files to page alloc section
  MAINTAINERS: add missing shrinker files
  MAINTAINERS: move memremap.[ch] to hotplug section
  MAINTAINERS: add missing mm_slot.h file THP section
  MAINTAINERS: add missing interval_tree.c to memory mapping section
  MAINTAINERS: add missing percpu-internal.h file to per-cpu section
  mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
  selftests/damon: introduce _common.sh to host shared function
  selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
  selftests/damon/sysfs.py: test non-default parameters runtime commit
  selftests/damon/sysfs.py: generalize DAMON context commit assertion
  selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
  selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
  selftests/damon/sysfs.py: test DAMOS filters commitment
  selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
  selftests/damon/sysfs.py: test DAMOS destinations commitment
  ...
2025-07-31 14:57:54 -07:00
Linus Torvalds
7031769e10 vfs-6.17-rc1.mmap_prepare
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCgQAKCRCRxhvAZXjc
 os+nAP9LFHUwWO6EBzHJJGEVjJvvzsbzqeYrRFamYiMc5ulPJwD+KW4RIgJa/MWO
 pcYE40CacaekD8rFWwYUyszpgmv6ewc=
 =wCwp
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull mmap_prepare updates from Christian Brauner:
 "Last cycle we introduce f_op->mmap_prepare() in c84bf6dd2b ("mm:
  introduce new .mmap_prepare() file callback").

  This is preferred to the existing f_op->mmap() hook as it does require
  a VMA to be established yet, thus allowing the mmap logic to invoke
  this hook far, far earlier, prior to inserting a VMA into the virtual
  address space, or performing any other heavy handed operations.

  This allows for much simpler unwinding on error, and for there to be a
  single attempt at merging a VMA rather than having to possibly
  reattempt a merge based on potentially altered VMA state.

  Far more importantly, it prevents inappropriate manipulation of
  incompletely initialised VMA state, which is something that has been
  the cause of bugs and complexity in the past.

  The intent is to gradually deprecate f_op->mmap, and in that vein this
  series coverts the majority of file systems to using f_op->mmap_prepare.

  Prerequisite steps are taken - firstly ensuring all checks for mmap
  capabilities use the file_has_valid_mmap_hooks() helper rather than
  directly checking for f_op->mmap (which is now not a valid check) and
  secondly updating daxdev_mapping_supported() to not require a VMA
  parameter to allow ext4 and xfs to be converted.

  Commit bb666b7c27 ("mm: add mmap_prepare() compatibility layer for
  nested file systems") handles the nasty edge-case of nested file
  systems like overlayfs, which introduces a compatibility shim to allow
  f_op->mmap_prepare() to be invoked from an f_op->mmap() callback.

  This allows for nested filesystems to continue to function correctly
  with all file systems regardless of which callback is used. Once we
  finally convert all file systems, this shim can be removed.

  As a result, ecryptfs, fuse, and overlayfs remain unaltered so they
  can nest all other file systems.

  We additionally do not update resctl - as this requires an update to
  remap_pfn_range() (or an alternative to it) which we defer to a later
  series, equally we do not update cramfs which needs a mixed mapping
  insertion with the same issue, nor do we update procfs, hugetlbfs,
  syfs or kernfs all of which require VMAs for internal state and hooks.
  We shall return to all of these later"

* tag 'vfs-6.17-rc1.mmap_prepare' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  doc: update porting, vfs documentation to describe mmap_prepare()
  fs: replace mmap hook with .mmap_prepare for simple mappings
  fs: convert most other generic_file_*mmap() users to .mmap_prepare()
  fs: convert simple use of generic_file_*_mmap() to .mmap_prepare()
  mm/filemap: introduce generic_file_*_mmap_prepare() helpers
  fs/xfs: transition from deprecated .mmap hook to .mmap_prepare
  fs/ext4: transition from deprecated .mmap hook to .mmap_prepare
  fs/dax: make it possible to check dev dax support without a VMA
  fs: consistently use can_mmap_file() helper
  mm/nommu: use file_has_valid_mmap_hooks() helper
  mm: rename call_mmap/mmap_prepare to vfs_mmap/mmap_prepare
2025-07-28 13:43:25 -07:00
Taotao Chen
e9d8e2bf23
fs: change write_begin/write_end interface to take struct kiocb *
Change the address_space_operations callbacks write_begin() and
write_end() to take struct kiocb * as the first argument instead of
struct file *.

Update all affected function prototypes, implementations, call sites,
and related documentation across VFS, filesystems, and block layer.

Part of a series refactoring address_space_operations write_begin and
write_end callbacks to use struct kiocb for passing write context and
flags.

Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16 14:48:18 +02:00
Lorenzo Stoakes
bfbe71109f mm: update core kernel code to use vm_flags_t consistently
The core kernel code is currently very inconsistent in its use of
vm_flags_t vs.  unsigned long.  This prevents us from changing the type of
vm_flags_t in the future and is simply not correct, so correct this.

While this results in rather a lot of churn, it is a critical
pre-requisite for a future planned change to VMA flag type.

Additionally, update VMA userland tests to account for the changes.

To make review easier and to break things into smaller parts, driver and
architecture-specific changes is left for a subsequent commit.

The code has been adjusted to cascade the changes across all calling code
as far as is needed.

We will adjust architecture-specific and driver code in a subsequent patch.

Overall, this patch does not introduce any functional change.

Link: https://lkml.kernel.org/r/d1588e7bb96d1ea3fe7b9df2c699d5b4592d901d.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Kees Cook <kees@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:13 -07:00
Ryan Roberts
38b0ece6d7 mm/filemap: allow arch to request folio size for exec memory
Change the readahead config so that if it is being requested for an
executable mapping, do a synchronous read into a set of folios with an
arch-specified order and in a naturally aligned manner.  We no longer
center the read on the faulting page but simply align it down to the
previous natural boundary.  Additionally, we don't bother with an
asynchronous part.

On arm64 if memory is physically contiguous and naturally aligned to the
"contpte" size, we can use contpte mappings, which improves utilization of
the TLB.  When paired with the "multi-size THP" feature, this works well
to reduce dTLB pressure.  However iTLB pressure is still high due to
executable mappings having a low likelihood of being in the required folio
size and mapping alignment, even when the filesystem supports readahead
into large folios (e.g.  XFS).

The reason for the low likelihood is that the current readahead algorithm
starts with an order-0 folio and increases the folio order by 2 every time
the readahead mark is hit.  But most executable memory tends to be
accessed randomly and so the readahead mark is rarely hit and most
executable folios remain order-0.

So let's special-case the read(ahead) logic for executable mappings.  The
trade-off is performance improvement (due to more efficient storage of the
translations in iTLB) vs potential for making reclaim more difficult (due
to the folios being larger so if a part of the folio is hot the whole
thing is considered hot).  But executable memory is a small portion of the
overall system memory so I doubt this will even register from a reclaim
perspective.

I've chosen 64K folio size for arm64 which benefits both the 4K and 16K
base page size configs.  Crucially the same amount of data is still read
(usually 128K) so I'm not expecting any read amplification issues.  I
don't anticipate any write amplification because text is always RO.

Note that the text region of an ELF file could be populated into the page
cache for other reasons than taking a fault in a mmapped area.  The most
common case is due to the loader read()ing the header which can be shared
with the beginning of text.  So some text will still remain in small
folios, but this simple, best effort change provides good performance
improvements as is.

Confine this special-case approach to the bounds of the VMA.  This
prevents wasting memory for any padding that might exist in the file
between sections.  Previously the padding would have been contained in
order-0 folios and would be easy to reclaim.  But now it would be part of
a larger folio so more difficult to reclaim.  Solve this by simply not
reading it into memory in the first place.

Benchmarking
============

The below shows pgbench and redis benchmarks on Graviton3 arm64 system.

First, confirmation that this patch causes more text to be contained in
64K folios:

+----------------------+---------------+---------------+---------------+
| File-backed folios by|  system boot  |    pgbench    |     redis     |
| size as percentage of+-------+-------+-------+-------+-------+-------+
| all mapped text mem  |before | after |before | after |before | after |
+======================+=======+=======+=======+=======+=======+=======+
| base-page-4kB        |   78% |   30% |   78% |   11% |   73% |   14% |
| thp-aligned-8kB      |    1% |    0% |    0% |    0% |    1% |    0% |
| thp-aligned-16kB     |   17% |    4% |   17% |    3% |   20% |    4% |
| thp-aligned-32kB     |    1% |    1% |    1% |    2% |    1% |    1% |
| thp-aligned-64kB     |    3% |   63% |    3% |   81% |    4% |   77% |
| thp-aligned-128kB    |    0% |    1% |    1% |    1% |    1% |    2% |
| thp-unaligned-64kB   |    0% |    0% |    0% |    1% |    0% |    1% |
| thp-unaligned-128kB  |    0% |    1% |    0% |    0% |    0% |    0% |
| thp-partial          |    0% |    0% |    0% |    1% |    0% |    1% |
+----------------------+-------+-------+-------+-------+-------+-------+
| cont-aligned-64kB    |    4% |   65% |    4% |   83% |    6% |   79% |
+----------------------+-------+-------+-------+-------+-------+-------+

The above shows that for both workloads (each isolated with cgroups) as
well as the general system state after boot, the amount of text backed by
4K and 16K folios reduces and the amount backed by 64K folios increases
significantly.  And the amount of text that is contpte-mapped
significantly increases (see last row).

And this is reflected in performance improvement.  "(I)" indicates a
statistically significant improvement.  Note TPS and Reqs/sec are rates so
bigger is better, ms is time so smaller is better:

+-------------+-------------------------------------------+------------+
| Benchmark   | Result Class                              | Improvemnt |
+=============+===========================================+============+
| pts/pgbench | Scale: 1 Clients: 1 RO (TPS)              |  (I) 3.47% |
|             | Scale: 1 Clients: 1 RO - Latency (ms)     |     -2.88% |
|             | Scale: 1 Clients: 250 RO (TPS)            |  (I) 5.02% |
|             | Scale: 1 Clients: 250 RO - Latency (ms)   | (I) -4.79% |
|             | Scale: 1 Clients: 1000 RO (TPS)           |  (I) 6.16% |
|             | Scale: 1 Clients: 1000 RO - Latency (ms)  | (I) -5.82% |
|             | Scale: 100 Clients: 1 RO (TPS)            |      2.51% |
|             | Scale: 100 Clients: 1 RO - Latency (ms)   |     -3.51% |
|             | Scale: 100 Clients: 250 RO (TPS)          |  (I) 4.75% |
|             | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% |
|             | Scale: 100 Clients: 1000 RO (TPS)         |  (I) 6.34% |
|             | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% |
+-------------+-------------------------------------------+------------+
| pts/redis   | Test: GET Connections: 50 (Reqs/sec)      |  (I) 3.20% |
|             | Test: GET Connections: 1000 (Reqs/sec)    |  (I) 2.55% |
|             | Test: LPOP Connections: 50 (Reqs/sec)     |  (I) 4.59% |
|             | Test: LPOP Connections: 1000 (Reqs/sec)   |  (I) 4.81% |
|             | Test: LPUSH Connections: 50 (Reqs/sec)    |  (I) 5.31% |
|             | Test: LPUSH Connections: 1000 (Reqs/sec)  |  (I) 4.36% |
|             | Test: SADD Connections: 50 (Reqs/sec)     |  (I) 2.64% |
|             | Test: SADD Connections: 1000 (Reqs/sec)   |  (I) 4.15% |
|             | Test: SET Connections: 50 (Reqs/sec)      |  (I) 3.11% |
|             | Test: SET Connections: 1000 (Reqs/sec)    |  (I) 3.36% |
+-------------+-------------------------------------------+------------+

[ryan.roberts@arm.com: fix use-after-free]
  Link: https://lkml.kernel.org/r/ea7f9da7-9a9f-4b85-9d0a-35b320f5ed25@arm.com
[ryan.roberts@arm.com: use the vma_pages() helper instead of open-coding]
  Link: https://lkml.kernel.org/r/0e0f674b-3b7e-494f-ae7a-fc9dbb98dad4@arm.com
Link: https://lkml.kernel.org/r/20250609092729.274960-6-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Will Deacon <will@kernel.org>
Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:03 -07:00
Ryan Roberts
c4602f9fa7 mm/readahead: store folio order in struct file_ra_state
Previously the folio order of the previous readahead request was inferred
from the folio who's readahead marker was hit.  But due to the way we have
to round to non-natural boundaries sometimes, this first folio in the
readahead block is often smaller than the preferred order for that
request.  This means that for cases where the initial sync readahead is
poorly aligned, the folio order will ramp up much more slowly.

So instead, let's store the order in struct file_ra_state so we are not
affected by any required alignment.  We previously made enough room in the
struct for a 16 order field.  This should be plenty big enough since we
are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never larger
than ~20.

Since we now pass order in struct file_ra_state, page_cache_ra_order() no
longer needs it's new_order parameter, so let's remove that.

Worked example:

Here we are touching pages 17-256 sequentially just as we did in the
previous commit, but now that we are remembering the preferred order
explicitly, we no longer have the slow ramp up problem.  Note specifically
that we no longer have 2 rounds (2x ~128K) of order-2 folios:

TYPE    STARTOFFS     ENDOFFS        SIZE  STARTPG    ENDPG   NRPG  ORDER  RA
-----  ----------  ----------  ----------  -------  -------  -----  -----  --
HOLE   0x00000000  0x00001000        4096        0        1      1
FOLIO  0x00001000  0x00002000        4096        1        2      1      0
FOLIO  0x00002000  0x00003000        4096        2        3      1      0
FOLIO  0x00003000  0x00004000        4096        3        4      1      0
FOLIO  0x00004000  0x00005000        4096        4        5      1      0
FOLIO  0x00005000  0x00006000        4096        5        6      1      0
FOLIO  0x00006000  0x00007000        4096        6        7      1      0
FOLIO  0x00007000  0x00008000        4096        7        8      1      0
FOLIO  0x00008000  0x00009000        4096        8        9      1      0
FOLIO  0x00009000  0x0000a000        4096        9       10      1      0
FOLIO  0x0000a000  0x0000b000        4096       10       11      1      0
FOLIO  0x0000b000  0x0000c000        4096       11       12      1      0
FOLIO  0x0000c000  0x0000d000        4096       12       13      1      0
FOLIO  0x0000d000  0x0000e000        4096       13       14      1      0
FOLIO  0x0000e000  0x0000f000        4096       14       15      1      0
FOLIO  0x0000f000  0x00010000        4096       15       16      1      0
FOLIO  0x00010000  0x00011000        4096       16       17      1      0
FOLIO  0x00011000  0x00012000        4096       17       18      1      0
FOLIO  0x00012000  0x00013000        4096       18       19      1      0
FOLIO  0x00013000  0x00014000        4096       19       20      1      0
FOLIO  0x00014000  0x00015000        4096       20       21      1      0
FOLIO  0x00015000  0x00016000        4096       21       22      1      0
FOLIO  0x00016000  0x00017000        4096       22       23      1      0
FOLIO  0x00017000  0x00018000        4096       23       24      1      0
FOLIO  0x00018000  0x00019000        4096       24       25      1      0
FOLIO  0x00019000  0x0001a000        4096       25       26      1      0
FOLIO  0x0001a000  0x0001b000        4096       26       27      1      0
FOLIO  0x0001b000  0x0001c000        4096       27       28      1      0
FOLIO  0x0001c000  0x0001d000        4096       28       29      1      0
FOLIO  0x0001d000  0x0001e000        4096       29       30      1      0
FOLIO  0x0001e000  0x0001f000        4096       30       31      1      0
FOLIO  0x0001f000  0x00020000        4096       31       32      1      0
FOLIO  0x00020000  0x00021000        4096       32       33      1      0
FOLIO  0x00021000  0x00022000        4096       33       34      1      0
FOLIO  0x00022000  0x00024000        8192       34       36      2      1
FOLIO  0x00024000  0x00028000       16384       36       40      4      2
FOLIO  0x00028000  0x0002c000       16384       40       44      4      2
FOLIO  0x0002c000  0x00030000       16384       44       48      4      2
FOLIO  0x00030000  0x00034000       16384       48       52      4      2
FOLIO  0x00034000  0x00038000       16384       52       56      4      2
FOLIO  0x00038000  0x0003c000       16384       56       60      4      2
FOLIO  0x0003c000  0x00040000       16384       60       64      4      2
FOLIO  0x00040000  0x00050000       65536       64       80     16      4
FOLIO  0x00050000  0x00060000       65536       80       96     16      4
FOLIO  0x00060000  0x00080000      131072       96      128     32      5
FOLIO  0x00080000  0x000a0000      131072      128      160     32      5
FOLIO  0x000a0000  0x000c0000      131072      160      192     32      5
FOLIO  0x000c0000  0x000e0000      131072      192      224     32      5
FOLIO  0x000e0000  0x00100000      131072      224      256     32      5
FOLIO  0x00100000  0x00120000      131072      256      288     32      5
FOLIO  0x00120000  0x00140000      131072      288      320     32      5  Y
HOLE   0x00140000  0x00800000     7077888      320     2048   1728

Link: https://lkml.kernel.org/r/20250609092729.274960-5-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:03 -07:00
Ryan Roberts
f5e8b140cd mm/readahead: make space in struct file_ra_state
We need to be able to store the preferred folio order associated with a
readahead request in the struct file_ra_state so that we can more
accurately increase the order across subsequent readahead requests.  But
struct file_ra_state is per-struct file, so we don't really want to
increase it's size.

mmap_miss is currently 32 bits but it is only counted up to 10 *
MMAP_LOTSAMISS, which is currently defined as 1000.  So 16 bits should be
plenty.  Redefine it to unsigned short, making room for order as unsigned
short in follow up commit.

Link: https://lkml.kernel.org/r/20250609092729.274960-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:03 -07:00
Chi Zhiling
bbcaee20e0 readahead: fix return value of page_cache_next_miss() when no hole is found
max_scan in page_cache_next_miss always decreases to zero when no hole is
found, causing the return value to be index + 0.

Fix this by preserving the max_scan value throughout the loop.

Jan said "From what I know and have seen in the past, wrong responses
from page_cache_next_miss() can lead to readahead window reduction and
thus reduced read speeds."

Link: https://lkml.kernel.org/r/20250605054935.2323451-1-chizhiling@163.com
Fixes: 901a269ff3 ("filemap: fix page_cache_next_miss() when no hole found")
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:41:59 -07:00
Lorenzo Stoakes
5b44297bcf
mm/filemap: introduce generic_file_*_mmap_prepare() helpers
Since commit c84bf6dd2b ("mm: introduce new .mmap_prepare() file
callback"), the f_op->mmap() hook has been deprecated in favour of
f_op->mmap_prepare().

The generic mmap handlers are very simple, so we can very easily convert
these in advance of converting file systems which use them.

This patch does so.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/30622c1f0b98c66840bc8c02668bda276a810b70.1750099179.git.lorenzo.stoakes@oracle.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-17 13:47:44 +02:00
Linus Torvalds
fd1f847350 - The 2 patch series "zram: support algorithm-specific parameters" from
Sergey Senozhatsky adds infrastructure for passing algorithm-specific
   parameters into zram.  A single parameter `winbits' is implemented at
   this time.
 
 - The 5 patch series "memcg: nmi-safe kmem charging" from Shakeel Butt
   makes memcg charging nmi-safe, which is required by BFP, which can
   operate in NMI context.
 
 - The 5 patch series "Some random fixes and cleanup to shmem" from
   Kemeng Shi implements small fixes and cleanups in the shmem code.
 
 - The 2 patch series "Skip mm selftests instead when kernel features are
   not present" from Zi Yan fixes some issues in the MM selftest code.
 
 - The 2 patch series "mm/damon: build-enable essential DAMON components
   by default" from SeongJae Park reworks DAMON Kconfig to make it easier
   to enable CONFIG_DAMON.
 
 - The 2 patch series "sched/numa: add statistics of numa balance task
   migration" from Libo Chen adds more info into sysfs and procfs files to
   improve visibility into the NUMA balancer's task migration activity.
 
 - The 4 patch series "selftests/mm: cow and gup_longterm cleanups" from
   Mark Brown provides various updates to some of the MM selftests to make
   them play better with the overall containing framework.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDzA9wAKCRDdBJ7gKXxA
 js8sAP9V3COg+vzTmimzP3ocTkkbbIJzDfM6nXpE2EQ4BR3ejwD+NsIT2ZLtTF6O
 LqAZpgO7ju6wMjR/lM30ebCq5qFbZAw=
 =oruw
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

 - "zram: support algorithm-specific parameters" from Sergey Senozhatsky
   adds infrastructure for passing algorithm-specific parameters into
   zram. A single parameter `winbits' is implemented at this time.

 - "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg
   charging nmi-safe, which is required by BFP, which can operate in NMI
   context.

 - "Some random fixes and cleanup to shmem" from Kemeng Shi implements
   small fixes and cleanups in the shmem code.

 - "Skip mm selftests instead when kernel features are not present" from
   Zi Yan fixes some issues in the MM selftest code.

 - "mm/damon: build-enable essential DAMON components by default" from
   SeongJae Park reworks DAMON Kconfig to make it easier to enable
   CONFIG_DAMON.

 - "sched/numa: add statistics of numa balance task migration" from Libo
   Chen adds more info into sysfs and procfs files to improve visibility
   into the NUMA balancer's task migration activity.

 - "selftests/mm: cow and gup_longterm cleanups" from Mark Brown
   provides various updates to some of the MM selftests to make them
   play better with the overall containing framework.

* tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits)
  mm/khugepaged: clean up refcount check using folio_expected_ref_count()
  selftests/mm: fix test result reporting in gup_longterm
  selftests/mm: report unique test names for each cow test
  selftests/mm: add helper for logging test start and results
  selftests/mm: use standard ksft_finished() in cow and gup_longterm
  selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled
  sched/numa: add statistics of numa balance task
  sched/numa: fix task swap by skipping kernel threads
  tools/testing: check correct variable in open_procmap()
  tools/testing/vma: add missing function stub
  mm/gup: update comment explaining why gup_fast() disables IRQs
  selftests/mm: two fixes for the pfnmap test
  mm/khugepaged: fix race with folio split/free using temporary reference
  mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order
  mmu_notifiers: remove leftover stub macros
  selftests/mm: deduplicate test names in madv_populate
  kcov: rust: add flags for KCOV with Rust
  mm: rust: make CONFIG_MMU ifdefs more narrow
  mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()
  mm/damon/Kconfig: enable CONFIG_DAMON by default
  ...
2025-06-02 16:00:26 -07:00
Linus Torvalds
fcd0bb8e99 vfs-6.16-rc2.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaD1hUAAKCRCRxhvAZXjc
 ohIBAQCJQYP7bzg+A7C7K6cA+5LwC1u4aZ424B5puIZrLiEEDQEAxQli95/rTIeE
 m2DWBDl5rMrKlfmpqGvjbbJldU75swo=
 =2EhD
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Fix the AT_HANDLE_CONNECTABLE option so filesystems that don't know
   how to decode a connected non-dir dentry fail the request

 - Use repr(transparent) to ensure identical layout between the C and
   Rust implementation of struct file

 - Add a missing xas_pause() into the dax code employing
   wait_entry_unlocked_exclusive()

 - Fix FOP_DONTCACHE which we disabled for v6.15.

   A folio could get redirtied and/or scheduled for writeback after the
   initial dropbehind test. Change the test accordingly to handle these
   cases so we can re-enable FOP_DONTCACHE again

* tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  exportfs: require ->fh_to_parent() to encode connectable file handles
  rust: file: improve safety comments
  rust: file: mark `LocalFile` as `repr(transparent)`
  fs/dax: Fix "don't skip locked entries when scanning entries"
  iomap: don't lose folio dropbehind state for overwrites
  mm/filemap: unify dropbehind flag testing and clearing
  mm/filemap: unify read/write dropbehind naming
  Revert "Disable FOP_DONTCACHE for now due to bugs"
  mm/filemap: use filemap_end_dropbehind() for read invalidation
  mm/filemap: gate dropbehind invalidate on folio !dirty && !writeback
2025-06-02 12:49:16 -07:00
Matthew Wilcox (Oracle)
acc53a0b4c mm: rename page->index to page->__folio_index
All users of page->index have been converted to not refer to it any more. 
Update a few pieces of documentation that were missed and prevent new
users from appearing (or at least make them easy to grep for).

Link: https://lkml.kernel.org/r/20250514181508.3019795-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:06 -07:00
Jens Axboe
a1d98e4ffb
mm/filemap: unify dropbehind flag testing and clearing
The read and write side does this a bit differently, unify it such that
the _{read,write} helpers check the bit before locking, and the generic
handler is in charge of clearing the bit and invalidating, once under
the folio lock.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/20250527133255.452431-6-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-27 21:06:24 +02:00
Jens Axboe
1da7a06d9c
mm/filemap: unify read/write dropbehind naming
The read side is filemap_end_dropbehind_read(), while the write side
used folio_ as the prefix rather than filemap_. The read side makes more
sense, unify the naming such that the write side follows that.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/20250527133255.452431-5-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-27 21:06:24 +02:00
Jens Axboe
25b065a744
mm/filemap: use filemap_end_dropbehind() for read invalidation
Use the filemap_end_dropbehind() helper rather than calling
folio_unmap_invalidate() directly, as we need to check if the folio has
been redirtied or marked for writeback once the folio lock has been
re-acquired.

Cc: stable@vger.kernel.org
Reported-by: Trond Myklebust <trondmy@hammerspace.com>
Fixes: 8026e49bff ("mm/filemap: add read support for RWF_DONTCACHE")
Link: https://lore.kernel.org/linux-fsdevel/ba8a9805331ce258a622feaca266b163db681a10.camel@hammerspace.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/20250527133255.452431-3-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-27 21:06:24 +02:00
Jens Axboe
095f627add
mm/filemap: gate dropbehind invalidate on folio !dirty && !writeback
It's possible for the folio to either get marked for writeback or
redirtied. Add a helper, filemap_end_dropbehind(), which guards the
folio_unmap_invalidate() call behind check for the folio being both
non-dirty and not under writeback AFTER the folio lock has been
acquired. Use this helper folio_end_dropbehind_write().

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: fb7d3bc414 ("mm/filemap: drop streaming/uncached pages when writeback completes")
Link: https://lore.kernel.org/linux-fsdevel/20250525083209.GS2023217@ZenIV/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/20250527133255.452431-2-axboe@kernel.dk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-27 21:06:24 +02:00
Baolin Wang
698c0089cd mm: convert do_set_pmd() to take a folio
In do_set_pmd(), we always use the folio->page to build PMD mappings for
the entire folio.  Since all callers of do_set_pmd() already hold a stable
folio, converting do_set_pmd() to take a folio is safe and more
straightforward.

In addition, to ensure the extensibility of do_set_pmd() for supporting
larger folios beyond PMD size, we keep the 'page' parameter to specify
which page within the folio should be mapped.

No functional changes expected.

Link: https://lkml.kernel.org/r/9b488f4ecb4d3fd8634e3d448dd0ed6964482480.1747017104.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-22 14:55:37 -07:00
Vishal Moola (Oracle)
8ab1b16023 mm: fix filemap_get_folios_contig returning batches of identical folios
filemap_get_folios_contig() is supposed to return distinct folios found
within [start, end].  Large folios in the Xarray become multi-index
entries.  xas_next() can iterate through the sub-indexes before finding a
sibling entry and breaking out of the loop.

This can result in a returned folio_batch containing an indeterminate
number of duplicate folios, which forces the callers to skeptically handle
the returned batch.  This is inefficient and incurs a large maintenance
overhead.

We can fix this by calling xas_advance() after we have successfully adding
a folio to the batch to ensure our Xarray is positioned such that it will
correctly find the next folio - similar to filemap_get_read_batch().

Link: https://lkml.kernel.org/r/Z-8s1-kiIDkzgRbc@fedora
Fixes: 35b471467f ("filemap: add filemap_get_folios_contig()")
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reported-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Closes: https://lkml.kernel.org/r/b714e4de-2583-4035-b829-72cfb5eb6fc6@gmx.com
Tested-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-11 17:32:40 -07:00
Linus Torvalds
eb0ece1602 - The 6 patch series "Enable strict percpu address space checks" from
Uros Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.
 
   This has caused a small amount of fallout - two or three issues were
   reported.  In all cases the calling code was founf to be incorrect.
 
 - The 4 patch series "Some cleanup for memcg" from Chen Ridong
   implements some relatively monir cleanups for the memcontrol code.
 
 - The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
   from David Hildenbrand fixes a boatload of issues which David found then
   using device-exclusive PTE entries when THP is enabled.  More work is
   needed, but this makes thins better - our own HMM selftests now succeed.
 
 - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
   Ahmed remove the z3fold and zbud implementations.  They have been
   deprecated for half a year and nobody has complained.
 
 - The 5 patch series "mm: further simplify VMA merge operation" from
   Lorenzo Stoakes implements numerous simplifications in this area.  No
   runtime effects are anticipated.
 
 - The 4 patch series "mm/madvise: remove redundant mmap_lock operations
   from process_madvise()" from SeongJae Park rationalizes the locking in
   the madvise() implementation.  Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.
 
 - The 12 patch series "Tiny cleanup and improvements about SWAP code"
   from Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.
 
 - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak user-visible
   output.
 
 - The 2 patch series "mm/damon/paddr: fix large folios access and
   schemes handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.
 
 - The 3 patch series "mm/damon/core: fix wrong and/or useless
   damos_walk() behaviors" from SeongJae Park fixes a few issues with the
   accuracy of kdamond's walking of DAMON regions.
 
 - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
   Lorenzo Stoakes changes the interaction between framebuffer deferred-io
   and core MM.  No functional changes are anticipated - this is
   preparatory work for the future removal of page structure fields.
 
 - The 4 patch series "mm/damon: add support for hugepage_size DAMOS
   filter" from Usama Arif adds a DAMOS filter which permits the filtering
   by huge page sizes.
 
 - The 4 patch series "mm: permit guard regions for file-backed/shmem
   mappings" from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state.  The feature now covers shmem and
   file-backed mappings.
 
 - The 4 patch series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping for
   pte-mapped large folios.
 
 - The 18 patch series "reimplement per-vma lock as a refcount" from
   Suren Baghdasaryan puts the vm_lock back into the vma.  Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy.  This patchset provides small (0-10%) improvements on one
   microbenchmark.
 
 - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
   fixes and improves" from SeongJae Park does some maintenance work on the
   DAMON docs.
 
 - The 27 patch series "hugetlb/CMA improvements for large systems" from
   Frank van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
   pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.
 
 - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.
 
 - The 12 patch series "selftests/mm: Some cleanups from trying to run
   them" from Brendan Jackman fixes a pile of unrelated issues which
   Brendan encountered while runnimg our selftests.
 
 - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
   from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.
 
 - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply wasn't
   being effective.
 
 - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
   from David Hildenbrand implements a number of unrelated cleanups in this
   code.
 
 - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
   Khandual implements a number of preparatoty cleanups to the
   GENERIC_PTDUMP Kconfig logic.
 
 - The 8 patch series "mm/damon: auto-tune aggregation interval" from
   SeongJae Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.
 
 - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
   issues in powerpc, sparc and x86 lazy MMU implementations.  Ryan did
   this in preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.
 
 - The 2 patch series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the code
   easier to follow.
 
 - The 3 patch series "page_counter cleanup and size reduction" from
   Shakeel Butt cleans up the page_counter code and fixes a size increase
   which we accidentally added late last year.
 
 - The 3 patch series "Add a command line option that enables control of
   how many threads should be used to allocate huge pages" from Thomas
   Prescher does that.  It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.
 
 - The 3 patch series "Fix calculations in trace_balance_dirty_pages()
   for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.
 
 - The 9 patch series "mm/damon: make allow filters after reject filters
   useful and intuitive" from SeongJae Park improves the handling of allow
   and reject filters.  Behaviour is made more consistent and the
   documention is updated accordingly.
 
 - The 5 patch series "Switch zswap to object read/write APIs" from Yosry
   Ahmed updates zswap to the new object read/write APIs and thus permits
   the removal of some legacy code from zpool and zsmalloc.
 
 - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
   does as it claims.
 
 - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
   from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.
 
 - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
   is a preparatoty refactoring and cleanup of the mremap() code.
 
 - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
   + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.
 
 - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
   filters based on handling layers" from SeongJae Park adds a couple of
   new sysfs directories to ease the management of DAMON/DAMOS filters.
 
 - The 13 patch series "arch, mm: reduce code duplication in mem_init()"
   from Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.
 
 - The 13 patch series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.
 
 - The 3 patch series "mm: page_ext: Introduce new iteration API" from
   Luiz Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.
 
 - The 8 patch series "Buddy allocator like (or non-uniform) folio split"
   from Zi Yan reworks the code to split a folio into smaller folios.  The
   main benefit is lessened memory consumption: fewer post-split folios are
   generated.
 
 - The 2 patch series "Minimize xa_node allocation during xarry split"
   from Zi Yan reduces the number of xarray xa_nodes which are generated
   during an xarray split.
 
 - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.
 
 - The 3 patch series "Add tracepoints for lowmem reserves, watermarks
   and totalreserve_pages" from Martin Liu adds some more tracepoints to
   the page allocator code.
 
 - The 4 patch series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which SeongJae
   observed during his earlier madvise work.
 
 - The 3 patch series "mm/hwpoison: Fix regressions in memory failure
   handling" from Shuai Xue addresses two quite serious regressions which
   Shuai has observed in the memory-failure implementation.
 
 - The 5 patch series "mm: reliable huge page allocator" from Johannes
   Weiner makes huge page allocations cheaper and more reliable by reducing
   fragmentation.
 
 - The 5 patch series "Minor memcg cleanups & prep for memdescs" from
   Matthew Wilcox is preparatory work for the future implementation of
   memdescs.
 
 - The 4 patch series "track memory used by balloon drivers" from Nico
   Pache introduces a way to track memory used by our various balloon
   drivers.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for active
   pages" from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.
 
 - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
   Hao Jia separates the proactive reclaim statistics from the direct
   reclaim statistics.
 
 - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
   from Jinjiang Tu fixes our handling of hwpoisoned pages within the
   reclaim code.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
 jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
 Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
 =Pn2K
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "Enable strict percpu address space checks" from Uros
   Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.

   This has caused a small amount of fallout - two or three issues were
   reported. In all cases the calling code was found to be incorrect.

 - The series "Some cleanup for memcg" from Chen Ridong implements some
   relatively monir cleanups for the memcontrol code.

 - The series "mm: fixes for device-exclusive entries (hmm)" from David
   Hildenbrand fixes a boatload of issues which David found then using
   device-exclusive PTE entries when THP is enabled. More work is
   needed, but this makes thins better - our own HMM selftests now
   succeed.

 - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
   remove the z3fold and zbud implementations. They have been deprecated
   for half a year and nobody has complained.

 - The series "mm: further simplify VMA merge operation" from Lorenzo
   Stoakes implements numerous simplifications in this area. No runtime
   effects are anticipated.

 - The series "mm/madvise: remove redundant mmap_lock operations from
   process_madvise()" from SeongJae Park rationalizes the locking in the
   madvise() implementation. Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.

 - The series "Tiny cleanup and improvements about SWAP code" from
   Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.

 - The series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak
   user-visible output.

 - The series "mm/damon/paddr: fix large folios access and schemes
   handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.

 - The series "mm/damon/core: fix wrong and/or useless damos_walk()
   behaviors" from SeongJae Park fixes a few issues with the accuracy of
   kdamond's walking of DAMON regions.

 - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
   Stoakes changes the interaction between framebuffer deferred-io and
   core MM. No functional changes are anticipated - this is preparatory
   work for the future removal of page structure fields.

 - The series "mm/damon: add support for hugepage_size DAMOS filter"
   from Usama Arif adds a DAMOS filter which permits the filtering by
   huge page sizes.

 - The series "mm: permit guard regions for file-backed/shmem mappings"
   from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state. The feature now covers shmem and
   file-backed mappings.

 - The series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping
   for pte-mapped large folios.

 - The series "reimplement per-vma lock as a refcount" from Suren
   Baghdasaryan puts the vm_lock back into the vma. Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy. This patchset provides small (0-10%) improvements on one
   microbenchmark.

 - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
   improves" from SeongJae Park does some maintenance work on the DAMON
   docs.

 - The series "hugetlb/CMA improvements for large systems" from Frank
   van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.

 - The series "mm/damon: introduce DAMOS filter type for unmapped pages"
   from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.

 - The series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.

 - The series "selftests/mm: Some cleanups from trying to run them" from
   Brendan Jackman fixes a pile of unrelated issues which Brendan
   encountered while runnimg our selftests.

 - The series "fs/proc/task_mmu: add guard region bit to pagemap" from
   Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.

 - The series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply
   wasn't being effective.

 - The series "mm: cleanups for device-exclusive entries (hmm)" from
   David Hildenbrand implements a number of unrelated cleanups in this
   code.

 - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
   implements a number of preparatoty cleanups to the GENERIC_PTDUMP
   Kconfig logic.

 - The series "mm/damon: auto-tune aggregation interval" from SeongJae
   Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.

 - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
   powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
   preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.

 - The series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the
   code easier to follow.

 - The series "page_counter cleanup and size reduction" from Shakeel
   Butt cleans up the page_counter code and fixes a size increase which
   we accidentally added late last year.

 - The series "Add a command line option that enables control of how
   many threads should be used to allocate huge pages" from Thomas
   Prescher does that. It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.

 - The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
   from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.

 - The series "mm/damon: make allow filters after reject filters useful
   and intuitive" from SeongJae Park improves the handling of allow and
   reject filters. Behaviour is made more consistent and the documention
   is updated accordingly.

 - The series "Switch zswap to object read/write APIs" from Yosry Ahmed
   updates zswap to the new object read/write APIs and thus permits the
   removal of some legacy code from zpool and zsmalloc.

 - The series "Some trivial cleanups for shmem" from Baolin Wang does as
   it claims.

 - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
   Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.

 - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
   preparatoty refactoring and cleanup of the mremap() code.

 - The series "mm: MM owner tracking for large folios (!hugetlb) +
   CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.

 - The series "mm/damon: add sysfs dirs for managing DAMOS filters based
   on handling layers" from SeongJae Park adds a couple of new sysfs
   directories to ease the management of DAMON/DAMOS filters.

 - The series "arch, mm: reduce code duplication in mem_init()" from
   Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.

 - The series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.

 - The series "mm: page_ext: Introduce new iteration API" from Luiz
   Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.

 - The series "Buddy allocator like (or non-uniform) folio split" from
   Zi Yan reworks the code to split a folio into smaller folios. The
   main benefit is lessened memory consumption: fewer post-split folios
   are generated.

 - The series "Minimize xa_node allocation during xarry split" from Zi
   Yan reduces the number of xarray xa_nodes which are generated during
   an xarray split.

 - The series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.

 - The series "Add tracepoints for lowmem reserves, watermarks and
   totalreserve_pages" from Martin Liu adds some more tracepoints to the
   page allocator code.

 - The series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which
   SeongJae observed during his earlier madvise work.

 - The series "mm/hwpoison: Fix regressions in memory failure handling"
   from Shuai Xue addresses two quite serious regressions which Shuai
   has observed in the memory-failure implementation.

 - The series "mm: reliable huge page allocator" from Johannes Weiner
   makes huge page allocations cheaper and more reliable by reducing
   fragmentation.

 - The series "Minor memcg cleanups & prep for memdescs" from Matthew
   Wilcox is preparatory work for the future implementation of memdescs.

 - The series "track memory used by balloon drivers" from Nico Pache
   introduces a way to track memory used by our various balloon drivers.

 - The series "mm/damon: introduce DAMOS filter type for active pages"
   from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.

 - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
   separates the proactive reclaim statistics from the direct reclaim
   statistics.

 - The series "mm/vmscan: don't try to reclaim hwpoison folio" from
   Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
   code.

* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
  mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
  x86/mm: restore early initialization of high_memory for 32-bits
  mm/vmscan: don't try to reclaim hwpoison folio
  mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
  cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
  mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
  selftests/mm: speed up split_huge_page_test
  selftests/mm: uffd-unit-tests support for hugepages > 2M
  docs/mm/damon/design: document active DAMOS filter type
  mm/damon: implement a new DAMOS filter type for active pages
  fs/dax: don't disassociate zero page entries
  MM documentation: add "Unaccepted" meminfo entry
  selftests/mm: add commentary about 9pfs bugs
  fork: use __vmalloc_node() for stack allocation
  docs/mm: Physical Memory: Populate the "Zones" section
  xen: balloon: update the NR_BALLOON_PAGES state
  hv_balloon: update the NR_BALLOON_PAGES state
  balloon_compaction: update the NR_BALLOON_PAGES state
  meminfo: add a per node counter for balloon drivers
  mm: remove references to folio in __memcg_kmem_uncharge_page()
  ...
2025-04-01 09:29:18 -07:00
Linus Torvalds
81d8e5e213 f2fs-for-6.15-rc1
In this round, there are three major updates: 1) folio conversion, 2) refactor
 for mount API conversion, 3) some performance improvement such as direct IO,
 checkpoint speed, and IO priority hints. For stability, there are patches
 which add more sanity checks and fixes some major issues like i_size in
 atomic write operations and write pointer recovery in zoned devices.
 
 Enhancement:
  - huge folio converion work by Matthew Wilcox
  - clean up for mount API conversion by Eric Sandeen
  - improve direct IO speed in the overwrite case
  - add some sanity check on node consistency
  - set highest IO priority for checkpoint thread
  - keep POSIX_FADV_NOREUSE ranges and add sysfs entry to reclaim pages
  - add ioctl to get IO priority hint
  - add carve_out sysfs node for fsstat
 
 Bug fix:
  - disable nat_bits during umount to avoid potential nat entry corruption
  - fix missing i_size update on atomic writes
  - fix missing discard for active segments
  - fix running out of free segments
  - fix out-of-bounds access in f2fs_truncate_inode_blocks()
  - call f2fs_recover_quota_end() correctly
  - fix potential deadloop in prepare_compress_overwrite()
  - fix the missing write pointer correction for zoned device
  - fix to avoid panic once fallocation fails for pinfile
  - don't retry IO for corrupted data scenario
 
 There are many other clean up patches and minor bug fixes as usual.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmfhjuYACgkQQBSofoJI
 UNJF+hAAip0Sf6bXwt43KR3lmbXg/YBTtPDACOs375Pn8pfRVsAPIAf5e1AnlBET
 rA0KpqUrEZHXenQFF9n4LIoHYqfuUw3EMempuvp4qgQcx15Kaajw+EGWVLYstNy2
 9dELc9DA55f/i1uvHezGfQFy6hMfXasf+tkaYk0z0ZYDStYboLgkVAY+F869q1Dl
 D16Y7Lna12h4eCxDdssIPDsLjFH/2LDn7SmhXsnpZxwK0Zx8JSo83WxXHnzXHxz4
 vUkukInKgqcBDvf6ufW/YF3/tqSs20XEXNK3cI1vyHx1dwij6Us+G0n0WJHL30QE
 zsliecR0X28JP1rJ3ldCjr+4Kxm6/u/Uwpinm2Fm1jL67UY4TZcaARBE8k20I6ND
 j/L1+sXrIdZ1aILM9/bwCjgXiVFdZbvlfGfpTj1duAkQgRd+/s9cjPlo5j5HTad5
 XJmzJz6YOaUNarhP/E31Z9SV9M2kEcmoDOTxKBg6ZcMWXZ27B6Z0ag92prd0GWk6
 rWDuVj0eP/LDY1QedbHRPbg1D84jVgcnldfPaf9ptln993skJGdgS0dkegTqMR0L
 H8RgpWOzWZI53gnQdePdej8diHmD8uRTrLf/oABm//GzTHC5BdwVpArOl1DCuLFF
 YkMtVEicgnWRmL3PgODAdTJXaDi3uGvT116i+lfMlUn9p+t93r0=
 =E+TE
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, there are three major updates: (1) folio conversion,
  (2) refactoring for mount API conversion, (3) some performance
  improvement such as direct IO, checkpoint speed, and IO priority
  hints.

  For stability, there are patches which add more sanity checks and
  fixes some major issues like i_size in atomic write operations and
  write pointer recovery in zoned devices.

  Enhancements:
   - huge folio converion work by Matthew Wilcox
   - clean up for mount API conversion by Eric Sandeen
   - improve direct IO speed in the overwrite case
   - add some sanity check on node consistency
   - set highest IO priority for checkpoint thread
   - keep POSIX_FADV_NOREUSE ranges and add sysfs entry to reclaim pages
   - add ioctl to get IO priority hint
   - add carve_out sysfs node for fsstat

  Bug fixes:
   - disable nat_bits during umount to avoid potential nat entry corruption
   - fix missing i_size update on atomic writes
   - fix missing discard for active segments
   - fix running out of free segments
   - fix out-of-bounds access in f2fs_truncate_inode_blocks()
   - call f2fs_recover_quota_end() correctly
   - fix potential deadloop in prepare_compress_overwrite()
   - fix the missing write pointer correction for zoned device
   - fix to avoid panic once fallocation fails for pinfile
   - don't retry IO for corrupted data scenario

  There are many other clean up patches and minor bug fixes as usual"

* tag 'f2fs-for-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits)
  f2fs: fix missing discard for active segments
  f2fs: optimize f2fs DIO overwrites
  f2fs: fix to avoid atomicity corruption of atomic file
  f2fs: pass sbi rather than sb to parse_options()
  f2fs: pass sbi rather than sb to quota qf_name helpers
  f2fs: defer readonly check vs norecovery
  f2fs: Pass sbi rather than sb to f2fs_set_test_dummy_encryption
  f2fs: make LAZYTIME a mount option flag
  f2fs: make INLINECRYPT a mount option flag
  f2fs: factor out an f2fs_default_check function
  f2fs: consolidate unsupported option handling errors
  f2fs: use f2fs_sb_has_device_alias during option parsing
  f2fs: add carve_out sysfs node
  f2fs: fix to avoid running out of free segments
  f2fs: Remove f2fs_write_node_page()
  f2fs: Remove f2fs_write_meta_page()
  f2fs: Remove f2fs_write_data_page()
  f2fs: Remove check for ->writepage
  Revert "f2fs: rebuild nat_bits during umount"
  f2fs: fix to avoid accessing uninitialized curseg
  ...
2025-03-27 12:55:54 -07:00
Linus Torvalds
592329e5e9 Summary
* Move vm_table members out of kernel/sysctl.c
 
   All vm_table array members have moved to their respective subsystems leading
   to the removal of vm_table from kernel/sysctl.c. This increases modularity by
   placing the ctl_tables closer to where they are actually used and at the same
   time reducing the chances of merge conflicts in kernel/sysctl.c.
 
 * ctl_table range fixes
 
   Replace the proc_handler function that checks variable ranges in
   coredump_sysctls and vdso_table with the one that actually uses the extra{1,2}
   pointers as min/max values. This tightens the range of the values that users
   can pass into the kernel effectively preventing {under,over}flows.
 
 * Misc fixes
 
   Correct grammar errors and typos in test messages. Update sysctl files in
   MAINTAINERS. Constified and removed array size in declaration for
   alignment_tbl
 
 * Testing
 
   - These have all been in linux-next for at least 1 month
   - They have gone through 0-day
   - Ran all these through sysctl selftests in x86_64
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmfhV8EACgkQupfNUreW
 QU/udAv/VCXGkndQsJ5biXpXYFnokX0gIEaYzzHiqrFycZqr8ys0/wWzc+ar1LjF
 Jvanl2uKB0mUviLKt7Gk0+Hri+PJlYIrbx+5K5eo2wsKUUxFykqLLm59y/orPODl
 gyPQjKNpHJb7COsnEc3Lrq/fvol4NPHlcBPXG8NwehccTeBHZ1ninfo+pSnxh3o8
 kI3GSLLxD4K9AgBl5QuVWH4gU7o//u7lUkKzy03NW+2jmuRv3dRcYF7IdgMINNee
 AeXnygdSBxLzECBvmkfNdyg+AmL8hdsmzbsIh7UuJDvxLlQOInVLZa+sXBotCOIc
 TImCrr1Ws1OuGrD0kpH+21tJvc8pNFWt61QlulObQdrLndWHdZEGyGOusLpXTwbn
 jIWZmMvzk1foSwdgzwPFzUqPEpW3FrBVDo4Z4kenBDrCp56QTX7hGRvkNYJNKvot
 Ue+i8BeHR/Gm/p+UMqgsSTOaNJXTqZhFqwJQVzxU/9LN/vkS0On6fbjgBd5X6Pn+
 a5dlc9gy
 =0bcX
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl

Pull sysctl updates from Joel Granados:

 - Move vm_table members out of kernel/sysctl.c

   All vm_table array members have moved to their respective subsystems
   leading to the removal of vm_table from kernel/sysctl.c. This
   increases modularity by placing the ctl_tables closer to where they
   are actually used and at the same time reducing the chances of merge
   conflicts in kernel/sysctl.c.

 - ctl_table range fixes

   Replace the proc_handler function that checks variable ranges in
   coredump_sysctls and vdso_table with the one that actually uses the
   extra{1,2} pointers as min/max values. This tightens the range of the
   values that users can pass into the kernel effectively preventing
   {under,over}flows.

 - Misc fixes

   Correct grammar errors and typos in test messages. Update sysctl
   files in MAINTAINERS. Constified and removed array size in
   declaration for alignment_tbl

* tag 'sysctl-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: (22 commits)
  selftests/sysctl: fix wording of help messages
  selftests: fix spelling/grammar errors in sysctl/sysctl.sh
  MAINTAINERS: Update sysctl file list in MAINTAINERS
  sysctl: Fix underflow value setting risk in vm_table
  coredump: Fixes core_pipe_limit sysctl proc_handler
  sysctl: remove unneeded include
  sysctl: remove the vm_table
  sh: vdso: move the sysctl to arch/sh/kernel/vsyscall/vsyscall.c
  x86: vdso: move the sysctl to arch/x86/entry/vdso/vdso32-setup.c
  fs: dcache: move the sysctl to fs/dcache.c
  sunrpc: simplify rpcauth_cache_shrink_count()
  fs: drop_caches: move sysctl to fs/drop_caches.c
  fs: fs-writeback: move sysctl to fs/fs-writeback.c
  mm: nommu: move sysctl to mm/nommu.c
  security: min_addr: move sysctl to security/min_addr.c
  mm: mmap: move sysctl to mm/mmap.c
  mm: util: move sysctls to mm/util.c
  mm: vmscan: move vmscan sysctls to mm/vmscan.c
  mm: swap: move sysctl to mm/swap.c
  mm: filemap: move sysctl to mm/filemap.c
  ...
2025-03-26 21:02:05 -07:00
Linus Torvalds
76b6905c11 15 hotfixes. 7 are cc:stable and the remainder address post-6.13 issues
or aren't considered necessary for -stable kernels.
 
 13 are for MM and the other two are for squashfs and procfs.
 
 All are singletons.  Please see the individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ9jkDgAKCRDdBJ7gKXxA
 jtgjAP0ZIVQ8wEL/txkrNyMT5JrWCK1XfjsfPNJ0afkcl4Oo/QEA8rnJHT/hX8aY
 Wr+aY2CoCsFEOsW4oDRUPrUYpkUBqA8=
 =AwLJ
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2025-03-17-20-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc hotfixes from Andrew Morton:
 "15 hotfixes. 7 are cc:stable and the remainder address post-6.13
  issues or aren't considered necessary for -stable kernels.

  13 are for MM and the other two are for squashfs and procfs.

  All are singletons. Please see the individual changelogs for details"

* tag 'mm-hotfixes-stable-2025-03-17-20-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/page_alloc: fix memory accept before watermarks gets initialized
  mm: decline to manipulate the refcount on a slab page
  memcg: drain obj stock on cpu hotplug teardown
  mm/huge_memory: drop beyond-EOF folios with the right number of refs
  selftests/mm: run_vmtests.sh: fix half_ufd_size_MB calculation
  mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
  mm: memcontrol: fix swap counter leak from offline cgroup
  mm/vma: do not register private-anon mappings with khugepaged during mmap
  squashfs: fix invalid pointer dereference in squashfs_cache_delete
  mm/migrate: fix shmem xarray update during migration
  mm/hugetlb: fix surplus pages in dissolve_free_huge_page()
  mm/damon/core: initialize damos->walk_completed in damon_new_scheme()
  mm/damon: respect core layer filters' allowance decision on ops layer
  filemap: move prefaulting out of hot write path
  proc: fix UAF in proc_get_inode()
2025-03-17 22:27:27 -07:00
Zi Yan
200a89c159 mm/filemap: use xas_try_split() in __filemap_add_folio()
Patch series "Minimize xa_node allocation during xarry split", v3.

When splitting a multi-index entry in XArray from order-n to order-m,
existing xas_split_alloc()+xas_split() approach requires 2^(n %
XA_CHUNK_SHIFT) xa_node allocations.  But its callers,
__filemap_add_folio() and shmem_split_large_entry(), use at most 1
xa_node.  To minimize xa_node allocation and remove the limitation of no
split from order-12 (or above) to order-0 (or anything between 0 and
5)[1], xas_try_split() was added[2], which allocates (n / XA_CHUNK_SHIFT -
m / XA_CHUNK_SHIFT) xa_node.  It is used for non-uniform folio split, but
can be used by __filemap_add_folio() and shmem_split_large_entry().

xas_split_alloc() and xas_split() split an order-9 to order-0:

         ---------------------------------
         |   |   |   |   |   |   |   |   |
         | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
         |   |   |   |   |   |   |   |   |
         ---------------------------------
           |   |                   |   |
     -------   ---               ---   -------
     |           |     ...       |           |
     V           V               V           V
----------- -----------     ----------- -----------
| xa_node | | xa_node | ... | xa_node | | xa_node |
----------- -----------     ----------- -----------

xas_try_split() splits an order-9 to order-0:
   ---------------------------------
   |   |   |   |   |   |   |   |   |
   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
   |   |   |   |   |   |   |   |   |
   ---------------------------------
     |
     |
     V
-----------
| xa_node |
-----------

xas_try_split() is designed to be called iteratively with n = m + 1. 
xas_try_split_mini_order() is added to minmize the number of calls to
xas_try_split() by telling the caller the next minimal order to split to
instead of n - 1.  Splitting order-n to order-m when m= l * XA_CHUNK_SHIFT
does not require xa_node allocation and requires 1 xa_node when n=l *
XA_CHUNK_SHIFT and m = n - 1, so it is OK to use xas_try_split() with n >
m + 1 when no new xa_node is needed.

xfstests quick group test passed on xfs and tmpfs.

[1] https://lore.kernel.org/linux-mm/Z6YX3RznGLUD07Ao@casper.infradead.org/
[2] https://lore.kernel.org/linux-mm/20250226210032.2044041-1-ziy@nvidia.com/


This patch (of 2):

During __filemap_add_folio(), a shadow entry is covering n slots and a
folio covers m slots with m < n is to be added.  Instead of splitting all
n slots, only the m slots covered by the folio need to be split and the
remaining n-m shadow entries can be retained with orders ranging from m to
n-1.  This method only requires

	(n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT)

new xa_nodes instead of

	(n % XA_CHUNK_SHIFT) * ((n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT))

new xa_nodes, compared to the original xas_split_alloc() + xas_split()
one.  For example, to insert an order-0 folio when an order-9 shadow entry
is present (assuming XA_CHUNK_SHIFT is 6), 1 xa_node is needed instead of
8.

xas_try_split_min_order() is introduced to reduce the number of calls to
xas_try_split() during split.

Link: https://lkml.kernel.org/r/20250314222113.711703-1-ziy@nvidia.com
Link: https://lkml.kernel.org/r/20250314222113.711703-2-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mattew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17 22:07:01 -07:00
Guanjun
b23ceebd63 filemap: remove redundant folio_test_large check in filemap_free_folio
The folio_test_large() check in filemap_free_folio() is unnecessary
because folio_nr_pages(), which is called internally already performs this
check.  Removing the redundant condition simplifies the code and avoids
double validation.

This change improves code readability and reduces unnecessary operations
in the folio freeing path.

Link: https://lkml.kernel.org/r/20250213055612.490993-1-guanjun@linux.alibaba.com
Signed-off-by: Guanjun <guanjun@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:06:16 -07:00
Raphael S. Carvalho
182db972c9 mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
original report:
https://lore.kernel.org/all/CAKhLTr1UL3ePTpYjXOx2AJfNk8Ku2EdcEfu+CH1sf3Asr=B-Dw@mail.gmail.com/T/

When doing buffered writes with FGP_NOWAIT, under memory pressure, the
system returned ENOMEM despite there being plenty of available memory, to
be reclaimed from page cache.  The user space used io_uring interface,
which in turn submits I/O with FGP_NOWAIT (the fast path).

retsnoop pointed to iomap_get_folio:

00:34:16.180612 -> 00:34:16.180651 TID/PID 253786/253721
(reactor-1/combined_tests):

                    entry_SYSCALL_64_after_hwframe+0x76
                    do_syscall_64+0x82
                    __do_sys_io_uring_enter+0x265
                    io_submit_sqes+0x209
                    io_issue_sqe+0x5b
                    io_write+0xdd
                    xfs_file_buffered_write+0x84
                    iomap_file_buffered_write+0x1a6
    32us [-ENOMEM]  iomap_write_begin+0x408
iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x…
pos=0 len=4096 foliop=0xffffb32c296b7b80
!    4us [-ENOMEM]  iomap_get_folio
iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x…
pos=0 len=4096

This is likely a regression caused by 66dabbb65d ("mm: return an ERR_PTR
from __filemap_get_folio"), which moved error handling from
io_map_get_folio() to __filemap_get_folio(), but broke FGP_NOWAIT
handling, so ENOMEM is being escaped to user space.  Had it correctly
returned -EAGAIN with NOWAIT, either io_uring or user space itself would
be able to retry the request.

It's not enough to patch io_uring since the iomap interface is the one
responsible for it, and pwritev2(RWF_NOWAIT) and AIO interfaces must
return the proper error too.

The patch was tested with scylladb test suite (its original reproducer),
and the tests all pass now when memory is pressured.

Link: https://lkml.kernel.org/r/20250224143700.23035-1-raphaelsc@scylladb.com
Fixes: 66dabbb65d ("mm: return an ERR_PTR from __filemap_get_folio")
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 17:40:25 -07:00
Dave Hansen
665575cff0 filemap: move prefaulting out of hot write path
There is a generic anti-pattern that shows up in the VFS and several
filesystems where the hot write paths touch userspace twice when they
could get away with doing it once.

Dave Chinner suggested that they should all be fixed up[1].  I agree[2]. 
But, the series to do that fixup spans a bunch of filesystems and a lot of
people.  This patch fixes common code that absolutely everyone uses.  It
has measurable performance benefits[3].

I think this patch can go in and not be held up by the others.

I will post them separately to their separate maintainers for
consideration. But, honestly, I'm not going to lose any sleep if
the maintainers don't pick those up.

1. https://lore.kernel.org/all/Z5f-x278Z3wTIugL@dread.disaster.area/
2. https://lore.kernel.org/all/20250129181749.C229F6F3@davehans-spike.ostc.intel.com/
3. https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/


This patch:

There is a bit of a sordid history here. I originally wrote
998ef75ddb ("fs: do not prefault sys_write() user buffer pages")
to fix a performance issue that showed up on early SMAP hardware.
But that was reverted with 00a3d660cb because it exposed an
underlying filesystem bug.

This is a reimplementation of the original commit along with some
simplification and comment improvements.

The basic problem is that the generic write path has two userspace
accesses: one to prefault the write source buffer and then another to
perform the actual write. On x86, this means an extra STAC/CLAC pair.
These are relatively expensive instructions because they function as
barriers.

Keep the prefaulting behavior but move it into the slow path that gets
run when the write did not make any progress. This avoids livelocks
that can happen when the write's source and destination target the
same folio. Contrary to the existing comments, the fault-in does not
prevent deadlocks. That's accomplished by using an "atomic" usercopy
that disables page faults.

The end result is that the generic write fast path now touches
userspace once instead of twice.

0day has shown some improvements on a couple of microbenchmarks:

	https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/

Link: https://lkml.kernel.org/r/20250228203722.CAEB63AC@davehans-spike.ostc.intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/yxyuijjfd6yknryji2q64j3keq2ygw6ca6fs5jwyolklzvo45s@4u63qqqyosy2/
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 17:40:23 -07:00
Amir Goldstein
252256e416 Revert "fanotify: disable readahead if we have pre-content watches"
This reverts commit fac84846a2.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250312073852.2123409-7-amir73il@gmail.com
2025-03-13 16:31:12 +01:00
Amir Goldstein
955fbe0ef1 Revert "fsnotify: generate pre-content permission event on page fault"
This reverts commit 8392bc2ff8.

In the use case of buffered write whose input buffer is mmapped file on a
filesystem with a pre-content mark, the prefaulting of the buffer can
happen under the filesystem freeze protection (obtained in vfs_write())
which breaks assumptions of pre-content hook and introduces potential
deadlock of HSM handler in userspace with filesystem freezing.

Now that we have pre-content hooks at file mmap() time, disable the
pre-content event hooks on page fault to avoid the potential deadlock.

Reported-by: syzbot+7229071b47908b19d5b7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-fsdevel/7ehxrhbvehlrjwvrduoxsao5k3x4aw275patsb3krkwuq573yv@o2hskrfawbnc/
Fixes: 8392bc2ff8 ("fsnotify: generate pre-content permission event on page fault")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250312073852.2123409-5-amir73il@gmail.com
2025-03-13 16:30:55 +01:00
Linus Torvalds
00a7d39898 fs/pipe: add simpler helpers for common cases
The fix to atomically read the pipe head and tail state when not holding
the pipe mutex has caused a number of headaches due to the size change
of the involved types.

It turns out that we don't have _that_ many places that access these
fields directly and were affected, but we have more than we strictly
should have, because our low-level helper functions have been designed
to have intimate knowledge of how the pipes work.

And as a result, that random noise of direct 'pipe->head' and
'pipe->tail' accesses makes it harder to pinpoint any actual potential
problem spots remaining.

For example, we didn't have a "is the pipe full" helper function, but
instead had a "given these pipe buffer indexes and this pipe size, is
the pipe full".  That's because some low-level pipe code does actually
want that much more complicated interface.

But most other places literally just want a "is the pipe full" helper,
and not having it meant that those places ended up being unnecessarily
much too aware of this all.

It would have been much better if only the very core pipe code that
cared had been the one aware of this all.

So let's fix it - better late than never.  This just introduces the
trivial wrappers for "is this pipe full or empty" and to get how many
pipe buffers are used, so that instead of writing

        if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))

the places that literally just want to know if a pipe is full can just
say

        if (pipe_is_full(pipe))

instead.  The existing trivial cases were converted with a 'sed' script.

This cuts down on the places that access pipe->head and pipe->tail
directly outside of the pipe code (and core splice code) quite a lot.

The splice code in particular still revels in doing the direct low-level
accesses, and the fuse fuse_dev_splice_write() code also seems a bit
unnecessarily eager to go very low-level, but it's at least a bit better
than it used to be.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-06 18:25:35 -10:00
Matthew Wilcox (Oracle)
d96e2802a8 mm: Remove wait_on_page_locked()
This compatibility wrapper has no callers left, so remove it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04 17:02:27 +00:00
Jingbo Xu
8510edf191
mm/filemap: fix miscalculated file range for filemap_fdatawrite_range_kick()
iocb->ki_pos has been updated with the number of written bytes since
generic_perform_write().

Besides __filemap_fdatawrite_range() accepts the inclusive end of the
data range.

Fixes: 1d44575765 ("mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue")
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250218120209.88093-2-jefflexu@linux.alibaba.com
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-21 14:09:47 +01:00
Kaixiong Yu
73aa354af2 mm: filemap: move sysctl to mm/filemap.c
This moves the filemap related sysctl to mm/filemap.c, and
removes the redundant external variable declaration.

Signed-off-by: Kaixiong Yu <yukaixiong@huawei.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-02-07 16:53:03 +01:00
Linus Torvalds
9c5968db9e The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
 
 - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
   page allocator so we end up with the ability to allocate and free
   zero-refcount pages.  So that callers (ie, slab) can avoid a refcount
   inc & dec.
 
 - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
   large folios other than PMD-sized ones.
 
 - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
   fixes for this small built-in kernel selftest.
 
 - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
   the mapletree code.
 
 - "mm: fix format issues and param types" from Keren Sun implements a
   few minor code cleanups.
 
 - "simplify split calculation" from Wei Yang provides a few fixes and a
   test for the mapletree code.
 
 - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
   continues the work of moving vma-related code into the (relatively) new
   mm/vma.c.
 
 - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
   Hildenbrand cleans up and rationalizes handling of gfp flags in the page
   allocator.
 
 - "readahead: Reintroduce fix for improper RA window sizing" from Jan
   Kara is a second attempt at fixing a readahead window sizing issue.  It
   should reduce the amount of unnecessary reading.
 
 - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
   addresses an issue where "huge" amounts of pte pagetables are
   accumulated
   (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
   Qi's series addresses this windup by synchronously freeing PTE memory
   within the context of madvise(MADV_DONTNEED).
 
 - "selftest/mm: Remove warnings found by adding compiler flags" from
   Muhammad Usama Anjum fixes some build warnings in the selftests code
   when optional compiler warnings are enabled.
 
 - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
   Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
 
 - "pkeys kselftests improvements" from Kevin Brodsky implements various
   fixes and cleanups in the MM selftests code, mainly pertaining to the
   pkeys tests.
 
 - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
   estimate application working set size.
 
 - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
   provides some cleanups to memcg's hugetlb charging logic.
 
 - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
   removes the global swap cgroup lock.  A speedup of 10% for a tmpfs-based
   kernel build was demonstrated.
 
 - "zram: split page type read/write handling" from Sergey Senozhatsky
   has several fixes and cleaups for zram in the area of zram_write_page().
   A watchdog softlockup warning was eliminated.
 
 - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
   cleans up the pagetable destructor implementations.  A rare
   use-after-free race is fixed.
 
 - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
   simplifies and cleans up the debugging code in the VMA merging logic.
 
 - "Account page tables at all levels" from Kevin Brodsky cleans up and
   regularizes the pagetable ctor/dtor handling.  This results in
   improvements in accounting accuracy.
 
 - "mm/damon: replace most damon_callback usages in sysfs with new core
   functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
   file interface logic.
 
 - "mm/damon: enable page level properties based monitoring" from
   SeongJae Park increases the amount of information which is presented in
   response to DAMOS actions.
 
 - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
   DAMON's long-deprecated debugfs interfaces.  Thus the migration to sysfs
   is completed.
 
 - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
   Xu cleans up and generalizes the hugetlb reservation accounting.
 
 - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
   removes a never-used feature of the alloc_pages_bulk() interface.
 
 - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
   extends DAMOS filters to support not only exclusion (rejecting), but
   also inclusion (allowing) behavior.
 
 - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
   "introduces a new memory descriptor for zswap.zpool that currently
   overlaps with struct page for now.  This is part of the effort to reduce
   the size of struct page and to enable dynamic allocation of memory
   descriptors."
 
 - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
   simplifies the swap allocator locking.  A speedup of 400% was
   demonstrated for one workload.  As was a 35% reduction for kernel build
   time with swap-on-zram.
 
 - "mm: update mips to use do_mmap(), make mmap_region() internal" from
   Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
   mmap_region() can be made MM-internal.
 
 - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
   regressions and otherwise improves MGLRU performance.
 
 - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
   updates DAMON documentation.
 
 - "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
 
 - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
   provides various cleanups in the areas of hugetlb folios, THP folios and
   migration.
 
 - "Uncached buffered IO" from Jens Axboe implements the new
   RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
   reading and writing.  To permite userspace to address issues with
   massive buildup of useless pagecache when reading/writing fast devices.
 
 - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
   Weißschuh fixes and optimizes some of the MM selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
 jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
 uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
 =Ib2s
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "The various patchsets are summarized below. Plus of course many
  indivudual patches which are described in their changelogs.

   - "Allocate and free frozen pages" from Matthew Wilcox reorganizes
     the page allocator so we end up with the ability to allocate and
     free zero-refcount pages. So that callers (ie, slab) can avoid a
     refcount inc & dec

   - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
     use large folios other than PMD-sized ones

   - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
     and fixes for this small built-in kernel selftest

   - "mas_anode_descend() related cleanup" from Wei Yang tidies up part
     of the mapletree code

   - "mm: fix format issues and param types" from Keren Sun implements a
     few minor code cleanups

   - "simplify split calculation" from Wei Yang provides a few fixes and
     a test for the mapletree code

   - "mm/vma: make more mmap logic userland testable" from Lorenzo
     Stoakes continues the work of moving vma-related code into the
     (relatively) new mm/vma.c

   - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
     Hildenbrand cleans up and rationalizes handling of gfp flags in the
     page allocator

   - "readahead: Reintroduce fix for improper RA window sizing" from Jan
     Kara is a second attempt at fixing a readahead window sizing issue.
     It should reduce the amount of unnecessary reading

   - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
     addresses an issue where "huge" amounts of pte pagetables are
     accumulated:

       https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/

     Qi's series addresses this windup by synchronously freeing PTE
     memory within the context of madvise(MADV_DONTNEED)

   - "selftest/mm: Remove warnings found by adding compiler flags" from
     Muhammad Usama Anjum fixes some build warnings in the selftests
     code when optional compiler warnings are enabled

   - "mm: don't use __GFP_HARDWALL when migrating remote pages" from
     David Hildenbrand tightens the allocator's observance of
     __GFP_HARDWALL

   - "pkeys kselftests improvements" from Kevin Brodsky implements
     various fixes and cleanups in the MM selftests code, mainly
     pertaining to the pkeys tests

   - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
     estimate application working set size

   - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
     provides some cleanups to memcg's hugetlb charging logic

   - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
     removes the global swap cgroup lock. A speedup of 10% for a
     tmpfs-based kernel build was demonstrated

   - "zram: split page type read/write handling" from Sergey Senozhatsky
     has several fixes and cleaups for zram in the area of
     zram_write_page(). A watchdog softlockup warning was eliminated

   - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
     Brodsky cleans up the pagetable destructor implementations. A rare
     use-after-free race is fixed

   - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
     simplifies and cleans up the debugging code in the VMA merging
     logic

   - "Account page tables at all levels" from Kevin Brodsky cleans up
     and regularizes the pagetable ctor/dtor handling. This results in
     improvements in accounting accuracy

   - "mm/damon: replace most damon_callback usages in sysfs with new
     core functions" from SeongJae Park cleans up and generalizes
     DAMON's sysfs file interface logic

   - "mm/damon: enable page level properties based monitoring" from
     SeongJae Park increases the amount of information which is
     presented in response to DAMOS actions

   - "mm/damon: remove DAMON debugfs interface" from SeongJae Park
     removes DAMON's long-deprecated debugfs interfaces. Thus the
     migration to sysfs is completed

   - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
     Peter Xu cleans up and generalizes the hugetlb reservation
     accounting

   - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
     removes a never-used feature of the alloc_pages_bulk() interface

   - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
     extends DAMOS filters to support not only exclusion (rejecting),
     but also inclusion (allowing) behavior

   - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
     introduces a new memory descriptor for zswap.zpool that currently
     overlaps with struct page for now. This is part of the effort to
     reduce the size of struct page and to enable dynamic allocation of
     memory descriptors

   - "mm, swap: rework of swap allocator locks" from Kairui Song redoes
     and simplifies the swap allocator locking. A speedup of 400% was
     demonstrated for one workload. As was a 35% reduction for kernel
     build time with swap-on-zram

   - "mm: update mips to use do_mmap(), make mmap_region() internal"
     from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
     mmap_region() can be made MM-internal

   - "mm/mglru: performance optimizations" from Yu Zhao fixes a few
     MGLRU regressions and otherwise improves MGLRU performance

   - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
     Park updates DAMON documentation

   - "Cleanup for memfd_create()" from Isaac Manjarres does that thing

   - "mm: hugetlb+THP folio and migration cleanups" from David
     Hildenbrand provides various cleanups in the areas of hugetlb
     folios, THP folios and migration

   - "Uncached buffered IO" from Jens Axboe implements the new
     RWF_DONTCACHE flag which provides synchronous dropbehind for
     pagecache reading and writing. To permite userspace to address
     issues with massive buildup of useless pagecache when
     reading/writing fast devices

   - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
     Weißschuh fixes and optimizes some of the MM selftests"

* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm/compaction: fix UBSAN shift-out-of-bounds warning
  s390/mm: add missing ctor/dtor on page table upgrade
  kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
  tools: add VM_WARN_ON_VMG definition
  mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
  seqlock: add missing parameter documentation for raw_seqcount_try_begin()
  mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
  mm/page_alloc: remove the incorrect and misleading comment
  zram: remove zcomp_stream_put() from write_incompressible_page()
  mm: separate move/undo parts from migrate_pages_batch()
  mm/kfence: use str_write_read() helper in get_access_type()
  selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
  kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
  selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
  selftests/mm: vm_util: split up /proc/self/smaps parsing
  selftests/mm: virtual_address_range: unmap chunks after validation
  selftests/mm: virtual_address_range: mmap() without PROT_WRITE
  selftests/memfd/memfd_test: fix possible NULL pointer dereference
  mm: add FGP_DONTCACHE folio creation flag
  mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
  ...
2025-01-26 18:36:23 -08:00
Jens Axboe
d94d23fdd7 mm: add FGP_DONTCACHE folio creation flag
Callers can pass this in for uncached folio creation, in which case if a
folio is newly created it gets marked as uncached.  If a folio exists for
this index and lookup succeeds, then it will not get marked as uncached. 
If an !uncached lookup finds a cached folio, clear the flag.  For that
case, there are competeting uncached and cached users of the folio, and it
should not get pruned.

Link: https://lkml.kernel.org/r/20241220154831.1086649-13-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:44 -08:00
Jens Axboe
dddc559f2e mm/filemap: add filemap_fdatawrite_range_kick() helper
Works like filemap_fdatawrite_range(), except it's a non-integrity data
writeback and hence only starts writeback on the specified range.  Will
help facilitate generically starting uncached writeback from
generic_write_sync(), as header dependencies preclude doing this inline
from fs.h.

Link: https://lkml.kernel.org/r/20241220154831.1086649-11-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:43 -08:00
Jens Axboe
fb7d3bc414 mm/filemap: drop streaming/uncached pages when writeback completes
If the folio is marked as streaming, drop pages when writeback completes. 
Intended to be used with RWF_DONTCACHE, to avoid needing sync writes for
uncached IO.

Link: https://lkml.kernel.org/r/20241220154831.1086649-10-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:43 -08:00
Jens Axboe
8026e49bff mm/filemap: add read support for RWF_DONTCACHE
Add RWF_DONTCACHE as a read operation flag, which means that any data read
wil be removed from the page cache upon completion.  Uses the page cache
to synchronize, and simply prunes folios that were instantiated when the
operation completes.  While it would be possible to use private pages for
this, using the page cache as synchronization is handy for a variety of
reasons:

1) No special truncate magic is needed
2) Async buffered reads need some place to serialize, using the page
   cache is a lot easier than writing extra code for this
3) The pruning cost is pretty reasonable

and the code to support this is much simpler as a result.

You can think of uncached buffered IO as being the much more attractive
cousin of O_DIRECT - it has none of the restrictions of O_DIRECT.  Yes, it
will copy the data, but unlike regular buffered IO, it doesn't run into
the unpredictability of the page cache in terms of reclaim.  As an
example, on a test box with 32 drives, reading them with buffered IO looks
as follows:

Reading bs 65536, uncached 0
  1s: 145945MB/sec
  2s: 158067MB/sec
  3s: 157007MB/sec
  4s: 148622MB/sec
  5s: 118824MB/sec
  6s: 70494MB/sec
  7s: 41754MB/sec
  8s: 90811MB/sec
  9s: 92204MB/sec
 10s: 95178MB/sec
 11s: 95488MB/sec
 12s: 95552MB/sec
 13s: 96275MB/sec

where it's quite easy to see where the page cache filled up, and
performance went from good to erratic, and finally settles at a much
lower rate. Looking at top while this is ongoing, we see:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
7535 root      20   0  267004      0      0 S  3199   0.0   8:40.65 uncached
3326 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd4
3327 root      20   0       0      0      0 R 100.0   0.0   0:17.22 kswapd5
3328 root      20   0       0      0      0 R 100.0   0.0   0:13.29 kswapd6
3332 root      20   0       0      0      0 R 100.0   0.0   0:11.11 kswapd10
3339 root      20   0       0      0      0 R 100.0   0.0   0:16.25 kswapd17
3348 root      20   0       0      0      0 R 100.0   0.0   0:16.40 kswapd26
3343 root      20   0       0      0      0 R 100.0   0.0   0:16.30 kswapd21
3344 root      20   0       0      0      0 R 100.0   0.0   0:11.92 kswapd22
3349 root      20   0       0      0      0 R 100.0   0.0   0:16.28 kswapd27
3352 root      20   0       0      0      0 R  99.7   0.0   0:11.89 kswapd30
3353 root      20   0       0      0      0 R  96.7   0.0   0:16.04 kswapd31
3329 root      20   0       0      0      0 R  96.4   0.0   0:11.41 kswapd7
3345 root      20   0       0      0      0 R  96.4   0.0   0:13.40 kswapd23
3330 root      20   0       0      0      0 S  91.1   0.0   0:08.28 kswapd8
3350 root      20   0       0      0      0 S  86.8   0.0   0:11.13 kswapd28
3325 root      20   0       0      0      0 S  76.3   0.0   0:07.43 kswapd3
3341 root      20   0       0      0      0 S  74.7   0.0   0:08.85 kswapd19
3334 root      20   0       0      0      0 S  71.7   0.0   0:10.04 kswapd12
3351 root      20   0       0      0      0 R  60.5   0.0   0:09.59 kswapd29
3323 root      20   0       0      0      0 R  57.6   0.0   0:11.50 kswapd1
[...]

which is just showing a partial list of the 32 kswapd threads that are
running mostly full tilt, burning ~28 full CPU cores.

If the same test case is run with RWF_DONTCACHE set for the buffered read,
the output looks as follows:

Reading bs 65536, uncached 0
  1s: 153144MB/sec
  2s: 156760MB/sec
  3s: 158110MB/sec
  4s: 158009MB/sec
  5s: 158043MB/sec
  6s: 157638MB/sec
  7s: 157999MB/sec
  8s: 158024MB/sec
  9s: 157764MB/sec
 10s: 157477MB/sec
 11s: 157417MB/sec
 12s: 157455MB/sec
 13s: 157233MB/sec
 14s: 156692MB/sec

which is just chugging along at ~155GB/sec of read performance. Looking
at top, we see:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
7961 root      20   0  267004      0      0 S  3180   0.0   5:37.95 uncached
8024 axboe     20   0   14292   4096      0 R   1.0   0.0   0:00.13 top

where just the test app is using CPU, no reclaim is taking place outside
of the main thread.  Not only is performance 65% better, it's also using
half the CPU to do it.

Link: https://lkml.kernel.org/r/20241220154831.1086649-9-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:43 -08:00
Jens Axboe
f598cdaafc mm/filemap: use page_cache_sync_ra() to kick off read-ahead
Rather than use the page_cache_sync_readahead() helper, define our own
ractl and use page_cache_sync_ra() directly.  In preparation for needing
to modify ractl inside filemap_get_pages().

No functional changes in this patch.

Link: https://lkml.kernel.org/r/20241220154831.1086649-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:42 -08:00
Jens Axboe
9ad6344568 mm/filemap: change filemap_create_folio() to take a struct kiocb
Patch series "Uncached buffered IO", v8.

5 years ago I posted patches adding support for RWF_UNCACHED, as a way to
do buffered IO that isn't page cache persistent.  The approach back then
was to have private pages for IO, and then get rid of them once IO was
done.  But that then runs into all the issues that O_DIRECT has, in terms
of synchronizing with the page cache.

So here's a new approach to the same concent, but using the page cache as
synchronization.  Due to excessive bike shedding on the naming, this is
now named RWF_DONTCACHE, and is less special in that it's just page cache
IO, except it prunes the ranges once IO is completed.

Why do this, you may ask?  The tldr is that device speeds are only getting
faster, while reclaim is not.  Doing normal buffered IO can be very
unpredictable, and suck up a lot of resources on the reclaim side.  This
leads people to use O_DIRECT as a work-around, which has its own set of
restrictions in terms of size, offset, and length of IO.  It's also
inherently synchronous, and now you need async IO as well.  While the
latter isn't necessarily a big problem as we have good options available
there, it also should not be a requirement when all you want to do is read
or write some data without caching.

Even on desktop type systems, a normal NVMe device can fill the entire
page cache in seconds.  On the big system I used for testing, there's a
lot more RAM, but also a lot more devices.  As can be seen in some of the
results in the following patches, you can still fill RAM in seconds even
when there's 1TB of it.  Hence this problem isn't solely a "big
hyperscaler system" issue, it's common across the board.

Common for both reads and writes with RWF_DONTCACHE is that they use the
page cache for IO.  Reads work just like a normal buffered read would,
with the only exception being that the touched ranges will get pruned
after data has been copied.  For writes, the ranges will get writeback
kicked off before the syscall returns, and then writeback completion will
prune the range.  Hence writes aren't synchronous, and it's easy to
pipeline writes using RWF_DONTCACHE.  Folios that aren't instantiated by
RWF_DONTCACHE IO are left untouched.  This means you that uncached IO will
take advantage of the page cache for uptodate data, but not leave anything
it instantiated/created in cache.

File systems need to support this.  This patchset adds support for the
generic read path, which covers file systems like ext4.  Patches exist to
add support for iomap/XFS and btrfs as well, which sit on top of this
series.  If RWF_DONTCACHE IO is attempted on a file system that doesn't
support it, -EOPNOTSUPP is returned.  Hence the user can rely on it either
working as designed, or flagging and error if that's not the case.  The
intent here is to give the application a sensible fallback path - eg, it
may fall back to O_DIRECT if appropriate, or just live with the fact that
uncached IO isn't available and do normal buffered IO.

Adding "support" to other file systems should be trivial, most of the time
just a one-liner adding FOP_DONTCACHE to the fop_flags in the
file_operations struct, if the file system is using either iomap or the
generic filemap helpers for reading and writing.

Performance results are in patch 8 for reads, and you can find the write
side results in the XFS patch adding support for DONTCACHE writes for XFS:

https://git.kernel.dk/cgit/linux/commit/?h=buffered-uncached-fs.10&id=257e92de795fdff7d7e256501e024fac6da6a7f4

with the tldr being that I see about a 65% improvement in performance for
both, with fully predictable IO times.  CPU reduction is substantial as
well, with no kswapd activity at all for reclaim when using uncached IO.

Using it from applications is trivial - just set RWF_DONTCACHE for the
read or write, using pwritev2(2) or preadv2(2).  For io_uring, same thing,
just set RWF_DONTCACHE in sqe->rw_flags for a buffered read/write
operation.  And that's it.

Patches 1..7 are just prep patches, and should have no functional changes
at all.  Patch 8 adds support for the filemap path for RWF_DONTCACHE
reads, and patches 9..12 are just prep patches for supporting the write
side of uncached writes.  In the below mentioned branch, there are then
patches to adopt uncached reads and writes for xfs, btrfs, and ext4.  The
latter currently relies on bit of a hack for passing whether this is an
uncached write or not through ->write_begin(), which can hopefully go away
once ext4 adopts iomap for buffered writes.  I say this is a hack as it's
not the prettiest way to do it, however it is fully solid and will work
just fine.

Passes full xfstests and fsx overnight runs, no issues observed.  That
includes the vm running the testing also using RWF_DONTCACHE on the host. 
I'll post fsstress and fsx patches for RWF_DONTCACHE separately.  As far
as I'm concerned, no further work needs doing here.

And git tree for the patches is here:

https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.10

with the file system patches on top adding support for xfs/btrfs/ext4
here:

https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached-fs.10


This patch (of 12):

Rather than pass in both the file and position directly from the kiocb,
just take a struct kiocb instead.  With the kiocb being passed in, skip
passing in the address_space separately as well.  While doing so, move the
ki_flags checking into filemap_create_folio() as well.  In preparation for
actually needing the kiocb in the function.

No functional changes in this patch.

Link: https://lkml.kernel.org/r/20241220154831.1086649-1-axboe@kernel.dk
Link: https://lkml.kernel.org/r/20241220154831.1086649-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Chris Mason <clm@meta.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:42 -08:00
Linus Torvalds
8883957b3c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmePs7oACgkQnJ2qBz9k
 QNmHuAf9GkLnY5u1/81xP5V9ukZ4N2yeMW0dydLS5cjWj/St5ELeMAza3jeqtJtD
 j36vbnmy2c5pPaGLAK8BJpMXT/R2TkmmKD004zcfqF2S3SgbGzdgO1zMZzq9KJpM
 woRKZtLuglDajedsDEBBcKotBhlN2+C/sQlFuL1mX4zitk9ajr0qYUB1+JqOeg5f
 qwPsDLT077ADpxd7lVIMcm+OqbduP5KWkBKYHpn7lJcLe1eqVMMzceJroW42zhVG
 Dq8Iln26bbU9Wx6FSPFCUcHEzHRHUfXmu07HN9U0X++0QgWjrmBQQLooGFB/bR4a
 edBrPpVas6xE4/brjgFX3gOKtv8xYg==
 =ewDV
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify pre-content notification support from Jan Kara:
 "This introduces a new fsnotify event (FS_PRE_ACCESS) that gets
  generated before a file contents is accessed.

  The event is synchronous so if there is listener for this event, the
  kernel waits for reply. On success the execution continues as usual,
  on failure we propagate the error to userspace. This allows userspace
  to fill in file content on demand from slow storage. The context in
  which the events are generated has been picked so that we don't hold
  any locks and thus there's no risk of a deadlock for the userspace
  handler.

  The new pre-content event is available only for users with global
  CAP_SYS_ADMIN capability (similarly to other parts of fanotify
  functionality) and it is an administrator responsibility to make sure
  the userspace event handler doesn't do stupid stuff that can DoS the
  system.

  Based on your feedback from the last submission, fsnotify code has
  been improved and now file->f_mode encodes whether pre-content event
  needs to be generated for the file so the fast path when nobody wants
  pre-content event for the file just grows the additional file->f_mode
  check. As a bonus this also removes the checks whether the old
  FS_ACCESS event needs to be generated from the fast path. Also the
  place where the event is generated during page fault has been moved so
  now filemap_fault() generates the event if and only if there is no
  uptodate folio in the page cache.

  Also we have dropped FS_PRE_MODIFY event as current real-world users
  of the pre-content functionality don't really use it so let's start
  with the minimal useful feature set"

* tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
  fanotify: Fix crash in fanotify_init(2)
  fs: don't block write during exec on pre-content watched files
  fs: enable pre-content events on supported file systems
  ext4: add pre-content fsnotify hook for DAX faults
  btrfs: disable defrag on pre-content watched files
  xfs: add pre-content fsnotify hook for DAX faults
  fsnotify: generate pre-content permission event on page fault
  mm: don't allow huge faults for files with pre content watches
  fanotify: disable readahead if we have pre-content watches
  fanotify: allow to set errno in FAN_DENY permission response
  fanotify: report file range info with pre-content events
  fanotify: introduce FAN_PRE_ACCESS permission event
  fsnotify: generate pre-content permission event on truncate
  fsnotify: pass optional file access range in pre-content event
  fsnotify: introduce pre-content permission events
  fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
  fanotify: rename a misnamed constant
  fanotify: don't skip extra event info if no info_mode is set
  fsnotify: check if file is actually being watched for pre-content events on open
  fsnotify: opt-in for permission events at file open time
  ...
2025-01-23 13:36:06 -08:00
Linus Torvalds
5f537664e7 cachestat: fix page cache statistics permission checking
When the 'cachestat()' system call was added in commit cf264e1329
("cachestat: implement cachestat syscall"), it was meant to be a much
more convenient (and performant) version of mincore() that didn't need
mapping things into the user virtual address space in order to work.

But it ended up missing the "check for writability or ownership" fix for
mincore(), done in commit 134fca9063 ("mm/mincore.c: make mincore()
more conservative").

This just adds equivalent logic to 'cachestat()', modified for the file
context (rather than vma).

Reported-by: Sudheendra Raghav Neela <sneela@tugraz.at>
Fixes: cf264e1329 ("cachestat: implement cachestat syscall")
Tested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-01-21 20:30:19 -08:00
Dr. David Alan Gilbert
1168b2bec7 filemap: remove unused folio_add_wait_queue
folio_add_wait_queue() has been unused since 2021's commit 850cba069c
("cachefiles: Delete the cachefiles driver pending rewrite")

Remove it.

Link: https://lkml.kernel.org/r/20241116151446.95555-1-linux@treblig.org
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13 22:40:38 -08:00
Matthew Wilcox (Oracle)
1c47c57818 mm: fix assertion in folio_end_read()
We only need to assert that the uptodate flag is clear if we're going to
set it.  This hasn't been a problem before now because we have only used
folio_end_read() when completing with an error, but it's convenient to use
it in squashfs if we discover the folio is already uptodate.

Link: https://lkml.kernel.org/r/20250110163300.3346321-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12 19:03:38 -08:00
Marco Nelissen
f505e6c91e filemap: avoid truncating 64-bit offset to 32 bits
On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a
64-bit value to 32 bits, leading to a possible infinite loop when writing
to an xfs filesystem.

Link: https://lkml.kernel.org/r/20250102190540.1356838-1-marco.nelissen@gmail.com
Fixes: 54fa39ac2e ("iomap: use mapping_seek_hole_data")
Signed-off-by: Marco Nelissen <marco.nelissen@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12 19:03:35 -08:00
Kairui Song
62e72d2cf7 mm, madvise: fix potential workingset node list_lru leaks
Since commit 5abc1e37af ("mm: list_lru: allocate list_lru_one only when
needed"), all list_lru users need to allocate the items using the new
infrastructure that provides list_lru info for slab allocation, ensuring
that the corresponding memcg list_lru is allocated before use.

For workingset shadow nodes (which are xa_node), users are converted to
use the new infrastructure by commit 9bbdc0f324 ("xarray: use
kmem_cache_alloc_lru to allocate xa_node").  The xas->xa_lru will be set
correctly for filemap users.  However, there is a missing case: xa_node
allocations caused by madvise(..., MADV_COLLAPSE).

madvise(..., MADV_COLLAPSE) will also read in the absent parts of file
map, and there will be xa_nodes allocated for the caller's memcg (assuming
it's not rootcg).  However, these allocations won't trigger memcg list_lru
allocation because the proper xas info was not set.

If nothing else has allocated other xa_nodes for that memcg to trigger
list_lru creation, and memory pressure starts to evict file pages,
workingset_update_node will try to add these xa_nodes to their
corresponding memcg list_lru, and it does not exist (NULL).  So they will
be added to rootcg's list_lru instead.

This shouldn't be a significant issue in practice, but it is indeed
unexpected behavior, and these xa_nodes will not be reclaimed effectively.
And may lead to incorrect counting of the list_lru->nr_items counter.

This problem wasn't exposed until recent commit 28e98022b3
("mm/list_lru: simplify reparenting and initial allocation") added a
sanity check: only dying memcg could have a NULL list_lru when
list_lru_{add,del} is called.  This problem triggered this WARNING.

So make madvise(..., MADV_COLLAPSE) also call xas_set_lru() to pass the
list_lru which we may want to insert xa_node into later.  And move
mapping_set_update to mm/internal.h, and turn into a macro to avoid
including extra headers in mm/internal.h.

Link: https://lkml.kernel.org/r/20241222122936.67501-1-ryncsn@gmail.com
Fixes: 9bbdc0f324 ("xarray: use kmem_cache_alloc_lru to allocate xa_node")
Reported-by: syzbot+38a0cbd267eff2d286ff@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/675d01e9.050a0220.37aaf.00be.GAE@google.com/
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-30 17:59:11 -08:00
Josef Bacik
8392bc2ff8 fsnotify: generate pre-content permission event on page fault
FS_PRE_ACCESS will be generated on page fault depending on the faulting
method. This pre-content event is meant to be used by hierarchical storage
managers that want to fill in the file content on first read access.

Export a simple helper that file systems that have their own ->fault()
will use, and have a more complicated helper to be do fancy things in
filemap_fault.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/aa56c50ce81b1fd18d7f5d71dd2dfced5eba9687.1731684329.git.josef@toxicpanda.com
2024-12-11 17:28:41 +01:00
Josef Bacik
fac84846a2 fanotify: disable readahead if we have pre-content watches
With page faults we can trigger readahead on the file, and then
subsequent faults can find these pages and insert them into the file
without emitting an fanotify event.  To avoid this case, disable
readahead if we have pre-content watches on the file.  This way we are
guaranteed to get an event for every range we attempt to access on a
pre-content watched file.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/70a54e859f555e54bc7a47b32fe5aca92b085615.1731684329.git.josef@toxicpanda.com
2024-12-10 12:03:17 +01:00
David Hildenbrand
3203b3ab0f mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()
The folio can get freed + buddy-merged + reallocated in the meantime,
resulting in us calling folio_test_locked() possibly on a tail page.

This makes const_folio_flags VM_BUG_ON_PGFLAGS() when stumbling over the
tail page.

Could this result in other issues?  Doesn't look like it.  False positives
and false negatives don't really matter, because this folio would get
skipped either way when detecting that they have been reallocated in the
meantime.

Fix it by performing the folio_test_locked() checked after grabbing a
reference.  If this ever becomes a real problem, we could add a special
helper that racily checks if the bit is set even on tail pages ...  but
let's hope that's not required so we can just handle it cleaner: work on
the folio after we hold a reference.

Do we really need the folio_test_locked() check if we are going to trylock
briefly after?  Well, we can at least avoid a xas_reload().

It's a bit unclear which exact change introduced that issue.  Likely, ever
since we made PG_locked obey to the PF_NO_TAIL policy it could have been
triggered in some way.

Link: https://lkml.kernel.org/r/20241129125303.4033164-1-david@redhat.com
Fixes: 48c935ad88 ("page-flags: define PG_locked behavior on compound pages")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: syzbot+9f9a7f73fb079b2387a6@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/674184c9.050a0220.1cc393.0001.GAE@google.com/
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05 19:54:47 -08:00
Linus Torvalds
5c00ff742b - The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection algorithm.
   This leads to improved memory savings.
 
 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
 
 	- "refine mas_mab_cp()"
 	- "Reduce the space to be cleared for maple_big_node"
 	- "maple_tree: simplify mas_push_node()"
 	- "Following cleanup after introduce mas_wr_store_type()"
 	- "refine storing null"
 
 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.
 
 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping code.
 
 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of shadow
   entries.
 
 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.
 
 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in the
   hugetlb code.
 
 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page into
   small pages.  Instead we replace the whole thing with a THP.  More
   consistent cleaner and potentiall saves a large number of pagefaults.
 
 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.
 
 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to do.
 
 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio size
   rather than as individual pages.  A 20% speedup was observed.
 
 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting.
 
 - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt
   removes the long-deprecated memcgv2 charge moving feature.
 
 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.
 
 - The series "x86/module: use large ROX pages for text allocations" from
   Mike Rapoport teaches x86 to use large pages for read-only-execute
   module text.
 
 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.
 
 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/.  A slow march towards shrinking
   struct page.
 
 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.
 
 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression.  It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.
 
 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests
   over to the KUnit framework.
 
 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a single
   VMA, rather than requiring that multiple VMAs be created for this.
   Improved efficiencies for userspace memory allocators are expected.
 
 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.
 
 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.
 
 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP from
   the kernel boot command line.
 
 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.
 
 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep is
   enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzwFqgAKCRDdBJ7gKXxA
 jkeuAQCkl+BmeYHE6uG0hi3pRxkupseR6DEOAYIiTv0/l8/GggD/Z3jmEeqnZaNq
 xyyenpibWgUoShU2wZ/Ha8FE5WDINwg=
 =JfWR
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "zram: optimal post-processing target selection" from
   Sergey Senozhatsky improves zram's post-processing selection
   algorithm. This leads to improved memory savings.

 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
	- "refine mas_mab_cp()"
	- "Reduce the space to be cleared for maple_big_node"
	- "maple_tree: simplify mas_push_node()"
	- "Following cleanup after introduce mas_wr_store_type()"
	- "refine storing null"

 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.

 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping
   code.

 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of
   shadow entries.

 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.

 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in
   the hugetlb code.

 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page
   into small pages. Instead we replace the whole thing with a THP. More
   consistent cleaner and potentiall saves a large number of pagefaults.

 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.

 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to
   do.

 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio
   size rather than as individual pages. A 20% speedup was observed.

 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
   splitting.

 - The series "memcg-v1: fully deprecate charge moving" from Shakeel
   Butt removes the long-deprecated memcgv2 charge moving feature.

 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.

 - The series "x86/module: use large ROX pages for text allocations"
   from Mike Rapoport teaches x86 to use large pages for
   read-only-execute module text.

 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.

 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/. A slow march towards shrinking
   struct page.

 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.

 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression. It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.

 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
   tests over to the KUnit framework.

 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a
   single VMA, rather than requiring that multiple VMAs be created for
   this. Improved efficiencies for userspace memory allocators are
   expected.

 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.

 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.

 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP
   from the kernel boot command line.

 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.

 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep
   is enabled.

* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
  cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
  mm/kfence: add a new kunit test test_use_after_free_read_nofault()
  zram: fix NULL pointer in comp_algorithm_show()
  memcg/hugetlb: add hugeTLB counters to memcg
  vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
  mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
  zram: ZRAM_DEF_COMP should depend on ZRAM
  MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
  Docs/mm/damon: recommend academic papers to read and/or cite
  mm: define general function pXd_init()
  kmemleak: iommu/iova: fix transient kmemleak false positive
  mm/list_lru: simplify the list_lru walk callback function
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: code clean up for reparenting
  mm/list_lru: don't export list_lru_add
  mm/list_lru: don't pass unnecessary key parameters
  kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
  kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
  kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
  ...
2024-11-23 09:58:07 -08:00
Linus Torvalds
0f25f0e4ef the bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same
 scope where they'd been created, getting rid of reassignments
 and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
 
 We are getting very close to having the memory safety of that stuff
 trivial to verify.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
 69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
 CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
 =gVH3
 -----END PGP SIGNATURE-----

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' class updates from Al Viro:
 "The bulk of struct fd memory safety stuff

  Making sure that struct fd instances are destroyed in the same scope
  where they'd been created, getting rid of reassignments and passing
  them by reference, converting to CLASS(fd{,_pos,_raw}).

  We are getting very close to having the memory safety of that stuff
  trivial to verify"

* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
  deal with the last remaing boolean uses of fd_file()
  css_set_fork(): switch to CLASS(fd_raw, ...)
  memcg_write_event_control(): switch to CLASS(fd)
  assorted variants of irqfd setup: convert to CLASS(fd)
  do_pollfd(): convert to CLASS(fd)
  convert do_select()
  convert vfs_dedupe_file_range().
  convert cifs_ioctl_copychunk()
  convert media_request_get_by_fd()
  convert spu_run(2)
  switch spufs_calls_{get,put}() to CLASS() use
  convert cachestat(2)
  convert do_preadv()/do_pwritev()
  fdget(), more trivial conversions
  fdget(), trivial conversions
  privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
  o2hb_region_dev_store(): avoid goto around fdget()/fdput()
  introduce "fd_pos" class, convert fdget_pos() users to it.
  fdget_raw() users: switch to CLASS(fd_raw)
  convert vmsplice() to CLASS(fd)
  ...
2024-11-18 12:24:06 -08:00