- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
VMAs" from Lorenzo Stoakes addresses an issue with KSM's
PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
merging with existing adjacent VMAs.
- The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
practical access monitoring" from SeongJae Park adds a new kernel module
which simplifies the setup and usage of DAMON in production
environments.
- The 6 patch series "stop passing a writeback_control to swap/shmem
writeout" from Christoph Hellwig is a cleanup to the writeback code
which removes a couple of pointers from struct writeback_control.
- The 7 patch series "drivers/base/node.c: optimization and cleanups"
from Donet Tom contains largely uncorrelated cleanups to the NUMA node
setup and management code.
- The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
Tal Zussman does some maintenance work on the userfaultfd code.
- The 5 patch series "Readahead tweaks for larger folios" from Ryan
Roberts implements some tuneups for pagecache readahead when it is
reading into order>0 folios.
- The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
Brown provides some cleanups and consistency improvements to the
selftests code.
- The 4 patch series "Optimize mremap() for large folios" from Dev Jain
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
- The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
zero_user() in favor of the more modern memzero_page().
- The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
which David noticed in the huge page code. These were not known to be
causing any issues at this time.
- The 3 patch series "mm/damon: use alloc_migrate_target() for
DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
consolidation work in DAMON.
- The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
uses vm_flags_t in places where we were inappropriately using other
types.
- The 3 patch series "mm/memfd: Reserve hugetlb folios before
allocation" from Vivek Kasireddy increases the reliability of large page
allocation in the memfd code.
- The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
type" from Alistair Popple removes several now-unneeded PFN_* flags.
- The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
Park implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
- The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
lot of cleanup/maintenance work in the madvise() code.
- The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
provides additional cleanups on top or Lorenzo's effort.
- The 11 patch series "Implement numa node notifier" from Oscar Salvador
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory on/offline
notifier.
- The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
cleans up the pageblock isolation code and fixes a potential issue which
doesn't seem to cause any problems in practice.
- The 5 patch series "selftests/damon: add python and drgn based DAMON
sysfs functionality tests" from SeongJae Park adds additional drgn- and
python-based DAMON selftests which are more comprehensive than the
existing selftest suite.
- The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
Salvador fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
- The 3 patch series "cma: factor out allocation logic from
__cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
up the highmem-specific code in the CMA allocator.
- The 28 patch series "mm/migration: rework movable_ops page migration
(part 1)" from David Hildenbrand provides cleanups and
future-preparedness to the migration code.
- The 2 patch series "mm/damon: add trace events for auto-tuned
monitoring intervals and DAMOS quota" from SeongJae Park adds some
tracepoints to some DAMON auto-tuning code.
- The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
SeongJae Park does that.
- The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
does what it claims.
- The 4 patch series "mm: folio_pte_batch() improvements" from David
Hildenbrand cleans up the large folio PTE batching code.
- The 13 patch series "mm/damon/vaddr: Allow interleaving in
migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
alteration of DAMON's inter-node allocation policy.
- The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
provides a couple of page->folio conversions.
- The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
Bueso implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
- The 14 patch series "mm/damon: remove damon_callback" from SeongJae
Park replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
- The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
from Lorenzo Stoakes implements a number of mremap cleanups (of course)
in preparation for adding new mremap() functionality: newly permit the
remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It
still excludes some specialized situations where this cannot be
performed reliably.
- The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
switches some sparc hugetlb code over to the generic version and removes
the thus-unneeded hugetlb_free_pgd_range().
- The 4 patch series "mm/damon/sysfs: support periodic and automated
stats update" from SeongJae Park augments the present
userspace-requested update of DAMON sysfs monitoring files. Automatic
update is now provided, along with a tunable to control the update
interval.
- The 4 patch series "Some randome fixes and cleanups to swapfile" from
Kemeng Shi does what is claims.
- The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
and David Hildenbrand provides (and uses) a means by which debug-style
functions can grab a copy of a pageframe and inspect it locklessly
without tripping over the races inherent in operating on the live
pageframe directly.
- The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
Suren Baghdasaryan addresses the large contention issues which can be
triggered by reads from that procfs file. Latencies are reduced by more
than half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
- The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
__folio_split()!
- The 7 patch series "Optimize mprotect() for large folios" from Dev
Jain provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
- The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
cleanup work in the selftests code.
- The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
Stoakes extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
- The 22 patch series "selftests/damon/sysfs.py: test all parameters"
from SeongJae Park extends the DAMON sysfs interface selftest so that it
tests all possible user-requested parameters. Rather than the present
minimal subset.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
=XsFy
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"As usual, many cleanups. The below blurbiage describes 42 patchsets.
21 of those are partially or fully cleanup work. "cleans up",
"cleanup", "maintainability", "rationalizes", etc.
I never knew the MM code was so dirty.
"mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
mapped VMAs were not eligible for merging with existing adjacent
VMAs.
"mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
adds a new kernel module which simplifies the setup and usage of
DAMON in production environments.
"stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
is a cleanup to the writeback code which removes a couple of
pointers from struct writeback_control.
"drivers/base/node.c: optimization and cleanups" (Donet Tom)
contains largely uncorrelated cleanups to the NUMA node setup and
management code.
"mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
does some maintenance work on the userfaultfd code.
"Readahead tweaks for larger folios" (Ryan Roberts)
implements some tuneups for pagecache readahead when it is reading
into order>0 folios.
"selftests/mm: Tweaks to the cow test" (Mark Brown)
provides some cleanups and consistency improvements to the
selftests code.
"Optimize mremap() for large folios" (Dev Jain)
does that. A 37% reduction in execution time was measured in a
memset+mremap+munmap microbenchmark.
"Remove zero_user()" (Matthew Wilcox)
expunges zero_user() in favor of the more modern memzero_page().
"mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
addresses some warts which David noticed in the huge page code.
These were not known to be causing any issues at this time.
"mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
provides some cleanup and consolidation work in DAMON.
"use vm_flags_t consistently" (Lorenzo Stoakes)
uses vm_flags_t in places where we were inappropriately using other
types.
"mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
increases the reliability of large page allocation in the memfd
code.
"mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
removes several now-unneeded PFN_* flags.
"mm/damon: decouple sysfs from core" (SeongJae Park)
implememnts some cleanup and maintainability work in the DAMON
sysfs layer.
"madvise cleanup" (Lorenzo Stoakes)
does quite a lot of cleanup/maintenance work in the madvise() code.
"madvise anon_name cleanups" (Vlastimil Babka)
provides additional cleanups on top or Lorenzo's effort.
"Implement numa node notifier" (Oscar Salvador)
creates a standalone notifier for NUMA node memory state changes.
Previously these were lumped under the more general memory
on/offline notifier.
"Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
cleans up the pageblock isolation code and fixes a potential issue
which doesn't seem to cause any problems in practice.
"selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
adds additional drgn- and python-based DAMON selftests which are
more comprehensive than the existing selftest suite.
"Misc rework on hugetlb faulting path" (Oscar Salvador)
fixes a rather obscure deadlock in the hugetlb fault code and
follows that fix with a series of cleanups.
"cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
rationalizes and cleans up the highmem-specific code in the CMA
allocator.
"mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
provides cleanups and future-preparedness to the migration code.
"mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
adds some tracepoints to some DAMON auto-tuning code.
"mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
does that.
"mm/damon: misc cleanups" (SeongJae Park)
also does what it claims.
"mm: folio_pte_batch() improvements" (David Hildenbrand)
cleans up the large folio PTE batching code.
"mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
facilitates dynamic alteration of DAMON's inter-node allocation
policy.
"Remove unmap_and_put_page()" (Vishal Moola)
provides a couple of page->folio conversions.
"mm: per-node proactive reclaim" (Davidlohr Bueso)
implements a per-node control of proactive reclaim - beyond the
current memcg-based implementation.
"mm/damon: remove damon_callback" (SeongJae Park)
replaces the damon_callback interface with a more general and
powerful damon_call()+damos_walk() interface.
"mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
implements a number of mremap cleanups (of course) in preparation
for adding new mremap() functionality: newly permit the remapping
of multiple VMAs when the user is specifying MREMAP_FIXED. It still
excludes some specialized situations where this cannot be performed
reliably.
"drop hugetlb_free_pgd_range()" (Anthony Yznaga)
switches some sparc hugetlb code over to the generic version and
removes the thus-unneeded hugetlb_free_pgd_range().
"mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
augments the present userspace-requested update of DAMON sysfs
monitoring files. Automatic update is now provided, along with a
tunable to control the update interval.
"Some randome fixes and cleanups to swapfile" (Kemeng Shi)
does what is claims.
"mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
provides (and uses) a means by which debug-style functions can grab
a copy of a pageframe and inspect it locklessly without tripping
over the races inherent in operating on the live pageframe
directly.
"use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
addresses the large contention issues which can be triggered by
reads from that procfs file. Latencies are reduced by more than
half in some situations. The series also introduces several new
selftests for the /proc/pid/maps interface.
"__folio_split() clean up" (Zi Yan)
cleans up __folio_split()!
"Optimize mprotect() for large folios" (Dev Jain)
provides some quite large (>3x) speedups to mprotect() when dealing
with large folios.
"selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
does some cleanup work in the selftests code.
"tools/testing: expand mremap testing" (Lorenzo Stoakes)
extends the mremap() selftest in several ways, including adding
more checking of Lorenzo's recently added "permit mremap() move of
multiple VMAs" feature.
"selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
extends the DAMON sysfs interface selftest so that it tests all
possible user-requested parameters. Rather than the present minimal
subset"
* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
MAINTAINERS: add missing headers to mempory policy & migration section
MAINTAINERS: add missing file to cgroup section
MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
MAINTAINERS: add missing zsmalloc file
MAINTAINERS: add missing files to page alloc section
MAINTAINERS: add missing shrinker files
MAINTAINERS: move memremap.[ch] to hotplug section
MAINTAINERS: add missing mm_slot.h file THP section
MAINTAINERS: add missing interval_tree.c to memory mapping section
MAINTAINERS: add missing percpu-internal.h file to per-cpu section
mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
selftests/damon: introduce _common.sh to host shared function
selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
selftests/damon/sysfs.py: test non-default parameters runtime commit
selftests/damon/sysfs.py: generalize DAMON context commit assertion
selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
selftests/damon/sysfs.py: test DAMOS filters commitment
selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
selftests/damon/sysfs.py: test DAMOS destinations commitment
...
DAMOS quota goal uses 'nid' field when the metric is
DAMOS_QUOTA_NODE_MEM_{USED,FREE}_BP. But the goal commit function is not
updating the goal's nid field. Fix it.
Link: https://lkml.kernel.org/r/20250719181932.72944-1-sj@kernel.org
Fixes: 0e1c773b50 ("mm/damon/core: introduce damos quota goal metrics for memory node utilization") [6.16.x]
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
All damon_callback usages are replicated by damon_call() and damos_walk().
Time to say goodbye. Remove damon_callback.
Link: https://lkml.kernel.org/r/20250712195016.151108-15-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When kdamond_fn() completes, the targets are kept. Those are kept to let
callers do additional cleanups if they need. There are no such additional
cleanups though. DAMON sysfs interface deallocates those in
before_terminate() callback, to reduce unnecessary memory usage, for
[f]vaddr use case. Just destroy the targets for every case in the core
layer. This saves more memory and simplifies the logic.
Link: https://lkml.kernel.org/r/20250712195016.151108-13-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Implement cleanup_target() callback for [f]vaddr, which calls put_pid()
for each target that will be destroyed. Also remove redundant put_pid()
calls in core, sysfs and sample modules, which were required to be done
redundantly due to the lack of such self cleanup in vaddr.
Link: https://lkml.kernel.org/r/20250712195016.151108-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_operations.cleanup() is documented to be called for kdamond
termination, but also being called for targets destruction, which is done
for any damon_ctx destruction. Nobody is using the callback for now,
though. Remove the cleanup() call under the destruction.
Link: https://lkml.kernel.org/r/20250712195016.151108-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_call() can be useful for reading or writing DAMON internal data for
one time. A common pattern of DAMON core usage from DAMON modules is
doing such reads and writes repeatedly, for example, to periodically
update the DAMOS stats. To do that with damon_call(), callers should call
damon_call() repeatedly, with their own delay loop. Each caller doing
that is repetitive. Introduce a repeat mode damon_call(). Callers can
use the mode by setting a new field in damon_call_control. If the mode is
turned on, damon_call() returns success immediately, and DAMON repeats
invoking the callback function inside the kdamond main loop.
Link: https://lkml.kernel.org/r/20250712195016.151108-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: remove damon_callback".
damon_callback was the only way for communicating with DAMON for contexts
running on its worker thread. The interface is flexible and simple. But
as DAMON evolves with more features, damon_callback has become somewhat
too old. With runtime parameters update, for example, its lack of
synchronization support was found to be inconvenient. Arguably it is also
not easy to use correctly since the callers should understand when each
callback is called, and implication of the return values from the
callbacks.
To replace it, damon_call() and damos_walk() are introduced. And those
replaced a few damon_callback use cases. Some use cases of damon_callback
such as parallel or repetitive DAMON internal data reading and additional
cleanups cannot simply be replaced by damon_call() and damos_walk(),
though.
To allow those replaceable, extend damon_call() for parallel and/or
repeated callbacks and modify the core/ops layers for additional resources
cleanup. With the updates, replace the remaining damon_callback usages
and finally say goodbye to damon_callback.
This patch (of 14):
Calling damon_call() while it is serving for another parallel thread
immediately fails with -EBUSY. The caller should call it again, later.
Each caller implementing such retry logic would be redundant. Accept
parallel damon_call() requests and do the wait instead of the caller.
Link: https://lkml.kernel.org/r/20250712195016.151108-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250712195016.151108-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When committing new scheme parameters from the sysfs, copy the
migrate_dests struct of the source schemes into the destination schemes.
Link: https://lkml.kernel.org/r/20250709005952.17776-8-bijan311@gmail.com
Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a new field to 'struct damos', namely migrate_dests, to allow DAMON
API callers specify multiple migration destination nodes and their
weights. Also update 'struct damos' creation and destruction functions
accordingly to initialize the new field and free up the API
caller-allocated buffers on those, respectively.
Link: https://lkml.kernel.org/r/20250709005952.17776-3-bijan311@gmail.com
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When committing new scheme parameters from the sysfs, the target_nid field
of the damos struct would not be copied. This would result in the
target_nid field to retain its original value, despite being updated in
the sysfs interface.
This patch fixes this issue by copying target_nid in damos_commit().
Link: https://lkml.kernel.org/r/20250709004729.17252-1-bijan311@gmail.com
Fixes: 83dc7bbaec ("mm/damon/sysfs: use damon_commit_ctx()")
Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON core implements a static function to see if a given DAMON context is
running. DAMON sysfs interface is implementing the same one on its own.
Make the core function non-static and reuse it from the DAMON sysfs
interface.
Link: https://lkml.kernel.org/r/20250705175000.56259-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Aim-oriented DAMOS quota auto-tuning is an important and recommended
feature for DAMOS users. Add a trace event for the observability of the
tuned quota and tuning itself.
[sj@kernel.org: initialize sidx in damos_trace_esz()]
Link: https://lkml.kernel.org/r/20250705172003.52324-1-sj@kernel.org
[sj@kernel.org: make damos_esz unconditional trace event]
Link: https://lkml.kernel.org/r/20250709182843.35812-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250704221408.38510-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: add trace events for auto-tuned monitoring
intervals and DAMOS quota".
The aim-oriented auto-tuning features for monitoring intervals and DAMOS
quota are important and recommended. Add tracepoints for observabilities
of those tuned values and the tuning itself.
This patch (of 2):
Aim-oriented monitoring intervals auto-tuning is an important and
recommended feature for DAMON users. Add a trace event for the
observability of the tuned intervals and tuning itself.
Link: https://lkml.kernel.org/r/20250704221408.38510-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250704221408.38510-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The current implementation allows having zero size regions with no special
reasons, but damon_get_intervals_score() gets crashed by divide by zero
when the region size is zero.
[ 29.403950] Oops: divide error: 0000 [#1] SMP NOPTI
This patch fixes the bug, but does not disallow zero size regions to keep
the backward compatibility since disallowing zero size regions might be a
breaking change for some users.
In addition, the same crash can happen when intervals_goal.access_bp is
zero so this should be fixed in stable trees as well.
Link: https://lkml.kernel.org/r/20250702000205.1921-5-honggyu.kim@sk.com
Fixes: f04b0fedbe ("mm/damon/core: implement intervals auto-tuning")
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON sysfs interface internally uses damon_call() to update DAMON
parameters as users requested, online. However, DAMON core cancels any
damon_call() requests when it is deactivated by DAMOS watermarks.
As a result, users cannot change DAMON parameters online while DAMON is
deactivated. Note that users can turn DAMON off and on with different
watermarks to work around. Since deactivated DAMON is nearly same to
stopped DAMON, the work around should have no big problem. Anyway, a bug
is a bug.
There is no real good reason to cancel the damon_call() request under
DAMOS deactivation. Fix it by simply handling the request as normal,
rather than cancelling under the situation.
Link: https://lkml.kernel.org/r/20250629204914.54114-1-sj@kernel.org
Fixes: 42b7491af1 ("mm/damon/core: introduce damon_call()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When the number of the monitoring targets in running contexts is reduced,
there may be DAMOS quotas referencing the targets that will be destroyed.
Applying the scheme action for such DAMOS scheme will be skipped forever
looking for the starting part of the region for the destroyed monitoring
target.
To fix this issue, when the monitoring target is destroyed, reset the
starting part for all DAMOS quotas that reference the target.
Link: https://lkml.kernel.org/r/20250517141852.142802-1-akinobu.mita@gmail.com
Fixes: da87878010 ("mm/damon/sysfs: support online inputs update")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: minor fixups and improvements for code, tests, and
documents".
Yet another batch of miscellaneous DAMON changes. Fix and improve minor
problems in code, tests and documents.
This patch (of 6):
For a bug such as double aggregation reset[1], ->nr_accesses and/or
->nr_accesses_bp of damon_region could be corrupted. Such corruption can
make monitoring results pretty inaccurate, so the root causing bug should
be investigated. Meanwhile, the corruption itself can easily be fixed but
silently fixing it will hide the bug.
Fix the corruption as soon as found, but WARN_ONCE() so that we can be
aware of the existence of the bug while keeping the system running in a
more sane way.
Link: https://lkml.kernel.org/r/20250513002715.40126-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250513002715.40126-2-sj@kernel.org
Link: https://lore.kernel.org/20250302214145.356806-1-sj@kernel.org [1]
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: auto-tune DAMOS for NUMA setups including tiered
memory".
Utilizing DAMON for memory tiering usually requires manual tuning and/or
tedious controls. Let it self-tune hotness and coldness thresholds for
promotion and demotion aiming high utilization of high memory tiers, by
introducing new DAMOS quota goal metrics representing the used and the
free memory ratios of specific NUMA nodes. And introduce a sample DAMON
module that demonstrates how the new feature can be used for memory
tiering use cases.
Backgrounds
===========
A type of tiered memory system exposes the memory tiers as NUMA nodes. A
straightforward pages placement strategy for such systems is placing
access-hot and cold pages on upper and lower tiers, reespectively,
pursuing higher utilization of upper tiers. Since access temperature can
be dynamic, periodically finding and migrating hot pages and cold pages to
proper tiers (promoting and demoting) is also required. Linux kernel
provides several features for such dynamic and transparent pages
placement.
Page Faults and LRU
-------------------
One widely known way is using NUMA balancing in tiering mode (a.k.a
NUMAB-2) and reclaim-based demotion features. In the setup, NUMAB-2 finds
hot pages using access check-purpose page faults (a.k.a prot_none) and
promote those inside each process' context, until there is no more pages
to promote, or the upper tier is filled up and memory pressure happens.
In the latter case, LRU-based reclaim logic wakes up as a response to the
memory pressure and demotes cold pages to lower tiers in asynchronous
(kswapd) and/or synchronous ways (direct reclaim).
DAMON
-----
Yet another available solution is using DAMOS with migrate_hot and
migrate_cold DAMOS actions for promotions and demotions, respectively. To
make it optimum, users need to specify aggressiveness and access
temperature thresholds for promotions and demotions in a good balance that
results in high utilization of upper tiers. The number of parameters is
not small, and optimum parameter values depend on characteristics of the
underlying hardware and the workload. As a result, it often requires
manual, time consuming and repetitive tuning of the DAMOS schemes for
given workloads and systems combinations.
Self-tuned DAMON-based Memory Tiering
=====================================
To solve such manual tuning problems, DAMOS provides aim-oriented
feedback-driven quotas self-tuning. Using the feature, we design a
self-tuned DAMON-based memory tiering for general multi-tier memory
systems.
For each memory tier node, if it has a lower tier, run a DAMOS scheme that
demotes cold pages of the node, auto-tuning the aggressiveness aiming an
amount of free space of the node. The free space is for keeping the
headroom that avoids significant memory pressure during upper tier memory
usage spike, and promoting hot pages from the lower tier.
For each memory tier node, if it has an upper tier, run a DAMOS scheme
that promotes hot pages of the current node to the upper tier, auto-tuning
the aggressiveness aiming a high utilization ratio of the upper tier. The
target ratio is to ensure higher tiers are utilized as much as possible.
It should match with the headroom for demotion scheme, but have slight
overlap, to ensure promotion and demotion are not entirely stopped.
The aim-oriented aggressiveness auto-tuning of DAMOS is already available.
Hence, to make such tiering solution implementation, only new quota goal
metrics for utilization and free space ratio of specific NUMA node need to
be developed.
Discussions
===========
The design imposes below discussion points.
Expected Behaviors
------------------
The system will let upper tier memory node accommodates as many hot data
as possible. If total amount of the data is less than the top tier
memory's promotion/demotion target utilization, entire data will be just
placed on the top tier. Promotion scheme will do nothing since there is
no data to promote. Demotion scheme will also do nothing since the free
space ratio of the top tier is higher than the goal.
Only if the amount of data is larger than the top tier's utilization
ratio, demotion scheme will demote cold pages and ensure the headroom free
space. Since the promotion and demotion schemes for a single node has
small overlap at their target utilization and free space goals, promotions
and demotions will continue working with a moderate aggressiveness level.
It will keep all data is placed on access hotness under dynamic access
pattern, while minimizing the migration overhead.
In any case, each node will keep headroom free space and as many upper
tiers are utilized as possible.
Ease of Use
-----------
Users still need to set the target utilization and free space ratio, but
it will be easier to set. We argue 99.7 % utilization and 0.5 % free
space ratios can be good default values. It can be easily adjusted based
on desired headroom size of given use case. Users are also still required
to answer the minimum coldness and hotness thresholds. Together with
monitoring intervals auto-tuning[2], DAMON will always show meaningful
amount of hot and cold memory. And DAMOS quota's prioritization mechanism
will make good decision as long as the source information is that
colorful. Hence, users can very naively set the minimum criterias. We
believe any access observation and no access observation within last one
aggregation interval is enough for minimum hot and cold regions criterias.
General Tiered Memory Setup Applicability
-----------------------------------------
The design can be applied to any number of tiers having any performance
characteristics, as long as they can be hierarchical. Hence, applying the
system to different tiered memory system will be straightforward. Note
that this assumes only single CPU NUMA node case. Because today's DAMON
is not aware of which CPU made each access, applying this on systems
having multiple CPU NUMA nodes can be complicated. We are planning to
extend DAMON for the use case, but that's out of the scope of this patch
series.
How To Use
----------
Users can implement the auto-tuned DAMON-based memory tiering using DAMON
sysfs interface. It can be easily done using DAMON user-space tool like
user-space tool. Below evaluation results section shows an example DAMON
user-space tool command for that.
For wider and simpler deployment, having a kernel module that sets up and
run the DAMOS schemes via DAMON kernel API can be useful. The module can
enable the memory tiering at boot time via kernel command line parameter
or at run time with single command. This patch series implements a sample
DAMON kernel module that shows how such module can be implemented.
Comparison To Page Faults and LRU-based Approaches
--------------------------------------------------
The existing page faults based promotion (NUMAB-2) does hot pages
detection and migration in the process context. When there are many pages
to promote, it can block the progress of the application's real works.
DAMOS works in asynchronous worker thread, so it doesn't block the real
works.
NUMAB-2 doesn't provide a way to control aggressiveness of promotion other
than the maximum amount of pages to promote per given time widnow. If hot
pages are found, promotions can happen in the upper-bound speed,
regardless of upper tier's memory pressure. If the maximum speed is not
well set for the given workload, it can result in slow promotion or
unnecessary memory pressure. Self-tuned DAMON-based memory tiering
alleviates the problem by adjusting the speed based on current utilization
of the upper tier.
LRU-based demotion can be triggered in both asynchronous (kswapd) and
synchronous (direct reclaim) ways. Other than the way of finding cold
pages, asynchronous LRU-based demotion and DAMON-based demotion has no big
difference. DAMON-based demotion can make a better balancing with
DAMON-based promotion, though. The LRU-based demotion can do better than
DAMON-based demotion when the tier is having significant memory pressure.
It would be wise to use DAMON-based demotion as a proactive and primary
one, but utilizing LRU-based demotions together as a fast backup solution.
Evaluation
==========
In short, under a setup that requires fast and frequent promotions,
self-tuned DAMON-based memory tiering's hot pages promotion improves
performance about 4.42 %. We believe this shows self-tuned DAMON-based
promotion's effectiveness. Meanwhile, NUMAB-2's hot pages promotion
degrades the performance about 7.34 %. We suspect the degradation is
mostly due to NUMAB-2's synchronous nature that can block the
application's progress, which highlights the advantage of DAMON-based
solution's asynchronous nature.
Note that the test was done with the RFC version of this patch series. We
don't run it again since this patch series got no meaningful change after
the RFC, while the test takes pretty long time.
Setup
-----
Hardware. Use a machine that equips 250 GiB DRAM memory tier and 50 GiB
CXL memory tier. The tiers are exposed as NUMA nodes 0 and 1,
respectively.
Kernel. Use Linux kernel v6.13 that modified as following. Add all DAMON
patches that available on mm tree of 2025-03-15, and this patch series.
Also modify it to ignore mempolicy() system calls, to avoid bad effects
from application's traditional NUMA systems assumed optimizations.
Workload. Use a modified version of Taobench benchmark[3] that available
on DCPerf benchmark suite. It represents an in-memory caching workload.
We set its 'memsize', 'warmup_time', and 'test_time' parameter as 340 GiB,
2,500 seconds and 1,440 seconds. The parameters are chosen to ensure the
workload uses more than DRAM memory tier. Its RSS under the parameter
grows to 270 GiB within the warmup time.
It turned out the workload has a very static access pattrn. Only about 13
% of the RSS is frequently accessed from the beginning to end. Hence
promotion shows no meaningful performance difference regardless of
different design and implementations. We therefore modify the kernel to
periodically demote up to 10 GiB hot pages and promote up to 10 GiB cold
pages once per minute. The intention is to simulate periodic access
pattern changes. The hotness and coldness threshold is very naively set
so that it is more like random access pattern change rather than strict
hot/cold pages exchange. This is why we call the workload as "modified".
It is implemented as two DAMOS schemes each running on an asynchronous
thread. It can be reproduced with DAMON user-space tool like below.
# ./damo start \
--ops paddr --numa_node 0 --monitoring_intervals 10s 200s 200s \
--damos_action migrate_hot 1 \
--damos_quota_interval 60s --damos_quota_space 10G \
--ops paddr --numa_node 1 --monitoring_intervals 10s 200s 200s \
--damos_action migrate_cold 0 \
--damos_quota_interval 60s --damos_quota_space 10G \
--nr_schemes 1 1 --nr_targets 1 1 --nr_ctxs 1 1
System configurations. Use below variant system configurations.
- Baseline. No memory tiering features are turned on.
- Numab_tiering. On the baseline, enable NUMAB-2 and relcaim-based
demotion. In detail, following command is executed:
echo 2 > /proc/sys/kernel/numa_balancing;
echo 1 > /sys/kernel/mm/numa/demotion_enabled;
echo 7 > /proc/sys/vm/zone_reclaim_mode
- DAMON_tiering. On the baseline, utilize self-tuned DAMON-based memory
tiering implementation via DAMON user-space tool. It utilizes two
kernel threads, namely promotion thread and demotion thread. Demotion
thread monitors access pattern of DRAM node using DAMON with
auto-tuned monitoring intervals aiming 4% DAMON-observed access ratio,
and demote coldest pages up to 200 MiB per second aiming 0.5% free
space of DRAM node. Promotion thread monitors CXL node using same
intervals auto-tuning, and promote hot pages in same way but aiming
for 99.7% utilization of DRAM node. Because DAMON provides only
best-effort accuracy, add young page DAMOS filters to allow only and
reject all young pages at promoting and demoting, respectively. It
can be reproduced with DAMON user-space tool like below.
# ./damo start \
--numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \
--damos_action migrate_cold 1 --damos_access_rate 0% 0% \
--damos_apply_interval 1s \
--damos_quota_interval 1s --damos_quota_space 200MB \
--damos_quota_goal node_mem_free_bp 0.5% 0 \
--damos_filter reject young \
--numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \
--damos_action migrate_hot 0 --damos_access_rate 5% max \
--damos_apply_interval 1s \
--damos_quota_interval 1s --damos_quota_space 200MB \
--damos_quota_goal node_mem_used_bp 99.7% 0 \
--damos_filter allow young \
--damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \
--nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1
Measurment Results
------------------
On each system configuration, run the modified version of Taobench and
collect 'score'. 'score' is a metric that calculated and provided by
Taobench to represents the performance of the run on the system. To
handle the measurement errors, repeat the measurement five times. The
results are as below.
Config Score Stdev (%) Normalized
Baseline 1.6165 0.0319 1.9764 1.0000
Numab_tiering 1.4976 0.0452 3.0209 0.9264
DAMON_tiering 1.6881 0.0249 1.4767 1.0443
'Config' column shows the system config of the measurement. 'Score'
column shows the 'score' measurement in average of the five runs on the
system config. 'Stdev' column shows the standsard deviation of the five
measurements of the scores. '(%)' column shows the 'Stdev' to 'Score'
ratio in percentage. Finally, 'Normalized' column shows the averaged
score values of the configs that normalized to that of 'Baseline'.
The periodic hot pages demotion and cold pages promotion that was
conducted to simulate dynamic access pattern was started from the
beginning of the workload. It resulted in the DRAM tier utilization
always under the watermark, and hence no real demotion was happened for
all test runs. This means the above results show no difference between
LRU-based and DAMON-based demotions. Only difference between NUMAB-2 and
DAMON-based promotions are represented on the results.
Numab_tiering config degraded the performance about 7.36 %. We suspect
this happened because NUMAB-2's synchronous promotion was blocking the
Taobench's real work progress.
DAMON_tiering config improved the performance about 4.43 %. We believe
this shows effectiveness of DAMON-based promotion that didn't block
Taobench's real work progress due to its asynchronous nature. Also this
means DAMON's monitoring results are accurate enough to provide visible
amount of improvement.
Evaluation Limitations
----------------------
As mentioned above, this evaluation shows only comparison of promotion
mechanisms. DAMON-based tiering is recommended to be used together with
reclaim-based demotion as a faster backup under significant memory
pressure, though.
From some perspective, the modified version of Taobench may seems making
the picture distorted too much. It would be better to evaluate with more
realistic workload, or more finely tuned micro benchmarks.
Patch Sequence
==============
The first patch (patch 1) implements two new quota goal metrics on core
layer and expose it to DAMON core kernel API. The second and third ones
(patches 2 and 3) further link it to DAMON sysfs interface. Three
following patches (patches 4-6) document the new feature and sysfs file on
design, usage, and ABI documents. The final one (patch 7) implements a
working version of a self-tuned DAMON-based memory tiering solution in an
incomplete but easy to understand form as a kernel module under
samples/damon/ directory.
References
==========
[1] https://lore.kernel.org/20231112195602.61525-1-sj@kernel.org/
[2] https://lore.kernel.org/20250303221726.484227-1-sj@kernel.org
[3] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md
This patch (of 7):
Used and free space ratios for specific NUMA nodes can be useful inputs
for NUMA-specific DAMOS schemes' aggressiveness self-tuning feedback loop.
Implement DAMOS quota goal metrics for such self-tuned schemes.
Link: https://lkml.kernel.org/r/20250420194030.75838-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250420194030.75838-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Yunjeong Mun <yunjeong.mun@sk.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function logic is not complex, so using goto is unnecessary. Replace
it with a straightforward if-else to simplify control flow and improve
readability.
Link: https://lkml.kernel.org/r/Z9vxcPCw8tDsjKw1@OneApple
Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The operations layer hook was introduced to let operations set do any
aggregation data reset if needed. But it is not really be used now.
Remove it.
Link: https://lkml.kernel.org/r/20250306175908.66300-14-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The hook was introduced to let DAMON kernel API users access DAMOS
schemes-eligible regions in a safe way. Now it is no more used by anyone,
and the functionality is provided in a better way by damos_walk(). Remove
it.
Link: https://lkml.kernel.org/r/20250306175908.66300-13-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The callback was used by DAMON sysfs interface for reading DAMON internal
data. But it is no more being used, and damon_call() can do similar works
in a better way. Remove it.
Link: https://lkml.kernel.org/r/20250306175908.66300-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function pointer field was added to be used as a place to do some
initialization works just before DAMON starts working. However, nobody is
using it now. Remove it.
Link: https://lkml.kernel.org/r/20250306175908.66300-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently all DAMON kernel API callers do online DAMON parameters commit
from damon_callback->after_aggregation because only those are safe place
to call the DAMON monitoring attributes update function, namely
damon_set_attrs().
Because damon_callback hooks provide no synchronization, the callers work
in asynchronous ways or implement their own inefficient and complicated
synchronization mechanisms. It also means online DAMON parameters commit
can take up to one aggregation interval. On large systems having long
aggregation intervals, that can be too slow. The synchronization can be
done in more efficient and simple way while removing the latency
constraint if it can be done using damon_call().
The fact that damon_call() can be executed in the middle of the
aggregation makes damon_set_attrs() unsafe to be called from it, though.
Two real problems can occur in the case. First, converting the not yet
completely aggregated nr_accesses for new user-set intervals can arguably
degrade the accuracy or at least make the logic complicated. Second,
kdamond_reset_aggregated() will not be called after the monitoring results
update, so next aggregation starts from unclean state. This can result in
inconsistent and unexpected nr_accesses_bp.
Make it safe as follows. Catch the middle-of-the-aggregation case from
damon_set_attrs() by checking the passed_sample_intervals and
next_aggregationsis of the context. And pass the information to
nr_accesses conversion logic. The logic works as before if it is not the
case (called after the current aggregation is completed). If it is the
case (committing parameters in the middle of the aggregation), it drops
the nr_accesses information that so far aggregated, and make the status
same to the beginning of this aggregation, but as if the last aggregation
was started with the updated sampling/aggregation intervals.
The middle-of-aggregastion check introduce yet another edge case, though.
This happens because kdamond_tune_intervals() can also call
damon_set_attrs() with the middle-of-aggregation check. Consider
damon_call() for parameters commit and kdamond_tune_intervals() are called
in same iteration of kdamond main loop. Because kdamond_tune_interval()
is called for aggregation intervals, it should be the end of the
aggregation. The first damon_set_attrs() call from kdamond_call()
understands it is the end of the aggregation and correctly handle it.
But, because the damon_set_attrs() updated next_aggregation_sis of the
context. Hence, the second damon_set_attrs() invocation from
kdamond_tune_interval() believes it is called in the middle of the
aggregation. It therefore resets aggregated information so far. After
that, kdamond_reset_interval() is called and double-reset the aggregated
information. Avoid this case, too, by setting the next_aggregation_sis
before kdamond_tune_intervals() is invoked.
Link: https://lkml.kernel.org/r/20250306175908.66300-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kdamond_call() callers may iterate the regions, so better to call it when
the number of regions is as small as possible. It is when
kdamond_merge_regions() is finished. Invoke it on the point.
This change is also aimed to make future changes for carrying online
parameters commit with damon_call() easier. The commit operation should
be able to make sequence between other aggregation interval based
operations including regioins merging and aggregation reset. Placing
damon_call() invocation after the regions merging makes the sequence
handling simpler.
Link: https://lkml.kernel.org/r/20250306175908.66300-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damos_filter_for_ops() can be useful to avoid putting wrong type of
filters in wrong place. Make it be exposed to DAMON kernel API callers.
Link: https://lkml.kernel.org/r/20250305222733.59089-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Decide whether to allow or reject by default on core and opertions layer
handled filters evaluation stages. It is decided as the opposite of the
last installed filter's behavior. If there is no filter at all, allow by
default. If there is any operations layer handled filters, core layer's
filtering stage sets allowing as the default behavior regardless of the
last filter of core layer-handling ones, since the last filter of core
layer handled filters in the case is not really the last filter of the
entire filtering stage.
Also, make the core layer's DAMOS filters handling stage uses the newly
set behavior field.
[sj@kernel.org: setup damos->{core,ops}_filters_default_reject for initial start]
Link: https://lkml.kernel.org/r/20250315222610.35245-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250304211913.53574-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damos->ops_filters has introduced to be used for all operations layer
handled filters. But DAMON kernel API callers can put any type of DAMOS
filters to any of damos->filters and damos->ops_filters. DAMON user-space
ABI users have no way to use ->ops_filters at all. Update
damos_add_filter(), which should be used by API callers to install DAMOS
filters, to add filters to ->filters and ->ops_filters depending on their
handling layer. The change forces both API callers and ABI users to use
proper lists since ABI users use the API internally.
Link: https://lkml.kernel.org/r/20250304211913.53574-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON kernel API callers should use damon_commit_ctx() to install DAMON
parameters including DAMOS filters. But damos_commit_ops_filters(), which
is called by damon_commit_ctx() for filters installing, is not handling
damos->ops_filters. Hence, no DAMON kernel API caller can use
damos->ops_filters. Do the committing of the ops_filters to make it
usable.
Link: https://lkml.kernel.org/r/20250304211913.53574-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: make allow filters after reject filters useful and
intuitive".
DAMOS filters do allow or reject elements of memory for given DAMOS scheme
only if those match the filter criterias. For elements that don't match
any DAMOS filter, 'allowing' is the default behavior. This makes
allow-filters that don't have any reject-filter after them meaningless
sources of overhead. The decision was made to keep the behavior
consistent with that before the introduction of allow-filters. This,
however, makes usage of DAMOS filters confusing and inefficient. It is
more intuitive and still consistent behavior to reject by default unless
there is no filter at all or the last filter is a reject filter. Update
the filtering logic in the way and update documents to clarify the
behavior.
Note that this is changing the old behavior. But the old behavior for the
problematic filter combination was definitely confusing, inefficient and
anyway useless. Also, the behavior has relatively recently introduced.
It is difficult to anticipate any user that depends on the behavior.
Hence this is not a user-breaking behavior change but an obvious
improvement.
This patch (of 9):
DAMOS filters can be categorized into two groups depending on which layer
they are handled, namely core layer and ops layer. The groups are
important because the filtering behavior depends on evaluation sequence of
filters, and core layer-handled filters are evaluated before operations
layer-handled ones.
The behavior is clearly documented, but the implementation is bit
inefficient and complicated. All filters are maintained in a single list
(damos->filters) in mix. Filters evaluation logics in core layer and
operations layer iterates all the filters on the list, while skipping
filters that should be not handled by the layer of the logic. It is
inefficient. Making future extensions having differentiations for filters
of different handling layers will also be complicated.
Add a new list that will be used for having all operations layer-handled
DAMOS filters to DAMOS scheme data structure. Also add the support of its
initialization and basic traversal functions.
Link: https://lkml.kernel.org/r/20250304211913.53574-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250304211913.53574-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Implement the DAMON sampling and aggregation intervals auto-tuning
mechanism as briefly described on 'struct damon_intervals_goal'. The core
part for deciding the direction and amount of the changes is implemented
reusing the feedback loop function which is being used for DAMOS quotas
auto-tuning. Unlike the DAMOS quotas auto-tuning use case, limit the
maximum decreasing amount after the adjustment to 50% of the current
value, though. This is because the intervals have no good merits at rapid
reductions since it could unnecessarily increase the monitoring overhead.
Link: https://lkml.kernel.org/r/20250303221726.484227-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: auto-tune aggregation interval".
DAMON requires time-consuming and repetitive aggregation interval tuning.
Introduce a feature for automating it using a feedback loop that aims an
amount of observed access events, like auto-exposing cameras.
Background: Access Frequency Monitoring and Aggregation Interval
================================================================
DAMON checks if each memory element (damon_region) is accessed or not for
every user-specified time interval called 'sampling interval'. It
aggregates the check intervals on per-element counter called
'nr_accesses'. DAMON users can read the counters to get the access
temperature of a given element. The counters are reset for every another
user-specified time interval called 'aggregation interval'.
This can be illustrated as DAMON continuously capturing a snapshot of
access events that happen and captured within the last aggregation
interval. This implies the aggregation interval plays a key role for the
quality of the snapshots, like the camera exposure time. If it is too
short, the amount of access events that happened and captured for each
snapshot is small, so each snapshot will show no many interesting things
but just a cold and dark world with hopefuly one pale blue dot or two. If
it is too long, too many events are aggregated in a single shot, so each
snapshot will look like world of flames, or Muspellheim. It will be
difficult to find practical insights in both cases.
Problem: Time Consuming and Repetitive Tuning
=============================================
The appropriate length of the aggregation interval depends on how
frequently the system and workloads are making access events that DAMON
can observe. Hence, users have to tune the interval with excessive amount
of tests with the target system and workloads. If the system and
workloads are changed, the tuning should be done again. If the
characteristic of the workloads is dynamic, it becomes more challenging.
It is therefore time-consuming and repetitive.
The tuning challenge mainly stems from the wrong question. It is not
asking users what quality of monitoring results they want, but how DAMON
should operate for their hidden goal. To make the right answer, users
need to fully understand DAMON's mechanisms and the characteristics of
their workloads. Users shouldn't be asked to understand the underlying
mechanism. Understanding the characteristics of the workloads shouldn't
be the role of users but DAMON.
Aim-oriented Feedback-driven Auto-Tuning
=========================================
Fortunately, the appropriate length of the aggregation interval can be
inferred using a feedback loop. If the current snapshots are showing no
much intresting information, in other words, if it shows only rare access
events, increasing the aggregation interval helps, and vice versa. We
tested this theory on a few real-world workloads, and documented one of
the experience with an official DAMON monitoring intervals tuning
guideline. Since it is a simple theory that requires repeatable tries, it
can be a good job for machines.
Based on the guideline's theory, we design an automation of aggregation
interval tuning, in a way similar to that of camera auto-exposure feature.
It defines the amount of interesting information as the ratio of
DAMON-observed access events that DAMON actually observed to theoretical
maximum amount of it within each snapshot. Events are accounted in byte
and sampling attempts granularity. For example, let's say there is a
region of 'X' bytes size. DAMON tried access check smapling for the
region 'Y' times in total for a given aggregation. Among the 'Y'
attempts, 'Z' times it shown positive results. Then, the theoritical
maximum number of access events for the region is 'X * Y'. And the number
of access events that DAMON has observed for the region is 'X * Z'. The
abount of the interesting information is '(X * Z / X * Y)'. Note that
each snapshot would have multiple regions.
Users can set an arbitrary value of the ratio as their target. Once the
target is set, the automation periodically measures the current value of
the ratio and increase or decrease the aggregation interval if the ratio
value is lower or higher than the target. The amount of the change is
proportion to the distance between the current adn the target values.
To avoid auto-tuning goes too long way, let users set the minimum and the
maximum aggregation interval times. Changing only aggregation interval
while sampling interval is kept makes the maximum level of access
frequency in each snapshot, or discernment of regions inconsistent. Also,
unnecessarily short sampling interval causes meaningless monitoring
overhed. The automation therefore adjusts the sampling interval together
with aggregation interval, while keeping the ratio between the two
intervals. Users can set the ratio, or the discernment.
Discussion
==========
The modified question (aimed amount of access events, or lights, in each
snapshot) is easy to answer by both the users and the kernel. If users
are interested in finding more cold regions, the value should be lower,
and vice versa. If users have no idea, kernel can suggest a fair default
value based on some theories and experiments. For example, based on the
Pareto principle (80/20 rule), we could expect 20% target ratio will
capture 80% of real access events. Since 80% might be too high, applying
the rule once again, 4% (20% * 20%) may capture about 56% (80% * 80%) of
real access events.
Sampling to aggregation intervals ratio and min/max aggregation intervals
are also arguably easy to answer. What users want is discernment of
regions for efficient system operation, for examples, X amount of colder
regions or Y amount of warmer regions, not exactly how many times each
cache line is accessed in nanoseconds degree. The appropriate min/max
aggregation interval can relatively naively set, and may better to set for
aimed monitoring overhead. Since sampling interval is directly deciding
the overhead, setting it based on the sampling interval can be easy. With
my experiences, I'd argue the intervals ratio 0.05, and 5 milliseconds to
20 seconds sampling interval range (100 milliseconds to 400 seconds
aggregation interval) can be a good default suggestion.
Evaluation
==========
On a machine running a real world server workload, I ran DAMON to monitor
its physical address space for about 23 hours, with this feature turned
on. We set it to tune sampling interval in a range from 5 milliseconds to
10 seconds, aiming 4 % DAMON-observed access ratio per three aggregation
intervals. The exact command I used is as below.
damo start --monitoring_intervals_goal 4% 3 5ms 10s --damos_action stat
During the test run, DAMON continuously updated sampling and aggregation
intervals as designed, within the given range. For all the time, DAMON
was able to find the intervals that meets the target access events ratio
in the given intervals range (sampling interval between 5 milliseconds and
10 seconds).
For most of the time, tuned sampling interval was converged in 300-400
milliseconds. It made only small amount of changes within the range. The
average of the tuned sampling interval during the test was about 380
milliseconds.
The workload periodically gets less load and decreases its CPU usage.
Presumably this also caused it making less memory access events.
Reactively to such event,s DAMON also increased the intervals as expected.
It was still able to find the optimum interval that satisfying the target
access ratio within the given intervals range. Usually it was converged
to about 5 seconds. Once the workload gets normal amount of load again,
DAMON reactively reduced the intervals to the normal range.
I collected and visualized DAMON's monitoring results on the server a few
times. Every time the visualized access pattern looked not biased to only
cold or hot pages but diverse and balanced. Let me show some of the
snapshots that I collected at the nearly end of the test (after about 23
hours have passed since starting DAMON on the server).
The recency histogram looks as below. Please note that this visualization
shows only a very coarse grained information. For more details about the
visualization format, please refer to DAMON user-space tool
documentation[1].
# ./damo report access --style recency-sz-hist --tried_regions_of 0 0 0 --access_rate 0 0
<last accessed time (us)> <total size>
[-19 h 7 m 45.514 s, -17 h 12 m 58.963 s) 6.198 GiB |**** |
[-17 h 12 m 58.963 s, -15 h 18 m 12.412 s) 0 B | |
[-15 h 18 m 12.412 s, -13 h 23 m 25.860 s) 0 B | |
[-13 h 23 m 25.860 s, -11 h 28 m 39.309 s) 0 B | |
[-11 h 28 m 39.309 s, -9 h 33 m 52.757 s) 0 B | |
[-9 h 33 m 52.757 s, -7 h 39 m 6.206 s) 0 B | |
[-7 h 39 m 6.206 s, -5 h 44 m 19.654 s) 0 B | |
[-5 h 44 m 19.654 s, -3 h 49 m 33.103 s) 0 B | |
[-3 h 49 m 33.103 s, -1 h 54 m 46.551 s) 0 B | |
[-1 h 54 m 46.551 s, -0 ns) 16.967 GiB |********* |
[-0 ns, --6886551440000 ns) 38.835 GiB |********************|
memory bw estimate: 9.425 GiB per second
total size: 62.000 GiB
It shows about 38 GiB of memory was accessed at least once within last
aggregation interval (given ~300 milliseconds tuned sampling interval,
this is about six seconds). This is about 61 % of the total memory. In
other words, DAMON found warmest 61 % memory of the system. The number is
particularly interesting given our Pareto principle based theory for the
tuning goal value. We set it as 20 % of 20 % (4 %), thinking it would
capture 80 % of 80 % (64 %) real access events. And it foudn 61 % hot
memory, or working set. Nevertheless, to make the theory clearer, much
more discussion and tests would be needed. At the moment, nonetheless, we
can say making the target value higher helps finding more hot memory
regions.
The histogram also shows an amount of cold memory. About 17 GiB memory of
the system has not accessed at least for last aggregation interval (about
six seconds), and at most for about last two hours. The real longest
unaccessed time of the 17 GiB memory was about 19 minutes, though. This
is a limitation of this visualization format.
It further found very cold 6 GiB memory. It has not accessed at least for
last 17 hours and at most 19 hours.
What about hot memory distribution? To see this, I capture and visualize
the snapshot in access temperature histogram. Again, please refer to the
DAMON user-space tool documentation[1] for the format and what access
temperature mean. Both the visualization and metric shows only very
coarse grained and limited information. The resulting histogram look like
below.
# ./damo report access --style temperature-sz-hist --tried_regions_of 0 0 0
<temperature> <total size>
[-6,840,763,776,000, -5,501,580,939,800) 6.198 GiB |*** |
[-5,501,580,939,800, -4,162,398,103,600) 0 B | |
[-4,162,398,103,600, -2,823,215,267,400) 0 B | |
[-2,823,215,267,400, -1,484,032,431,200) 0 B | |
[-1,484,032,431,200, -144,849,595,000) 0 B | |
[-144,849,595,000, 1,194,333,241,200) 55.802 GiB |********************|
[1,194,333,241,200, 2,533,516,077,400) 4.000 KiB |* |
[2,533,516,077,400, 3,872,698,913,600) 4.000 KiB |* |
[3,872,698,913,600, 5,211,881,749,800) 8.000 KiB |* |
[5,211,881,749,800, 6,551,064,586,000) 12.000 KiB |* |
[6,551,064,586,000, 7,890,247,422,200) 4.000 KiB |* |
memory bw estimate: 5.178 GiB per second
total size: 62.000 GiB
We can see most of the memory is in similar access temperature range, and
definitely some pages are extremely hot.
To see the picture in more detail, let's capture and visualize the
snapshot per DAMON-region, sorted by their access temperature. The total
number of the regions was about 300. Due to the limited space, I'm
showing only a few parts of the output here.
# ./damo report access --style hot --tried_regions_of 0 0 0
heatmap: 00000000888888889999999888888888888888888888888888888888888888888888888888888888
# min/max temperatures: -6,827,258,184,000, 17,589,052,500, column size: 793.600 MiB
|999999999999999999999999999999999999999| 4.000 KiB access 100 % 18 h 9 m 43.918 s
|999999999999999999999999999999999999999| 8.000 KiB access 100 % 17 h 56 m 5.351 s
|999999999999999999999999999999999999999| 4.000 KiB access 100 % 15 h 24 m 19.634 s
|999999999999999999999999999999999999999| 4.000 KiB access 100 % 14 h 10 m 55.606 s
|999999999999999999999999999999999999999| 4.000 KiB access 100 % 11 h 34 m 18.993 s
[...]
|99999999999999999999999999999| 8.000 KiB access 100 % 1 m 27.945 s
|11111111111111111111111111111| 80.000 KiB access 15 % 1 m 21.180 s
|00000000000000000000000000000| 24.000 KiB access 5 % 1 m 21.180 s
|00000000000000000000000000000| 5.919 GiB access 10 % 1 m 14.415 s
|99999999999999999999999999999| 12.000 KiB access 100 % 1 m 7.650 s
[...]
|0| 4.000 KiB access 5 % 0 ns
|0| 12.000 KiB access 5 % 0 ns
|0| 188.000 KiB access 0 % 0 ns
|0| 24.000 KiB access 0 % 0 ns
|0| 48.000 KiB access 0 % 0 ns
[...]
|0000000000000000000000000000000| 8.000 KiB access 0 % 6 m 45.901 s
|00000000000000000000000000000000| 36.000 KiB access 0 % 7 m 26.491 s
|00000000000000000000000000000000| 4.000 KiB access 0 % 12 m 37.682 s
|000000000000000000000000000000000| 8.000 KiB access 0 % 18 m 9.168 s
|000000000000000000000000000000000| 16.000 KiB access 0 % 19 m 3.288 s
|0000000000000000000000000000000000000000| 6.198 GiB access 0 % 18 h 57 m 52.582 s
memory bw estimate: 8.798 GiB per second
total size: 62.000 GiB
We can see DAMON found small and extremely hot regions that accessed for
all access check sampling (once per about 300 milliseconds) for more than
10 hours. The access temperature rapidly decreases. DAMON was also able
to find small and big regions that not accessed for up to about 19
minutes. It even found an outlier cold region of 6 GiB that not accessed
for about 19 hours. It is unclear what the outlier region is, as of this
writing.
For the testing, DAMON was consuming about 0.1% of single CPU time. This
is again expected results, since DAMON was using about 370 milliseconds
sampling interval in most case.
# ps -p $kdamond_pid -o %cpu
%CPU
0.1
I also ran similar tests against kernel build workload and an in-memory
cache workload benchmark[2]. Detialed results including tuned intervals
and captured access pattern were of course different sicne those depend on
the workloads. But the auto-tuning feature was always working as expected
like the above results for the real world workload.
To wrap up, with intervals auto-tuning feature, DAMON was able to capture
access pattern snapshots of a quality on a real world server workload.
The auto-tuning feature was able to adaptively react to the dynamic access
patterns of the workload and reliably provide consistent monitoring
results without manual human interventions. Also, the auto-tuning made
DAMON consumes only necessary amount of resource for the required quality.
References
==========
[1] https://github.com/damonitor/damo/blob/next/USAGE.md#access-report-styles
[2] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md
This patch (of 8):
Add data structures for DAMON sampling and aggregation intervals automatic
tuning that aims specific amount of DAMON-observed access events per
snapshot. In more detail, define the data structure for the tuning goal,
link it to the monitoring attributes data structure so that DAMON kernel
API callers can make the request, and update parameters setup DAMON
function to respect the new parameter.
Link: https://lkml.kernel.org/r/20250303221726.484227-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250303221726.484227-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: add support for hugepage_size DAMOS filter", v5.
hugepage_size DAMOS filter can be used to gather statistics to check if
memory regions of specific access tempratures are backed by hugepages of a
size in a specific range. This filter can help to observe and prove the
effectivenes of different schemes for shrinking/collapsing hugepages.
This patch (of 4):
This is to gather statistics to check if memory regions of specific access
tempratures are backed by pages of a size in a specific range. This
filter can help to observe and prove the effectivenes of different schemes
for shrinking/collapsing hugepages.
[sj@kernel.org: add kernel-doc comment for damos_filter->sz_range]
Link: https://lkml.kernel.org/r/20250218223058.52459-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250211124437.278873-1-usamaarif642@gmail.com
Link: https://lkml.kernel.org/r/20250211124437.278873-2-usamaarif642@gmail.com
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damos_walk_control can be installed while DAMOS is walking the regions.
This means the walk callback function invocations can be started from a
region at the middle of the regions list. This makes it hard to be used
reliably. Particularly, DAMOS tried regions update for collecting
monitoring results gets problematic results. Increase the
walk_control_lock critical section to do walking in entire regions
granularity.
Link: https://lkml.kernel.org/r/20250210182737.134994-4-sj@kernel.org
Fixes: bf0eaba0ff ("mm/damon/core: implement damos_walk()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damos_walk() invokes callback functions of schemes until all schemes
finishes at least one round of walks. If there are multiple DAMOS schemes
having different apply_interval, the callback functions for longer apply
interval scheme will be called for more than a round of the walk.
The behavior is different from the document (see damos_walk() kernel-doc
comment), and not useful. Make the behavior be same to the documented
one, by stopping invoking the callback if the walk for the given scheme is
completed.
Link: https://lkml.kernel.org/r/20250210182737.134994-3-sj@kernel.org
Fixes: bf0eaba0ff ("mm/damon/core: implement damos_walk()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors".
damos_walk() can finish working earlier or later than expected, and start
earlier than practical. First two behaviors are clearly wrong behavior
(doesn't follow the documentation) and all three behaviors are only making
the feature useless. Fix those.
This patch (of 3):
damos->walk_completed is only set, not unset. This can cause next
damos_walk() finish earlier than expected. Unset it after all
walk_completed is confirmed.
Link: https://lkml.kernel.org/r/20250210182737.134994-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250210182737.134994-2-sj@kernel.org
Fixes: bf0eaba0ff ("mm/damon/core: implement damos_walk()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
'paddr' DAMON operations set can apply a DAMOS scheme's action to a large
folio multiple times in single DAMOS-regions-walk if the folio is laid on
multiple DAMON regions. Add a field for DAMOS scheme object that can be
used by the underlying ops to know what was the last entity that the
scheme's action has applied. The core layer unsets the field when each
DAMOS-regions-walk is done for the given scheme. And update 'paddr' ops
to use the infrastructure to avoid the problem.
Link: https://lkml.kernel.org/r/20250207212033.45269-3-sj@kernel.org
Fixes: 57223ac295 ("mm/damon/paddr: support the pageout scheme")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Usama Arif <usamaarif642@gmail.com>
Closes: https://lore.kernel.org/20250203225604.44742-3-usamaarif642@gmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function for allocating and initialize a 'struct damos' object,
damon_new_scheme(), is not initializing damos->walk_completed field. Only
damos_walk_complete() is setting the field. Hence the field will be
eventually set and used correctly from second damos_walk() call for the
scheme. But the first damos_walk() could mistakenly not walk on the
regions. Actually, a common usage of DAMOS for taking an access pattern
snapshot is installing a monitoring-purpose DAMOS scheme, doing
damos_walk() to retrieve the snapshot, and then removing the scheme.
DAMON user-space tool (damo) also gets runtime snapshot in the way. Hence
the problem can continuously happen in such use cases. Initialize it
properly in the allocation function.
Link: https://lkml.kernel.org/r/20250228174450.41472-1-sj@kernel.org
Fixes: bf0eaba0ff ("mm/damon/core: implement damos_walk()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Filtering decisions are made in filters evaluation order. Once a decision
is made by a filter, filters that scheduled to be evaluated after the
decision-made filter should just respect it. This is the intended and
documented behavior. Since core layer-handled filters are evaluated
before operations layer-handled filters, decisions made on core layer
should respected by ops layer.
In case of reject filters, the decision is respected, since core
layer-rejected regions are not passed to ops layer. But in case of allow
filters, ops layer filters don't know if the region has passed to them
because it was allowed by core filters or just because it didn't match to
any core layer. The current wrong implementation assumes it was due to
not matched by any core filters. As a reuslt, the decision is not
respected. Pass the missing information to ops layer using a new filed in
'struct damos', and make the ops layer filters respect it.
Link: https://lkml.kernel.org/r/20250228175336.42781-1-sj@kernel.org
Fixes: 491fee286e ("mm/damon/core: support damos_filter->allow")
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Remove hard-coded strings by using the str_high_low() helper function.
Link: https://lkml.kernel.org/r/20250116204216.106999-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON API users should set damos_filter->allow manually to use a DAMOS
allow-filter, since damos_new_filter() unsets the field always. It is
cumbersome and easy to mistake. Add an arugment for setting the field to
damos_new_filter().
Link: https://lkml.kernel.org/r/20250109175126.57878-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMOS filters supports allowing behavior, but the core layer's DAMOS
filters handling logic still assumes only rejecting (filtering-out)
behavior. Update the logic to aware of and respect the behavioral
decision by reading damos_filter->allow when making the decision to
exclude a region or not.
Link: https://lkml.kernel.org/r/20250109175126.57878-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMOS filters work as only exclusive (reject) filters. This makes it easy
to be confused, and restrictive at combining multiple filters for covering
various types of memory.
Add a field named 'allow' to damos_filter. The field will be used to
indicate whether the filter should work for inclusion or exclusion. To
keep the old behavior, set it as 'false' (work as exclusive filter) by
default, from damos_new_filter().
Following two commits will make the core and operations set layers, which
handles damos_filter objects, respect the field, respectively.
Link: https://lkml.kernel.org/r/20250109175126.57878-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Total size of memory that passed DAMON operations set layer-handled DAMOS
filters per scheme is provided to DAMON core API and ABI (sysfs interface)
users. Having it per-region in non-accumulated way can provide it in
finer granularity. Provide it to damos_walk() core API users, by passing
the data to damos_walk_control->walk_fn().
Link: https://lkml.kernel.org/r/20250106193401.109161-13-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Implement a new per-DAMOS scheme statistic field, namely
sz_ops_filter_passed, using the changed damon_operations->apply_scheme()
interface. It counts total bytes of memory that given DAMOS action tried
to be applied, and passed the operations layer handled region-internal
filters of the scheme. DAMON API users can access it using DAMON-internal
safe access features such as damon_call() and/or damos_walk().
Link: https://lkml.kernel.org/r/20250106193401.109161-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Some DAMOS filter types including those for young page, anon page, and
belonging memcg are handled by underlying DAMON operations set
implementation, via damon_operations->apply_scheme() interface. How many
bytes of the region have passed the filter can be useful for DAMOS scheme
tuning and access pattern monitoring. Modify the interface to let the
callback implementation reports back the number if possible.
Link: https://lkml.kernel.org/r/20250106193401.109161-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce a new core layer interface, damos_walk(). It aims to replace
some damon_callback usages that access DAMOS schemes applied regions of
ongoing kdamond with additional synchronizations. It receives a function
pointer and asks kdamond to invoke it for any region that it tried to
apply any DAMOS action within one scheme apply interval for every scheme
of it. The function further waits until the kdamond finishes the
invocations for every scheme, or cancels the request, and returns.
The kdamond invokes the function as requested within the main loop. If it
is deactivated by DAMOS watermarks or going out of the main loop, it marks
the request as canceled, so that damos_walk() can wakeup and return.
Link: https://lkml.kernel.org/r/20250103174400.54890-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce a new DAMON core API function, damon_call(). It aims to replace
some damon_callback usages that access damon_ctx of ongoing kdamond with
additional synchronizations. It receives a function pointer, let the
parallel kdamond invokes the function, and returns after the invocation is
finished, or canceled due to some races.
kdamond invokes the function inside the main loop after sampling is done.
If it is deactivated by DAMOS watermarks or already out of the main loop,
mark the request as canceled so that damon_call() can wakeup and return.
Link: https://lkml.kernel.org/r/20250103174400.54890-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>