linux/mm
Qi Zheng 6375e95f38 mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc.  These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.

The following are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

In this case, most of the page table entries are empty.  For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.

As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case.  We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).

Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock.  If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).

For other cases such as madvise(MADV_FREE), consider scanning and freeing
empty PTE pages asynchronously in the future.

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.1 MB
VmPTE                102640 KB                         240 KB

[zhengqi.arch@bytedance.com: fix uninitialized symbol 'ptl']
  Link: https://lkml.kernel.org/r/20241206112348.51570-1-zhengqi.arch@bytedance.com
  Link: https://lore.kernel.org/linux-mm/224e6a4e-43b5-4080-bdd8-b0a6fb2f0853@stanley.mountain/
Link: https://lkml.kernel.org/r/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13 22:40:48 -08:00
..
damon mm/damon/core: remove duplicate list_empty quota->goals check 2025-01-13 22:40:34 -08:00
kasan mm:kasan: fix sparse warnings: Should it be static? 2025-01-13 22:40:42 -08:00
kfence mm/kfence: add a new kunit test test_use_after_free_read_nofault() 2024-11-14 22:49:19 -08:00
kmsan mm, kasan, kmsan: instrument copy_from/to_kernel_nofault 2024-11-06 20:11:14 -08:00
backing-dev.c
balloon_compaction.c
bootmem_info.c bootmem: stop using page->index 2024-11-07 14:38:07 -08:00
cma.c cma: enforce non-zero pageblock_order during cma_init_reserved_mem() 2024-11-14 22:49:19 -08:00
cma.h mm: change type of cma_area_count to unsigned int 2025-01-13 22:40:35 -08:00
cma_debug.c
cma_sysfs.c
compaction.c mm/page_alloc: move set_page_refcounted() to callers of post_alloc_hook() 2025-01-13 22:40:31 -08:00
debug.c mm: open-code page_folio() in dump_page() 2024-12-05 19:54:45 -08:00
debug_page_alloc.c
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries 2024-09-17 01:07:01 -07:00
dmapool.c
dmapool_test.c
early_ioremap.c
execmem.c alloc_tag: populate memory for module tags as needed 2024-11-07 14:25:16 -08:00
fadvise.c fdget(), trivial conversions 2024-11-03 01:28:06 -05:00
fail_page_alloc.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
failslab.c fault-inject: improve build for CONFIG_FAULT_INJECTION=n 2024-09-01 20:43:33 -07:00
filemap.c filemap: remove unused folio_add_wait_queue 2025-01-13 22:40:38 -08:00
folio-compat.c mm/writeback: add folio_mark_dirty_lock() 2024-11-05 11:14:32 +01:00
gup.c mm/gup: handle NULL pages in unpin_user_pages() 2024-12-05 19:54:42 -08:00
gup_test.c
gup_test.h
highmem.c
hmm.c
huge_memory.c mm: migrate: remove unused argument vma from migrate_misplaced_folio() 2025-01-13 22:40:30 -08:00
hugetlb.c mm/hugetlb: don't map folios writable without VM_WRITE when copying during fork() 2025-01-13 22:40:46 -08:00
hugetlb_cgroup.c mm/hugetlb_cgroup: avoid useless return in void function 2025-01-13 22:40:34 -08:00
hugetlb_vmemmap.c mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO 2024-09-01 20:25:45 -07:00
hugetlb_vmemmap.h
hwpoison-inject.c
init-mm.c
internal.h mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) 2025-01-13 22:40:48 -08:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) 2025-01-13 22:40:48 -08:00
Kconfig.debug slub: Introduce CONFIG_SLUB_RCU_DEBUG 2024-08-27 14:12:51 +02:00
khugepaged.c mm: khugepaged: recheck pmd state in retract_page_tables() 2025-01-13 22:40:46 -08:00
kmemleak.c mm/kmemleak: fix percpu memory leak detection failure 2025-01-12 19:03:34 -08:00
ksm.c - The series "zram: optimal post-processing target selection" from 2024-11-23 09:58:07 -08:00
list_lru.c mm/list_lru: fix false warning of negative counter 2024-12-30 17:59:10 -08:00
maccess.c kasan: migrate copy_user_test to kunit 2024-11-11 00:26:44 -08:00
madvise.c mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) 2025-01-13 22:40:48 -08:00
Makefile mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) 2025-01-13 22:40:48 -08:00
mapping_dirty_helpers.c
memblock.c memblock: allow zero threshold in validate_numa_converage() 2024-12-01 21:08:56 +02:00
memcontrol-v1.c mm: remove the non-useful else after a break in a if statement 2025-01-13 22:40:40 -08:00
memcontrol-v1.h mm: memcg: declare do_memsw_account inline 2024-12-05 19:54:46 -08:00
memcontrol.c memcg/hugetlb: add hugeTLB counters to memcg 2024-11-14 22:49:19 -08:00
memfd.c mm: reinstate ability to map write-sealed memfd mappings read-only 2024-12-30 17:59:06 -08:00
memory-failure.c mm/memory-failure: replace sprintf() with sysfs_emit() 2024-11-11 00:26:46 -08:00
memory-tiers.c memory tiers: use default_dram_perf_ref_source in log message 2024-09-26 14:01:44 -07:00
memory.c mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) 2025-01-13 22:40:48 -08:00
memory_hotplug.c mm/page_isolation: don't pass gfp flags to start_isolate_page_range() 2025-01-13 22:40:44 -08:00
mempolicy.c mm/mempolicy: add alloc_frozen_pages() 2025-01-13 22:40:33 -08:00
mempool.c
memremap.c
memtest.c
migrate.c mm: migrate: remove unused argument vma from migrate_misplaced_folio() 2025-01-13 22:40:30 -08:00
migrate_device.c mm: remap unused subpages to shared zeropage when splitting isolated thp 2024-09-09 16:39:03 -07:00
mincore.c
mlock.c mm/mlock: set the correct prev on failure 2024-11-07 14:14:58 -08:00
mm_init.c memblock: updates for 6.13-rc1 2024-11-27 11:13:25 -08:00
mm_slot.h
mmap.c mm/vma: move __vm_munmap() to mm/vma.c 2025-01-13 22:40:43 -08:00
mmap_lock.c mm: mmap_lock: optimize mmap_lock tracepoints 2025-01-13 22:40:34 -08:00
mmu_gather.c
mmu_notifier.c mm: move internal core VMA manipulation functions to own file 2024-09-01 20:25:54 -07:00
mmzone.c mm: improve code consistency with zonelist_* helper functions 2024-09-01 20:25:55 -07:00
mprotect.c mm: add PTE_MARKER_GUARD PTE marker 2024-11-11 00:26:44 -08:00
mremap.c mm: clear uffd-wp PTE/PMD state on mremap() 2025-01-12 19:03:37 -08:00
mseal.c mm: madvise: implement lightweight guard page mechanism 2024-11-11 00:26:45 -08:00
msync.c
nommu.c nommu: pass NULL argument to vma_iter_prealloc() 2024-11-11 17:20:23 -08:00
numa.c mm: make range-to-target_node lookup facility a part of numa_memblks 2024-09-03 21:15:32 -07:00
numa_emulation.c mm: introduce numa_emulation 2024-09-03 21:15:31 -07:00
numa_memblks.c mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node id 2024-10-28 21:40:40 -07:00
oom_kill.c mm: move mm flags to mm_types.h 2024-11-05 16:56:26 -08:00
page-writeback.c mm: fix div by zero in bdi_ratio_from_pages 2025-01-12 19:03:36 -08:00
page_alloc.c mm/page_alloc: forward the gfp flags from alloc_contig_range() to post_alloc_hook() 2025-01-13 22:40:45 -08:00
page_counter.c mm, memcg: cg2 memory{.swap,}.peak write handlers 2024-09-01 20:25:53 -07:00
page_ext.c mm: don't account memmap per-node 2024-08-15 22:16:14 -07:00
page_frag_cache.c mm/page_alloc: export free_frozen_pages() instead of free_unref_page() 2025-01-13 22:40:31 -08:00
page_idle.c
page_io.c mm: add per-order mTHP swpin counters 2024-11-11 00:26:43 -08:00
page_isolation.c mm/page_isolation: don't pass gfp flags to start_isolate_page_range() 2025-01-13 22:40:44 -08:00
page_owner.c
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c
page_vma_mapped.c mm: mass constification of folio/page pointers 2024-11-07 14:38:07 -08:00
pagewalk.c mm: pagewalk: add the ability to install PTEs 2024-11-11 00:26:44 -08:00
percpu-internal.h
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c mm: use page->private instead of page->index in percpu 2024-11-07 14:38:07 -08:00
pgalloc-track.h
pgtable-generic.c mm: add RCU annotation to pte_offset_map(_lock) 2024-12-18 19:04:43 -08:00
process_vm_access.c mm: refactor mm_access() to not return NULL 2024-11-05 16:56:23 -08:00
pt_reclaim.c mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) 2025-01-13 22:40:48 -08:00
ptdump.c
readahead.c readahead: properly shorten readahead when falling back to do_page_cache_ra() 2025-01-13 22:40:45 -08:00
rmap.c mm: mass constification of folio/page pointers 2024-11-07 14:38:07 -08:00
rodata_test.c mm/rodata_test: verify test data is unchanged, rather than non-zero 2025-01-13 22:40:38 -08:00
secretmem.c secretmem: disable memfd_secret() if arch cannot set direct map 2024-10-09 12:47:19 -07:00
shmem.c mm: shmem: add a kernel command line to change the default huge policy for tmpfs 2025-01-13 22:40:37 -08:00
shmem_quota.c shmem_quota: build the object file conditionally to the config option 2024-09-01 20:25:45 -07:00
show_mem.c mm/show_mem: use str_yes_no() helper in show_free_areas() 2024-11-07 14:38:08 -08:00
shrinker.c mm: shrinker: avoid memleak in alloc_shrinker_info 2024-10-31 20:27:04 -07:00
shrinker_debug.c mm: shrinker: use min() to improve shrinker_debugfs_scan_write() 2024-09-03 21:15:40 -07:00
shuffle.c
shuffle.h
slab.h mm/slub: Avoid list corruption when removing a slab from the full list 2024-11-16 21:19:39 +01:00
slab_common.c slab updates for 6.13 2024-11-25 16:51:24 -08:00
slub.c kasan: make kasan_record_aux_stack_noalloc() the default behaviour 2025-01-13 22:40:36 -08:00
sparse-vmemmap.c mm: define general function pXd_init() 2024-11-11 17:22:27 -08:00
sparse.c bootmem: stop using page->index 2024-11-07 14:38:07 -08:00
swap.c mm/page_alloc: export free_frozen_pages() instead of free_unref_page() 2025-01-13 22:40:31 -08:00
swap.h mm: fix swap_read_folio_zeromap() for large folios with partial zeromap 2024-09-17 01:07:01 -07:00
swap_cgroup.c mm: swap_cgroup: get rid of __lookup_swap_cgroup() 2025-01-13 22:40:41 -08:00
swap_slots.c
swap_state.c mm: swap: use str_true_false() helper function 2024-11-06 20:11:14 -08:00
swapfile.c mm, swap: fix allocation and scanning race with swapoff 2024-11-14 15:25:07 -08:00
truncate.c - The series "zram: optimal post-processing target selection" from 2024-11-23 09:58:07 -08:00
usercopy.c
userfaultfd.c mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() 2025-01-13 22:40:46 -08:00
util.c mm/util: make memdup_user_nul() similar to memdup_user() 2024-12-30 17:59:11 -08:00
vma.c mm/vma: move __vm_munmap() to mm/vma.c 2025-01-13 22:40:43 -08:00
vma.h mm/vma: move __vm_munmap() to mm/vma.c 2025-01-13 22:40:43 -08:00
vma_internal.h mm/vma: move brk() internals to mm/vma.c 2025-01-13 22:40:42 -08:00
vmalloc.c vmalloc: fix accounting with i915 2024-12-18 19:04:45 -08:00
vmpressure.c
vmscan.c mm: vmscan : pgdemote vmstat is not getting updated when MGLRU is enabled. 2025-01-12 19:03:38 -08:00
vmstat.c vmstat: disable vmstat_work on vmstat_cpu_down_prep() 2025-01-12 19:03:38 -08:00
workingset.c mm/list_lru: simplify the list_lru walk callback function 2024-11-11 17:22:26 -08:00
z3fold.c mm/z3fold: add __percpu annotation to *unbuddied pointer in struct z3fold_pool 2024-09-01 20:25:56 -07:00
zbud.c
zpool.c
zsmalloc.c mm/zsmalloc: use memcpy_from/to_page whereever possible 2024-11-07 14:38:07 -08:00
zswap.c mm/zswap: add LRU_STOP to comment about dropping the lru lock 2025-01-13 22:40:30 -08:00