linux/mm
Huang Ying e39bb6be9f NUMA Balancing: add page promotion counter
Patch series "NUMA balancing: optimize memory placement for memory tiering system", v13

With the advent of various new memory types, some machines will have
multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
memory subsystem of these machines can be called memory tiering system,
because the performance of the different types of memory are different.

After commit c221c0b030 ("device-dax: "Hotplug" persistent memory for
use like normal RAM"), the PMEM could be used as the cost-effective
volatile memory in separate NUMA nodes.  In a typical memory tiering
system, there are CPUs, DRAM and PMEM in each physical NUMA node.  The
CPUs and the DRAM will be put in one logical node, while the PMEM will
be put in another (faked) logical node.

To optimize the system overall performance, the hot pages should be
placed in DRAM node.  To do that, we need to identify the hot pages in
the PMEM node and migrate them to DRAM node via NUMA migration.

In the original NUMA balancing, there are already a set of existing
mechanisms to identify the pages recently accessed by the CPUs in a node
and migrate the pages to the node.  So we can reuse these mechanisms to
build the mechanisms to optimize the page placement in the memory
tiering system.  This is implemented in this patchset.

At the other hand, the cold pages should be placed in PMEM node.  So, we
also need to identify the cold pages in the DRAM node and migrate them
to PMEM node.

In commit 26aa2d199d ("mm/migrate: demote pages during reclaim"), a
mechanism to demote the cold DRAM pages to PMEM node under memory
pressure is implemented.  Based on that, the cold DRAM pages can be
demoted to PMEM node proactively to free some memory space on DRAM node
to accommodate the promoted hot PMEM pages.  This is implemented in this
patchset too.

We have tested the solution with the pmbench memory accessing benchmark
with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent Memory
Model.  The test results shows that the pmbench score can improve up to
95.9%.

This patch (of 3):

In a system with multiple memory types, e.g.  DRAM and PMEM, the CPU
and DRAM in one socket will be put in one NUMA node as before, while
the PMEM will be put in another NUMA node as described in the
description of the commit c221c0b030 ("device-dax: "Hotplug"
persistent memory for use like normal RAM").  So, the NUMA balancing
mechanism will identify all PMEM accesses as remote access and try to
promote the PMEM pages to DRAM.

To distinguish the number of the inter-type promoted pages from that of
the inter-socket migrated pages.  A new vmstat count is added.  The
counter is per-node (count in the target node).  So this can be used to
identify promotion imbalance among the NUMA nodes.

Link: https://lkml.kernel.org/r/20220301085329.3210428-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220221084529.1052339-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220221084529.1052339-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:09 -07:00
..
damon mm/damon: hide kernel pointer from tracepoint event 2022-01-15 16:30:33 +02:00
kasan lib/stackdepot: always do filter_irq_stacks() in stack_depot_save() 2022-01-22 08:33:38 +02:00
kfence kfence: make test case compatible with run time set sample interval 2022-02-11 17:55:00 -08:00
backing-dev.c remove congestion tracking framework 2022-03-22 15:57:01 -07:00
balloon_compaction.c
bootmem_info.c bootmem: Use page->index instead of page->freelist 2022-01-06 12:27:03 +01:00
cma.c mm/cma: provide option to opt out from exposing pages on activation failure 2022-03-22 15:57:09 -07:00
cma.h mm/cma: provide option to opt out from exposing pages on activation failure 2022-03-22 15:57:09 -07:00
cma_debug.c
cma_sysfs.c
compaction.c mm: compaction: cleanup the compaction trace events 2022-03-22 15:57:09 -07:00
debug.c mm,fs: split dump_mapping() out from dump_page() 2022-01-15 16:30:26 +02:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: remove pte entry from the page table 2022-02-04 09:25:04 -08:00
dmapool.c mm/dmapool.c: revert "make dma pool to use kmalloc_node" 2022-01-15 16:30:28 +02:00
early_ioremap.c
fadvise.c remove inode_congested() 2022-03-22 15:57:01 -07:00
failslab.c
filemap.c tmpfs: do not allocate pages on read 2022-03-22 15:57:02 -07:00
folio-compat.c filemap: Add filemap_release_folio() 2022-01-04 13:15:34 -05:00
frontswap.c frontswap: remove support for multiple ops 2022-01-22 08:33:38 +02:00
gup.c mm/gup: remove unused get_user_pages_locked() 2022-03-22 15:57:01 -07:00
gup_test.c
gup_test.h
highmem.c Fixes for 5.16 folios: 2021-11-25 10:13:56 -08:00
hmm.c mm/hmm.c: allow VM_MIXEDMAP to work with hmm_range_fault 2022-01-15 16:30:31 +02:00
huge_memory.c mm/thp: refix __split_huge_pmd_locked() for migration PMD 2022-03-22 15:57:09 -07:00
hugetlb.c userfaultfd: provide unmasked address on page-fault 2022-03-22 15:57:08 -07:00
hugetlb_cgroup.c hugetlb: add hugetlb.*.numa_stat file 2022-01-15 16:30:29 +02:00
hugetlb_vmemmap.c mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key 2022-03-22 15:57:08 -07:00
hugetlb_vmemmap.h
hwpoison-inject.c mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler 2022-03-22 15:57:07 -07:00
init-mm.c
internal.h mm/sparse: make mminit_validate_memmodel_limits() static 2022-03-22 15:57:05 -07:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig mm/hugetlb: generalize ARCH_WANT_GENERAL_HUGETLB 2022-03-22 15:57:08 -07:00
Kconfig.debug mm: page table check 2022-01-15 16:30:28 +02:00
khugepaged.c mm/page_table_check: check entries at pmd levels 2022-02-04 09:25:04 -08:00
kmemleak.c mm/kmemleak: avoid scanning potential huge holes 2022-02-04 09:25:05 -08:00
ksm.c mm: ksm: fix use-after-free kasan report in ksm_might_need_to_copy 2022-01-15 16:30:31 +02:00
list_lru.c mm/list_lru: optimize memcg_reparent_list_lru_node() 2022-03-22 15:57:08 -07:00
maccess.c
madvise.c mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler 2022-03-22 15:57:07 -07:00
Makefile mm: remove cleancache 2022-01-22 08:33:38 +02:00
mapping_dirty_helpers.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
memblock.c memblock: use kfree() to release kmalloced memblock regions 2022-02-20 08:45:39 +02:00
memcontrol.c mm: memcontrol: fix cannot alloc the maximum memcg ID 2022-03-22 15:57:03 -07:00
memfd.c memfd: fix F_SEAL_WRITE after shmem huge page allocated 2022-03-05 11:08:32 -08:00
memory-failure.c mm/memory-failure.c: make non-LRU movable pages unhandlable 2022-03-22 15:57:07 -07:00
memory.c userfaultfd: provide unmasked address on page-fault 2022-03-22 15:57:08 -07:00
memory_hotplug.c mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key 2022-03-22 15:57:08 -07:00
mempolicy.c mempolicy: mbind_range() set_policy() after vma_merge() 2022-03-22 15:57:09 -07:00
mempool.c
memremap.c mm/memremap: avoid calling kasan_remove_zero_shadow() for device private memory 2022-03-22 15:57:01 -07:00
memtest.c
migrate.c NUMA Balancing: add page promotion counter 2022-03-22 15:57:09 -07:00
mincore.c
mlock.c mm/mlock: fix potential imbalanced rlimit ucounts adjustment 2022-03-22 15:57:07 -07:00
mm_init.c
mmap.c mm/mmap: remove obsolete comment in ksys_mmap_pgoff 2022-03-22 15:57:05 -07:00
mmap_lock.c
mmu_gather.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
mmu_notifier.c
mmzone.c mm/mmzone.c: use try_cmpxchg() in page_cpupid_xchg_last() 2022-03-22 15:57:05 -07:00
mprotect.c mm: refactor vm_area_struct::anon_vma_name usage code 2022-03-05 11:08:32 -08:00
mremap.c mm/mremap:: use vma_lookup() instead of find_vma() 2022-03-22 15:57:05 -07:00
msync.c
nommu.c Merge branch 'akpm' (patches from Andrew) 2021-11-06 14:08:17 -07:00
oom_kill.c mm/oom_kill: remove unneeded is_memcg_oom check 2022-03-22 15:57:09 -07:00
page-writeback.c mm/writeback: minor clean up for highmem_dirtyable_memory 2022-03-22 15:57:01 -07:00
page_alloc.c mm/hwpoison-inject: support injecting hwpoison to free page 2022-03-22 15:57:07 -07:00
page_counter.c mm/page_counter: remove an incorrect call to propagate_protected_usage() 2022-01-15 16:30:27 +02:00
page_ext.c mm: make some vars and functions static or __init 2022-01-15 16:30:31 +02:00
page_idle.c
page_io.c delayacct: support swapin delay accounting for swapping without blkio 2022-01-20 08:52:55 +02:00
page_isolation.c Revert "mm/page_isolation: unset migratetype directly for non Buddy page" 2022-02-04 09:25:04 -08:00
page_owner.c lib/stackdepot: allow optional init and stack_table allocation by kvmalloc() 2022-01-22 08:33:37 +02:00
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c mm/page_table_check: check entries at pmd levels 2022-02-04 09:25:04 -08:00
page_vma_mapped.c
pagewalk.c
percpu-internal.h mm: memcg/percpu: account extra objcg space to memory cgroups 2022-01-15 16:30:31 +02:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c bitmap patches for 5.17-rc1 2022-01-23 06:20:44 +02:00
pgalloc-track.h
pgtable-generic.c mm: move tlb_flush_pending inline helpers to mm_inline.h 2022-01-15 16:30:27 +02:00
process_vm_access.c
ptdump.c mm: sparsemem: use page table lock to protect kernel pmd operations 2022-03-22 15:57:08 -07:00
readahead.c remove inode_congested() 2022-03-22 15:57:01 -07:00
rmap.c mm/rmap: fix potential batched TLB flush race 2022-01-15 16:30:31 +02:00
rodata_test.c
secretmem.c
shmem.c mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte() 2022-03-22 15:57:04 -07:00
shuffle.c
shuffle.h
slab.c mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
slab.h mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
slab_common.c Merge branch 'akpm' (patches from Andrew) 2022-01-15 20:37:06 +02:00
slob.c mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
slub.c mm: introduce kmem_cache_alloc_lru 2022-03-22 15:57:03 -07:00
sparse-vmemmap.c mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP 2022-03-22 15:57:08 -07:00
sparse.c mm/sparse: make mminit_validate_memmodel_limits() static 2022-03-22 15:57:05 -07:00
swap.c mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu 2022-03-22 15:57:08 -07:00
swap_cgroup.c
swap_slots.c treewide: Add missing includes masked by cgroup -> bpf dependency 2021-12-03 10:58:13 -08:00
swap_state.c mm: swap: get rid of livelock in swapin readahead 2022-03-17 11:02:13 -07:00
swapfile.c userfaultfd: provide unmasked address on page-fault 2022-03-22 15:57:08 -07:00
truncate.c mm: remove cleancache 2022-01-22 08:33:38 +02:00
usercopy.c mm: Convert check_heap_object() to use struct slab 2022-01-06 12:25:51 +01:00
userfaultfd.c mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic() 2022-03-22 15:57:04 -07:00
util.c mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls 2022-03-04 10:00:37 -08:00
vmacache.c
vmalloc.c mm/vmalloc.c: fix "unused function" warning 2022-03-22 15:57:05 -07:00
vmpressure.c mm/vmpressure: fix data-race with memcg->socket_pressure 2021-11-06 13:30:40 -07:00
vmscan.c mm: vmscan: fix documentation for page_check_references() 2022-03-22 15:57:09 -07:00
vmstat.c NUMA Balancing: add page promotion counter 2022-03-22 15:57:09 -07:00
workingset.c mm: workingset: replace IRQ-off check with a lockdep assert. 2022-03-22 15:57:08 -07:00
z3fold.c
zbud.c
zpool.c zpool: remove the list of pools_head 2022-01-15 16:30:31 +02:00
zsmalloc.c zsmalloc: replace get_cpu_var with local_lock 2022-01-22 08:33:37 +02:00
zswap.c frontswap: remove support for multiple ops 2022-01-22 08:33:38 +02:00