2018-05-14 11:13:40 +03:00
|
|
|
============================
|
|
|
|
Transparent Hugepage Support
|
|
|
|
============================
|
|
|
|
|
|
|
|
Objective
|
|
|
|
=========
|
|
|
|
|
|
|
|
Performance critical computing applications dealing with large memory
|
|
|
|
working sets are already running on top of libhugetlbfs and in turn
|
|
|
|
hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
|
|
|
|
using huge pages for the backing of virtual memory with huge pages
|
|
|
|
that supports the automatic promotion and demotion of page sizes and
|
|
|
|
without the shortcomings of hugetlbfs.
|
|
|
|
|
|
|
|
Currently THP only works for anonymous memory mappings and tmpfs/shmem.
|
|
|
|
But in the future it can expand to other filesystems.
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
in the examples below we presume that the basic page size is 4K and
|
|
|
|
the huge page size is 2M, although the actual numbers may vary
|
|
|
|
depending on the CPU architecture.
|
|
|
|
|
|
|
|
The reason applications are running faster is because of two
|
|
|
|
factors. The first factor is almost completely irrelevant and it's not
|
|
|
|
of significant interest because it'll also have the downside of
|
|
|
|
requiring larger clear-page copy-page in page faults which is a
|
|
|
|
potentially negative effect. The first factor consists in taking a
|
|
|
|
single page fault for each 2M virtual region touched by userland (so
|
|
|
|
reducing the enter/exit kernel frequency by a 512 times factor). This
|
|
|
|
only matters the first time the memory is accessed for the lifetime of
|
|
|
|
a memory mapping. The second long lasting and much more important
|
|
|
|
factor will affect all subsequent accesses to the memory for the whole
|
|
|
|
runtime of the application. The second factor consist of two
|
|
|
|
components:
|
|
|
|
|
|
|
|
1) the TLB miss will run faster (especially with virtualization using
|
|
|
|
nested pagetables but almost always also on bare metal without
|
|
|
|
virtualization)
|
|
|
|
|
|
|
|
2) a single TLB entry will be mapping a much larger amount of virtual
|
|
|
|
memory in turn reducing the number of TLB misses. With
|
|
|
|
virtualization and nested pagetables the TLB can be mapped of
|
|
|
|
larger size only if both KVM and the Linux guest are using
|
|
|
|
hugepages but a significant speedup already happens if only one of
|
|
|
|
the two is using hugepages just because of the fact the TLB miss is
|
|
|
|
going to run faster.
|
|
|
|
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
Modern kernels support "multi-size THP" (mTHP), which introduces the
|
|
|
|
ability to allocate memory in blocks that are bigger than a base page
|
|
|
|
but smaller than traditional PMD-size (as described above), in
|
|
|
|
increments of a power-of-2 number of pages. mTHP can back anonymous
|
|
|
|
memory (for example 16K, 32K, 64K, etc). These THPs continue to be
|
|
|
|
PTE-mapped, but in many cases can still provide similar benefits to
|
|
|
|
those outlined above: Page faults are significantly reduced (by a
|
|
|
|
factor of e.g. 4, 8, 16, etc), but latency spikes are much less
|
|
|
|
prominent because the size of each page isn't as huge as the PMD-sized
|
|
|
|
variant and there is less memory to clear in each page fault. Some
|
|
|
|
architectures also employ TLB compression mechanisms to squeeze more
|
|
|
|
entries in when a set of PTEs are virtually and physically contiguous
|
|
|
|
and approporiately aligned. In this case, TLB misses will occur less
|
|
|
|
often.
|
|
|
|
|
2018-05-14 11:13:40 +03:00
|
|
|
THP can be enabled system wide or restricted to certain tasks or even
|
|
|
|
memory ranges inside task's address space. Unless THP is completely
|
|
|
|
disabled, there is ``khugepaged`` daemon that scans memory and
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
collapses sequences of basic pages into PMD-sized huge pages.
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
|
2019-07-16 10:49:08 -04:00
|
|
|
interface and using madvise(2) and prctl(2) system calls.
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
Transparent Hugepage Support maximizes the usefulness of free memory
|
|
|
|
if compared to the reservation approach of hugetlbfs by allowing all
|
|
|
|
unused memory to be used as cache or other movable (or even unmovable
|
|
|
|
entities). It doesn't require reservation to prevent hugepage
|
|
|
|
allocation failures to be noticeable from userland. It allows paging
|
|
|
|
and all other advanced VM features to be available on the
|
|
|
|
hugepages. It requires no modifications for applications to take
|
|
|
|
advantage of it.
|
|
|
|
|
|
|
|
Applications however can be further optimized to take advantage of
|
|
|
|
this feature, like for example they've been optimized before to avoid
|
|
|
|
a flood of mmap system calls for every malloc(4k). Optimizing userland
|
|
|
|
is by far not mandatory and khugepaged already can take care of long
|
|
|
|
lived page allocations even for hugepage unaware applications that
|
|
|
|
deals with large amounts of memory.
|
|
|
|
|
|
|
|
In certain cases when hugepages are enabled system wide, application
|
|
|
|
may end up allocating more memory resources. An application may mmap a
|
|
|
|
large region but only touch 1 byte of it, in that case a 2M page might
|
|
|
|
be allocated instead of a 4k page for no good. This is why it's
|
|
|
|
possible to disable hugepages system-wide and to only have them inside
|
|
|
|
MADV_HUGEPAGE madvise regions.
|
|
|
|
|
|
|
|
Embedded systems should enable hugepages only inside madvise regions
|
|
|
|
to eliminate any risk of wasting any precious byte of memory and to
|
|
|
|
only run faster.
|
|
|
|
|
|
|
|
Applications that gets a lot of benefit from hugepages and that don't
|
|
|
|
risk to lose memory by using hugepages, should use
|
|
|
|
madvise(MADV_HUGEPAGE) on their critical mmapped regions.
|
|
|
|
|
|
|
|
.. _thp_sysfs:
|
|
|
|
|
|
|
|
sysfs
|
|
|
|
=====
|
|
|
|
|
|
|
|
Global THP controls
|
|
|
|
-------------------
|
|
|
|
|
2025-07-21 16:55:30 +01:00
|
|
|
Transparent Hugepage Support for anonymous memory can be disabled
|
2018-05-14 11:13:40 +03:00
|
|
|
(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
|
|
|
|
regions (to avoid the risk of consuming more memory resources) or enabled
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
system wide. This can be achieved per-supported-THP-size with one of::
|
|
|
|
|
|
|
|
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
|
|
|
echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
|
|
|
echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
|
|
|
|
|
|
|
where <size> is the hugepage size being addressed, the available sizes
|
|
|
|
for which vary by system.
|
|
|
|
|
2025-07-21 16:55:30 +01:00
|
|
|
.. note:: Setting "never" in all sysfs THP controls does **not** disable
|
|
|
|
Transparent Huge Pages globally. This is because ``madvise(...,
|
|
|
|
MADV_COLLAPSE)`` ignores these settings and collapses ranges to
|
|
|
|
PMD-sized huge pages unconditionally.
|
|
|
|
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
For example::
|
|
|
|
|
|
|
|
echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
|
|
|
|
|
|
|
|
Alternatively it is possible to specify that a given hugepage size
|
|
|
|
will inherit the top-level "enabled" value::
|
|
|
|
|
|
|
|
echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled
|
|
|
|
|
|
|
|
For example::
|
|
|
|
|
|
|
|
echo inherit >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
|
|
|
|
|
|
|
|
The top-level setting (for use with "inherit") can be set by issuing
|
|
|
|
one of the following commands::
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
echo always >/sys/kernel/mm/transparent_hugepage/enabled
|
|
|
|
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
|
|
|
|
echo never >/sys/kernel/mm/transparent_hugepage/enabled
|
|
|
|
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
By default, PMD-sized hugepages have enabled="inherit" and all other
|
|
|
|
hugepage sizes have enabled="never". If enabling multiple hugepage
|
|
|
|
sizes, the kernel will select the most appropriate enabled size for a
|
|
|
|
given allocation.
|
|
|
|
|
2018-05-14 11:13:40 +03:00
|
|
|
It's also possible to limit defrag efforts in the VM to generate
|
|
|
|
anonymous hugepages in case they're not immediately free to madvise
|
|
|
|
regions or to never try to defrag memory and simply fallback to regular
|
|
|
|
pages unless hugepages are immediately available. Clearly if we spend CPU
|
|
|
|
time to defrag memory, we would expect to gain even more by the fact we
|
|
|
|
use hugepages later instead of regular pages. This isn't always
|
|
|
|
guaranteed, but it may be more likely in case the allocation is for a
|
|
|
|
MADV_HUGEPAGE region.
|
|
|
|
|
|
|
|
::
|
|
|
|
|
|
|
|
echo always >/sys/kernel/mm/transparent_hugepage/defrag
|
|
|
|
echo defer >/sys/kernel/mm/transparent_hugepage/defrag
|
|
|
|
echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
|
|
|
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
|
|
|
|
echo never >/sys/kernel/mm/transparent_hugepage/defrag
|
|
|
|
|
|
|
|
always
|
|
|
|
means that an application requesting THP will stall on
|
|
|
|
allocation failure and directly reclaim pages and compact
|
|
|
|
memory in an effort to allocate a THP immediately. This may be
|
|
|
|
desirable for virtual machines that benefit heavily from THP
|
|
|
|
use and are willing to delay the VM start to utilise them.
|
|
|
|
|
|
|
|
defer
|
|
|
|
means that an application will wake kswapd in the background
|
|
|
|
to reclaim pages and wake kcompactd to compact memory so that
|
|
|
|
THP is available in the near future. It's the responsibility
|
|
|
|
of khugepaged to then install the THP pages later.
|
|
|
|
|
|
|
|
defer+madvise
|
|
|
|
will enter direct reclaim and compaction like ``always``, but
|
|
|
|
only for regions that have used madvise(MADV_HUGEPAGE); all
|
|
|
|
other regions will wake kswapd in the background to reclaim
|
|
|
|
pages and wake kcompactd to compact memory so that THP is
|
|
|
|
available in the near future.
|
|
|
|
|
|
|
|
madvise
|
|
|
|
will enter direct reclaim like ``always`` but only for regions
|
|
|
|
that are have used madvise(MADV_HUGEPAGE). This is the default
|
|
|
|
behaviour.
|
|
|
|
|
|
|
|
never
|
2025-07-21 16:55:30 +01:00
|
|
|
should be self-explanatory. Note that ``madvise(...,
|
|
|
|
MADV_COLLAPSE)`` can still cause transparent huge pages to be
|
|
|
|
obtained even if this mode is specified everywhere.
|
2018-05-14 11:13:40 +03:00
|
|
|
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
By default kernel tries to use huge, PMD-mappable zero page on read
|
|
|
|
page fault to anonymous mapping. It's possible to disable huge zero
|
|
|
|
page by writing 0 or enable it back by writing 1::
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
|
|
|
echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
|
|
|
|
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
Some userspace (such as a test program, or an optimized memory
|
|
|
|
allocation library) may want to know the size (in bytes) of a
|
|
|
|
PMD-mappable transparent hugepage::
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
|
|
|
|
|
2024-08-30 11:03:40 +01:00
|
|
|
All THPs at fault and collapse time will be added to _deferred_list,
|
|
|
|
and will therefore be split under memory presure if they are considered
|
|
|
|
"underused". A THP is underused if the number of zero-filled pages in
|
|
|
|
the THP is above max_ptes_none (see below). It is possible to disable
|
|
|
|
this behaviour by writing 0 to shrink_underused, and enable it by writing
|
|
|
|
1 to it::
|
|
|
|
|
|
|
|
echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
|
|
|
|
echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
|
|
|
|
|
2024-07-04 10:10:50 +01:00
|
|
|
khugepaged will be automatically started when PMD-sized THP is enabled
|
|
|
|
(either of the per-size anon control or the top-level control are set
|
|
|
|
to "always" or "madvise"), and it'll be automatically shutdown when
|
|
|
|
PMD-sized THP is disabled (when both the per-size anon control and the
|
|
|
|
top-level control are "never")
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
Khugepaged controls
|
|
|
|
-------------------
|
|
|
|
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
.. note::
|
|
|
|
khugepaged currently only searches for opportunities to collapse to
|
|
|
|
PMD-sized THP and no attempt is made to collapse to other THP
|
|
|
|
sizes.
|
|
|
|
|
2018-05-14 11:13:40 +03:00
|
|
|
khugepaged runs usually at low frequency so while one may not want to
|
|
|
|
invoke defrag algorithms synchronously during the page faults, it
|
|
|
|
should be worth invoking defrag at least in khugepaged. However it's
|
|
|
|
also possible to disable defrag in khugepaged by writing 0 or enable
|
|
|
|
defrag in khugepaged by writing 1::
|
|
|
|
|
|
|
|
echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
|
|
|
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
|
|
|
|
|
|
|
|
You can also control how many pages khugepaged should scan at each
|
|
|
|
pass::
|
|
|
|
|
|
|
|
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
|
|
|
|
|
|
|
|
and how many milliseconds to wait in khugepaged between each pass (you
|
|
|
|
can set this to 0 to run khugepaged at 100% utilization of one core)::
|
|
|
|
|
|
|
|
/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
|
|
|
|
|
|
|
|
and how many milliseconds to wait in khugepaged if there's an hugepage
|
|
|
|
allocation failure to throttle the next allocation attempt::
|
|
|
|
|
|
|
|
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
|
|
|
|
|
mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds
The main benefit of THPs are that they can be mapped at the pmd level,
increasing the likelihood of TLB hit and spending less cycles in page
table walks. pte-mapped hugepages - that is - hugepage-aligned compound
pages of order HPAGE_PMD_ORDER mapped by ptes - although being contiguous
in physical memory, don't have this advantage. In fact, one could argue
they are detrimental to system performance overall since they occupy a
precious hugepage-aligned/sized region of physical memory that could
otherwise be used more effectively. Additionally, pte-mapped hugepages
can be the cheapest memory to collapse for khugepaged since no new
hugepage allocation or copying of memory contents is necessary - we only
need to update the mapping page tables.
In the anonymous collapse path, we are able to collapse pte-mapped
hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
effort when compound pages (of any order) are encountered.
Identify pte-mapped hugepages in the file/shmem collapse path. The
final step of which makes a racy check of the value of the pmd to
ensure it maps a pte table. This should be fine, since races that
result in false-positive (i.e. attempt collapse even though we
shouldn't) will fail later in collapse_pte_mapped_thp() once we
actually lock mmap_lock and reinspect the pmd value. Races that result
in false-negatives (i.e. where we decide to not attempt collapse, but
should have) shouldn't be an issue, since in the worst case, we do
nothing - which is what we've done up to this point. We make a similar
check in retract_page_tables(). If we do think we've found a
pte-mapped hugepgae in khugepaged context, attempt to update page
tables mapping this hugepage.
Note that these collapses still count towards the
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter,
and if the pte-mapped hugepage was also mapped into multiple process'
address spaces, could be incremented for each page table update. Since we
increment the counter when a pte-mapped hugepage is successfully added to
the list of to-collapse pte-mapped THPs, it's possible that we never
actually update the page table either. This is different from how
file/shmem pages_collapsed accounting works today where only a successful
page cache update is counted (it's also possible here that no page tables
are actually changed). Though it incurs some slop, this is preferred to
either not accounting for the event at all, or plumbing through data in
struct mm_slot on whether to account for the collapse or not.
Also note that work still needs to be done to support arbitrary compound
pages, and that this should all be converted to using folios.
[shy828301@gmail.com: Spelling mistake, update comment, and add Documentation]
Link: https://lore.kernel.org/linux-mm/CAHbLzkpHwZxFzjfX9nxVoRhzup8WMjMfyL6Xiq8mZ9M-N3ombw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220907144521.3115321-3-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220922224046.1143204-3-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-22 15:40:38 -07:00
|
|
|
The khugepaged progress can be seen in the number of pages collapsed (note
|
|
|
|
that this counter may not be an exact count of the number of pages
|
|
|
|
collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
|
|
|
|
being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
|
|
|
|
one 2M hugepage. Each may happen independently, or together, depending on
|
|
|
|
the type of memory and the failures that occur. As such, this value should
|
|
|
|
be interpreted roughly as a sign of progress, and counters in /proc/vmstat
|
|
|
|
consulted for more accurate accounting)::
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
|
|
|
|
|
|
|
|
for each pass::
|
|
|
|
|
|
|
|
/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
|
|
|
|
|
|
|
|
``max_ptes_none`` specifies how many extra small pages (that are
|
|
|
|
not already mapped) can be allocated when collapsing a group
|
|
|
|
of small pages into one large page::
|
|
|
|
|
|
|
|
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
|
|
|
|
|
|
|
|
A higher value leads to use additional memory for programs.
|
|
|
|
A lower value leads to gain less thp performance. Value of
|
|
|
|
max_ptes_none can waste cpu time very little, you can
|
|
|
|
ignore it.
|
|
|
|
|
|
|
|
``max_ptes_swap`` specifies how many pages can be brought in from
|
|
|
|
swap when collapsing a group of pages into a transparent huge page::
|
|
|
|
|
|
|
|
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
|
|
|
|
|
|
|
|
A higher value can cause excessive swap IO and waste
|
|
|
|
memory. A lower value can prevent THPs from being
|
|
|
|
collapsed, resulting fewer pages being collapsed into
|
|
|
|
THPs, and lower memory access performance.
|
|
|
|
|
2020-06-03 16:00:30 -07:00
|
|
|
``max_ptes_shared`` specifies how many pages can be shared across multiple
|
mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()
We want to limit the use of page_mapcount() to places where absolutely
required, to prepare for kernel configs where we won't keep track of
per-page mapcounts in large folios.
khugepaged is one of the remaining "more challenging" page_mapcount()
users, but we might be able to move away from page_mapcount() without
resulting in a significant behavior change that would warrant
special-casing based on kernel configs.
In 2020, we first added support to khugepaged for collapsing COW-shared
pages via commit 9445689f3b61 ("khugepaged: allow to collapse a page
shared across fork"), followed by support for collapsing PTE-mapped THP in
commit 5503fbf2b0b8 ("khugepaged: allow to collapse PTE-mapped compound
pages") and limiting the memory waste via the "page_count() > 1" check in
commit 71a2c112a0f6 ("khugepaged: introduce 'max_ptes_shared' tunable").
As a default, khugepaged will allow up to half of the PTEs to map shared
pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged
setting.
khugepaged does currently not care about swapcache page references, and
does not check under folio lock: so in some corner cases the "shared vs.
exclusive" detection might be a bit off, making us detect "exclusive" when
it's actually "shared".
Most of our anonymous folios in the system are usually exclusive. We
frequently see sharing of anonymous folios for a short period of time,
after which our short-lived suprocesses either quit or exec().
There are some famous examples, though, where child processes exist for a
long time, and where memory is COW-shared with a lot of processes
(webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for
reducing the memory footprint. We don't want to suddenly change the
behavior to result in a significant increase in memory waste.
Interestingly, khugepaged will only collapse an anonymous THP if at least
one PTE is writable. After fork(), that means that something (usually a
page fault) populated at least a single exclusive anonymous THP in that
PMD range.
So ... what happens when we switch to "is this folio mapped shared"
instead of "is this page mapped shared" by using
folio_likely_mapped_shared()?
For "not-COW-shared" folios, small folios and for THPs (large folios) that
are completely mapped into at least one process, switching to
folio_likely_mapped_shared() will not result in a change.
We'll only see a change for COW-shared PTE-mapped THPs that are partially
mapped into all involved processes.
There are two cases to consider:
(A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP
If the folio is detected as exclusive, and it actually is exclusive,
there is no change: page_mapcount() == 1. This is the common case
without fork() or with short-lived child processes.
folio_likely_mapped_shared() might currently still detect a folio as
exclusive although it is shared (false negatives): if the first page is
not mapped multiple times and if the average per-page mapcount is smaller
than 1, implying that (1) the folio is partially mapped and (2) if we are
responsible for many mapcounts by mapping many pages others can't
("mostly exclusive") (3) if we are not responsible for many mapcounts by
mapping little pages ("mostly shared") it won't make a big impact on the
end result.
So while we might now detect a page as "exclusive" although it isn't,
it's not expected to make a big difference in common cases.
(B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP
folio_likely_mapped_shared() will never detect a large anonymous folio
as shared although it is exclusive: there are no false positives.
If we detect a THP as shared, at least one page of the THP is mapped by
another process. It could well be that some pages are actually exclusive.
For example, our child processes could have unmapped/COW'ed some pages
such that they would now be exclusive to out process, which we now
would treat as still-shared.
Examples:
(1) Parent maps all pages of a THP, child maps some pages. We detect
all pages in the parent as shared although some are actually
exclusive.
(2) Parent maps all but some page of a THP, child maps the remainder.
We detect all pages of the THP that the parent maps as shared
although they are all exclusive.
In (1) we wouldn't collapse a THP right now already: no PTE
is writable, because a write fault would have resulted in COW of a
single page and the parent would no longer map all pages of that THP.
For (2) we would have collapsed a THP in the parent so far, now we
wouldn't as long as the child process is still alive: unless the child
process unmaps the remaining THP pages or we decide to split that THP.
Possibly, the child COW'ed many pages, meaning that it's likely that
we can populate a THP for our child first, and then for our parent.
For (2), we are making really bad use of the THP in the first
place (not even mapped completely in at least one process). If the
THP would be completely partially mapped, it would be on the deferred
split queue where we would split it lazily later.
For short-running child processes, we don't particularly care. For
long-running processes, the expectation is that such scenarios are
rather rare: further, a THP might be best placed if most data in the
PMD range is actually written, implying that we'll have to COW more
pages first before khugepaged would collapse it.
To summarize, in the common case, this change is not expected to matter
much. The more common application of khugepaged operates on exclusive
pages, either before fork() or after a child quit.
Can we improve (A)? Yes, if we implement more precise tracking of "mapped
shared" vs. "mapped exclusively", we could get rid of the false negatives
completely.
Can we improve (B)? We could count how many pages of a large folio we map
inside the current page table and detect that we are responsible for most
of the folio mapcount and conclude "as good as exclusive", which might
help in some cases. ... but likely, some other mechanism should detect
that the THP is not a good use in the scenario (not even mapped completely
in a single process) and try splitting that folio lazily etc.
We'll move the folio_test_anon() check before our "shared" check, so we
might get more expressive results for SCAN_EXCEED_SHARED_PTE: this order
of checks now matches the one in __collapse_huge_page_isolate(). Extend
documentation.
Link: https://lkml.kernel.org/r/20240424122630.495788-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-24 14:26:30 +02:00
|
|
|
processes. khugepaged might treat pages of THPs as shared if any page of
|
|
|
|
that THP is shared. Exceeding the number would block the collapse::
|
2020-06-03 16:00:30 -07:00
|
|
|
|
|
|
|
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
|
|
|
|
|
|
|
|
A higher value may increase memory footprint for some workloads.
|
|
|
|
|
2024-08-14 14:02:47 +12:00
|
|
|
Boot parameters
|
|
|
|
===============
|
|
|
|
|
|
|
|
You can change the sysfs boot time default for the top-level "enabled"
|
|
|
|
control by passing the parameter ``transparent_hugepage=always`` or
|
|
|
|
``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
|
|
|
|
kernel command line.
|
|
|
|
|
|
|
|
Alternatively, each supported anonymous THP size can be controlled by
|
2024-11-01 13:54:05 -03:00
|
|
|
passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``,
|
2024-08-14 14:02:47 +12:00
|
|
|
where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and
|
|
|
|
supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``,
|
|
|
|
``never`` or ``inherit``.
|
|
|
|
|
|
|
|
For example, the following will set 16K, 32K, 64K THP to ``always``,
|
|
|
|
set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M
|
|
|
|
to ``never``::
|
|
|
|
|
|
|
|
thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never
|
|
|
|
|
|
|
|
``thp_anon=`` may be specified multiple times to configure all THP sizes as
|
|
|
|
required. If ``thp_anon=`` is specified at least once, any anon THP sizes
|
|
|
|
not explicitly configured on the command line are implicitly set to
|
|
|
|
``never``.
|
|
|
|
|
|
|
|
``transparent_hugepage`` setting only affects the global toggle. If
|
|
|
|
``thp_anon`` is not specified, PMD_ORDER THP will default to ``inherit``.
|
|
|
|
However, if a valid ``thp_anon`` setting is provided by the user, the
|
|
|
|
PMD_ORDER THP policy will be overridden. If the policy for PMD_ORDER
|
|
|
|
is not defined within a valid ``thp_anon``, its policy will default to
|
|
|
|
``never``.
|
2018-05-14 11:13:40 +03:00
|
|
|
|
mm: shmem: control THP support through the kernel command line
Patch series "mm: add more kernel parameters to control mTHP", v5.
This series introduces four patches related to the kernel parameters
controlling mTHP and a fifth patch replacing `strcpy()` for `strscpy()` in
the file `mm/huge_memory.c`.
The first patch is a straightforward documentation update, correcting the
format of the kernel parameter ``thp_anon=``.
The second, third, and fourth patches focus on controlling THP support for
shmem via the kernel command line. The second patch introduces a
parameter to control the global default huge page allocation policy for
the internal shmem mount. The third patch moves a piece of code to a
shared header to ease the implementation of the fourth patch. Finally,
the fourth patch implements a parameter similar to ``thp_anon=``, but for
shmem.
The goal of these changes is to simplify the configuration of systems that
rely on mTHP support for shmem. For instance, a platform with a GPU that
benefits from huge pages may want to enable huge pages for shmem. Having
these kernel parameters streamlines the configuration process and ensures
consistency across setups.
This patch (of 4):
Add a new kernel command line to control the hugepage allocation policy
for the internal shmem mount, ``transparent_hugepage_shmem``. The
parameter is similar to ``transparent_hugepage`` and has the following
format:
transparent_hugepage_shmem=<policy>
where ``<policy>`` is one of the seven valid policies available for
shmem.
Configuring the default huge page allocation policy for the internal
shmem mount can be beneficial for DRM GPU drivers. Just as CPU
architectures, GPUs can also take advantage of huge pages, but this is
possible only if DRM GEM objects are backed by huge pages.
Since GEM uses shmem to allocate anonymous pageable memory, having control
over the default huge page allocation policy allows for the exploration of
huge pages use on GPUs that rely on GEM objects backed by shmem.
Link: https://lkml.kernel.org/r/20241101165719.1074234-2-mcanal@igalia.com
Link: https://lkml.kernel.org/r/20241101165719.1074234-4-mcanal@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: dri-devel@lists.freedesktop.org
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: kernel-dev@igalia.com
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-01 13:54:06 -03:00
|
|
|
Similarly to ``transparent_hugepage``, you can control the hugepage
|
|
|
|
allocation policy for the internal shmem mount by using the kernel parameter
|
|
|
|
``transparent_hugepage_shmem=<policy>``, where ``<policy>`` is one of the
|
|
|
|
seven valid policies for shmem (``always``, ``within_size``, ``advise``,
|
|
|
|
``never``, ``deny``, and ``force``).
|
|
|
|
|
2024-11-28 15:40:42 +08:00
|
|
|
Similarly to ``transparent_hugepage_shmem``, you can control the default
|
|
|
|
hugepage allocation policy for the tmpfs mount by using the kernel parameter
|
|
|
|
``transparent_hugepage_tmpfs=<policy>``, where ``<policy>`` is one of the
|
|
|
|
four valid policies for tmpfs (``always``, ``within_size``, ``advise``,
|
|
|
|
``never``). The tmpfs mount default policy is ``never``.
|
|
|
|
|
2024-11-01 13:54:08 -03:00
|
|
|
In the same manner as ``thp_anon`` controls each supported anonymous THP
|
|
|
|
size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem``
|
|
|
|
has the same format as ``thp_anon``, but also supports the policy
|
|
|
|
``within_size``.
|
|
|
|
|
|
|
|
``thp_shmem=`` may be specified multiple times to configure all THP sizes
|
|
|
|
as required. If ``thp_shmem=`` is specified at least once, any shmem THP
|
|
|
|
sizes not explicitly configured on the command line are implicitly set to
|
|
|
|
``never``.
|
|
|
|
|
|
|
|
``transparent_hugepage_shmem`` setting only affects the global toggle. If
|
|
|
|
``thp_shmem`` is not specified, PMD_ORDER hugepage will default to
|
|
|
|
``inherit``. However, if a valid ``thp_shmem`` setting is provided by the
|
|
|
|
user, the PMD_ORDER hugepage policy will be overridden. If the policy for
|
|
|
|
PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will
|
|
|
|
default to ``never``.
|
|
|
|
|
2018-05-14 11:13:40 +03:00
|
|
|
Hugepages in tmpfs/shmem
|
|
|
|
========================
|
|
|
|
|
2024-11-28 15:40:43 +08:00
|
|
|
Traditionally, tmpfs only supported a single huge page size ("PMD"). Today,
|
|
|
|
it also supports smaller sizes just like anonymous memory, often referred
|
|
|
|
to as "multi-size THP" (mTHP). Huge pages of any size are commonly
|
|
|
|
represented in the kernel as "large folios".
|
|
|
|
|
|
|
|
While there is fine control over the huge page sizes to use for the internal
|
|
|
|
shmem mount (see below), ordinary tmpfs mounts will make use of all available
|
|
|
|
huge page sizes without any control over the exact sizes, behaving more like
|
|
|
|
other file systems.
|
|
|
|
|
|
|
|
tmpfs mounts
|
|
|
|
------------
|
|
|
|
|
|
|
|
The THP allocation policy for tmpfs mounts can be adjusted using the mount
|
|
|
|
option: ``huge=``. It can have following values:
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
always
|
|
|
|
Attempt to allocate huge pages every time we need a new page;
|
|
|
|
|
|
|
|
never
|
2025-07-21 16:55:30 +01:00
|
|
|
Do not allocate huge pages. Note that ``madvise(..., MADV_COLLAPSE)``
|
|
|
|
can still cause transparent huge pages to be obtained even if this mode
|
|
|
|
is specified everywhere;
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
within_size
|
|
|
|
Only allocate huge page if it will be fully within i_size.
|
2024-11-28 15:40:44 +08:00
|
|
|
Also respect madvise() hints;
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
advise
|
2024-11-28 15:40:44 +08:00
|
|
|
Only allocate huge pages if requested with madvise();
|
2018-05-14 11:13:40 +03:00
|
|
|
|
2024-11-28 15:40:43 +08:00
|
|
|
Remember, that the kernel may use huge pages of all available sizes, and
|
|
|
|
that no fine control as for the internal tmpfs mount is available.
|
|
|
|
|
|
|
|
The default policy in the past was ``never``, but it can now be adjusted
|
|
|
|
using the kernel parameter ``transparent_hugepage_tmpfs=<policy>``.
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
|
|
|
|
``huge=never`` will not attempt to break up huge pages at all, just stop more
|
|
|
|
from being allocated.
|
|
|
|
|
2024-11-28 15:40:43 +08:00
|
|
|
In addition to policies listed above, the sysfs knob
|
|
|
|
/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the
|
|
|
|
allocation policy of tmpfs mounts, when set to the following values:
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
deny
|
|
|
|
For use in emergencies, to force the huge option off from
|
|
|
|
all mounts;
|
|
|
|
force
|
|
|
|
Force the huge option on for all - very useful for testing;
|
|
|
|
|
2024-11-28 15:40:43 +08:00
|
|
|
shmem / internal tmpfs
|
|
|
|
----------------------
|
|
|
|
The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous
|
|
|
|
mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
|
|
|
|
|
|
|
|
To control the THP allocation policy for this internal tmpfs mount, the
|
|
|
|
sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs
|
|
|
|
per THP size in
|
|
|
|
'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled'
|
|
|
|
can be used.
|
|
|
|
|
|
|
|
The global knob has the same semantics as the ``huge=`` mount options
|
|
|
|
for tmpfs mounts, except that the different huge page sizes can be controlled
|
|
|
|
individually, and will only use the setting of the global knob when the
|
|
|
|
per-size knob is set to 'inherit'.
|
|
|
|
|
|
|
|
The options 'force' and 'deny' are dropped for the individual sizes, which
|
|
|
|
are rather testing artifacts from the old ages.
|
mm: shmem: add multi-size THP sysfs interface for anonymous shmem
To support the use of mTHP with anonymous shmem, add a new sysfs interface
'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that mTHP,
with a value similar to the top level 'shmem_enabled', which can be set
to: "always", "inherit (to inherit the top level setting)", "within_size",
"advise", "never". An 'inherit' option is added to ensure compatibility
with these global settings, and the options 'force' and 'deny' are
dropped, which are rather testing artifacts from the old ages.
By default, PMD-sized hugepages have enabled="inherit" and all other
hugepage sizes have enabled="never" for
'/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
In addition, if top level value is 'force', then only PMD-sized hugepages
have enabled="inherit", otherwise configuration will be failed and vice
versa. That means now we will avoid using non-PMD sized THP to override
the global huge allocation.
[baolin.wang@linux.alibaba.com: fix transhuge.rst indentation]
Link: https://lkml.kernel.org/r/b189d815-998b-4dfd-ba89-218ff51313f8@linux.alibaba.com
[akpm@linux-foundation.org: reflow transhuge.rst addition to 80 cols]
[baolin.wang@linux.alibaba.com: move huge_shmem_orders_lock under CONFIG_SYSFS]
Link: https://lkml.kernel.org/r/eb34da66-7f12-44f3-a39e-2bcc90c33354@linux.alibaba.com
[akpm@linux-foundation.org: huge_memory.c needs mm_types.h]
Link: https://lkml.kernel.org/r/ffddfa8b3cb4266ff963099ab78cfd7184c57ac7.1718090413.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-11 18:11:07 +08:00
|
|
|
|
|
|
|
always
|
|
|
|
Attempt to allocate <size> huge pages every time we need a new page;
|
|
|
|
|
|
|
|
inherit
|
|
|
|
Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
|
|
|
|
have enabled="inherit" and all other hugepage sizes have enabled="never";
|
|
|
|
|
|
|
|
never
|
2025-07-21 16:55:30 +01:00
|
|
|
Do not allocate <size> huge pages. Note that ``madvise(...,
|
|
|
|
MADV_COLLAPSE)`` can still cause transparent huge pages to be obtained
|
|
|
|
even if this mode is specified everywhere;
|
mm: shmem: add multi-size THP sysfs interface for anonymous shmem
To support the use of mTHP with anonymous shmem, add a new sysfs interface
'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that mTHP,
with a value similar to the top level 'shmem_enabled', which can be set
to: "always", "inherit (to inherit the top level setting)", "within_size",
"advise", "never". An 'inherit' option is added to ensure compatibility
with these global settings, and the options 'force' and 'deny' are
dropped, which are rather testing artifacts from the old ages.
By default, PMD-sized hugepages have enabled="inherit" and all other
hugepage sizes have enabled="never" for
'/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
In addition, if top level value is 'force', then only PMD-sized hugepages
have enabled="inherit", otherwise configuration will be failed and vice
versa. That means now we will avoid using non-PMD sized THP to override
the global huge allocation.
[baolin.wang@linux.alibaba.com: fix transhuge.rst indentation]
Link: https://lkml.kernel.org/r/b189d815-998b-4dfd-ba89-218ff51313f8@linux.alibaba.com
[akpm@linux-foundation.org: reflow transhuge.rst addition to 80 cols]
[baolin.wang@linux.alibaba.com: move huge_shmem_orders_lock under CONFIG_SYSFS]
Link: https://lkml.kernel.org/r/eb34da66-7f12-44f3-a39e-2bcc90c33354@linux.alibaba.com
[akpm@linux-foundation.org: huge_memory.c needs mm_types.h]
Link: https://lkml.kernel.org/r/ffddfa8b3cb4266ff963099ab78cfd7184c57ac7.1718090413.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-11 18:11:07 +08:00
|
|
|
|
|
|
|
within_size
|
|
|
|
Only allocate <size> huge page if it will be fully within i_size.
|
2024-11-28 15:40:44 +08:00
|
|
|
Also respect madvise() hints;
|
mm: shmem: add multi-size THP sysfs interface for anonymous shmem
To support the use of mTHP with anonymous shmem, add a new sysfs interface
'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that mTHP,
with a value similar to the top level 'shmem_enabled', which can be set
to: "always", "inherit (to inherit the top level setting)", "within_size",
"advise", "never". An 'inherit' option is added to ensure compatibility
with these global settings, and the options 'force' and 'deny' are
dropped, which are rather testing artifacts from the old ages.
By default, PMD-sized hugepages have enabled="inherit" and all other
hugepage sizes have enabled="never" for
'/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
In addition, if top level value is 'force', then only PMD-sized hugepages
have enabled="inherit", otherwise configuration will be failed and vice
versa. That means now we will avoid using non-PMD sized THP to override
the global huge allocation.
[baolin.wang@linux.alibaba.com: fix transhuge.rst indentation]
Link: https://lkml.kernel.org/r/b189d815-998b-4dfd-ba89-218ff51313f8@linux.alibaba.com
[akpm@linux-foundation.org: reflow transhuge.rst addition to 80 cols]
[baolin.wang@linux.alibaba.com: move huge_shmem_orders_lock under CONFIG_SYSFS]
Link: https://lkml.kernel.org/r/eb34da66-7f12-44f3-a39e-2bcc90c33354@linux.alibaba.com
[akpm@linux-foundation.org: huge_memory.c needs mm_types.h]
Link: https://lkml.kernel.org/r/ffddfa8b3cb4266ff963099ab78cfd7184c57ac7.1718090413.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-11 18:11:07 +08:00
|
|
|
|
|
|
|
advise
|
2024-11-28 15:40:44 +08:00
|
|
|
Only allocate <size> huge pages if requested with madvise();
|
mm: shmem: add multi-size THP sysfs interface for anonymous shmem
To support the use of mTHP with anonymous shmem, add a new sysfs interface
'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that mTHP,
with a value similar to the top level 'shmem_enabled', which can be set
to: "always", "inherit (to inherit the top level setting)", "within_size",
"advise", "never". An 'inherit' option is added to ensure compatibility
with these global settings, and the options 'force' and 'deny' are
dropped, which are rather testing artifacts from the old ages.
By default, PMD-sized hugepages have enabled="inherit" and all other
hugepage sizes have enabled="never" for
'/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
In addition, if top level value is 'force', then only PMD-sized hugepages
have enabled="inherit", otherwise configuration will be failed and vice
versa. That means now we will avoid using non-PMD sized THP to override
the global huge allocation.
[baolin.wang@linux.alibaba.com: fix transhuge.rst indentation]
Link: https://lkml.kernel.org/r/b189d815-998b-4dfd-ba89-218ff51313f8@linux.alibaba.com
[akpm@linux-foundation.org: reflow transhuge.rst addition to 80 cols]
[baolin.wang@linux.alibaba.com: move huge_shmem_orders_lock under CONFIG_SYSFS]
Link: https://lkml.kernel.org/r/eb34da66-7f12-44f3-a39e-2bcc90c33354@linux.alibaba.com
[akpm@linux-foundation.org: huge_memory.c needs mm_types.h]
Link: https://lkml.kernel.org/r/ffddfa8b3cb4266ff963099ab78cfd7184c57ac7.1718090413.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-11 18:11:07 +08:00
|
|
|
|
2018-05-14 11:13:40 +03:00
|
|
|
Need of application restart
|
|
|
|
===========================
|
|
|
|
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
The transparent_hugepage/enabled and
|
|
|
|
transparent_hugepage/hugepages-<size>kB/enabled values and tmpfs mount
|
|
|
|
option only affect future behavior. So to make them effective you need
|
|
|
|
to restart any application that could have been using hugepages. This
|
|
|
|
also applies to the regions registered in khugepaged.
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
Monitoring usage
|
|
|
|
================
|
|
|
|
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
The number of PMD-sized anonymous transparent huge pages currently used by the
|
2018-05-14 11:13:40 +03:00
|
|
|
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
To identify what applications are using PMD-sized anonymous transparent huge
|
|
|
|
pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages
|
|
|
|
fields for each mapping. (Note that AnonHugePages only applies to traditional
|
|
|
|
PMD-sized THP for historical reasons and should have been called
|
|
|
|
AnonHugePmdMapped).
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
The number of file transparent huge pages mapped to userspace is available
|
|
|
|
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
|
|
|
|
To identify what applications are mapping file transparent huge pages, it
|
2024-12-17 16:55:39 +08:00
|
|
|
is necessary to read ``/proc/PID/smaps`` and count the FilePmdMapped fields
|
2018-05-14 11:13:40 +03:00
|
|
|
for each mapping.
|
|
|
|
|
|
|
|
Note that reading the smaps file is expensive and reading it
|
|
|
|
frequently will incur overhead.
|
|
|
|
|
|
|
|
There are a number of counters in ``/proc/vmstat`` that may be used to
|
|
|
|
monitor how successfully the system is providing huge pages for use.
|
|
|
|
|
|
|
|
thp_fault_alloc
|
|
|
|
is incremented every time a huge page is successfully
|
2024-04-12 23:48:58 +12:00
|
|
|
allocated and charged to handle a page fault.
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
thp_collapse_alloc
|
|
|
|
is incremented by khugepaged when it has found
|
|
|
|
a range of pages to collapse into one huge page and has
|
|
|
|
successfully allocated a new huge page to store the data.
|
|
|
|
|
|
|
|
thp_fault_fallback
|
2024-04-12 23:48:58 +12:00
|
|
|
is incremented if a page fault fails to allocate or charge
|
2018-05-14 11:13:40 +03:00
|
|
|
a huge page and instead falls back to using small pages.
|
|
|
|
|
2020-04-06 20:04:28 -07:00
|
|
|
thp_fault_fallback_charge
|
|
|
|
is incremented if a page fault fails to charge a huge page and
|
|
|
|
instead falls back to using small pages even though the
|
|
|
|
allocation was successful.
|
|
|
|
|
2018-05-14 11:13:40 +03:00
|
|
|
thp_collapse_alloc_failed
|
|
|
|
is incremented if khugepaged found a range
|
|
|
|
of pages that should be collapsed into one huge page but failed
|
|
|
|
the allocation.
|
|
|
|
|
|
|
|
thp_file_alloc
|
mm: shmem: rename mTHP shmem counters
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc,
thp_file_fallback and thp_file_fallback_charge, which rather confusingly
refer to shmem THP and do not include any other types of file pages. This
is inconsistent since in most other places in the kernel, THP counters are
explicitly separated for anon, shmem and file flavours. However, we are
stuck with it since it constitutes a user ABI.
Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous
shmem") added equivalent mTHP stats for shmem, keeping the same "file_"
prefix in the names. But in future, we may want to add extra stats to
cover actual file pages, at which point, it would all become very
confusing.
So let's take the opportunity to rename these new counters "shmem_" before
the change makes it upstream and the ABI becomes immutable. While we are
at it, let's improve the documentation for the legacy counters to make it
clear that they count shmem pages only.
Link: https://lkml.kernel.org/r/20240710095503.3193901-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 10:55:01 +01:00
|
|
|
is incremented every time a shmem huge page is successfully
|
|
|
|
allocated (Note that despite being named after "file", the counter
|
|
|
|
measures only shmem).
|
2018-05-14 11:13:40 +03:00
|
|
|
|
2020-04-06 20:04:25 -07:00
|
|
|
thp_file_fallback
|
mm: shmem: rename mTHP shmem counters
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc,
thp_file_fallback and thp_file_fallback_charge, which rather confusingly
refer to shmem THP and do not include any other types of file pages. This
is inconsistent since in most other places in the kernel, THP counters are
explicitly separated for anon, shmem and file flavours. However, we are
stuck with it since it constitutes a user ABI.
Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous
shmem") added equivalent mTHP stats for shmem, keeping the same "file_"
prefix in the names. But in future, we may want to add extra stats to
cover actual file pages, at which point, it would all become very
confusing.
So let's take the opportunity to rename these new counters "shmem_" before
the change makes it upstream and the ABI becomes immutable. While we are
at it, let's improve the documentation for the legacy counters to make it
clear that they count shmem pages only.
Link: https://lkml.kernel.org/r/20240710095503.3193901-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 10:55:01 +01:00
|
|
|
is incremented if a shmem huge page is attempted to be allocated
|
|
|
|
but fails and instead falls back to using small pages. (Note that
|
|
|
|
despite being named after "file", the counter measures only shmem).
|
2020-04-06 20:04:25 -07:00
|
|
|
|
2020-04-06 20:04:28 -07:00
|
|
|
thp_file_fallback_charge
|
mm: shmem: rename mTHP shmem counters
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc,
thp_file_fallback and thp_file_fallback_charge, which rather confusingly
refer to shmem THP and do not include any other types of file pages. This
is inconsistent since in most other places in the kernel, THP counters are
explicitly separated for anon, shmem and file flavours. However, we are
stuck with it since it constitutes a user ABI.
Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous
shmem") added equivalent mTHP stats for shmem, keeping the same "file_"
prefix in the names. But in future, we may want to add extra stats to
cover actual file pages, at which point, it would all become very
confusing.
So let's take the opportunity to rename these new counters "shmem_" before
the change makes it upstream and the ABI becomes immutable. While we are
at it, let's improve the documentation for the legacy counters to make it
clear that they count shmem pages only.
Link: https://lkml.kernel.org/r/20240710095503.3193901-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 10:55:01 +01:00
|
|
|
is incremented if a shmem huge page cannot be charged and instead
|
2020-04-06 20:04:28 -07:00
|
|
|
falls back to using small pages even though the allocation was
|
mm: shmem: rename mTHP shmem counters
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc,
thp_file_fallback and thp_file_fallback_charge, which rather confusingly
refer to shmem THP and do not include any other types of file pages. This
is inconsistent since in most other places in the kernel, THP counters are
explicitly separated for anon, shmem and file flavours. However, we are
stuck with it since it constitutes a user ABI.
Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous
shmem") added equivalent mTHP stats for shmem, keeping the same "file_"
prefix in the names. But in future, we may want to add extra stats to
cover actual file pages, at which point, it would all become very
confusing.
So let's take the opportunity to rename these new counters "shmem_" before
the change makes it upstream and the ABI becomes immutable. While we are
at it, let's improve the documentation for the legacy counters to make it
clear that they count shmem pages only.
Link: https://lkml.kernel.org/r/20240710095503.3193901-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 10:55:01 +01:00
|
|
|
successful. (Note that despite being named after "file", the
|
|
|
|
counter measures only shmem).
|
2020-04-06 20:04:28 -07:00
|
|
|
|
2018-05-14 11:13:40 +03:00
|
|
|
thp_file_mapped
|
mm: shmem: rename mTHP shmem counters
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc,
thp_file_fallback and thp_file_fallback_charge, which rather confusingly
refer to shmem THP and do not include any other types of file pages. This
is inconsistent since in most other places in the kernel, THP counters are
explicitly separated for anon, shmem and file flavours. However, we are
stuck with it since it constitutes a user ABI.
Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous
shmem") added equivalent mTHP stats for shmem, keeping the same "file_"
prefix in the names. But in future, we may want to add extra stats to
cover actual file pages, at which point, it would all become very
confusing.
So let's take the opportunity to rename these new counters "shmem_" before
the change makes it upstream and the ABI becomes immutable. While we are
at it, let's improve the documentation for the legacy counters to make it
clear that they count shmem pages only.
Link: https://lkml.kernel.org/r/20240710095503.3193901-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 10:55:01 +01:00
|
|
|
is incremented every time a file or shmem huge page is mapped into
|
2018-05-14 11:13:40 +03:00
|
|
|
user address space.
|
|
|
|
|
|
|
|
thp_split_page
|
|
|
|
is incremented every time a huge page is split into base
|
|
|
|
pages. This can happen for a variety of reasons but a common
|
|
|
|
reason is that a huge page is old and is being reclaimed.
|
|
|
|
This action implies splitting all PMD the page mapped with.
|
|
|
|
|
|
|
|
thp_split_page_failed
|
|
|
|
is incremented if kernel fails to split huge
|
|
|
|
page. This can happen if the page was pinned by somebody.
|
|
|
|
|
|
|
|
thp_deferred_split_page
|
|
|
|
is incremented when a huge page is put onto split
|
|
|
|
queue. This happens when a huge page is partially unmapped and
|
|
|
|
splitting it would free up some memory. Pages on split queue are
|
|
|
|
going to be split under memory pressure.
|
|
|
|
|
2024-08-30 11:03:39 +01:00
|
|
|
thp_underused_split_page
|
|
|
|
is incremented when a huge page on the split queue was split
|
|
|
|
because it was underused. A THP is underused if the number of
|
|
|
|
zero pages in the THP is above a certain threshold
|
|
|
|
(/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none).
|
|
|
|
|
2018-05-14 11:13:40 +03:00
|
|
|
thp_split_pmd
|
|
|
|
is incremented every time a PMD split into table of PTEs.
|
|
|
|
This can happen, for instance, when application calls mprotect() or
|
|
|
|
munmap() on part of huge page. It doesn't split huge page, only
|
|
|
|
page table entry.
|
|
|
|
|
|
|
|
thp_zero_page_alloc
|
2022-09-09 10:16:53 +08:00
|
|
|
is incremented every time a huge zero page used for thp is
|
|
|
|
successfully allocated. Note, it doesn't count every map of
|
|
|
|
the huge zero page, only its allocation.
|
2018-05-14 11:13:40 +03:00
|
|
|
|
|
|
|
thp_zero_page_alloc_failed
|
|
|
|
is incremented if kernel fails to allocate
|
|
|
|
huge zero page and falls back to using small pages.
|
|
|
|
|
|
|
|
thp_swpout
|
|
|
|
is incremented every time a huge page is swapout in one
|
|
|
|
piece without splitting.
|
|
|
|
|
|
|
|
thp_swpout_fallback
|
|
|
|
is incremented if a huge page has to be split before swapout.
|
|
|
|
Usually because failed to allocate some continuous swap space
|
|
|
|
for the huge page.
|
|
|
|
|
2024-04-12 23:48:57 +12:00
|
|
|
In /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats, There are
|
|
|
|
also individual counters for each huge page size, which can be utilized to
|
|
|
|
monitor the system's effectiveness in providing huge pages for usage. Each
|
|
|
|
counter has its own corresponding file.
|
|
|
|
|
|
|
|
anon_fault_alloc
|
|
|
|
is incremented every time a huge page is successfully
|
|
|
|
allocated and charged to handle a page fault.
|
|
|
|
|
|
|
|
anon_fault_fallback
|
|
|
|
is incremented if a page fault fails to allocate or charge
|
|
|
|
a huge page and instead falls back to using huge pages with
|
|
|
|
lower orders or small pages.
|
|
|
|
|
|
|
|
anon_fault_fallback_charge
|
|
|
|
is incremented if a page fault fails to charge a huge page and
|
|
|
|
instead falls back to using huge pages with lower orders or
|
|
|
|
small pages even though the allocation was successful.
|
|
|
|
|
2024-09-30 22:32:22 -07:00
|
|
|
zswpout
|
|
|
|
is incremented every time a huge page is swapped out to zswap in one
|
2024-04-12 23:48:57 +12:00
|
|
|
piece without splitting.
|
2024-09-30 22:32:22 -07:00
|
|
|
|
2024-10-26 21:24:23 +13:00
|
|
|
swpin
|
|
|
|
is incremented every time a huge page is swapped in from a non-zswap
|
|
|
|
swap device in one piece.
|
|
|
|
|
2024-12-02 20:47:30 +08:00
|
|
|
swpin_fallback
|
|
|
|
is incremented if swapin fails to allocate or charge a huge page
|
|
|
|
and instead falls back to using huge pages with lower orders or
|
|
|
|
small pages.
|
|
|
|
|
|
|
|
swpin_fallback_charge
|
|
|
|
is incremented if swapin fails to charge a huge page and instead
|
|
|
|
falls back to using huge pages with lower orders or small pages
|
|
|
|
even though the allocation was successful.
|
|
|
|
|
2024-09-30 22:32:22 -07:00
|
|
|
swpout
|
|
|
|
is incremented every time a huge page is swapped out to a non-zswap
|
|
|
|
swap device in one piece without splitting.
|
2024-04-12 23:48:57 +12:00
|
|
|
|
2024-05-23 10:36:39 +08:00
|
|
|
swpout_fallback
|
2024-04-12 23:48:57 +12:00
|
|
|
is incremented if a huge page has to be split before swapout.
|
|
|
|
Usually because failed to allocate some continuous swap space
|
|
|
|
for the huge page.
|
|
|
|
|
mm: shmem: rename mTHP shmem counters
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc,
thp_file_fallback and thp_file_fallback_charge, which rather confusingly
refer to shmem THP and do not include any other types of file pages. This
is inconsistent since in most other places in the kernel, THP counters are
explicitly separated for anon, shmem and file flavours. However, we are
stuck with it since it constitutes a user ABI.
Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous
shmem") added equivalent mTHP stats for shmem, keeping the same "file_"
prefix in the names. But in future, we may want to add extra stats to
cover actual file pages, at which point, it would all become very
confusing.
So let's take the opportunity to rename these new counters "shmem_" before
the change makes it upstream and the ABI becomes immutable. While we are
at it, let's improve the documentation for the legacy counters to make it
clear that they count shmem pages only.
Link: https://lkml.kernel.org/r/20240710095503.3193901-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 10:55:01 +01:00
|
|
|
shmem_alloc
|
|
|
|
is incremented every time a shmem huge page is successfully
|
2024-06-11 18:11:10 +08:00
|
|
|
allocated.
|
|
|
|
|
mm: shmem: rename mTHP shmem counters
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc,
thp_file_fallback and thp_file_fallback_charge, which rather confusingly
refer to shmem THP and do not include any other types of file pages. This
is inconsistent since in most other places in the kernel, THP counters are
explicitly separated for anon, shmem and file flavours. However, we are
stuck with it since it constitutes a user ABI.
Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous
shmem") added equivalent mTHP stats for shmem, keeping the same "file_"
prefix in the names. But in future, we may want to add extra stats to
cover actual file pages, at which point, it would all become very
confusing.
So let's take the opportunity to rename these new counters "shmem_" before
the change makes it upstream and the ABI becomes immutable. While we are
at it, let's improve the documentation for the legacy counters to make it
clear that they count shmem pages only.
Link: https://lkml.kernel.org/r/20240710095503.3193901-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 10:55:01 +01:00
|
|
|
shmem_fallback
|
|
|
|
is incremented if a shmem huge page is attempted to be allocated
|
2024-06-11 18:11:10 +08:00
|
|
|
but fails and instead falls back to using small pages.
|
|
|
|
|
mm: shmem: rename mTHP shmem counters
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc,
thp_file_fallback and thp_file_fallback_charge, which rather confusingly
refer to shmem THP and do not include any other types of file pages. This
is inconsistent since in most other places in the kernel, THP counters are
explicitly separated for anon, shmem and file flavours. However, we are
stuck with it since it constitutes a user ABI.
Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous
shmem") added equivalent mTHP stats for shmem, keeping the same "file_"
prefix in the names. But in future, we may want to add extra stats to
cover actual file pages, at which point, it would all become very
confusing.
So let's take the opportunity to rename these new counters "shmem_" before
the change makes it upstream and the ABI becomes immutable. While we are
at it, let's improve the documentation for the legacy counters to make it
clear that they count shmem pages only.
Link: https://lkml.kernel.org/r/20240710095503.3193901-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <ioworker0@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-10 10:55:01 +01:00
|
|
|
shmem_fallback_charge
|
|
|
|
is incremented if a shmem huge page cannot be charged and instead
|
2024-06-11 18:11:10 +08:00
|
|
|
falls back to using small pages even though the allocation was
|
|
|
|
successful.
|
|
|
|
|
2024-06-28 21:07:50 +08:00
|
|
|
split
|
|
|
|
is incremented every time a huge page is successfully split into
|
|
|
|
smaller orders. This can happen for a variety of reasons but a
|
|
|
|
common reason is that a huge page is old and is being reclaimed.
|
|
|
|
|
|
|
|
split_failed
|
|
|
|
is incremented if kernel fails to split huge
|
|
|
|
page. This can happen if the page was pinned by somebody.
|
|
|
|
|
|
|
|
split_deferred
|
|
|
|
is incremented when a huge page is put onto split queue.
|
|
|
|
This happens when a huge page is partially unmapped and splitting
|
|
|
|
it would free up some memory. Pages on split queue are going to
|
|
|
|
be split under memory pressure, if splitting is possible.
|
|
|
|
|
mm: count the number of anonymous THPs per size
Patch series "mm: count the number of anonymous THPs per size", v4.
Knowing the number of transparent anon THPs in the system is crucial
for performance analysis. It helps in understanding the ratio and
distribution of THPs versus small folios throughout the system.
Additionally, partial unmapping by userspace can lead to significant waste
of THPs over time and increase memory reclamation pressure. We need this
information for comprehensive system tuning.
This patch (of 2):
Let's track for each anonymous THP size, how many of them are currently
allocated. We'll track the complete lifespan of an anon THP, starting
when it becomes an anon THP ("large anon folio") (->mapping gets set),
until it gets freed (->mapping gets cleared).
Introduce a new "nr_anon" counter per THP size and adjust the
corresponding counter in the following cases:
* We allocate a new THP and call folio_add_new_anon_rmap() to map
it the first time and turn it into an anon THP.
* We split an anon THP into multiple smaller ones.
* We migrate an anon THP, when we prepare the destination.
* We free an anon THP back to the buddy.
Note that AnonPages in /proc/meminfo currently tracks the total number of
*mapped* anonymous *pages*, and therefore has slightly different
semantics. In the future, we might also want to track "nr_anon_mapped"
for each THP size, which might be helpful when comparing it to the number
of allocated anon THPs (long-term pinning, stuck in swapcache, memory
leaks, ...).
Further note that for now, we only track anon THPs after they got their
->mapping set, for example via folio_add_new_anon_rmap(). If we would
allocate some in the swapcache, they will only show up in the statistics
for now after they have been mapped to user space the first time, where we
call folio_add_new_anon_rmap().
[akpm@linux-foundation.org: documentation fixups, per David]
Link: https://lkml.kernel.org/r/3e8add35-e26b-443b-8a04-1078f4bc78f6@redhat.com
Link: https://lkml.kernel.org/r/20240824010441.21308-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240824010441.21308-2-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuai Yuan <yuanshuai@oppo.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-24 13:04:40 +12:00
|
|
|
nr_anon
|
|
|
|
the number of anonymous THP we have in the whole system. These THPs
|
|
|
|
might be currently entirely mapped or have partially unmapped/unused
|
|
|
|
subpages.
|
|
|
|
|
2024-08-24 13:04:41 +12:00
|
|
|
nr_anon_partially_mapped
|
|
|
|
the number of anonymous THP which are likely partially mapped, possibly
|
|
|
|
wasting memory, and have been queued for deferred memory reclamation.
|
|
|
|
Note that in corner some cases (e.g., failed migration), we might detect
|
|
|
|
an anonymous THP as "partially mapped" and count it here, even though it
|
|
|
|
is not actually partially mapped anymore.
|
|
|
|
|
2018-05-14 11:13:40 +03:00
|
|
|
As the system ages, allocating huge pages may be expensive as the
|
|
|
|
system uses memory compaction to copy data around memory to free a
|
|
|
|
huge page for use. There are some counters in ``/proc/vmstat`` to help
|
|
|
|
monitor this overhead.
|
|
|
|
|
|
|
|
compact_stall
|
|
|
|
is incremented every time a process stalls to run
|
|
|
|
memory compaction so that a huge page is free for use.
|
|
|
|
|
|
|
|
compact_success
|
|
|
|
is incremented if the system compacted memory and
|
|
|
|
freed a huge page for use.
|
|
|
|
|
|
|
|
compact_fail
|
|
|
|
is incremented if the system tries to compact memory
|
|
|
|
but failed.
|
|
|
|
|
|
|
|
It is possible to establish how long the stalls were using the function
|
2021-04-29 23:01:15 -07:00
|
|
|
tracer to record how long was spent in __alloc_pages() and
|
2018-05-14 11:13:40 +03:00
|
|
|
using the mm_page_alloc tracepoint to identify which allocations were
|
|
|
|
for huge pages.
|
|
|
|
|
|
|
|
Optimizing the applications
|
|
|
|
===========================
|
|
|
|
|
mm: thp: introduce multi-size THP sysfs interface
In preparation for adding support for anonymous multi-size THP, introduce
new sysfs structure that will be used to control the new behaviours. A
new directory is added under transparent_hugepage for each supported THP
size, and contains an `enabled` file, which can be set to "inherit" (to
inherit the global setting), "always", "madvise" or "never". For now, the
kernel still only supports PMD-sized anonymous THP, so only 1 directory is
populated.
The first half of the change converts transhuge_vma_suitable() and
hugepage_vma_check() so that they take a bitfield of orders for which the
user wants to determine support, and the functions filter out all the
orders that can't be supported, given the current sysfs configuration and
the VMA dimensions. The resulting functions are renamed to
thp_vma_suitable_orders() and thp_vma_allowable_orders() respectively.
Convenience functions that take a single, unencoded order and return a
boolean are also defined as thp_vma_suitable_order() and
thp_vma_allowable_order().
The second half of the change implements the new sysfs interface. It has
been done so that each supported THP size has a `struct thpsize`, which
describes the relevant metadata and is itself a kobject. This is pretty
minimal for now, but should make it easy to add new per-thpsize files to
the interface if needed in future (e.g. per-size defrag). Rather than
keep the `enabled` state directly in the struct thpsize, I've elected to
directly encode it into huge_anon_orders_[always|madvise|inherit]
bitfields since this reduces the amount of work required in
thp_vma_allowable_orders() which is called for every page fault.
See Documentation/admin-guide/mm/transhuge.rst, as modified by this
commit, for details of how the new sysfs interface works.
[ryan.roberts@arm.com: fix build warning when CONFIG_SYSFS is disabled]
Link: https://lkml.kernel.org/r/20231211125320.3997543-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-4-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-07 16:12:04 +00:00
|
|
|
To be guaranteed that the kernel will map a THP immediately in any
|
2018-05-14 11:13:40 +03:00
|
|
|
memory region, the mmap region has to be hugepage naturally
|
|
|
|
aligned. posix_memalign() can provide that guarantee.
|
|
|
|
|
|
|
|
Hugetlbfs
|
|
|
|
=========
|
|
|
|
|
|
|
|
You can use hugetlbfs on a kernel that has transparent hugepage
|
|
|
|
support enabled just fine as always. No difference can be noted in
|
|
|
|
hugetlbfs other than there will be less overall fragmentation. All
|
|
|
|
usual features belonging to hugetlbfs are preserved and
|
|
|
|
unaffected. libhugetlbfs will also work fine as usual.
|