mirror of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2025-08-05 16:54:27 +00:00

simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - The 8 patch series "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - The 3 patch series "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - The 2 patch series "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - The 8 patch series "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - The 6 patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - The 3 patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - The 2 patch series "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - The 3 patch series "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - The 3 patch series "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - The 12 patch series "Always call constructor for kernel page tables" from Kevin Brodsky "ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables". This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - The 9 patch series "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - The 3 patch series "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - The 4 patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - The 6 patch series "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - The 3 patch series "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - The 3 patch series ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - The 7 patch series "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - The 5 patch series "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - The 2 patch series "vmscan: enforce mems_effective during demotion" from Gregory Price "changes reclaim to respect cpuset.mems_effective during demotion when possible". because "presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated." "This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently." - The 2 patch series ""Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - The 3 patch series "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - The 4 patch series "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - The 17 patch series "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - The 7 patch series "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - The 2 patch series "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - The 2 patch series "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - The 4 patch series "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - The 6 patch series "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - The 2 patch series "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - The 8 patch series "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - The 3 patch series "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - The 4 patch series "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - The 6 patch series "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is "yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents." - The 7 patch series "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - The 4 patch series "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDt5qgAKCRDdBJ7gKXxA ju6XAP9nTiSfRz8Cz1n5LJZpFKEGzLpSihCYyR6P3o1L9oe3mwEAlZ5+XAwk2I5x Qqb/UGMEpilyre1PayQqOnct3aSL9Ao= =tYYm -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
868 lines
24 KiB
C
868 lines
24 KiB
C
// SPDX-License-Identifier: GPL-2.0
|
|
/*
|
|
* linux/mm/swap_state.c
|
|
*
|
|
* Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
|
|
* Swap reorganised 29.12.95, Stephen Tweedie
|
|
*
|
|
* Rewritten to use page cache, (C) 1998 Stephen Tweedie
|
|
*/
|
|
#include <linux/mm.h>
|
|
#include <linux/gfp.h>
|
|
#include <linux/kernel_stat.h>
|
|
#include <linux/mempolicy.h>
|
|
#include <linux/swap.h>
|
|
#include <linux/swapops.h>
|
|
#include <linux/init.h>
|
|
#include <linux/pagemap.h>
|
|
#include <linux/pagevec.h>
|
|
#include <linux/backing-dev.h>
|
|
#include <linux/blkdev.h>
|
|
#include <linux/migrate.h>
|
|
#include <linux/vmalloc.h>
|
|
#include <linux/huge_mm.h>
|
|
#include <linux/shmem_fs.h>
|
|
#include "internal.h"
|
|
#include "swap.h"
|
|
|
|
/*
|
|
* swapper_space is a fiction, retained to simplify the path through
|
|
* vmscan's shrink_folio_list.
|
|
*/
|
|
static const struct address_space_operations swap_aops = {
|
|
.dirty_folio = noop_dirty_folio,
|
|
#ifdef CONFIG_MIGRATION
|
|
.migrate_folio = migrate_folio,
|
|
#endif
|
|
};
|
|
|
|
struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
|
|
static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
|
|
static bool enable_vma_readahead __read_mostly = true;
|
|
|
|
#define SWAP_RA_ORDER_CEILING 5
|
|
|
|
#define SWAP_RA_WIN_SHIFT (PAGE_SHIFT / 2)
|
|
#define SWAP_RA_HITS_MASK ((1UL << SWAP_RA_WIN_SHIFT) - 1)
|
|
#define SWAP_RA_HITS_MAX SWAP_RA_HITS_MASK
|
|
#define SWAP_RA_WIN_MASK (~PAGE_MASK & ~SWAP_RA_HITS_MASK)
|
|
|
|
#define SWAP_RA_HITS(v) ((v) & SWAP_RA_HITS_MASK)
|
|
#define SWAP_RA_WIN(v) (((v) & SWAP_RA_WIN_MASK) >> SWAP_RA_WIN_SHIFT)
|
|
#define SWAP_RA_ADDR(v) ((v) & PAGE_MASK)
|
|
|
|
#define SWAP_RA_VAL(addr, win, hits) \
|
|
(((addr) & PAGE_MASK) | \
|
|
(((win) << SWAP_RA_WIN_SHIFT) & SWAP_RA_WIN_MASK) | \
|
|
((hits) & SWAP_RA_HITS_MASK))
|
|
|
|
/* Initial readahead hits is 4 to start up with a small window */
|
|
#define GET_SWAP_RA_VAL(vma) \
|
|
(atomic_long_read(&(vma)->swap_readahead_info) ? : 4)
|
|
|
|
static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
|
|
|
|
void show_swap_cache_info(void)
|
|
{
|
|
printk("%lu pages in swap cache\n", total_swapcache_pages());
|
|
printk("Free swap = %ldkB\n", K(get_nr_swap_pages()));
|
|
printk("Total swap = %lukB\n", K(total_swap_pages));
|
|
}
|
|
|
|
void *get_shadow_from_swap_cache(swp_entry_t entry)
|
|
{
|
|
struct address_space *address_space = swap_address_space(entry);
|
|
pgoff_t idx = swap_cache_index(entry);
|
|
void *shadow;
|
|
|
|
shadow = xa_load(&address_space->i_pages, idx);
|
|
if (xa_is_value(shadow))
|
|
return shadow;
|
|
return NULL;
|
|
}
|
|
|
|
/*
|
|
* add_to_swap_cache resembles filemap_add_folio on swapper_space,
|
|
* but sets SwapCache flag and 'swap' instead of mapping and index.
|
|
*/
|
|
int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
|
|
gfp_t gfp, void **shadowp)
|
|
{
|
|
struct address_space *address_space = swap_address_space(entry);
|
|
pgoff_t idx = swap_cache_index(entry);
|
|
XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
|
|
unsigned long i, nr = folio_nr_pages(folio);
|
|
void *old;
|
|
|
|
xas_set_update(&xas, workingset_update_node);
|
|
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
|
|
VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
|
|
VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
|
|
|
|
folio_ref_add(folio, nr);
|
|
folio_set_swapcache(folio);
|
|
folio->swap = entry;
|
|
|
|
do {
|
|
xas_lock_irq(&xas);
|
|
xas_create_range(&xas);
|
|
if (xas_error(&xas))
|
|
goto unlock;
|
|
for (i = 0; i < nr; i++) {
|
|
VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
|
|
if (shadowp) {
|
|
old = xas_load(&xas);
|
|
if (xa_is_value(old))
|
|
*shadowp = old;
|
|
}
|
|
xas_store(&xas, folio);
|
|
xas_next(&xas);
|
|
}
|
|
address_space->nrpages += nr;
|
|
__node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
|
|
__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
|
|
unlock:
|
|
xas_unlock_irq(&xas);
|
|
} while (xas_nomem(&xas, gfp));
|
|
|
|
if (!xas_error(&xas))
|
|
return 0;
|
|
|
|
folio_clear_swapcache(folio);
|
|
folio_ref_sub(folio, nr);
|
|
return xas_error(&xas);
|
|
}
|
|
|
|
/*
|
|
* This must be called only on folios that have
|
|
* been verified to be in the swap cache.
|
|
*/
|
|
void __delete_from_swap_cache(struct folio *folio,
|
|
swp_entry_t entry, void *shadow)
|
|
{
|
|
struct address_space *address_space = swap_address_space(entry);
|
|
int i;
|
|
long nr = folio_nr_pages(folio);
|
|
pgoff_t idx = swap_cache_index(entry);
|
|
XA_STATE(xas, &address_space->i_pages, idx);
|
|
|
|
xas_set_update(&xas, workingset_update_node);
|
|
|
|
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
|
|
VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
|
|
VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
|
|
|
|
for (i = 0; i < nr; i++) {
|
|
void *entry = xas_store(&xas, shadow);
|
|
VM_BUG_ON_PAGE(entry != folio, entry);
|
|
xas_next(&xas);
|
|
}
|
|
folio->swap.val = 0;
|
|
folio_clear_swapcache(folio);
|
|
address_space->nrpages -= nr;
|
|
__node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
|
|
__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
|
|
}
|
|
|
|
/*
|
|
* This must be called only on folios that have
|
|
* been verified to be in the swap cache and locked.
|
|
* It will never put the folio into the free list,
|
|
* the caller has a reference on the folio.
|
|
*/
|
|
void delete_from_swap_cache(struct folio *folio)
|
|
{
|
|
swp_entry_t entry = folio->swap;
|
|
struct address_space *address_space = swap_address_space(entry);
|
|
|
|
xa_lock_irq(&address_space->i_pages);
|
|
__delete_from_swap_cache(folio, entry, NULL);
|
|
xa_unlock_irq(&address_space->i_pages);
|
|
|
|
put_swap_folio(folio, entry);
|
|
folio_ref_sub(folio, folio_nr_pages(folio));
|
|
}
|
|
|
|
void clear_shadow_from_swap_cache(int type, unsigned long begin,
|
|
unsigned long end)
|
|
{
|
|
unsigned long curr = begin;
|
|
void *old;
|
|
|
|
for (;;) {
|
|
swp_entry_t entry = swp_entry(type, curr);
|
|
unsigned long index = curr & SWAP_ADDRESS_SPACE_MASK;
|
|
struct address_space *address_space = swap_address_space(entry);
|
|
XA_STATE(xas, &address_space->i_pages, index);
|
|
|
|
xas_set_update(&xas, workingset_update_node);
|
|
|
|
xa_lock_irq(&address_space->i_pages);
|
|
xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAGES)) {
|
|
if (!xa_is_value(old))
|
|
continue;
|
|
xas_store(&xas, NULL);
|
|
}
|
|
xa_unlock_irq(&address_space->i_pages);
|
|
|
|
/* search the next swapcache until we meet end */
|
|
curr = ALIGN((curr + 1), SWAP_ADDRESS_SPACE_PAGES);
|
|
if (curr > end)
|
|
break;
|
|
}
|
|
}
|
|
|
|
/*
|
|
* If we are the only user, then try to free up the swap cache.
|
|
*
|
|
* Its ok to check the swapcache flag without the folio lock
|
|
* here because we are going to recheck again inside
|
|
* folio_free_swap() _with_ the lock.
|
|
* - Marcelo
|
|
*/
|
|
void free_swap_cache(struct folio *folio)
|
|
{
|
|
if (folio_test_swapcache(folio) && !folio_mapped(folio) &&
|
|
folio_trylock(folio)) {
|
|
folio_free_swap(folio);
|
|
folio_unlock(folio);
|
|
}
|
|
}
|
|
|
|
/*
|
|
* Freeing a folio and also freeing any swap cache associated with
|
|
* this folio if it is the last user.
|
|
*/
|
|
void free_folio_and_swap_cache(struct folio *folio)
|
|
{
|
|
free_swap_cache(folio);
|
|
if (!is_huge_zero_folio(folio))
|
|
folio_put(folio);
|
|
}
|
|
|
|
/*
|
|
* Passed an array of pages, drop them all from swapcache and then release
|
|
* them. They are removed from the LRU and freed if this is their last use.
|
|
*/
|
|
void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
|
|
{
|
|
struct folio_batch folios;
|
|
unsigned int refs[PAGEVEC_SIZE];
|
|
|
|
folio_batch_init(&folios);
|
|
for (int i = 0; i < nr; i++) {
|
|
struct folio *folio = page_folio(encoded_page_ptr(pages[i]));
|
|
|
|
free_swap_cache(folio);
|
|
refs[folios.nr] = 1;
|
|
if (unlikely(encoded_page_flags(pages[i]) &
|
|
ENCODED_PAGE_BIT_NR_PAGES_NEXT))
|
|
refs[folios.nr] = encoded_nr_pages(pages[++i]);
|
|
|
|
if (folio_batch_add(&folios, folio) == 0)
|
|
folios_put_refs(&folios, refs);
|
|
}
|
|
if (folios.nr)
|
|
folios_put_refs(&folios, refs);
|
|
}
|
|
|
|
static inline bool swap_use_vma_readahead(void)
|
|
{
|
|
return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
|
|
}
|
|
|
|
/*
|
|
* Lookup a swap entry in the swap cache. A found folio will be returned
|
|
* unlocked and with its refcount incremented - we rely on the kernel
|
|
* lock getting page table operations atomic even if we drop the folio
|
|
* lock before returning.
|
|
*
|
|
* Caller must lock the swap device or hold a reference to keep it valid.
|
|
*/
|
|
struct folio *swap_cache_get_folio(swp_entry_t entry,
|
|
struct vm_area_struct *vma, unsigned long addr)
|
|
{
|
|
struct folio *folio;
|
|
|
|
folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
|
|
if (!IS_ERR(folio)) {
|
|
bool vma_ra = swap_use_vma_readahead();
|
|
bool readahead;
|
|
|
|
/*
|
|
* At the moment, we don't support PG_readahead for anon THP
|
|
* so let's bail out rather than confusing the readahead stat.
|
|
*/
|
|
if (unlikely(folio_test_large(folio)))
|
|
return folio;
|
|
|
|
readahead = folio_test_clear_readahead(folio);
|
|
if (vma && vma_ra) {
|
|
unsigned long ra_val;
|
|
int win, hits;
|
|
|
|
ra_val = GET_SWAP_RA_VAL(vma);
|
|
win = SWAP_RA_WIN(ra_val);
|
|
hits = SWAP_RA_HITS(ra_val);
|
|
if (readahead)
|
|
hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
|
|
atomic_long_set(&vma->swap_readahead_info,
|
|
SWAP_RA_VAL(addr, win, hits));
|
|
}
|
|
|
|
if (readahead) {
|
|
count_vm_event(SWAP_RA_HIT);
|
|
if (!vma || !vma_ra)
|
|
atomic_inc(&swapin_readahead_hits);
|
|
}
|
|
} else {
|
|
folio = NULL;
|
|
}
|
|
|
|
return folio;
|
|
}
|
|
|
|
/**
|
|
* filemap_get_incore_folio - Find and get a folio from the page or swap caches.
|
|
* @mapping: The address_space to search.
|
|
* @index: The page cache index.
|
|
*
|
|
* This differs from filemap_get_folio() in that it will also look for the
|
|
* folio in the swap cache.
|
|
*
|
|
* Return: The found folio or %NULL.
|
|
*/
|
|
struct folio *filemap_get_incore_folio(struct address_space *mapping,
|
|
pgoff_t index)
|
|
{
|
|
swp_entry_t swp;
|
|
struct swap_info_struct *si;
|
|
struct folio *folio = filemap_get_entry(mapping, index);
|
|
|
|
if (!folio)
|
|
return ERR_PTR(-ENOENT);
|
|
if (!xa_is_value(folio))
|
|
return folio;
|
|
if (!shmem_mapping(mapping))
|
|
return ERR_PTR(-ENOENT);
|
|
|
|
swp = radix_to_swp_entry(folio);
|
|
/* There might be swapin error entries in shmem mapping. */
|
|
if (non_swap_entry(swp))
|
|
return ERR_PTR(-ENOENT);
|
|
/* Prevent swapoff from happening to us */
|
|
si = get_swap_device(swp);
|
|
if (!si)
|
|
return ERR_PTR(-ENOENT);
|
|
index = swap_cache_index(swp);
|
|
folio = filemap_get_folio(swap_address_space(swp), index);
|
|
put_swap_device(si);
|
|
return folio;
|
|
}
|
|
|
|
struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
|
|
struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
|
|
bool skip_if_exists)
|
|
{
|
|
struct swap_info_struct *si = swp_swap_info(entry);
|
|
struct folio *folio;
|
|
struct folio *new_folio = NULL;
|
|
struct folio *result = NULL;
|
|
void *shadow = NULL;
|
|
|
|
*new_page_allocated = false;
|
|
for (;;) {
|
|
int err;
|
|
/*
|
|
* First check the swap cache. Since this is normally
|
|
* called after swap_cache_get_folio() failed, re-calling
|
|
* that would confuse statistics.
|
|
*/
|
|
folio = filemap_get_folio(swap_address_space(entry),
|
|
swap_cache_index(entry));
|
|
if (!IS_ERR(folio))
|
|
goto got_folio;
|
|
|
|
/*
|
|
* Just skip read ahead for unused swap slot.
|
|
*/
|
|
if (!swap_entry_swapped(si, entry))
|
|
goto put_and_return;
|
|
|
|
/*
|
|
* Get a new folio to read into from swap. Allocate it now if
|
|
* new_folio not exist, before marking swap_map SWAP_HAS_CACHE,
|
|
* when -EEXIST will cause any racers to loop around until we
|
|
* add it to cache.
|
|
*/
|
|
if (!new_folio) {
|
|
new_folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
|
|
if (!new_folio)
|
|
goto put_and_return;
|
|
}
|
|
|
|
/*
|
|
* Swap entry may have been freed since our caller observed it.
|
|
*/
|
|
err = swapcache_prepare(entry, 1);
|
|
if (!err)
|
|
break;
|
|
else if (err != -EEXIST)
|
|
goto put_and_return;
|
|
|
|
/*
|
|
* Protect against a recursive call to __read_swap_cache_async()
|
|
* on the same entry waiting forever here because SWAP_HAS_CACHE
|
|
* is set but the folio is not the swap cache yet. This can
|
|
* happen today if mem_cgroup_swapin_charge_folio() below
|
|
* triggers reclaim through zswap, which may call
|
|
* __read_swap_cache_async() in the writeback path.
|
|
*/
|
|
if (skip_if_exists)
|
|
goto put_and_return;
|
|
|
|
/*
|
|
* We might race against __delete_from_swap_cache(), and
|
|
* stumble across a swap_map entry whose SWAP_HAS_CACHE
|
|
* has not yet been cleared. Or race against another
|
|
* __read_swap_cache_async(), which has set SWAP_HAS_CACHE
|
|
* in swap_map, but not yet added its folio to swap cache.
|
|
*/
|
|
schedule_timeout_uninterruptible(1);
|
|
}
|
|
|
|
/*
|
|
* The swap entry is ours to swap in. Prepare the new folio.
|
|
*/
|
|
__folio_set_locked(new_folio);
|
|
__folio_set_swapbacked(new_folio);
|
|
|
|
if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
|
|
goto fail_unlock;
|
|
|
|
/* May fail (-ENOMEM) if XArray node allocation failed. */
|
|
if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
|
|
goto fail_unlock;
|
|
|
|
memcg1_swapin(entry, 1);
|
|
|
|
if (shadow)
|
|
workingset_refault(new_folio, shadow);
|
|
|
|
/* Caller will initiate read into locked new_folio */
|
|
folio_add_lru(new_folio);
|
|
*new_page_allocated = true;
|
|
folio = new_folio;
|
|
got_folio:
|
|
result = folio;
|
|
goto put_and_return;
|
|
|
|
fail_unlock:
|
|
put_swap_folio(new_folio, entry);
|
|
folio_unlock(new_folio);
|
|
put_and_return:
|
|
if (!(*new_page_allocated) && new_folio)
|
|
folio_put(new_folio);
|
|
return result;
|
|
}
|
|
|
|
/*
|
|
* Locate a page of swap in physical memory, reserving swap cache space
|
|
* and reading the disk if it is not already cached.
|
|
* A failure return means that either the page allocation failed or that
|
|
* the swap entry is no longer in use.
|
|
*
|
|
* get/put_swap_device() aren't needed to call this function, because
|
|
* __read_swap_cache_async() call them and swap_read_folio() holds the
|
|
* swap cache folio lock.
|
|
*/
|
|
struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
|
|
struct vm_area_struct *vma, unsigned long addr,
|
|
struct swap_iocb **plug)
|
|
{
|
|
struct swap_info_struct *si;
|
|
bool page_allocated;
|
|
struct mempolicy *mpol;
|
|
pgoff_t ilx;
|
|
struct folio *folio;
|
|
|
|
si = get_swap_device(entry);
|
|
if (!si)
|
|
return NULL;
|
|
|
|
mpol = get_vma_policy(vma, addr, 0, &ilx);
|
|
folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
|
|
&page_allocated, false);
|
|
mpol_cond_put(mpol);
|
|
|
|
if (page_allocated)
|
|
swap_read_folio(folio, plug);
|
|
|
|
put_swap_device(si);
|
|
return folio;
|
|
}
|
|
|
|
static unsigned int __swapin_nr_pages(unsigned long prev_offset,
|
|
unsigned long offset,
|
|
int hits,
|
|
int max_pages,
|
|
int prev_win)
|
|
{
|
|
unsigned int pages, last_ra;
|
|
|
|
/*
|
|
* This heuristic has been found to work well on both sequential and
|
|
* random loads, swapping to hard disk or to SSD: please don't ask
|
|
* what the "+ 2" means, it just happens to work well, that's all.
|
|
*/
|
|
pages = hits + 2;
|
|
if (pages == 2) {
|
|
/*
|
|
* We can have no readahead hits to judge by: but must not get
|
|
* stuck here forever, so check for an adjacent offset instead
|
|
* (and don't even bother to check whether swap type is same).
|
|
*/
|
|
if (offset != prev_offset + 1 && offset != prev_offset - 1)
|
|
pages = 1;
|
|
} else {
|
|
unsigned int roundup = 4;
|
|
while (roundup < pages)
|
|
roundup <<= 1;
|
|
pages = roundup;
|
|
}
|
|
|
|
if (pages > max_pages)
|
|
pages = max_pages;
|
|
|
|
/* Don't shrink readahead too fast */
|
|
last_ra = prev_win / 2;
|
|
if (pages < last_ra)
|
|
pages = last_ra;
|
|
|
|
return pages;
|
|
}
|
|
|
|
static unsigned long swapin_nr_pages(unsigned long offset)
|
|
{
|
|
static unsigned long prev_offset;
|
|
unsigned int hits, pages, max_pages;
|
|
static atomic_t last_readahead_pages;
|
|
|
|
max_pages = 1 << READ_ONCE(page_cluster);
|
|
if (max_pages <= 1)
|
|
return 1;
|
|
|
|
hits = atomic_xchg(&swapin_readahead_hits, 0);
|
|
pages = __swapin_nr_pages(READ_ONCE(prev_offset), offset, hits,
|
|
max_pages,
|
|
atomic_read(&last_readahead_pages));
|
|
if (!hits)
|
|
WRITE_ONCE(prev_offset, offset);
|
|
atomic_set(&last_readahead_pages, pages);
|
|
|
|
return pages;
|
|
}
|
|
|
|
/**
|
|
* swap_cluster_readahead - swap in pages in hope we need them soon
|
|
* @entry: swap entry of this memory
|
|
* @gfp_mask: memory allocation flags
|
|
* @mpol: NUMA memory allocation policy to be applied
|
|
* @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
|
|
*
|
|
* Returns the struct folio for entry and addr, after queueing swapin.
|
|
*
|
|
* Primitive swap readahead code. We simply read an aligned block of
|
|
* (1 << page_cluster) entries in the swap area. This method is chosen
|
|
* because it doesn't cost us any seek time. We also make sure to queue
|
|
* the 'original' request together with the readahead ones...
|
|
*
|
|
* Note: it is intentional that the same NUMA policy and interleave index
|
|
* are used for every page of the readahead: neighbouring pages on swap
|
|
* are fairly likely to have been swapped out from the same node.
|
|
*/
|
|
struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
|
|
struct mempolicy *mpol, pgoff_t ilx)
|
|
{
|
|
struct folio *folio;
|
|
unsigned long entry_offset = swp_offset(entry);
|
|
unsigned long offset = entry_offset;
|
|
unsigned long start_offset, end_offset;
|
|
unsigned long mask;
|
|
struct swap_info_struct *si = swp_swap_info(entry);
|
|
struct blk_plug plug;
|
|
struct swap_iocb *splug = NULL;
|
|
bool page_allocated;
|
|
|
|
mask = swapin_nr_pages(offset) - 1;
|
|
if (!mask)
|
|
goto skip;
|
|
|
|
/* Read a page_cluster sized and aligned cluster around offset. */
|
|
start_offset = offset & ~mask;
|
|
end_offset = offset | mask;
|
|
if (!start_offset) /* First page is swap header. */
|
|
start_offset++;
|
|
if (end_offset >= si->max)
|
|
end_offset = si->max - 1;
|
|
|
|
blk_start_plug(&plug);
|
|
for (offset = start_offset; offset <= end_offset ; offset++) {
|
|
/* Ok, do the async read-ahead now */
|
|
folio = __read_swap_cache_async(
|
|
swp_entry(swp_type(entry), offset),
|
|
gfp_mask, mpol, ilx, &page_allocated, false);
|
|
if (!folio)
|
|
continue;
|
|
if (page_allocated) {
|
|
swap_read_folio(folio, &splug);
|
|
if (offset != entry_offset) {
|
|
folio_set_readahead(folio);
|
|
count_vm_event(SWAP_RA);
|
|
}
|
|
}
|
|
folio_put(folio);
|
|
}
|
|
blk_finish_plug(&plug);
|
|
swap_read_unplug(splug);
|
|
lru_add_drain(); /* Push any new pages onto the LRU now */
|
|
skip:
|
|
/* The page was likely read above, so no need for plugging here */
|
|
folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
|
|
&page_allocated, false);
|
|
if (unlikely(page_allocated))
|
|
swap_read_folio(folio, NULL);
|
|
return folio;
|
|
}
|
|
|
|
int init_swap_address_space(unsigned int type, unsigned long nr_pages)
|
|
{
|
|
struct address_space *spaces, *space;
|
|
unsigned int i, nr;
|
|
|
|
nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
|
|
spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
|
|
if (!spaces)
|
|
return -ENOMEM;
|
|
for (i = 0; i < nr; i++) {
|
|
space = spaces + i;
|
|
xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
|
|
atomic_set(&space->i_mmap_writable, 0);
|
|
space->a_ops = &swap_aops;
|
|
/* swap cache doesn't use writeback related tags */
|
|
mapping_set_no_writeback_tags(space);
|
|
}
|
|
nr_swapper_spaces[type] = nr;
|
|
swapper_spaces[type] = spaces;
|
|
|
|
return 0;
|
|
}
|
|
|
|
void exit_swap_address_space(unsigned int type)
|
|
{
|
|
int i;
|
|
struct address_space *spaces = swapper_spaces[type];
|
|
|
|
for (i = 0; i < nr_swapper_spaces[type]; i++)
|
|
VM_WARN_ON_ONCE(!mapping_empty(&spaces[i]));
|
|
kvfree(spaces);
|
|
nr_swapper_spaces[type] = 0;
|
|
swapper_spaces[type] = NULL;
|
|
}
|
|
|
|
static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
|
|
unsigned long *end)
|
|
{
|
|
struct vm_area_struct *vma = vmf->vma;
|
|
unsigned long ra_val;
|
|
unsigned long faddr, prev_faddr, left, right;
|
|
unsigned int max_win, hits, prev_win, win;
|
|
|
|
max_win = 1 << min(READ_ONCE(page_cluster), SWAP_RA_ORDER_CEILING);
|
|
if (max_win == 1)
|
|
return 1;
|
|
|
|
faddr = vmf->address;
|
|
ra_val = GET_SWAP_RA_VAL(vma);
|
|
prev_faddr = SWAP_RA_ADDR(ra_val);
|
|
prev_win = SWAP_RA_WIN(ra_val);
|
|
hits = SWAP_RA_HITS(ra_val);
|
|
win = __swapin_nr_pages(PFN_DOWN(prev_faddr), PFN_DOWN(faddr), hits,
|
|
max_win, prev_win);
|
|
atomic_long_set(&vma->swap_readahead_info, SWAP_RA_VAL(faddr, win, 0));
|
|
if (win == 1)
|
|
return 1;
|
|
|
|
if (faddr == prev_faddr + PAGE_SIZE)
|
|
left = faddr;
|
|
else if (prev_faddr == faddr + PAGE_SIZE)
|
|
left = faddr - (win << PAGE_SHIFT) + PAGE_SIZE;
|
|
else
|
|
left = faddr - (((win - 1) / 2) << PAGE_SHIFT);
|
|
right = left + (win << PAGE_SHIFT);
|
|
if ((long)left < 0)
|
|
left = 0;
|
|
*start = max3(left, vma->vm_start, faddr & PMD_MASK);
|
|
*end = min3(right, vma->vm_end, (faddr & PMD_MASK) + PMD_SIZE);
|
|
|
|
return win;
|
|
}
|
|
|
|
/**
|
|
* swap_vma_readahead - swap in pages in hope we need them soon
|
|
* @targ_entry: swap entry of the targeted memory
|
|
* @gfp_mask: memory allocation flags
|
|
* @mpol: NUMA memory allocation policy to be applied
|
|
* @targ_ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
|
|
* @vmf: fault information
|
|
*
|
|
* Returns the struct folio for entry and addr, after queueing swapin.
|
|
*
|
|
* Primitive swap readahead code. We simply read in a few pages whose
|
|
* virtual addresses are around the fault address in the same vma.
|
|
*
|
|
* Caller must hold read mmap_lock if vmf->vma is not NULL.
|
|
*
|
|
*/
|
|
static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
|
|
struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf)
|
|
{
|
|
struct blk_plug plug;
|
|
struct swap_iocb *splug = NULL;
|
|
struct folio *folio;
|
|
pte_t *pte = NULL, pentry;
|
|
int win;
|
|
unsigned long start, end, addr;
|
|
swp_entry_t entry;
|
|
pgoff_t ilx;
|
|
bool page_allocated;
|
|
|
|
win = swap_vma_ra_win(vmf, &start, &end);
|
|
if (win == 1)
|
|
goto skip;
|
|
|
|
ilx = targ_ilx - PFN_DOWN(vmf->address - start);
|
|
|
|
blk_start_plug(&plug);
|
|
for (addr = start; addr < end; ilx++, addr += PAGE_SIZE) {
|
|
if (!pte++) {
|
|
pte = pte_offset_map(vmf->pmd, addr);
|
|
if (!pte)
|
|
break;
|
|
}
|
|
pentry = ptep_get_lockless(pte);
|
|
if (!is_swap_pte(pentry))
|
|
continue;
|
|
entry = pte_to_swp_entry(pentry);
|
|
if (unlikely(non_swap_entry(entry)))
|
|
continue;
|
|
pte_unmap(pte);
|
|
pte = NULL;
|
|
folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
|
|
&page_allocated, false);
|
|
if (!folio)
|
|
continue;
|
|
if (page_allocated) {
|
|
swap_read_folio(folio, &splug);
|
|
if (addr != vmf->address) {
|
|
folio_set_readahead(folio);
|
|
count_vm_event(SWAP_RA);
|
|
}
|
|
}
|
|
folio_put(folio);
|
|
}
|
|
if (pte)
|
|
pte_unmap(pte);
|
|
blk_finish_plug(&plug);
|
|
swap_read_unplug(splug);
|
|
lru_add_drain();
|
|
skip:
|
|
/* The folio was likely read above, so no need for plugging here */
|
|
folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
|
|
&page_allocated, false);
|
|
if (unlikely(page_allocated))
|
|
swap_read_folio(folio, NULL);
|
|
return folio;
|
|
}
|
|
|
|
/**
|
|
* swapin_readahead - swap in pages in hope we need them soon
|
|
* @entry: swap entry of this memory
|
|
* @gfp_mask: memory allocation flags
|
|
* @vmf: fault information
|
|
*
|
|
* Returns the struct folio for entry and addr, after queueing swapin.
|
|
*
|
|
* It's a main entry function for swap readahead. By the configuration,
|
|
* it will read ahead blocks by cluster-based(ie, physical disk based)
|
|
* or vma-based(ie, virtual address based on faulty address) readahead.
|
|
*/
|
|
struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
|
|
struct vm_fault *vmf)
|
|
{
|
|
struct mempolicy *mpol;
|
|
pgoff_t ilx;
|
|
struct folio *folio;
|
|
|
|
mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
|
|
folio = swap_use_vma_readahead() ?
|
|
swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
|
|
swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
|
|
mpol_cond_put(mpol);
|
|
|
|
return folio;
|
|
}
|
|
|
|
#ifdef CONFIG_SYSFS
|
|
static ssize_t vma_ra_enabled_show(struct kobject *kobj,
|
|
struct kobj_attribute *attr, char *buf)
|
|
{
|
|
return sysfs_emit(buf, "%s\n", str_true_false(enable_vma_readahead));
|
|
}
|
|
static ssize_t vma_ra_enabled_store(struct kobject *kobj,
|
|
struct kobj_attribute *attr,
|
|
const char *buf, size_t count)
|
|
{
|
|
ssize_t ret;
|
|
|
|
ret = kstrtobool(buf, &enable_vma_readahead);
|
|
if (ret)
|
|
return ret;
|
|
|
|
return count;
|
|
}
|
|
static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
|
|
|
|
static struct attribute *swap_attrs[] = {
|
|
&vma_ra_enabled_attr.attr,
|
|
NULL,
|
|
};
|
|
|
|
static const struct attribute_group swap_attr_group = {
|
|
.attrs = swap_attrs,
|
|
};
|
|
|
|
static int __init swap_init_sysfs(void)
|
|
{
|
|
int err;
|
|
struct kobject *swap_kobj;
|
|
|
|
swap_kobj = kobject_create_and_add("swap", mm_kobj);
|
|
if (!swap_kobj) {
|
|
pr_err("failed to create swap kobject\n");
|
|
return -ENOMEM;
|
|
}
|
|
err = sysfs_create_group(swap_kobj, &swap_attr_group);
|
|
if (err) {
|
|
pr_err("failed to register swap group\n");
|
|
goto delete_obj;
|
|
}
|
|
return 0;
|
|
|
|
delete_obj:
|
|
kobject_put(swap_kobj);
|
|
return err;
|
|
}
|
|
subsys_initcall(swap_init_sysfs);
|
|
#endif
|