2018-03-21 21:22:39 +02:00
|
|
|
=====================
|
2013-11-14 14:31:51 -08:00
|
|
|
Split page table lock
|
|
|
|
=====================
|
|
|
|
|
|
|
|
Originally, mm->page_table_lock spinlock protected all page tables of the
|
|
|
|
mm_struct. But this approach leads to poor page fault scalability of
|
2025-02-04 17:33:26 +08:00
|
|
|
multi-threaded applications due to high contention on the lock. To improve
|
2013-11-14 14:31:51 -08:00
|
|
|
scalability, split page table lock was introduced.
|
|
|
|
|
|
|
|
With split page table lock we have separate per-table lock to serialize
|
|
|
|
access to the table. At the moment we use split lock for PTE and PMD
|
|
|
|
tables. Access to higher level tables protected by mm->page_table_lock.
|
|
|
|
|
|
|
|
There are helpers to lock/unlock a table and other accessor functions:
|
2018-03-21 21:22:39 +02:00
|
|
|
|
2013-11-14 14:31:51 -08:00
|
|
|
- pte_offset_map_lock()
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-08 18:10:32 -07:00
|
|
|
maps PTE and takes PTE table lock, returns pointer to PTE with
|
|
|
|
pointer to its PTE table lock, or returns NULL if no PTE table;
|
mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
Patch series "introduce pte_offset_map_{ro|rw}_nolock()", v5.
As proposed by David Hildenbrand [1], this series introduces the following
two new helper functions to replace pte_offset_map_nolock().
1. pte_offset_map_ro_nolock()
2. pte_offset_map_rw_nolock()
As the name suggests, pte_offset_map_ro_nolock() is used for read-only
case. In this case, only read-only operations will be performed on PTE
page after the PTL is held. The RCU lock in pte_offset_map_nolock() will
ensure that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified. Therefore
pte_offset_map_ro_nolock() is just a renamed version of
pte_offset_map_nolock().
pte_offset_map_rw_nolock() is used for may-write case. In this case, the
pte or pmd entry may be modified after the PTL is held, so we need to
ensure that the pmd entry has not been modified concurrently. So in
addition to the name change, it also outputs the pmdval when successful.
The users should make sure the page table is stable like checking
pte_same() or checking pmd_same() by using the output pmdval before
performing the write operations.
This series will convert all pte_offset_map_nolock() into the above two
helper functions one by one, and finally completely delete it.
This also a preparation for reclaiming the empty user PTE page table
pages.
This patch (of 13):
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:
1) After acquiring PTL, only read-only operations are performed on the PTE
page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified.
2) After acquiring PTL, the pte or pmd entries may be modified. At this
time, we need to ensure that the pmd entry has not been modified
concurrently.
To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. It is applicable for may-write cases where any
modification operations to the page table may happen after the
corresponding spinlock is held afterwards. But the users should make sure
the page table is stable like checking pte_same() or checking pmd_same()
by using the output pmdval before performing the write operations.
Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.
Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.
Link: https://lkml.kernel.org/r/cover.1727332572.git.zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/5aeecfa131600a454b1f3a038a1a54282ca3b856.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26 14:46:14 +08:00
|
|
|
- pte_offset_map_ro_nolock()
|
|
|
|
maps PTE, returns pointer to PTE with pointer to its PTE table
|
|
|
|
lock (not taken), or returns NULL if no PTE table;
|
|
|
|
- pte_offset_map_rw_nolock()
|
|
|
|
maps PTE, returns pointer to PTE with pointer to its PTE table
|
|
|
|
lock (not taken) and the value of its pmd entry, or returns NULL
|
|
|
|
if no PTE table;
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-08 18:10:32 -07:00
|
|
|
- pte_offset_map()
|
|
|
|
maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
|
|
|
|
- pte_unmap()
|
|
|
|
unmaps PTE table;
|
2013-11-14 14:31:51 -08:00
|
|
|
- pte_unmap_unlock()
|
|
|
|
unlocks and unmaps PTE table;
|
|
|
|
- pte_alloc_map_lock()
|
mm/pgtable: allow pte_offset_map[_lock]() to fail
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally outputs
pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad() when
pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the lock,
trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done to
cover the imminent case, but we expect to generalize it later, and it
makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Link: https://lkml.kernel.org/r/2929bfd-9893-a374-e463-4c3127ff9b9d@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zack Rusin <zackr@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-08 18:10:32 -07:00
|
|
|
allocates PTE table if needed and takes its lock, returns pointer to
|
|
|
|
PTE with pointer to its lock, or returns NULL if allocation failed;
|
2013-11-14 14:31:51 -08:00
|
|
|
- pmd_lock()
|
|
|
|
takes PMD table lock, returns pointer to taken lock;
|
|
|
|
- pmd_lockptr()
|
|
|
|
returns pointer to PMD table lock;
|
|
|
|
|
|
|
|
Split page table lock for PTE tables is enabled compile-time if
|
|
|
|
CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.
|
2021-01-12 14:19:36 +01:00
|
|
|
If split lock is disabled, all tables are guarded by mm->page_table_lock.
|
2013-11-14 14:31:51 -08:00
|
|
|
|
|
|
|
Split page table lock for PMD tables is enabled, if it's enabled for PTE
|
|
|
|
tables and the architecture supports it (see below).
|
|
|
|
|
|
|
|
Hugetlb and split page table lock
|
2018-03-21 21:22:39 +02:00
|
|
|
=================================
|
2013-11-14 14:31:51 -08:00
|
|
|
|
|
|
|
Hugetlb can support several page sizes. We use split lock only for PMD
|
|
|
|
level, but not for PUD.
|
|
|
|
|
|
|
|
Hugetlb-specific helpers:
|
2018-03-21 21:22:39 +02:00
|
|
|
|
2013-11-14 14:31:51 -08:00
|
|
|
- huge_pte_lock()
|
|
|
|
takes pmd split lock for PMD_SIZE page, mm->page_table_lock
|
|
|
|
otherwise;
|
|
|
|
- huge_pte_lockptr()
|
|
|
|
returns pointer to table lock;
|
|
|
|
|
|
|
|
Support of split page table lock by an architecture
|
2018-03-21 21:22:39 +02:00
|
|
|
===================================================
|
2013-11-14 14:31:51 -08:00
|
|
|
|
2019-09-25 16:49:46 -07:00
|
|
|
There's no need in special enabling of PTE split page table lock: everything
|
2025-01-08 14:57:23 +08:00
|
|
|
required is done by pagetable_pte_ctor() and pagetable_dtor(), which
|
2019-09-25 16:49:46 -07:00
|
|
|
must be called on PTE table allocation / freeing.
|
2013-11-14 14:31:51 -08:00
|
|
|
|
|
|
|
Make sure the architecture doesn't use slab allocator for page table
|
2015-11-06 16:29:54 -08:00
|
|
|
allocation: slab uses page->slab_cache for its pages.
|
|
|
|
This field shares storage with page->ptl.
|
2013-11-14 14:31:51 -08:00
|
|
|
|
|
|
|
PMD split lock only makes sense if you have more than two page table
|
|
|
|
levels.
|
|
|
|
|
2023-08-07 16:05:13 -07:00
|
|
|
PMD split lock enabling requires pagetable_pmd_ctor() call on PMD table
|
2025-01-08 14:57:23 +08:00
|
|
|
allocation and pagetable_dtor() on freeing.
|
2013-11-14 14:31:51 -08:00
|
|
|
|
2013-11-21 14:32:09 -08:00
|
|
|
Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
|
|
|
|
pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
|
|
|
|
paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().
|
2013-11-14 14:31:51 -08:00
|
|
|
|
|
|
|
With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.
|
|
|
|
|
2023-08-07 16:05:13 -07:00
|
|
|
NOTE: pagetable_pte_ctor() and pagetable_pmd_ctor() can fail -- it must
|
2013-11-14 14:31:51 -08:00
|
|
|
be handled properly.
|
|
|
|
|
|
|
|
page->ptl
|
2018-03-21 21:22:39 +02:00
|
|
|
=========
|
2013-11-14 14:31:51 -08:00
|
|
|
|
|
|
|
page->ptl is used to access split page table lock, where 'page' is struct
|
|
|
|
page of page containing the table. It shares storage with page->private
|
|
|
|
(and few other fields in union).
|
|
|
|
|
|
|
|
To avoid increasing size of struct page and have best performance, we use a
|
|
|
|
trick:
|
2018-03-21 21:22:39 +02:00
|
|
|
|
2013-11-14 14:31:51 -08:00
|
|
|
- if spinlock_t fits into long, we use page->ptr as spinlock, so we
|
|
|
|
can avoid indirect access and save a cache line.
|
|
|
|
- if size of spinlock_t is bigger then size of long, we use page->ptl as
|
|
|
|
pointer to spinlock_t and allocate it dynamically. This allows to use
|
|
|
|
split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
|
|
|
|
one more cache line for indirect access;
|
|
|
|
|
2023-08-07 16:05:13 -07:00
|
|
|
The spinlock_t allocated in pagetable_pte_ctor() for PTE table and in
|
|
|
|
pagetable_pmd_ctor() for PMD table.
|
2013-11-14 14:31:51 -08:00
|
|
|
|
|
|
|
Please, never access page->ptl directly -- use appropriate helper.
|