linux/arch/x86/include/asm/pgtable-3level.h

210 lines
6.5 KiB
C
Raw Normal View History

License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _ASM_X86_PGTABLE_3LEVEL_H
#define _ASM_X86_PGTABLE_3LEVEL_H
/*
* Intel Physical Address Extension (PAE) Mode - three-level page
* tables on PPro+ CPUs.
*
* Copyright (C) 1999 Ingo Molnar <mingo@redhat.com>
*/
#define pte_ERROR(e) \
pr_err("%s:%d: bad pte %p(%08lx%08lx)\n", \
__FILE__, __LINE__, &(e), (e).pte_high, (e).pte_low)
#define pmd_ERROR(e) \
pr_err("%s:%d: bad pmd %p(%016Lx)\n", \
__FILE__, __LINE__, &(e), pmd_val(e))
#define pgd_ERROR(e) \
pr_err("%s:%d: bad pgd %p(%016Lx)\n", \
__FILE__, __LINE__, &(e), pgd_val(e))
#define pxx_xchg64(_pxx, _ptr, _val) ({ \
_pxx##val_t *_p = (_pxx##val_t *)_ptr; \
_pxx##val_t _o = *_p; \
do { } while (!try_cmpxchg64(_p, &_o, (_val))); \
native_make_##_pxx(_o); \
})
/*
* Rules for using set_pte: the pte being assigned *must* be
* either not present or in a state where the hardware will
* not attempt to update the pte. In places where this is
* not possible, use pte_get_and_clear to obtain the old pte
* value and then use set_pte to update it. -ben
*/
static inline void native_set_pte(pte_t *ptep, pte_t pte)
{
WRITE_ONCE(ptep->pte_high, pte.pte_high);
smp_wmb();
WRITE_ONCE(ptep->pte_low, pte.pte_low);
}
static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
{
pxx_xchg64(pte, ptep, native_pte_val(pte));
}
static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
{
pxx_xchg64(pmd, pmdp, native_pmd_val(pmd));
}
static inline void native_set_pud(pud_t *pudp, pud_t pud)
{
#ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
pud.p4d.pgd = pti_set_user_pgtbl(&pudp->p4d.pgd, pud.p4d.pgd);
#endif
pxx_xchg64(pud, pudp, native_pud_val(pud));
}
[PATCH] x86/PAE: Fix pte_clear for the >4GB RAM case Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug, so use the same fix for both. Turns out pmd_clear had it as well, but pgds are not affected. The problem is rather intricate. Page table entries in PAE mode are 64-bits wide, but the only atomic 8-byte write operation available in 32-bit mode is cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can happen that the processor may prefetch entries into the TLB in the middle of an operation which clears a page table entry. So one must always clear the P-bit in the low word of the page table entry first when clearing it. Since the sequence *ptep = __pte(0) leaves the order of the write dependent on the compiler, it must be coded explicitly as a clear of the low word followed by a clear of the high word. Further, there must be a write memory barrier here to enforce proper ordering by the compiler (and, in the future, by the processor as well). On > 4GB memory machines, the implementation of pte_clear for PAE was clearly deficient, as it could leave virtual mappings of physical memory above 4GB aliased to memory below 4GB in the TLB. The implementation of ptep_get_and_clear_full has a similar bug, although not nearly as likely to occur, since the mappings being cleared are in the process of being destroyed, and should never be dereferenced again. But, as luck would have it, it is possible to trigger bugs even without ever dereferencing these bogus TLB mappings, even if the clear is followed fairly soon after with a TLB flush or invalidation. The problem is that memory above 4GB may now be aliased into the first 4GB of memory, and in fact, may hit a region of memory with non-memory semantics. These regions include AGP and PCI space. As such, these memory regions are not cached by the processor. This introduces the bug. The processor can speculate memory operations, including memory writes, as long as they are committed with the proper ordering. Speculating a memory write to a linear address that has a bogus TLB mapping is possible. Normally, the speculation is harmless. But for cached memory, it does leave the falsely speculated cacheline unmodified, but in a dirty state. This cache line will be eventually written back. If this cacheline happens to intersect a region of memory that is not protected by the cache coherency protocol, it can corrupt data in I/O memory, which is generally a very bad thing to do, and can cause total system failure or just plain undefined behavior. These bugs are extremely unlikely, but the severity is of such magnitude, and the fix so simple that I think fixing them immediately is justified. Also, they are nearly impossible to debug. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-27 11:32:29 -07:00
/*
* For PTEs and PDEs, we must clear the P-bit first when clearing a page table
* entry, so clear the bottom half first and enforce ordering with a compiler
* barrier.
*/
static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
[PATCH] x86/PAE: Fix pte_clear for the >4GB RAM case Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug, so use the same fix for both. Turns out pmd_clear had it as well, but pgds are not affected. The problem is rather intricate. Page table entries in PAE mode are 64-bits wide, but the only atomic 8-byte write operation available in 32-bit mode is cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can happen that the processor may prefetch entries into the TLB in the middle of an operation which clears a page table entry. So one must always clear the P-bit in the low word of the page table entry first when clearing it. Since the sequence *ptep = __pte(0) leaves the order of the write dependent on the compiler, it must be coded explicitly as a clear of the low word followed by a clear of the high word. Further, there must be a write memory barrier here to enforce proper ordering by the compiler (and, in the future, by the processor as well). On > 4GB memory machines, the implementation of pte_clear for PAE was clearly deficient, as it could leave virtual mappings of physical memory above 4GB aliased to memory below 4GB in the TLB. The implementation of ptep_get_and_clear_full has a similar bug, although not nearly as likely to occur, since the mappings being cleared are in the process of being destroyed, and should never be dereferenced again. But, as luck would have it, it is possible to trigger bugs even without ever dereferencing these bogus TLB mappings, even if the clear is followed fairly soon after with a TLB flush or invalidation. The problem is that memory above 4GB may now be aliased into the first 4GB of memory, and in fact, may hit a region of memory with non-memory semantics. These regions include AGP and PCI space. As such, these memory regions are not cached by the processor. This introduces the bug. The processor can speculate memory operations, including memory writes, as long as they are committed with the proper ordering. Speculating a memory write to a linear address that has a bogus TLB mapping is possible. Normally, the speculation is harmless. But for cached memory, it does leave the falsely speculated cacheline unmodified, but in a dirty state. This cache line will be eventually written back. If this cacheline happens to intersect a region of memory that is not protected by the cache coherency protocol, it can corrupt data in I/O memory, which is generally a very bad thing to do, and can cause total system failure or just plain undefined behavior. These bugs are extremely unlikely, but the severity is of such magnitude, and the fix so simple that I think fixing them immediately is justified. Also, they are nearly impossible to debug. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-27 11:32:29 -07:00
{
WRITE_ONCE(ptep->pte_low, 0);
[PATCH] x86/PAE: Fix pte_clear for the >4GB RAM case Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug, so use the same fix for both. Turns out pmd_clear had it as well, but pgds are not affected. The problem is rather intricate. Page table entries in PAE mode are 64-bits wide, but the only atomic 8-byte write operation available in 32-bit mode is cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can happen that the processor may prefetch entries into the TLB in the middle of an operation which clears a page table entry. So one must always clear the P-bit in the low word of the page table entry first when clearing it. Since the sequence *ptep = __pte(0) leaves the order of the write dependent on the compiler, it must be coded explicitly as a clear of the low word followed by a clear of the high word. Further, there must be a write memory barrier here to enforce proper ordering by the compiler (and, in the future, by the processor as well). On > 4GB memory machines, the implementation of pte_clear for PAE was clearly deficient, as it could leave virtual mappings of physical memory above 4GB aliased to memory below 4GB in the TLB. The implementation of ptep_get_and_clear_full has a similar bug, although not nearly as likely to occur, since the mappings being cleared are in the process of being destroyed, and should never be dereferenced again. But, as luck would have it, it is possible to trigger bugs even without ever dereferencing these bogus TLB mappings, even if the clear is followed fairly soon after with a TLB flush or invalidation. The problem is that memory above 4GB may now be aliased into the first 4GB of memory, and in fact, may hit a region of memory with non-memory semantics. These regions include AGP and PCI space. As such, these memory regions are not cached by the processor. This introduces the bug. The processor can speculate memory operations, including memory writes, as long as they are committed with the proper ordering. Speculating a memory write to a linear address that has a bogus TLB mapping is possible. Normally, the speculation is harmless. But for cached memory, it does leave the falsely speculated cacheline unmodified, but in a dirty state. This cache line will be eventually written back. If this cacheline happens to intersect a region of memory that is not protected by the cache coherency protocol, it can corrupt data in I/O memory, which is generally a very bad thing to do, and can cause total system failure or just plain undefined behavior. These bugs are extremely unlikely, but the severity is of such magnitude, and the fix so simple that I think fixing them immediately is justified. Also, they are nearly impossible to debug. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-27 11:32:29 -07:00
smp_wmb();
WRITE_ONCE(ptep->pte_high, 0);
[PATCH] x86/PAE: Fix pte_clear for the >4GB RAM case Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug, so use the same fix for both. Turns out pmd_clear had it as well, but pgds are not affected. The problem is rather intricate. Page table entries in PAE mode are 64-bits wide, but the only atomic 8-byte write operation available in 32-bit mode is cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can happen that the processor may prefetch entries into the TLB in the middle of an operation which clears a page table entry. So one must always clear the P-bit in the low word of the page table entry first when clearing it. Since the sequence *ptep = __pte(0) leaves the order of the write dependent on the compiler, it must be coded explicitly as a clear of the low word followed by a clear of the high word. Further, there must be a write memory barrier here to enforce proper ordering by the compiler (and, in the future, by the processor as well). On > 4GB memory machines, the implementation of pte_clear for PAE was clearly deficient, as it could leave virtual mappings of physical memory above 4GB aliased to memory below 4GB in the TLB. The implementation of ptep_get_and_clear_full has a similar bug, although not nearly as likely to occur, since the mappings being cleared are in the process of being destroyed, and should never be dereferenced again. But, as luck would have it, it is possible to trigger bugs even without ever dereferencing these bogus TLB mappings, even if the clear is followed fairly soon after with a TLB flush or invalidation. The problem is that memory above 4GB may now be aliased into the first 4GB of memory, and in fact, may hit a region of memory with non-memory semantics. These regions include AGP and PCI space. As such, these memory regions are not cached by the processor. This introduces the bug. The processor can speculate memory operations, including memory writes, as long as they are committed with the proper ordering. Speculating a memory write to a linear address that has a bogus TLB mapping is possible. Normally, the speculation is harmless. But for cached memory, it does leave the falsely speculated cacheline unmodified, but in a dirty state. This cache line will be eventually written back. If this cacheline happens to intersect a region of memory that is not protected by the cache coherency protocol, it can corrupt data in I/O memory, which is generally a very bad thing to do, and can cause total system failure or just plain undefined behavior. These bugs are extremely unlikely, but the severity is of such magnitude, and the fix so simple that I think fixing them immediately is justified. Also, they are nearly impossible to debug. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-27 11:32:29 -07:00
}
static inline void native_pmd_clear(pmd_t *pmdp)
[PATCH] x86/PAE: Fix pte_clear for the >4GB RAM case Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug, so use the same fix for both. Turns out pmd_clear had it as well, but pgds are not affected. The problem is rather intricate. Page table entries in PAE mode are 64-bits wide, but the only atomic 8-byte write operation available in 32-bit mode is cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can happen that the processor may prefetch entries into the TLB in the middle of an operation which clears a page table entry. So one must always clear the P-bit in the low word of the page table entry first when clearing it. Since the sequence *ptep = __pte(0) leaves the order of the write dependent on the compiler, it must be coded explicitly as a clear of the low word followed by a clear of the high word. Further, there must be a write memory barrier here to enforce proper ordering by the compiler (and, in the future, by the processor as well). On > 4GB memory machines, the implementation of pte_clear for PAE was clearly deficient, as it could leave virtual mappings of physical memory above 4GB aliased to memory below 4GB in the TLB. The implementation of ptep_get_and_clear_full has a similar bug, although not nearly as likely to occur, since the mappings being cleared are in the process of being destroyed, and should never be dereferenced again. But, as luck would have it, it is possible to trigger bugs even without ever dereferencing these bogus TLB mappings, even if the clear is followed fairly soon after with a TLB flush or invalidation. The problem is that memory above 4GB may now be aliased into the first 4GB of memory, and in fact, may hit a region of memory with non-memory semantics. These regions include AGP and PCI space. As such, these memory regions are not cached by the processor. This introduces the bug. The processor can speculate memory operations, including memory writes, as long as they are committed with the proper ordering. Speculating a memory write to a linear address that has a bogus TLB mapping is possible. Normally, the speculation is harmless. But for cached memory, it does leave the falsely speculated cacheline unmodified, but in a dirty state. This cache line will be eventually written back. If this cacheline happens to intersect a region of memory that is not protected by the cache coherency protocol, it can corrupt data in I/O memory, which is generally a very bad thing to do, and can cause total system failure or just plain undefined behavior. These bugs are extremely unlikely, but the severity is of such magnitude, and the fix so simple that I think fixing them immediately is justified. Also, they are nearly impossible to debug. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-27 11:32:29 -07:00
{
WRITE_ONCE(pmdp->pmd_low, 0);
[PATCH] x86/PAE: Fix pte_clear for the >4GB RAM case Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug, so use the same fix for both. Turns out pmd_clear had it as well, but pgds are not affected. The problem is rather intricate. Page table entries in PAE mode are 64-bits wide, but the only atomic 8-byte write operation available in 32-bit mode is cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can happen that the processor may prefetch entries into the TLB in the middle of an operation which clears a page table entry. So one must always clear the P-bit in the low word of the page table entry first when clearing it. Since the sequence *ptep = __pte(0) leaves the order of the write dependent on the compiler, it must be coded explicitly as a clear of the low word followed by a clear of the high word. Further, there must be a write memory barrier here to enforce proper ordering by the compiler (and, in the future, by the processor as well). On > 4GB memory machines, the implementation of pte_clear for PAE was clearly deficient, as it could leave virtual mappings of physical memory above 4GB aliased to memory below 4GB in the TLB. The implementation of ptep_get_and_clear_full has a similar bug, although not nearly as likely to occur, since the mappings being cleared are in the process of being destroyed, and should never be dereferenced again. But, as luck would have it, it is possible to trigger bugs even without ever dereferencing these bogus TLB mappings, even if the clear is followed fairly soon after with a TLB flush or invalidation. The problem is that memory above 4GB may now be aliased into the first 4GB of memory, and in fact, may hit a region of memory with non-memory semantics. These regions include AGP and PCI space. As such, these memory regions are not cached by the processor. This introduces the bug. The processor can speculate memory operations, including memory writes, as long as they are committed with the proper ordering. Speculating a memory write to a linear address that has a bogus TLB mapping is possible. Normally, the speculation is harmless. But for cached memory, it does leave the falsely speculated cacheline unmodified, but in a dirty state. This cache line will be eventually written back. If this cacheline happens to intersect a region of memory that is not protected by the cache coherency protocol, it can corrupt data in I/O memory, which is generally a very bad thing to do, and can cause total system failure or just plain undefined behavior. These bugs are extremely unlikely, but the severity is of such magnitude, and the fix so simple that I think fixing them immediately is justified. Also, they are nearly impossible to debug. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-27 11:32:29 -07:00
smp_wmb();
WRITE_ONCE(pmdp->pmd_high, 0);
[PATCH] x86/PAE: Fix pte_clear for the >4GB RAM case Proposed fix for ptep_get_and_clear_full PAE bug. Pte_clear had the same bug, so use the same fix for both. Turns out pmd_clear had it as well, but pgds are not affected. The problem is rather intricate. Page table entries in PAE mode are 64-bits wide, but the only atomic 8-byte write operation available in 32-bit mode is cmpxchg8b, which is expensive (at least on P4), and thus avoided. But it can happen that the processor may prefetch entries into the TLB in the middle of an operation which clears a page table entry. So one must always clear the P-bit in the low word of the page table entry first when clearing it. Since the sequence *ptep = __pte(0) leaves the order of the write dependent on the compiler, it must be coded explicitly as a clear of the low word followed by a clear of the high word. Further, there must be a write memory barrier here to enforce proper ordering by the compiler (and, in the future, by the processor as well). On > 4GB memory machines, the implementation of pte_clear for PAE was clearly deficient, as it could leave virtual mappings of physical memory above 4GB aliased to memory below 4GB in the TLB. The implementation of ptep_get_and_clear_full has a similar bug, although not nearly as likely to occur, since the mappings being cleared are in the process of being destroyed, and should never be dereferenced again. But, as luck would have it, it is possible to trigger bugs even without ever dereferencing these bogus TLB mappings, even if the clear is followed fairly soon after with a TLB flush or invalidation. The problem is that memory above 4GB may now be aliased into the first 4GB of memory, and in fact, may hit a region of memory with non-memory semantics. These regions include AGP and PCI space. As such, these memory regions are not cached by the processor. This introduces the bug. The processor can speculate memory operations, including memory writes, as long as they are committed with the proper ordering. Speculating a memory write to a linear address that has a bogus TLB mapping is possible. Normally, the speculation is harmless. But for cached memory, it does leave the falsely speculated cacheline unmodified, but in a dirty state. This cache line will be eventually written back. If this cacheline happens to intersect a region of memory that is not protected by the cache coherency protocol, it can corrupt data in I/O memory, which is generally a very bad thing to do, and can cause total system failure or just plain undefined behavior. These bugs are extremely unlikely, but the severity is of such magnitude, and the fix so simple that I think fixing them immediately is justified. Also, they are nearly impossible to debug. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-27 11:32:29 -07:00
}
mm, x86: add support for PUD-sized transparent hugepages The current transparent hugepage code only supports PMDs. This patch adds support for transparent use of PUDs with DAX. It does not include support for anonymous pages. x86 support code also added. Most of this patch simply parallels the work that was done for huge PMDs. The only major difference is how the new ->pud_entry method in mm_walk works. The ->pmd_entry method replaces the ->pte_entry method, whereas the ->pud_entry method works along with either ->pmd_entry or ->pte_entry. The pagewalk code takes care of locking the PUD before calling ->pud_walk, so handlers do not need to worry whether the PUD is stable. [dave.jiang@intel.com: fix SMP x86 32bit build for native_pud_clear()] Link: http://lkml.kernel.org/r/148719066814.31111.3239231168815337012.stgit@djiang5-desk3.ch.intel.com [dave.jiang@intel.com: native_pud_clear missing on i386 build] Link: http://lkml.kernel.org/r/148640375195.69754.3315433724330910314.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/148545059381.17912.8602162635537598445.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Tested-by: Alexander Kapshuk <alexander.kapshuk@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 14:57:02 -08:00
static inline void native_pud_clear(pud_t *pudp)
{
}
static inline void pud_clear(pud_t *pudp)
{
set_pud(pudp, __pud(0));
/*
* According to Intel App note "TLBs, Paging-Structure Caches,
* and Their Invalidation", April 2007, document 317080-001,
* section 8.1: in PAE mode we explicitly have to flush the
* TLB via cr3 if the top-level pgd is changed...
*
* Currently all places where pud_clear() is called either have
* flush_tlb_mm() followed or don't need TLB flush (x86_64 code or
* pud_clear_bad()), so we don't need TLB flush here.
*/
}
#ifdef CONFIG_SMP
static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
{
return pxx_xchg64(pte, ptep, 0ULL);
}
static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
{
return pxx_xchg64(pmd, pmdp, 0ULL);
}
static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
{
return pxx_xchg64(pud, pudp, 0ULL);
}
#else
#define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
#define native_pudp_get_and_clear(xp) native_local_pudp_get_and_clear(xp)
#endif
#ifndef pmdp_establish
#define pmdp_establish pmdp_establish
static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp, pmd_t pmd)
{
pmd_t old;
/*
* If pmd has present bit cleared we can get away without expensive
* cmpxchg64: we can update pmdp half-by-half without racing with
* anybody.
*/
if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
/* xchg acts as a barrier before setting of the high bits */
old.pmd_low = xchg(&pmdp->pmd_low, pmd.pmd_low);
old.pmd_high = READ_ONCE(pmdp->pmd_high);
WRITE_ONCE(pmdp->pmd_high, pmd.pmd_high);
return old;
}
return pxx_xchg64(pmd, pmdp, pmd.pmd);
}
#endif
/*
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
* are !pte_none() && !pte_present().
*
* Format of swap PTEs:
*
* 6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
* 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2
* < type -> <---------------------- offset ----------------------
*
* 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
* 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
* --------------------------------------------> 0 E 0 0 0 0 0 0 0
*
* E is the exclusive marker that is not stored in swap entries.
*/
#define SWP_TYPE_BITS 5
#define _SWP_TYPE_MASK ((1U << SWP_TYPE_BITS) - 1)
#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
/* We always extract/encode the offset by shifting it all the way up, and then down again */
#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)
mm/x86: use SWP_TYPE_BITS in 3-level swap macros Patch series "mm: Remember a/d bits for migration entries", v4. Problem ======= When migrating a page, right now we always mark the migrated page as old & clean. However that could lead to at least two problems: (1) We lost the real hot/cold information while we could have persisted. That information shouldn't change even if the backing page is changed after the migration, (2) There can be always extra overhead on the immediate next access to any migrated page, because hardware MMU needs cycles to set the young bit again for reads, and dirty bits for write, as long as the hardware MMU supports these bits. Many of the recent upstream works showed that (2) is not something trivial and actually very measurable. In my test case, reading 1G chunk of memory - jumping in page size intervals - could take 99ms just because of the extra setting on the young bit on a generic x86_64 system, comparing to 4ms if young set. This issue is originally reported by Andrea Arcangeli. Solution ======== To solve this problem, this patchset tries to remember the young/dirty bits in the migration entries and carry them over when recovering the ptes. We have the chance to do so because in many systems the swap offset is not really fully used. Migration entries use swp offset to store PFN only, while the PFN is normally not as large as swp offset and normally smaller. It means we do have some free bits in swp offset that we can use to store things like A/D bits, and that's how this series tried to approach this problem. max_swapfile_size() is used here to detect per-arch offset length in swp entries. We'll automatically remember the A/D bits when we find that we have enough swp offset field to keep both the PFN and the extra bits. Since max_swapfile_size() can be slow, the last two patches cache the results for it and also swap_migration_ad_supported as a whole. Known Issues / TODOs ==================== We still haven't taught madvise() to recognize the new A/D bits in migration entries, namely MADV_COLD/MADV_FREE. E.g. when MADV_COLD upon a migration entry. It's not clear yet on whether we should clear the A bit, or we should just drop the entry directly. We didn't teach idle page tracking on the new migration entries, because it'll need larger rework on the tree on rmap pgtable walk. However it should make it already better because before this patchset page will be old page after migration, so the series will fix potential false negative of idle page tracking when pages were migrated before observing. The other thing is migration A/D bits will not start to working for private device swap entries. The code is there for completeness but since private device swap entries do not yet have fields to store A/D bits, even if we'll persistent A/D across present pte switching to migration entry, we'll lose it again when the migration entry converted to private device swap entry. Tests ===== After the patchset applied, the immediate read access test [1] of above 1G chunk after migration can shrink from 99ms to 4ms. The test is done by moving 1G pages from node 0->1->0 then read it in page size jumps. The test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. Similar effect can also be measured when writting the memory the 1st time after migration. After applying the patchset, both initial immediate read/write after page migrated will perform similarly like before migration happened. Patch Layout ============ Patch 1-2: Cleanups from either previous versions or on swapops.h macros. Patch 3-4: Prepare for the introduction of migration A/D bits Patch 5: The core patch to remember young/dirty bit in swap offsets. Patch 6-7: Cache relevant fields to make migration_entry_supports_ad() fast. [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c This patch (of 7): Replace all the magic "5" with the macro. Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-11 12:13:25 -04:00
#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
#define __swp_type(x) (((x).val) & _SWP_TYPE_MASK)
mm/x86: use SWP_TYPE_BITS in 3-level swap macros Patch series "mm: Remember a/d bits for migration entries", v4. Problem ======= When migrating a page, right now we always mark the migrated page as old & clean. However that could lead to at least two problems: (1) We lost the real hot/cold information while we could have persisted. That information shouldn't change even if the backing page is changed after the migration, (2) There can be always extra overhead on the immediate next access to any migrated page, because hardware MMU needs cycles to set the young bit again for reads, and dirty bits for write, as long as the hardware MMU supports these bits. Many of the recent upstream works showed that (2) is not something trivial and actually very measurable. In my test case, reading 1G chunk of memory - jumping in page size intervals - could take 99ms just because of the extra setting on the young bit on a generic x86_64 system, comparing to 4ms if young set. This issue is originally reported by Andrea Arcangeli. Solution ======== To solve this problem, this patchset tries to remember the young/dirty bits in the migration entries and carry them over when recovering the ptes. We have the chance to do so because in many systems the swap offset is not really fully used. Migration entries use swp offset to store PFN only, while the PFN is normally not as large as swp offset and normally smaller. It means we do have some free bits in swp offset that we can use to store things like A/D bits, and that's how this series tried to approach this problem. max_swapfile_size() is used here to detect per-arch offset length in swp entries. We'll automatically remember the A/D bits when we find that we have enough swp offset field to keep both the PFN and the extra bits. Since max_swapfile_size() can be slow, the last two patches cache the results for it and also swap_migration_ad_supported as a whole. Known Issues / TODOs ==================== We still haven't taught madvise() to recognize the new A/D bits in migration entries, namely MADV_COLD/MADV_FREE. E.g. when MADV_COLD upon a migration entry. It's not clear yet on whether we should clear the A bit, or we should just drop the entry directly. We didn't teach idle page tracking on the new migration entries, because it'll need larger rework on the tree on rmap pgtable walk. However it should make it already better because before this patchset page will be old page after migration, so the series will fix potential false negative of idle page tracking when pages were migrated before observing. The other thing is migration A/D bits will not start to working for private device swap entries. The code is there for completeness but since private device swap entries do not yet have fields to store A/D bits, even if we'll persistent A/D across present pte switching to migration entry, we'll lose it again when the migration entry converted to private device swap entry. Tests ===== After the patchset applied, the immediate read access test [1] of above 1G chunk after migration can shrink from 99ms to 4ms. The test is done by moving 1G pages from node 0->1->0 then read it in page size jumps. The test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. Similar effect can also be measured when writting the memory the 1st time after migration. After applying the patchset, both initial immediate read/write after page migrated will perform similarly like before migration happened. Patch Layout ============ Patch 1-2: Cleanups from either previous versions or on swapops.h macros. Patch 3-4: Prepare for the introduction of migration A/D bits Patch 5: The core patch to remember young/dirty bit in swap offsets. Patch 6-7: Cache relevant fields to make migration_entry_supports_ad() fast. [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c This patch (of 7): Replace all the magic "5" with the macro. Link: https://lkml.kernel.org/r/20220811161331.37055-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220811161331.37055-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-11 12:13:25 -04:00
#define __swp_offset(x) ((x).val >> SWP_TYPE_BITS)
#define __swp_entry(type, offset) ((swp_entry_t){((type) & _SWP_TYPE_MASK) \
| (offset) << SWP_TYPE_BITS})
/*
* Normally, __swp_entry() converts from arch-independent swp_entry_t to
* arch-dependent swp_entry_t, and __swp_entry_to_pte() just stores the result
* to pte. But here we have 32bit swp_entry_t and 64bit pte, and need to use the
* whole 64 bits. Thus, we shift the "real" arch-dependent conversion to
* __swp_entry_to_pte() through the following helper macro based on 64bit
* __swp_entry().
*/
#define __swp_pteval_entry(type, offset) ((pteval_t) { \
(~(pteval_t)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
| ((pteval_t)(type) << (64 - SWP_TYPE_BITS)) })
#define __swp_entry_to_pte(x) ((pte_t){ .pte = \
__swp_pteval_entry(__swp_type(x), __swp_offset(x)) })
/*
* Analogically, __pte_to_swp_entry() doesn't just extract the arch-dependent
* swp_entry_t, but also has to convert it from 64bit to the 32bit
* intermediate representation, using the following macros based on 64bit
* __swp_type() and __swp_offset().
*/
#define __pteval_swp_type(x) ((unsigned long)((x).pte >> (64 - SWP_TYPE_BITS)))
#define __pteval_swp_offset(x) ((unsigned long)(~((x).pte) << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT))
#define __pte_to_swp_entry(pte) (__swp_entry(__pteval_swp_type(pte), \
__pteval_swp_offset(pte)))
/* We borrow bit 7 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE _PAGE_PSE
x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation When PTEs are set to PROT_NONE the kernel just clears the Present bit and preserves the PFN, which creates attack surface for L1TF speculation speculation attacks. This is important inside guests, because L1TF speculation bypasses physical page remapping. While the host has its own migitations preventing leaking data from other VMs into the guest, this would still risk leaking the wrong page inside the current guest. This uses the same technique as Linus' swap entry patch: while an entry is is in PROTNONE state invert the complete PFN part part of it. This ensures that the the highest bit will point to non existing memory. The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and pte/pmd/pud_pfn undo it. This assume that no code path touches the PFN part of a PTE directly without using these primitives. This doesn't handle the case that MMIO is on the top of the CPU physical memory. If such an MMIO region was exposed by an unpriviledged driver for mmap it would be possible to attack some real memory. However this situation is all rather unlikely. For 32bit non PAE the inversion is not done because there are really not enough bits to protect anything. Q: Why does the guest need to be protected when the HyperVisor already has L1TF mitigations? A: Here's an example: Physical pages 1 2 get mapped into a guest as GPA 1 -> PA 2 GPA 2 -> PA 1 through EPT. The L1TF speculation ignores the EPT remapping. Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and they belong to different users and should be isolated. A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and gets read access to the underlying physical page. Which in this case points to PA 2, so it can read process B's data, if it happened to be in L1, so isolation inside the guest is broken. There's nothing the hypervisor can do about this. This mitigation has to be done in the guest itself. [ tglx: Massaged changelog ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Dave Hansen <dave.hansen@intel.com>
2018-06-13 15:48:24 -07:00
#include <asm/pgtable-invert.h>
#endif /* _ASM_X86_PGTABLE_3LEVEL_H */