License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2012-07-06 15:25:10 -05:00
|
|
|
/*
|
|
|
|
* Slab allocator functions that are independent of the allocator strategy
|
|
|
|
*
|
2025-04-21 13:58:06 -07:00
|
|
|
* (C) 2012 Christoph Lameter <cl@gentwo.org>
|
2012-07-06 15:25:10 -05:00
|
|
|
*/
|
|
|
|
#include <linux/slab.h>
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/poison.h>
|
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/memory.h>
|
2018-04-05 16:20:11 -07:00
|
|
|
#include <linux/cache.h>
|
2012-07-06 15:25:10 -05:00
|
|
|
#include <linux/compiler.h>
|
2021-02-25 17:19:11 -08:00
|
|
|
#include <linux/kfence.h>
|
2012-07-06 15:25:10 -05:00
|
|
|
#include <linux/module.h>
|
2012-07-06 15:25:13 -05:00
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/uaccess.h>
|
2012-10-19 18:20:25 +04:00
|
|
|
#include <linux/seq_file.h>
|
2023-06-12 16:31:48 +01:00
|
|
|
#include <linux/dma-mapping.h>
|
2023-06-12 16:32:00 +01:00
|
|
|
#include <linux/swiotlb.h>
|
2012-10-19 18:20:25 +04:00
|
|
|
#include <linux/proc_fs.h>
|
2019-07-11 20:56:38 -07:00
|
|
|
#include <linux/debugfs.h>
|
2023-10-03 11:57:45 +02:00
|
|
|
#include <linux/kmemleak.h>
|
2020-12-22 12:03:31 -08:00
|
|
|
#include <linux/kasan.h>
|
2012-07-06 15:25:10 -05:00
|
|
|
#include <asm/cacheflush.h>
|
|
|
|
#include <asm/tlbflush.h>
|
|
|
|
#include <asm/page.h>
|
2012-12-18 14:22:34 -08:00
|
|
|
#include <linux/memcontrol.h>
|
2021-07-07 18:07:47 -07:00
|
|
|
#include <linux/stackdepot.h>
|
2024-12-12 19:02:08 +01:00
|
|
|
#include <trace/events/rcu.h>
|
2014-08-06 16:04:44 -07:00
|
|
|
|
2024-12-12 19:02:08 +01:00
|
|
|
#include "../kernel/rcu/rcu.h"
|
2020-08-06 23:18:28 -07:00
|
|
|
#include "internal.h"
|
2012-07-06 15:25:11 -05:00
|
|
|
#include "slab.h"
|
|
|
|
|
2022-06-03 06:21:49 +03:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/kmem.h>
|
|
|
|
|
2012-07-06 15:25:11 -05:00
|
|
|
enum slab_state slab_state;
|
2012-07-06 15:25:12 -05:00
|
|
|
LIST_HEAD(slab_caches);
|
|
|
|
DEFINE_MUTEX(slab_mutex);
|
2012-09-05 00:20:33 +00:00
|
|
|
struct kmem_cache *kmem_cache;
|
2012-07-06 15:25:11 -05:00
|
|
|
|
2014-10-09 15:26:22 -07:00
|
|
|
/*
|
|
|
|
* Set of flags that will prevent slab merging
|
|
|
|
*/
|
|
|
|
#define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
|
2017-01-18 02:53:44 -08:00
|
|
|
SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
|
2024-02-23 19:27:19 +01:00
|
|
|
SLAB_FAILSLAB | SLAB_NO_MERGE)
|
2014-10-09 15:26:22 -07:00
|
|
|
|
2016-01-14 15:18:15 -08:00
|
|
|
#define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
|
mm: add support for kmem caches in DMA32 zone
Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
v6.
This is a followup to the discussion in [1], [2].
IOMMUs using ARMv7 short-descriptor format require page tables (level 1
and 2) to be allocated within the first 4GB of RAM, even on 64-bit
systems.
For L1 tables that are bigger than a page, we can just use
__get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
use GFP_DMA).
For L2 tables that only take 1KB, it would be a waste to allocate a full
page, so we considered 3 approaches:
1. This series, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2 page
tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable to reuse
freed fragments until the whole page is freed. [3]
This series is the most memory-efficient approach.
stable@ note:
We confirmed that this is a regression, and IOMMU errors happen on 4.19
and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
platforms (and maybe others?).
[1] https://lists.linuxfoundation.org/pipermail/iommu/2018-November/030876.html
[2] https://lists.linuxfoundation.org/pipermail/iommu/2018-December/031696.html
[3] https://patchwork.codeaurora.org/patch/671639/
This patch (of 3):
IOMMUs using ARMv7 short-descriptor format require page tables to be
allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
this is done by passing GFP_DMA32 flag to memory allocation functions.
For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
a full page using get_free_pages, so we considered 3 approaches:
1. This patch, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2
page tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable
to reuse freed fragments until the whole page is freed.
This change makes it possible to create a custom cache in DMA32 zone using
kmem_cache_create, then allocate memory using kmem_cache_alloc.
We do not create a DMA32 kmalloc cache array, as there are currently no
users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.
This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
unnecessary).
Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org
Signed-off-by: Nicolas Boichat <drinkcat@chromium.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Sasha Levin <Alexander.Levin@microsoft.com>
Cc: Huaisheng Ye <yehs1@lenovo.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Yong Wu <yong.wu@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Tomasz Figa <tfiga@google.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-28 20:43:42 -07:00
|
|
|
SLAB_CACHE_DMA32 | SLAB_ACCOUNT)
|
2014-10-09 15:26:22 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Merge control. If this is set then no merging of slab caches will occur.
|
|
|
|
*/
|
2017-07-06 15:36:40 -07:00
|
|
|
static bool slab_nomerge = !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT);
|
2014-10-09 15:26:22 -07:00
|
|
|
|
|
|
|
static int __init setup_slab_nomerge(char *str)
|
|
|
|
{
|
2017-07-06 15:36:40 -07:00
|
|
|
slab_nomerge = true;
|
2014-10-09 15:26:22 -07:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2021-04-29 22:54:39 -07:00
|
|
|
static int __init setup_slab_merge(char *str)
|
|
|
|
{
|
|
|
|
slab_nomerge = false;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2014-10-09 15:26:22 -07:00
|
|
|
__setup_param("slub_nomerge", slub_nomerge, setup_slab_nomerge, 0);
|
2021-04-29 22:54:39 -07:00
|
|
|
__setup_param("slub_merge", slub_merge, setup_slab_merge, 0);
|
2014-10-09 15:26:22 -07:00
|
|
|
|
|
|
|
__setup("slab_nomerge", setup_slab_nomerge);
|
2021-04-29 22:54:39 -07:00
|
|
|
__setup("slab_merge", setup_slab_merge);
|
2014-10-09 15:26:22 -07:00
|
|
|
|
2014-10-09 15:26:00 -07:00
|
|
|
/*
|
|
|
|
* Determine the size of a slab object
|
|
|
|
*/
|
|
|
|
unsigned int kmem_cache_size(struct kmem_cache *s)
|
|
|
|
{
|
|
|
|
return s->object_size;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmem_cache_size);
|
|
|
|
|
2012-08-16 00:09:46 -07:00
|
|
|
#ifdef CONFIG_DEBUG_VM
|
2024-08-07 10:07:46 +01:00
|
|
|
|
|
|
|
static bool kmem_cache_is_duplicate_name(const char *name)
|
|
|
|
{
|
|
|
|
struct kmem_cache *s;
|
|
|
|
|
|
|
|
list_for_each_entry(s, &slab_caches, list) {
|
|
|
|
if (!strcmp(s->name, name))
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2018-04-05 16:20:37 -07:00
|
|
|
static int kmem_cache_sanity_check(const char *name, unsigned int size)
|
2012-07-06 15:25:10 -05:00
|
|
|
{
|
2021-06-15 18:23:22 -07:00
|
|
|
if (!name || in_interrupt() || size > KMALLOC_MAX_SIZE) {
|
2012-08-16 00:09:46 -07:00
|
|
|
pr_err("kmem_cache_create(%s) integrity check failed\n", name);
|
|
|
|
return -EINVAL;
|
2012-07-06 15:25:10 -05:00
|
|
|
}
|
2012-08-16 10:12:18 +03:00
|
|
|
|
2024-08-07 10:07:46 +01:00
|
|
|
/* Duplicate names will confuse slabtop, et al */
|
|
|
|
WARN(kmem_cache_is_duplicate_name(name),
|
|
|
|
"kmem_cache of name '%s' already exists\n", name);
|
|
|
|
|
2012-07-06 15:25:13 -05:00
|
|
|
WARN_ON(strchr(name, ' ')); /* It confuses parsers */
|
2012-08-16 00:09:46 -07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#else
|
2018-04-05 16:20:37 -07:00
|
|
|
static inline int kmem_cache_sanity_check(const char *name, unsigned int size)
|
2012-08-16 00:09:46 -07:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2012-07-06 15:25:13 -05:00
|
|
|
#endif
|
|
|
|
|
2018-01-31 16:15:36 -08:00
|
|
|
/*
|
|
|
|
* Figure out what the alignment of the objects will be given a set of
|
|
|
|
* flags, a user specified alignment and the size of the objects.
|
|
|
|
*/
|
2018-04-05 16:20:37 -07:00
|
|
|
static unsigned int calculate_alignment(slab_flags_t flags,
|
|
|
|
unsigned int align, unsigned int size)
|
2018-01-31 16:15:36 -08:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If the user wants hardware cache aligned objects then follow that
|
|
|
|
* suggestion if the object is sufficiently large.
|
|
|
|
*
|
|
|
|
* The hardware cache alignment cannot override the specified
|
|
|
|
* alignment though. If that is greater then use it.
|
|
|
|
*/
|
|
|
|
if (flags & SLAB_HWCACHE_ALIGN) {
|
2018-04-05 16:20:37 -07:00
|
|
|
unsigned int ralign;
|
2018-01-31 16:15:36 -08:00
|
|
|
|
|
|
|
ralign = cache_line_size();
|
|
|
|
while (size <= ralign / 2)
|
|
|
|
ralign /= 2;
|
|
|
|
align = max(align, ralign);
|
|
|
|
}
|
|
|
|
|
2022-05-09 18:20:53 -07:00
|
|
|
align = max(align, arch_slab_minalign());
|
2018-01-31 16:15:36 -08:00
|
|
|
|
|
|
|
return ALIGN(align, sizeof(void *));
|
|
|
|
}
|
|
|
|
|
2014-10-09 15:26:22 -07:00
|
|
|
/*
|
|
|
|
* Find a mergeable slab cache
|
|
|
|
*/
|
|
|
|
int slab_unmergeable(struct kmem_cache *s)
|
|
|
|
{
|
|
|
|
if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (s->ctor)
|
|
|
|
return 1;
|
|
|
|
|
2022-11-16 15:56:32 +01:00
|
|
|
#ifdef CONFIG_HARDENED_USERCOPY
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-10 22:50:28 -04:00
|
|
|
if (s->usersize)
|
|
|
|
return 1;
|
2022-11-16 15:56:32 +01:00
|
|
|
#endif
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-10 22:50:28 -04:00
|
|
|
|
2014-10-09 15:26:22 -07:00
|
|
|
/*
|
|
|
|
* We may have set a slab to be unmergeable during bootstrap.
|
|
|
|
*/
|
|
|
|
if (s->refcount < 0)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-04-05 16:20:37 -07:00
|
|
|
struct kmem_cache *find_mergeable(unsigned int size, unsigned int align,
|
2017-11-15 17:32:18 -08:00
|
|
|
slab_flags_t flags, const char *name, void (*ctor)(void *))
|
2014-10-09 15:26:22 -07:00
|
|
|
{
|
|
|
|
struct kmem_cache *s;
|
|
|
|
|
2017-02-22 15:40:59 -08:00
|
|
|
if (slab_nomerge)
|
2014-10-09 15:26:22 -07:00
|
|
|
return NULL;
|
|
|
|
|
|
|
|
if (ctor)
|
|
|
|
return NULL;
|
|
|
|
|
2024-02-21 12:12:53 +00:00
|
|
|
flags = kmem_cache_flags(flags, name);
|
2014-10-09 15:26:22 -07:00
|
|
|
|
2017-02-22 15:40:59 -08:00
|
|
|
if (flags & SLAB_NEVER_MERGE)
|
|
|
|
return NULL;
|
|
|
|
|
2024-09-04 15:40:37 +08:00
|
|
|
size = ALIGN(size, sizeof(void *));
|
|
|
|
align = calculate_alignment(flags, align, size);
|
|
|
|
size = ALIGN(size, align);
|
|
|
|
|
2020-08-06 23:21:20 -07:00
|
|
|
list_for_each_entry_reverse(s, &slab_caches, list) {
|
2014-10-09 15:26:22 -07:00
|
|
|
if (slab_unmergeable(s))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (size > s->size)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME))
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* Check if alignment is compatible.
|
|
|
|
* Courtesy of Adrian Drzewiecki
|
|
|
|
*/
|
|
|
|
if ((s->size & ~(align - 1)) != s->size)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (s->size - size >= sizeof(void *))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2015-11-05 18:45:08 -08:00
|
|
|
static struct kmem_cache *create_cache(const char *name,
|
2024-09-05 09:56:49 +02:00
|
|
|
unsigned int object_size,
|
|
|
|
struct kmem_cache_args *args,
|
|
|
|
slab_flags_t flags)
|
2014-04-07 15:39:26 -07:00
|
|
|
{
|
|
|
|
struct kmem_cache *s;
|
|
|
|
int err;
|
|
|
|
|
2024-08-28 12:56:24 +02:00
|
|
|
/* If a custom freelist pointer is requested make sure it's sane. */
|
|
|
|
err = -EINVAL;
|
2024-09-05 09:56:49 +02:00
|
|
|
if (args->use_freeptr_offset &&
|
|
|
|
(args->freeptr_offset >= object_size ||
|
|
|
|
!(flags & SLAB_TYPESAFE_BY_RCU) ||
|
2024-11-20 13:46:21 +01:00
|
|
|
!IS_ALIGNED(args->freeptr_offset, __alignof__(freeptr_t))))
|
2024-08-28 12:56:24 +02:00
|
|
|
goto out;
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-10 22:50:28 -04:00
|
|
|
|
2014-04-07 15:39:26 -07:00
|
|
|
err = -ENOMEM;
|
|
|
|
s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
|
|
|
|
if (!s)
|
|
|
|
goto out;
|
2024-09-05 09:56:51 +02:00
|
|
|
err = do_kmem_cache_create(s, name, object_size, args, flags);
|
2014-04-07 15:39:26 -07:00
|
|
|
if (err)
|
|
|
|
goto out_free_cache;
|
|
|
|
|
|
|
|
s->refcount = 1;
|
|
|
|
list_add(&s->list, &slab_caches);
|
|
|
|
return s;
|
|
|
|
|
|
|
|
out_free_cache:
|
2015-02-10 14:09:40 -08:00
|
|
|
kmem_cache_free(kmem_cache, s);
|
2023-06-06 14:55:43 +08:00
|
|
|
out:
|
|
|
|
return ERR_PTR(err);
|
2014-04-07 15:39:26 -07:00
|
|
|
}
|
2012-11-28 16:23:16 +00:00
|
|
|
|
2018-12-06 23:13:00 +02:00
|
|
|
/**
|
2024-09-13 10:15:56 +02:00
|
|
|
* __kmem_cache_create_args - Create a kmem cache.
|
2012-08-16 00:09:46 -07:00
|
|
|
* @name: A string which is used in /proc/slabinfo to identify this cache.
|
2024-09-05 09:56:45 +02:00
|
|
|
* @object_size: The size of objects to be created in this cache.
|
2024-09-13 10:15:56 +02:00
|
|
|
* @args: Additional arguments for the cache creation (see
|
|
|
|
* &struct kmem_cache_args).
|
2024-10-09 16:29:37 +02:00
|
|
|
* @flags: See the desriptions of individual flags. The common ones are listed
|
|
|
|
* in the description below.
|
2012-08-16 00:09:46 -07:00
|
|
|
*
|
2024-09-13 10:15:56 +02:00
|
|
|
* Not to be called directly, use the kmem_cache_create() wrapper with the same
|
|
|
|
* parameters.
|
2012-08-16 00:09:46 -07:00
|
|
|
*
|
2024-10-09 16:29:37 +02:00
|
|
|
* Commonly used @flags:
|
|
|
|
*
|
|
|
|
* &SLAB_ACCOUNT - Account allocations to memcg.
|
|
|
|
*
|
|
|
|
* &SLAB_HWCACHE_ALIGN - Align objects on cache line boundaries.
|
|
|
|
*
|
|
|
|
* &SLAB_RECLAIM_ACCOUNT - Objects are reclaimable.
|
|
|
|
*
|
|
|
|
* &SLAB_TYPESAFE_BY_RCU - Slab page (not individual objects) freeing delayed
|
|
|
|
* by a grace period - see the full description before using.
|
|
|
|
*
|
2024-09-13 10:15:56 +02:00
|
|
|
* Context: Cannot be called within a interrupt, but can be interrupted.
|
2018-12-06 23:13:00 +02:00
|
|
|
*
|
|
|
|
* Return: a pointer to the cache on success, NULL on failure.
|
2012-08-16 00:09:46 -07:00
|
|
|
*/
|
2024-09-05 09:56:45 +02:00
|
|
|
struct kmem_cache *__kmem_cache_create_args(const char *name,
|
|
|
|
unsigned int object_size,
|
|
|
|
struct kmem_cache_args *args,
|
|
|
|
slab_flags_t flags)
|
2012-08-16 00:09:46 -07:00
|
|
|
{
|
2015-11-05 18:45:43 -08:00
|
|
|
struct kmem_cache *s = NULL;
|
2015-02-13 14:36:38 -08:00
|
|
|
const char *cache_name;
|
2014-01-23 15:52:55 -08:00
|
|
|
int err;
|
2012-07-06 15:25:10 -05:00
|
|
|
|
2021-05-14 17:27:10 -07:00
|
|
|
#ifdef CONFIG_SLUB_DEBUG
|
|
|
|
/*
|
2023-12-15 11:41:48 +08:00
|
|
|
* If no slab_debug was enabled globally, the static key is not yet
|
2021-05-14 17:27:10 -07:00
|
|
|
* enabled by setup_slub_debug(). Enable it if the cache is being
|
|
|
|
* created with any of the debugging flags passed explicitly.
|
2021-07-07 18:07:47 -07:00
|
|
|
* It's also possible that this is the first cache created with
|
|
|
|
* SLAB_STORE_USER and we should init stack_depot for it.
|
2021-05-14 17:27:10 -07:00
|
|
|
*/
|
|
|
|
if (flags & SLAB_DEBUG_FLAGS)
|
|
|
|
static_branch_enable(&slub_debug_enabled);
|
2021-07-07 18:07:47 -07:00
|
|
|
if (flags & SLAB_STORE_USER)
|
|
|
|
stack_depot_init();
|
2025-01-24 16:48:58 +00:00
|
|
|
#else
|
|
|
|
flags &= ~SLAB_DEBUG_FLAGS;
|
2021-05-14 17:27:10 -07:00
|
|
|
#endif
|
|
|
|
|
2012-08-16 00:09:46 -07:00
|
|
|
mutex_lock(&slab_mutex);
|
2012-09-05 00:20:33 +00:00
|
|
|
|
2024-09-05 09:56:45 +02:00
|
|
|
err = kmem_cache_sanity_check(name, object_size);
|
2014-10-09 15:25:58 -07:00
|
|
|
if (err) {
|
2014-01-23 15:52:55 -08:00
|
|
|
goto out_unlock;
|
2014-10-09 15:25:58 -07:00
|
|
|
}
|
2012-09-05 00:20:33 +00:00
|
|
|
|
2016-12-12 16:41:38 -08:00
|
|
|
if (flags & ~SLAB_FLAGS_PERMITTED) {
|
|
|
|
err = -EINVAL;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
usercopy: Prepare for usercopy whitelisting
This patch prepares the slab allocator to handle caches having annotations
(useroffset and usersize) defining usercopy regions.
This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on
my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code.
Currently, hardened usercopy performs dynamic bounds checking on slab
cache objects. This is good, but still leaves a lot of kernel memory
available to be copied to/from userspace in the face of bugs. To further
restrict what memory is available for copying, this creates a way to
whitelist specific areas of a given slab cache object for copying to/from
userspace, allowing much finer granularity of access control. Slab caches
that are never exposed to userspace can declare no whitelist for their
objects, thereby keeping them unavailable to userspace via dynamic copy
operations. (Note, an implicit form of whitelisting is the use of constant
sizes in usercopy operations and get_user()/put_user(); these bypass
hardened usercopy checks since these sizes cannot change at runtime.)
To support this whitelist annotation, usercopy region offset and size
members are added to struct kmem_cache. The slab allocator receives a
new function, kmem_cache_create_usercopy(), that creates a new cache
with a usercopy region defined, suitable for declaring spans of fields
within the objects that get copied to/from userspace.
In this patch, the default kmem_cache_create() marks the entire allocation
as whitelisted, leaving it semantically unchanged. Once all fine-grained
whitelists have been added (in subsequent patches), this will be changed
to a usersize of 0, making caches created with kmem_cache_create() not
copyable to/from userspace.
After the entire usercopy whitelist series is applied, less than 15%
of the slab cache memory remains exposed to potential usercopy bugs
after a fresh boot:
Total Slab Memory: 48074720
Usercopyable Memory: 6367532 13.2%
task_struct 0.2% 4480/1630720
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 269760/8740224
dentry 11.1% 585984/5273856
mm_struct 29.1% 54912/188448
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 81920/81920
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 167936/167936
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 455616/455616
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 812032/812032
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1310720/1310720
After some kernel build workloads, the percentage (mainly driven by
dentry and inode caches expanding) drops under 10%:
Total Slab Memory: 95516184
Usercopyable Memory: 8497452 8.8%
task_struct 0.2% 4000/1456000
RAW 0.3% 300/96000
RAWv6 2.1% 1408/64768
ext4_inode_cache 3.0% 1217280/39439872
dentry 11.1% 1623200/14608800
mm_struct 29.1% 73216/251264
kmalloc-8 100.0% 24576/24576
kmalloc-16 100.0% 28672/28672
kmalloc-32 100.0% 94208/94208
kmalloc-192 100.0% 96768/96768
kmalloc-128 100.0% 143360/143360
names_cache 100.0% 163840/163840
kmalloc-64 100.0% 245760/245760
kmalloc-256 100.0% 339968/339968
kmalloc-512 100.0% 350720/350720
kmalloc-96 100.0% 563520/563520
kmalloc-8192 100.0% 655360/655360
kmalloc-1024 100.0% 794624/794624
kmalloc-4096 100.0% 819200/819200
kmalloc-2048 100.0% 1257472/1257472
Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, split out a few extra kmalloc hunks]
[kees: add field names to function declarations]
[kees: convert BUGs to WARNs and fail closed]
[kees: add attack surface reduction analysis to commit log]
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christoph Lameter <cl@linux.com>
2017-06-10 22:50:28 -04:00
|
|
|
/* Fail closed on bad usersize of useroffset values. */
|
2022-11-16 15:56:32 +01:00
|
|
|
if (!IS_ENABLED(CONFIG_HARDENED_USERCOPY) ||
|
2024-09-05 09:56:45 +02:00
|
|
|
WARN_ON(!args->usersize && args->useroffset) ||
|
|
|
|
WARN_ON(object_size < args->usersize ||
|
|
|
|
object_size - args->usersize < args->useroffset))
|
|
|
|
args->usersize = args->useroffset = 0;
|
|
|
|
|
|
|
|
if (!args->usersize)
|
|
|
|
s = __kmem_cache_alias(name, object_size, args->align, flags,
|
|
|
|
args->ctor);
|
2014-04-07 15:39:26 -07:00
|
|
|
if (s)
|
2014-01-23 15:52:55 -08:00
|
|
|
goto out_unlock;
|
2012-12-18 14:22:34 -08:00
|
|
|
|
2015-02-13 14:36:38 -08:00
|
|
|
cache_name = kstrdup_const(name, GFP_KERNEL);
|
2014-04-07 15:39:26 -07:00
|
|
|
if (!cache_name) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
2012-09-04 23:38:33 +00:00
|
|
|
|
2024-09-05 09:56:49 +02:00
|
|
|
args->align = calculate_alignment(flags, args->align, object_size);
|
|
|
|
s = create_cache(cache_name, object_size, args, flags);
|
2014-04-07 15:39:26 -07:00
|
|
|
if (IS_ERR(s)) {
|
|
|
|
err = PTR_ERR(s);
|
2015-02-13 14:36:38 -08:00
|
|
|
kfree_const(cache_name);
|
2014-04-07 15:39:26 -07:00
|
|
|
}
|
2014-01-23 15:52:55 -08:00
|
|
|
|
|
|
|
out_unlock:
|
2012-07-06 15:25:13 -05:00
|
|
|
mutex_unlock(&slab_mutex);
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:07:20 -07:00
|
|
|
|
slab: fix wrong retval on kmem_cache_create_memcg error path
On kmem_cache_create_memcg() error path we set 'err', but leave 's' (the
new cache ptr) undefined. The latter can be NULL if we could not
allocate the cache, or pointing to a freed area if we failed somewhere
later while trying to initialize it. Initially we checked 'err'
immediately before exiting the function and returned NULL if it was set
ignoring the value of 's':
out_unlock:
...
if (err) {
/* report error */
return NULL;
}
return s;
Recently this check was, in fact, broken by commit f717eb3abb5e ("slab:
do not panic if we fail to create memcg cache"), which turned it to:
out_unlock:
...
if (err && !memcg) {
/* report error */
return NULL;
}
return s;
As a result, if we are failing creating a cache for a memcg, we will
skip the check and return 's' that can contain crap. Obviously, commit
f717eb3abb5e intended not to return crap on error allocating a cache for
a memcg, but only to remove the error reporting in this case, so the
check should look like this:
out_unlock:
...
if (err) {
if (!memcg)
return NULL;
/* report error */
return NULL;
}
return s;
[rientjes@google.com: despaghettification]
[vdavydov@parallels.com: patch monkeying]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-29 14:05:48 -08:00
|
|
|
if (err) {
|
2012-09-05 00:20:33 +00:00
|
|
|
if (flags & SLAB_PANIC)
|
2021-06-28 19:34:27 -07:00
|
|
|
panic("%s: Failed to create slab '%s'. Error %d\n",
|
|
|
|
__func__, name, err);
|
2012-09-05 00:20:33 +00:00
|
|
|
else {
|
2021-06-28 19:34:27 -07:00
|
|
|
pr_warn("%s(%s) failed with error %d\n",
|
|
|
|
__func__, name, err);
|
2012-09-05 00:20:33 +00:00
|
|
|
dump_stack();
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
2012-07-06 15:25:10 -05:00
|
|
|
return s;
|
|
|
|
}
|
2024-09-05 09:56:45 +02:00
|
|
|
EXPORT_SYMBOL(__kmem_cache_create_args);
|
2012-12-18 14:22:34 -08:00
|
|
|
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
static struct kmem_cache *kmem_buckets_cache __ro_after_init;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kmem_buckets_create - Create a set of caches that handle dynamic sized
|
|
|
|
* allocations via kmem_buckets_alloc()
|
|
|
|
* @name: A prefix string which is used in /proc/slabinfo to identify this
|
|
|
|
* cache. The individual caches with have their sizes as the suffix.
|
|
|
|
* @flags: SLAB flags (see kmem_cache_create() for details).
|
|
|
|
* @useroffset: Starting offset within an allocation that may be copied
|
|
|
|
* to/from userspace.
|
|
|
|
* @usersize: How many bytes, starting at @useroffset, may be copied
|
|
|
|
* to/from userspace.
|
|
|
|
* @ctor: A constructor for the objects, run when new allocations are made.
|
|
|
|
*
|
|
|
|
* Cannot be called within an interrupt, but can be interrupted.
|
|
|
|
*
|
|
|
|
* Return: a pointer to the cache on success, NULL on failure. When
|
|
|
|
* CONFIG_SLAB_BUCKETS is not enabled, ZERO_SIZE_PTR is returned, and
|
|
|
|
* subsequent calls to kmem_buckets_alloc() will fall back to kmalloc().
|
|
|
|
* (i.e. callers only need to check for NULL on failure.)
|
|
|
|
*/
|
|
|
|
kmem_buckets *kmem_buckets_create(const char *name, slab_flags_t flags,
|
|
|
|
unsigned int useroffset,
|
|
|
|
unsigned int usersize,
|
|
|
|
void (*ctor)(void *))
|
|
|
|
{
|
2024-11-05 11:27:47 +09:00
|
|
|
unsigned long mask = 0;
|
|
|
|
unsigned int idx;
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
kmem_buckets *b;
|
2024-11-05 11:27:47 +09:00
|
|
|
|
|
|
|
BUILD_BUG_ON(ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]) > BITS_PER_LONG);
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* When the separate buckets API is not built in, just return
|
|
|
|
* a non-NULL value for the kmem_buckets pointer, which will be
|
|
|
|
* unused when performing allocations.
|
|
|
|
*/
|
|
|
|
if (!IS_ENABLED(CONFIG_SLAB_BUCKETS))
|
|
|
|
return ZERO_SIZE_PTR;
|
|
|
|
|
|
|
|
if (WARN_ON(!kmem_buckets_cache))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
b = kmem_cache_alloc(kmem_buckets_cache, GFP_KERNEL|__GFP_ZERO);
|
|
|
|
if (WARN_ON(!b))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
flags |= SLAB_NO_MERGE;
|
|
|
|
|
|
|
|
for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++) {
|
|
|
|
char *short_size, *cache_name;
|
|
|
|
unsigned int cache_useroffset, cache_usersize;
|
2024-11-05 11:27:47 +09:00
|
|
|
unsigned int size, aligned_idx;
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
|
|
|
|
if (!kmalloc_caches[KMALLOC_NORMAL][idx])
|
|
|
|
continue;
|
|
|
|
|
|
|
|
size = kmalloc_caches[KMALLOC_NORMAL][idx]->object_size;
|
|
|
|
if (!size)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
short_size = strchr(kmalloc_caches[KMALLOC_NORMAL][idx]->name, '-');
|
|
|
|
if (WARN_ON(!short_size))
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
if (useroffset >= size) {
|
|
|
|
cache_useroffset = 0;
|
|
|
|
cache_usersize = 0;
|
|
|
|
} else {
|
|
|
|
cache_useroffset = useroffset;
|
|
|
|
cache_usersize = min(size - cache_useroffset, usersize);
|
|
|
|
}
|
2024-11-05 11:27:47 +09:00
|
|
|
|
|
|
|
aligned_idx = __kmalloc_index(size, false);
|
|
|
|
if (!(*b)[aligned_idx]) {
|
|
|
|
cache_name = kasprintf(GFP_KERNEL, "%s-%s", name, short_size + 1);
|
|
|
|
if (WARN_ON(!cache_name))
|
|
|
|
goto fail;
|
|
|
|
(*b)[aligned_idx] = kmem_cache_create_usercopy(cache_name, size,
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
0, flags, cache_useroffset,
|
|
|
|
cache_usersize, ctor);
|
2024-11-05 11:27:47 +09:00
|
|
|
kfree(cache_name);
|
|
|
|
if (WARN_ON(!(*b)[aligned_idx]))
|
|
|
|
goto fail;
|
|
|
|
set_bit(aligned_idx, &mask);
|
|
|
|
}
|
|
|
|
if (idx != aligned_idx)
|
|
|
|
(*b)[idx] = (*b)[aligned_idx];
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
return b;
|
|
|
|
|
|
|
|
fail:
|
2024-11-05 11:27:47 +09:00
|
|
|
for_each_set_bit(idx, &mask, ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]))
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
kmem_cache_destroy((*b)[idx]);
|
2024-08-22 10:27:04 +08:00
|
|
|
kmem_cache_free(kmem_buckets_cache, b);
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmem_buckets_create);
|
|
|
|
|
2022-08-12 14:30:33 -04:00
|
|
|
/*
|
|
|
|
* For a given kmem_cache, kmem_cache_destroy() should only be called
|
|
|
|
* once or there will be a use-after-free problem. The actual deletion
|
|
|
|
* and release of the kobject does not need slab_mutex or cpu_hotplug_lock
|
|
|
|
* protection. So they are now done without holding those locks.
|
|
|
|
*/
|
|
|
|
static void kmem_cache_release(struct kmem_cache *s)
|
|
|
|
{
|
2024-08-07 12:31:16 +02:00
|
|
|
kfence_shutdown_cache(s);
|
2024-08-07 12:31:15 +02:00
|
|
|
if (__is_defined(SLAB_SUPPORTS_SYSFS) && slab_state >= FULL)
|
2024-02-28 11:04:08 +08:00
|
|
|
sysfs_slab_release(s);
|
2024-08-07 12:31:15 +02:00
|
|
|
else
|
2024-02-28 11:04:08 +08:00
|
|
|
slab_kmem_cache_release(s);
|
memcg: zap memcg_slab_caches and memcg_slab_mutex
mem_cgroup->memcg_slab_caches is a list of kmem caches corresponding to
the given cgroup. Currently, it is only used on css free in order to
destroy all caches corresponding to the memory cgroup being freed. The
list is protected by memcg_slab_mutex. The mutex is also used to protect
kmem_cache->memcg_params->memcg_caches arrays and synchronizes
kmem_cache_destroy vs memcg_unregister_all_caches.
However, we can perfectly get on without these two. To destroy all caches
corresponding to a memory cgroup, we can walk over the global list of kmem
caches, slab_caches, and we can do all the synchronization stuff using the
slab_mutex instead of the memcg_slab_mutex. This patch therefore gets rid
of the memcg_slab_caches and memcg_slab_mutex.
Apart from this nice cleanup, it also:
- assures that rcu_barrier() is called once at max when a root cache is
destroyed or a memory cgroup is freed, no matter how many caches have
SLAB_DESTROY_BY_RCU flag set;
- fixes the race between kmem_cache_destroy and kmem_cache_create that
exists, because memcg_cleanup_cache_params, which is called from
kmem_cache_destroy after checking that kmem_cache->refcount=0,
releases the slab_mutex, which gives kmem_cache_create a chance to
make an alias to a cache doomed to be destroyed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10 14:11:47 -08:00
|
|
|
}
|
|
|
|
|
2014-05-06 12:50:08 -07:00
|
|
|
void slab_kmem_cache_release(struct kmem_cache *s)
|
|
|
|
{
|
2016-02-17 13:11:37 -08:00
|
|
|
__kmem_cache_release(s);
|
2015-02-13 14:36:38 -08:00
|
|
|
kfree_const(s->name);
|
2014-05-06 12:50:08 -07:00
|
|
|
kmem_cache_free(kmem_cache, s);
|
|
|
|
}
|
|
|
|
|
2012-09-04 23:18:33 +00:00
|
|
|
void kmem_cache_destroy(struct kmem_cache *s)
|
|
|
|
{
|
2024-08-07 12:31:15 +02:00
|
|
|
int err;
|
2022-08-12 14:30:33 -04:00
|
|
|
|
2022-01-14 14:04:54 -08:00
|
|
|
if (unlikely(!s) || !kasan_check_byte(s))
|
2015-09-08 15:00:50 -07:00
|
|
|
return;
|
|
|
|
|
2024-08-07 12:31:19 +02:00
|
|
|
/* in-flight kfree_rcu()'s may include objects from our cache */
|
|
|
|
kvfree_rcu_barrier();
|
|
|
|
|
2024-08-09 17:36:56 +02:00
|
|
|
if (IS_ENABLED(CONFIG_SLUB_RCU_DEBUG) &&
|
|
|
|
(s->flags & SLAB_TYPESAFE_BY_RCU)) {
|
|
|
|
/*
|
|
|
|
* Under CONFIG_SLUB_RCU_DEBUG, when objects in a
|
|
|
|
* SLAB_TYPESAFE_BY_RCU slab are freed, SLUB will internally
|
|
|
|
* defer their freeing with call_rcu().
|
|
|
|
* Wait for such call_rcu() invocations here before actually
|
|
|
|
* destroying the cache.
|
|
|
|
*
|
|
|
|
* It doesn't matter that we haven't looked at the slab refcount
|
|
|
|
* yet - slabs with SLAB_TYPESAFE_BY_RCU can't be merged, so
|
|
|
|
* the refcount should be 1 here.
|
|
|
|
*/
|
|
|
|
rcu_barrier();
|
|
|
|
}
|
|
|
|
|
2021-02-26 17:11:55 +01:00
|
|
|
cpus_read_lock();
|
2012-09-04 23:18:33 +00:00
|
|
|
mutex_lock(&slab_mutex);
|
2014-04-07 15:39:28 -07:00
|
|
|
|
mm/slab_common: fix slab_caches list corruption after kmem_cache_destroy()
After the commit in Fixes:, if a module that created a slab cache does not
release all of its allocated objects before destroying the cache (at rmmod
time), we might end up releasing the kmem_cache object without removing it
from the slab_caches list thus corrupting the list as kmem_cache_destroy()
ignores the return value from shutdown_cache(), which in turn never removes
the kmem_cache object from slabs_list in case __kmem_cache_shutdown() fails
to release all of the cache's slabs.
This is easily observable on a kernel built with CONFIG_DEBUG_LIST=y
as after that ill release the system will immediately trip on list_add,
or list_del, assertions similar to the one shown below as soon as another
kmem_cache gets created, or destroyed:
[ 1041.213632] list_del corruption. next->prev should be ffff89f596fb5768, but was 52f1e5016aeee75d. (next=ffff89f595a1b268)
[ 1041.219165] ------------[ cut here ]------------
[ 1041.221517] kernel BUG at lib/list_debug.c:62!
[ 1041.223452] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 1041.225408] CPU: 2 PID: 1852 Comm: rmmod Kdump: loaded Tainted: G B W OE 6.5.0 #15
[ 1041.228244] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc37 05/24/2023
[ 1041.231212] RIP: 0010:__list_del_entry_valid+0xae/0xb0
Another quick way to trigger this issue, in a kernel with CONFIG_SLUB=y,
is to set slub_debug to poison the released objects and then just run
cat /proc/slabinfo after removing the module that leaks slab objects,
in which case the kernel will panic:
[ 50.954843] general protection fault, probably for non-canonical address 0xa56b6b6b6b6b6b8b: 0000 [#1] PREEMPT SMP PTI
[ 50.961545] CPU: 2 PID: 1495 Comm: cat Kdump: loaded Tainted: G B W OE 6.5.0 #15
[ 50.966808] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc37 05/24/2023
[ 50.972663] RIP: 0010:get_slabinfo+0x42/0xf0
This patch fixes this issue by properly checking shutdown_cache()'s
return value before taking the kmem_cache_release() branch.
Fixes: 0495e337b703 ("mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock")
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-09-08 19:06:49 -04:00
|
|
|
s->refcount--;
|
2024-08-07 12:31:15 +02:00
|
|
|
if (s->refcount) {
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
cpus_read_unlock();
|
|
|
|
return;
|
|
|
|
}
|
2014-04-07 15:39:28 -07:00
|
|
|
|
2024-08-07 12:31:14 +02:00
|
|
|
/* free asan quarantined objects */
|
|
|
|
kasan_cache_shutdown(s);
|
|
|
|
|
|
|
|
err = __kmem_cache_shutdown(s);
|
2024-10-01 18:20:48 +02:00
|
|
|
if (!slab_in_kunit_test())
|
|
|
|
WARN(err, "%s %s: Slab cache still has objects when called from %pS",
|
|
|
|
__func__, s->name, (void *)_RET_IP_);
|
2024-08-07 12:31:14 +02:00
|
|
|
|
|
|
|
list_del(&s->list);
|
|
|
|
|
2014-04-07 15:39:28 -07:00
|
|
|
mutex_unlock(&slab_mutex);
|
2021-02-26 17:11:55 +01:00
|
|
|
cpus_read_unlock();
|
2024-08-07 12:31:15 +02:00
|
|
|
|
|
|
|
if (slab_state >= FULL)
|
|
|
|
sysfs_slab_unlink(s);
|
|
|
|
debugfs_slab_release(s);
|
|
|
|
|
|
|
|
if (err)
|
|
|
|
return;
|
|
|
|
|
2024-08-07 12:31:17 +02:00
|
|
|
if (s->flags & SLAB_TYPESAFE_BY_RCU)
|
|
|
|
rcu_barrier();
|
|
|
|
|
|
|
|
kmem_cache_release(s);
|
2012-09-04 23:18:33 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmem_cache_destroy);
|
|
|
|
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:07:20 -07:00
|
|
|
/**
|
|
|
|
* kmem_cache_shrink - Shrink a cache.
|
|
|
|
* @cachep: The cache to shrink.
|
|
|
|
*
|
|
|
|
* Releases as many slabs as possible for a cache.
|
|
|
|
* To help debugging, a zero exit status indicates all slabs were released.
|
2019-03-05 15:48:42 -08:00
|
|
|
*
|
|
|
|
* Return: %0 if all slabs were released, non-zero otherwise
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:07:20 -07:00
|
|
|
*/
|
|
|
|
int kmem_cache_shrink(struct kmem_cache *cachep)
|
|
|
|
{
|
mm: kasan: initial memory quarantine implementation
Quarantine isolates freed objects in a separate queue. The objects are
returned to the allocator later, which helps to detect use-after-free
errors.
When the object is freed, its state changes from KASAN_STATE_ALLOC to
KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine
instead of being returned to the allocator, therefore every subsequent
access to that object triggers a KASAN error, and the error handler is
able to say where the object has been allocated and deallocated.
When it's time for the object to leave quarantine, its state becomes
KASAN_STATE_FREE and it's returned to the allocator. From now on the
allocator may reuse it for another allocation. Before that happens,
it's still possible to detect a use-after free on that object (it
retains the allocation/deallocation stacks).
When the allocator reuses this object, the shadow is unpoisoned and old
allocation/deallocation stacks are wiped. Therefore a use of this
object, even an incorrect one, won't trigger ASan warning.
Without the quarantine, it's not guaranteed that the objects aren't
reused immediately, that's why the probability of catching a
use-after-free is lower than with quarantine in place.
Quarantine isolates freed objects in a separate queue. The objects are
returned to the allocator later, which helps to detect use-after-free
errors.
Freed objects are first added to per-cpu quarantine queues. When a
cache is destroyed or memory shrinking is requested, the objects are
moved into the global quarantine queue. Whenever a kmalloc call allows
memory reclaiming, the oldest objects are popped out of the global queue
until the total size of objects in quarantine is less than 3/4 of the
maximum quarantine size (which is a fraction of installed physical
memory).
As long as an object remains in the quarantine, KASAN is able to report
accesses to it, so the chance of reporting a use-after-free is
increased. Once the object leaves quarantine, the allocator may reuse
it, in which case the object is unpoisoned and KASAN can't detect
incorrect accesses to it.
Right now quarantine support is only enabled in SLAB allocator.
Unification of KASAN features in SLAB and SLUB will be done later.
This patch is based on the "mm: kasan: quarantine" patch originally
prepared by Dmitry Chernenkov. A number of improvements have been
suggested by Andrey Ryabinin.
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 16:59:11 -07:00
|
|
|
kasan_cache_shrink(cachep);
|
mm, slab, slub: stop taking memory hotplug lock
Since commit 03afc0e25f7f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for
SLAB and SLUB when creating, destroying or shrinking a cache. It is quite
a heavy lock and it's best to avoid it if possible, as we had several
issues with lockdep complaining about ordering in the past, see e.g.
e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()").
The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock)
can be summarized as follows: while there's slab_mutex synchronizing new
kmem cache creation and SLUB's MEM_GOING_ONLINE callback
slab_mem_going_online_callback(), we may miss creation of kmem_cache_node
for the hotplugged node in the new kmem cache, because the hotplug
callback doesn't yet see the new cache, and cache creation in
init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
N_NORMAL_MEMORY nodemask, which however may not yet include the new node,
as that happens only later after the MEM_GOING_ONLINE callback.
Instead of using get/put_online_mems(), the problem can be solved by SLUB
maintaining its own nodemask of nodes for which it has allocated the
per-node kmem_cache_node structures. This nodemask would generally mirror
the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's
control in its memory hotplug callbacks under the slab_mutex. This patch
adds such nodemask and its handling.
Commit 03afc0e25f7f mentiones "issues like [the one above]", but there
don't appear to be further issues. All the paths (shared for SLAB and
SLUB) taking the memory hotplug locks are also taking the slab_mutex,
except kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with
get/put_online_mems().
We however cannot simply restore slab_mutex in kmem_cache_shrink(), as
SLUB can enters the function from a write to sysfs 'shrink' file, thus
holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested
within slab_mutex. But on closer inspection we don't actually need to
protect kmem_cache_shrink() from hotplug callbacks: While SLUB's
__kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node
added in parallel hotplug is not fatal, and parallel hotremove does not
free kmem_cache_node's anymore after the previous patch, so use-after free
cannot happen. The per-node shrinking itself is protected by
n->list_lock. Same is true for SLAB, and SLOB is no-op.
SLAB also doesn't need the memory hotplug locking, which it only gained by
03afc0e25f7f through the shared paths in slab_common.c. Its memory
hotplug callbacks are also protected by slab_mutex against races with
these paths. The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply
to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new
node is already set there during the MEM_GOING_ONLINE callback, so no
special care is needed for SLAB.
As such, this patch removes all get/put_online_mems() usage by the slab
subsystem.
Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <cai@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 12:01:12 -08:00
|
|
|
|
2022-08-23 07:52:41 +00:00
|
|
|
return __kmem_cache_shrink(cachep);
|
slab: get_online_mems for kmem_cache_{create,destroy,shrink}
When we create a sl[au]b cache, we allocate kmem_cache_node structures
for each online NUMA node. To handle nodes taken online/offline, we
register memory hotplug notifier and allocate/free kmem_cache_node
corresponding to the node that changes its state for each kmem cache.
To synchronize between the two paths we hold the slab_mutex during both
the cache creationg/destruction path and while tuning per-node parts of
kmem caches in memory hotplug handler, but that's not quite right,
because it does not guarantee that a newly created cache will have all
kmem_cache_nodes initialized in case it races with memory hotplug. For
instance, in case of slub:
CPU0 CPU1
---- ----
kmem_cache_create: online_pages:
__kmem_cache_create: slab_memory_callback:
slab_mem_going_online_callback:
lock slab_mutex
for each slab_caches list entry
allocate kmem_cache node
unlock slab_mutex
lock slab_mutex
init_kmem_cache_nodes:
for_each_node_state(node, N_NORMAL_MEMORY)
allocate kmem_cache node
add kmem_cache to slab_caches list
unlock slab_mutex
online_pages (continued):
node_states_set_node
As a result we'll get a kmem cache with not all kmem_cache_nodes
allocated.
To avoid issues like that we should hold get/put_online_mems() during
the whole kmem cache creation/destruction/shrink paths, just like we
deal with cpu hotplug. This patch does the trick.
Note, that after it's applied, there is no need in taking the slab_mutex
for kmem_cache_shrink any more, so it is removed from there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:07:20 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmem_cache_shrink);
|
|
|
|
|
2015-11-05 18:44:59 -08:00
|
|
|
bool slab_is_available(void)
|
2012-07-06 15:25:11 -05:00
|
|
|
{
|
|
|
|
return slab_state >= UP;
|
|
|
|
}
|
2012-10-19 18:20:25 +04:00
|
|
|
|
2021-01-07 13:46:11 -08:00
|
|
|
#ifdef CONFIG_PRINTK
|
2022-04-14 19:13:40 -07:00
|
|
|
static void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
|
|
|
|
{
|
|
|
|
if (__kfence_obj_info(kpp, object, slab))
|
|
|
|
return;
|
|
|
|
__kmem_obj_info(kpp, object, slab);
|
|
|
|
}
|
|
|
|
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
/**
|
|
|
|
* kmem_dump_obj - Print available slab provenance information
|
|
|
|
* @object: slab object for which to find provenance information.
|
|
|
|
*
|
|
|
|
* This function uses pr_cont(), so that the caller is expected to have
|
|
|
|
* printed out whatever preamble is appropriate. The provenance information
|
|
|
|
* depends on the type of object and on how much debugging is enabled.
|
|
|
|
* For a slab-cache object, the fact that it is a slab object is printed,
|
|
|
|
* and, if available, the slab name, return address, and stack trace from
|
2021-03-16 16:07:11 +05:30
|
|
|
* the allocation and last free path of that object.
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
*
|
2023-08-05 11:17:25 +08:00
|
|
|
* Return: %true if the pointer is to a not-yet-freed object from
|
|
|
|
* kmalloc() or kmem_cache_alloc(), either %true or %false if the pointer
|
|
|
|
* is to an already-freed object, and %false otherwise.
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
*/
|
2023-08-05 11:17:25 +08:00
|
|
|
bool kmem_dump_obj(void *object)
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
{
|
|
|
|
char *cp = IS_ENABLED(CONFIG_MMU) ? "" : "/vmalloc";
|
|
|
|
int i;
|
2021-10-04 14:45:55 +01:00
|
|
|
struct slab *slab;
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
unsigned long ptroffset;
|
|
|
|
struct kmem_obj_info kp = { };
|
|
|
|
|
2023-08-05 11:17:25 +08:00
|
|
|
/* Some arches consider ZERO_SIZE_PTR to be a valid address. */
|
|
|
|
if (object < (void *)PAGE_SIZE || !virt_addr_valid(object))
|
|
|
|
return false;
|
2021-10-04 14:45:55 +01:00
|
|
|
slab = virt_to_slab(object);
|
2023-08-05 11:17:25 +08:00
|
|
|
if (!slab)
|
|
|
|
return false;
|
|
|
|
|
2021-10-04 14:45:55 +01:00
|
|
|
kmem_obj_info(&kp, object, slab);
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
if (kp.kp_slab_cache)
|
|
|
|
pr_cont(" slab%s %s", cp, kp.kp_slab_cache->name);
|
|
|
|
else
|
|
|
|
pr_cont(" slab%s", cp);
|
2022-04-14 19:13:40 -07:00
|
|
|
if (is_kfence_address(object))
|
|
|
|
pr_cont(" (kfence)");
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
if (kp.kp_objp)
|
|
|
|
pr_cont(" start %px", kp.kp_objp);
|
|
|
|
if (kp.kp_data_offset)
|
|
|
|
pr_cont(" data offset %lu", kp.kp_data_offset);
|
|
|
|
if (kp.kp_objp) {
|
|
|
|
ptroffset = ((char *)object - (char *)kp.kp_objp) - kp.kp_data_offset;
|
|
|
|
pr_cont(" pointer offset %lu", ptroffset);
|
|
|
|
}
|
2022-11-16 15:56:32 +01:00
|
|
|
if (kp.kp_slab_cache && kp.kp_slab_cache->object_size)
|
|
|
|
pr_cont(" size %u", kp.kp_slab_cache->object_size);
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
if (kp.kp_ret)
|
|
|
|
pr_cont(" allocated at %pS\n", kp.kp_ret);
|
|
|
|
else
|
|
|
|
pr_cont("\n");
|
|
|
|
for (i = 0; i < ARRAY_SIZE(kp.kp_stack); i++) {
|
|
|
|
if (!kp.kp_stack[i])
|
|
|
|
break;
|
|
|
|
pr_info(" %pS\n", kp.kp_stack[i]);
|
|
|
|
}
|
2021-03-16 16:07:11 +05:30
|
|
|
|
|
|
|
if (kp.kp_free_stack[0])
|
|
|
|
pr_cont(" Free path:\n");
|
|
|
|
|
|
|
|
for (i = 0; i < ARRAY_SIZE(kp.kp_free_stack); i++) {
|
|
|
|
if (!kp.kp_free_stack[i])
|
|
|
|
break;
|
|
|
|
pr_info(" %pS\n", kp.kp_free_stack[i]);
|
|
|
|
}
|
|
|
|
|
2023-08-05 11:17:25 +08:00
|
|
|
return true;
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
}
|
2020-12-07 21:23:36 -08:00
|
|
|
EXPORT_SYMBOL_GPL(kmem_dump_obj);
|
2021-01-07 13:46:11 -08:00
|
|
|
#endif
|
mm: Add mem_dump_obj() to print source of memory block
There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening. In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.
This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from. This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is. These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.
The information printed can depend on kernel configuration. For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled. For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired. Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reported-by: Andrii Nakryiko <andrii@kernel.org>
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
[ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
[ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-12-07 17:41:02 -08:00
|
|
|
|
2012-11-28 16:23:07 +00:00
|
|
|
/* Create a cache during boot when no slab services are available yet */
|
2018-04-05 16:20:33 -07:00
|
|
|
void __init create_boot_cache(struct kmem_cache *s, const char *name,
|
|
|
|
unsigned int size, slab_flags_t flags,
|
|
|
|
unsigned int useroffset, unsigned int usersize)
|
2012-11-28 16:23:07 +00:00
|
|
|
{
|
|
|
|
int err;
|
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
In most configurations, kmalloc() happens to return naturally aligned
(i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment. For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it. That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The
implicitly provided alignment could be compromised with
CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
caches with alignment larger than unsigned long long. Practically on at
least x86 this includes kmalloc caches as they use cache line alignment,
which is larger than that. Still, this patch ensures alignment on all
arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through
CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
With this patch, explicit alignment is guaranteed with redzoning as
well. This will result in more memory being wasted, but that should be
acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for
kmalloc(). The potential downside is increased fragmentation. While
pathological allocation scenarios are certainly possible, in my testing,
after booting a x86_64 kernel+userspace with virtme, around 16MB memory
was consumed by slab pages both before and after the patch, with
difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-06 17:58:45 -07:00
|
|
|
unsigned int align = ARCH_KMALLOC_MINALIGN;
|
2024-09-05 09:56:51 +02:00
|
|
|
struct kmem_cache_args kmem_args = {};
|
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
In most configurations, kmalloc() happens to return naturally aligned
(i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment. For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it. That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The
implicitly provided alignment could be compromised with
CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
caches with alignment larger than unsigned long long. Practically on at
least x86 this includes kmalloc caches as they use cache line alignment,
which is larger than that. Still, this patch ensures alignment on all
arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through
CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
With this patch, explicit alignment is guaranteed with redzoning as
well. This will result in more memory being wasted, but that should be
acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for
kmalloc(). The potential downside is increased fragmentation. While
pathological allocation scenarios are certainly possible, in my testing,
after booting a x86_64 kernel+userspace with virtme, around 16MB memory
was consumed by slab pages both before and after the patch, with
difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-06 17:58:45 -07:00
|
|
|
|
|
|
|
/*
|
slab, rust: extend kmalloc() alignment guarantees to remove Rust padding
Slab allocators have been guaranteeing natural alignment for
power-of-two sizes since commit 59bb47985c1d ("mm, sl[aou]b: guarantee
natural alignment for kmalloc(power-of-two)"), while any other sizes are
guaranteed to be aligned only to ARCH_KMALLOC_MINALIGN bytes (although
in practice are aligned more than that in non-debug scenarios).
Rust's allocator API specifies size and alignment per allocation, which
have to satisfy the following rules, per Alice Ryhl [1]:
1. The alignment is a power of two.
2. The size is non-zero.
3. When you round up the size to the next multiple of the alignment,
then it must not overflow the signed type isize / ssize_t.
In order to map this to kmalloc()'s guarantees, some requested
allocation sizes have to be padded to the next power-of-two size [2].
For example, an allocation of size 96 and alignment of 32 will be padded
to an allocation of size 128, because the existing kmalloc-96 bucket
doesn't guarantee alignent above ARCH_KMALLOC_MINALIGN. Without slab
debugging active, the layout of the kmalloc-96 slabs however naturally
align the objects to 32 bytes, so extending the size to 128 bytes is
wasteful.
To improve the situation we can extend the kmalloc() alignment
guarantees in a way that
1) doesn't change the current slab layout (and thus does not increase
internal fragmentation) when slab debugging is not active
2) reduces waste in the Rust allocator use case
3) is a superset of the current guarantee for power-of-two sizes.
The extended guarantee is that alignment is at least the largest
power-of-two divisor of the requested size. For power-of-two sizes the
largest divisor is the size itself, but let's keep this case documented
separately for clarity.
For current kmalloc size buckets, it means kmalloc-96 will guarantee
alignment of 32 bytes and kmalloc-196 will guarantee 64 bytes.
This covers the rules 1 and 2 above of Rust's API as long as the size is
a multiple of the alignment. The Rust layer should now only need to
round up the size to the next multiple if it isn't, while enforcing the
rule 3.
Implementation-wise, this changes the alignment calculation in
create_boot_cache(). While at it also do the calulation only for caches
with the SLAB_KMALLOC flag, because the function is also used to create
the initial kmem_cache and kmem_cache_node caches, where no alignment
guarantee is necessary.
In the Rust allocator's krealloc_aligned(), remove the code that padded
sizes to the next power of two (suggested by Alice Ryhl) as it's no
longer necessary with the new guarantees.
Reported-by: Alice Ryhl <aliceryhl@google.com>
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/all/CAH5fLggjrbdUuT-H-5vbQfMazjRDpp2%2Bk3%3DYhPyS17ezEqxwcw@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/CAH5fLghsZRemYUwVvhk77o6y1foqnCeDzW4WZv6ScEWna2+_jw@mail.gmail.com/ [2]
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-03 09:25:21 +02:00
|
|
|
* kmalloc caches guarantee alignment of at least the largest
|
|
|
|
* power-of-two divisor of the size. For power-of-two sizes,
|
|
|
|
* it is the size itself.
|
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
In most configurations, kmalloc() happens to return naturally aligned
(i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment. For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it. That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The
implicitly provided alignment could be compromised with
CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
caches with alignment larger than unsigned long long. Practically on at
least x86 this includes kmalloc caches as they use cache line alignment,
which is larger than that. Still, this patch ensures alignment on all
arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through
CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
With this patch, explicit alignment is guaranteed with redzoning as
well. This will result in more memory being wasted, but that should be
acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for
kmalloc(). The potential downside is increased fragmentation. While
pathological allocation scenarios are certainly possible, in my testing,
after booting a x86_64 kernel+userspace with virtme, around 16MB memory
was consumed by slab pages both before and after the patch, with
difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-06 17:58:45 -07:00
|
|
|
*/
|
slab, rust: extend kmalloc() alignment guarantees to remove Rust padding
Slab allocators have been guaranteeing natural alignment for
power-of-two sizes since commit 59bb47985c1d ("mm, sl[aou]b: guarantee
natural alignment for kmalloc(power-of-two)"), while any other sizes are
guaranteed to be aligned only to ARCH_KMALLOC_MINALIGN bytes (although
in practice are aligned more than that in non-debug scenarios).
Rust's allocator API specifies size and alignment per allocation, which
have to satisfy the following rules, per Alice Ryhl [1]:
1. The alignment is a power of two.
2. The size is non-zero.
3. When you round up the size to the next multiple of the alignment,
then it must not overflow the signed type isize / ssize_t.
In order to map this to kmalloc()'s guarantees, some requested
allocation sizes have to be padded to the next power-of-two size [2].
For example, an allocation of size 96 and alignment of 32 will be padded
to an allocation of size 128, because the existing kmalloc-96 bucket
doesn't guarantee alignent above ARCH_KMALLOC_MINALIGN. Without slab
debugging active, the layout of the kmalloc-96 slabs however naturally
align the objects to 32 bytes, so extending the size to 128 bytes is
wasteful.
To improve the situation we can extend the kmalloc() alignment
guarantees in a way that
1) doesn't change the current slab layout (and thus does not increase
internal fragmentation) when slab debugging is not active
2) reduces waste in the Rust allocator use case
3) is a superset of the current guarantee for power-of-two sizes.
The extended guarantee is that alignment is at least the largest
power-of-two divisor of the requested size. For power-of-two sizes the
largest divisor is the size itself, but let's keep this case documented
separately for clarity.
For current kmalloc size buckets, it means kmalloc-96 will guarantee
alignment of 32 bytes and kmalloc-196 will guarantee 64 bytes.
This covers the rules 1 and 2 above of Rust's API as long as the size is
a multiple of the alignment. The Rust layer should now only need to
round up the size to the next multiple if it isn't, while enforcing the
rule 3.
Implementation-wise, this changes the alignment calculation in
create_boot_cache(). While at it also do the calulation only for caches
with the SLAB_KMALLOC flag, because the function is also used to create
the initial kmem_cache and kmem_cache_node caches, where no alignment
guarantee is necessary.
In the Rust allocator's krealloc_aligned(), remove the code that padded
sizes to the next power of two (suggested by Alice Ryhl) as it's no
longer necessary with the new guarantees.
Reported-by: Alice Ryhl <aliceryhl@google.com>
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Link: https://lore.kernel.org/all/CAH5fLggjrbdUuT-H-5vbQfMazjRDpp2%2Bk3%3DYhPyS17ezEqxwcw@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/CAH5fLghsZRemYUwVvhk77o6y1foqnCeDzW4WZv6ScEWna2+_jw@mail.gmail.com/ [2]
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-03 09:25:21 +02:00
|
|
|
if (flags & SLAB_KMALLOC)
|
|
|
|
align = max(align, 1U << (ffs(size) - 1));
|
2024-09-05 09:56:51 +02:00
|
|
|
kmem_args.align = calculate_alignment(flags, align, size);
|
mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
In most configurations, kmalloc() happens to return naturally aligned
(i.e. aligned to the block size itself) blocks for power of two sizes.
That means some kmalloc() users might unknowingly rely on that
alignment, until stuff breaks when the kernel is built with e.g.
CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then
developers have to devise workaround such as own kmem caches with
specified alignment [1], which is not always practical, as recently
evidenced in [2].
The topic has been discussed at LSF/MM 2019 [3]. Adding a
'kmalloc_aligned()' variant would not help with code unknowingly relying
on the implicit alignment. For slab implementations it would either
require creating more kmalloc caches, or allocate a larger size and only
give back part of it. That would be wasteful, especially with a generic
alignment parameter (in contrast with a fixed alignment to size).
Ideally we should provide to mm users what they need without difficult
workarounds or own reimplementations, so let's make the kmalloc()
alignment to size explicitly guaranteed for power-of-two sizes under all
configurations. What this means for the three available allocators?
* SLAB object layout happens to be mostly unchanged by the patch. The
implicitly provided alignment could be compromised with
CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
caches with alignment larger than unsigned long long. Practically on at
least x86 this includes kmalloc caches as they use cache line alignment,
which is larger than that. Still, this patch ensures alignment on all
arches and cache sizes.
* SLUB layout is also unchanged unless redzoning is enabled through
CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
With this patch, explicit alignment is guaranteed with redzoning as
well. This will result in more memory being wasted, but that should be
acceptable in a debugging scenario.
* SLOB has no implicit alignment so this patch adds it explicitly for
kmalloc(). The potential downside is increased fragmentation. While
pathological allocation scenarios are certainly possible, in my testing,
after booting a x86_64 kernel+userspace with virtme, around 16MB memory
was consumed by slab pages both before and after the patch, with
difference in the noise.
[1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
[2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
[3] https://lwn.net/Articles/787740/
[akpm@linux-foundation.org: documentation fixlet, per Matthew]
Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: David Sterba <dsterba@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-06 17:58:45 -07:00
|
|
|
|
2022-11-16 15:56:32 +01:00
|
|
|
#ifdef CONFIG_HARDENED_USERCOPY
|
2024-09-05 09:56:51 +02:00
|
|
|
kmem_args.useroffset = useroffset;
|
|
|
|
kmem_args.usersize = usersize;
|
2022-11-16 15:56:32 +01:00
|
|
|
#endif
|
2015-02-12 14:59:20 -08:00
|
|
|
|
2024-09-05 09:56:51 +02:00
|
|
|
err = do_kmem_cache_create(s, name, size, &kmem_args, flags);
|
2012-11-28 16:23:07 +00:00
|
|
|
|
|
|
|
if (err)
|
2018-04-05 16:20:33 -07:00
|
|
|
panic("Creation of kmalloc slab %s size=%u failed. Reason %d\n",
|
2012-11-28 16:23:07 +00:00
|
|
|
name, size, err);
|
|
|
|
|
|
|
|
s->refcount = -1; /* Exempt from merging for now */
|
|
|
|
}
|
|
|
|
|
2023-06-12 16:31:47 +01:00
|
|
|
static struct kmem_cache *__init create_kmalloc_cache(const char *name,
|
|
|
|
unsigned int size,
|
|
|
|
slab_flags_t flags)
|
2012-11-28 16:23:07 +00:00
|
|
|
{
|
|
|
|
struct kmem_cache *s = kmem_cache_zalloc(kmem_cache, GFP_NOWAIT);
|
|
|
|
|
|
|
|
if (!s)
|
|
|
|
panic("Out of memory when creating slab %s\n", name);
|
|
|
|
|
2023-06-12 16:31:47 +01:00
|
|
|
create_boot_cache(s, name, size, flags | SLAB_KMALLOC, 0, size);
|
2012-11-28 16:23:07 +00:00
|
|
|
list_add(&s->list, &slab_caches);
|
|
|
|
s->refcount = 1;
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
2024-07-01 12:12:58 -07:00
|
|
|
kmem_buckets kmalloc_caches[NR_KMALLOC_TYPES] __ro_after_init =
|
2024-01-09 15:16:31 -07:00
|
|
|
{ /* initialization for https://llvm.org/pr42570 */ };
|
2013-01-10 19:12:17 +00:00
|
|
|
EXPORT_SYMBOL(kmalloc_caches);
|
|
|
|
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
#ifdef CONFIG_RANDOM_KMALLOC_CACHES
|
|
|
|
unsigned long random_kmalloc_seed __ro_after_init;
|
|
|
|
EXPORT_SYMBOL(random_kmalloc_seed);
|
|
|
|
#endif
|
|
|
|
|
2013-01-10 19:14:19 +00:00
|
|
|
/*
|
|
|
|
* Conversion table for small slabs sizes / 8 to the index in the
|
|
|
|
* kmalloc array. This is necessary for slabs < 192 since we have non power
|
|
|
|
* of two cache sizes there. The size of larger slabs can be determined using
|
|
|
|
* fls.
|
|
|
|
*/
|
2023-11-13 12:02:02 +01:00
|
|
|
u8 kmalloc_size_index[24] __ro_after_init = {
|
2013-01-10 19:14:19 +00:00
|
|
|
3, /* 8 */
|
|
|
|
4, /* 16 */
|
|
|
|
5, /* 24 */
|
|
|
|
5, /* 32 */
|
|
|
|
6, /* 40 */
|
|
|
|
6, /* 48 */
|
|
|
|
6, /* 56 */
|
|
|
|
6, /* 64 */
|
|
|
|
1, /* 72 */
|
|
|
|
1, /* 80 */
|
|
|
|
1, /* 88 */
|
|
|
|
1, /* 96 */
|
|
|
|
7, /* 104 */
|
|
|
|
7, /* 112 */
|
|
|
|
7, /* 120 */
|
|
|
|
7, /* 128 */
|
|
|
|
2, /* 136 */
|
|
|
|
2, /* 144 */
|
|
|
|
2, /* 152 */
|
|
|
|
2, /* 160 */
|
|
|
|
2, /* 168 */
|
|
|
|
2, /* 176 */
|
|
|
|
2, /* 184 */
|
|
|
|
2 /* 192 */
|
|
|
|
};
|
|
|
|
|
slab: Introduce kmalloc_size_roundup()
In the effort to help the compiler reason about buffer sizes, the
__alloc_size attribute was added to allocators. This improves the scope
of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near
future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well,
as the vast majority of callers are not expecting to use more memory
than what they asked for.
There is, however, one common exception to this: anticipatory resizing
of kmalloc allocations. These cases all use ksize() to determine the
actual bucket size of a given allocation (e.g. 128 when 126 was asked
for). This comes in two styles in the kernel:
1) An allocation has been determined to be too small, and needs to be
resized. Instead of the caller choosing its own next best size, it
wants to minimize the number of calls to krealloc(), so it just uses
ksize() plus some additional bytes, forcing the realloc into the next
bucket size, from which it can learn how large it is now. For example:
data = krealloc(data, ksize(data) + 1, gfp);
data_len = ksize(data);
2) The minimum size of an allocation is calculated, but since it may
grow in the future, just use all the space available in the chosen
bucket immediately, to avoid needing to reallocate later. A good
example of this is skbuff's allocators:
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
...
/* kmalloc(size) might give us more room than requested.
* Put skb_shared_info exactly at the end of allocated zone,
* to allow max possible filling before reallocation.
*/
osize = ksize(data);
size = SKB_WITH_OVERHEAD(osize);
In both cases, the "how much was actually allocated?" question is answered
_after_ the allocation, where the compiler hinting is not in an easy place
to make the association any more. This mismatch between the compiler's
view of the buffer length and the code's intention about how much it is
going to actually use has already caused problems[1]. It is possible to
fix this by reordering the use of the "actual size" information.
We can serve the needs of users of ksize() and still have accurate buffer
length hinting for the compiler by doing the bucket size calculation
_before_ the allocation. Code can instead ask "how large an allocation
would I get for a given size?".
Introduce kmalloc_size_roundup(), to serve this function so we can start
replacing the "anticipatory resizing" uses of ksize().
[1] https://github.com/ClangBuiltLinux/linux/issues/1599
https://github.com/KSPP/linux/issues/183
[ vbabka@suse.cz: add SLOB version ]
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-23 13:28:08 -07:00
|
|
|
size_t kmalloc_size_roundup(size_t size)
|
|
|
|
{
|
2023-09-07 12:42:20 +00:00
|
|
|
if (size && size <= KMALLOC_MAX_CACHE_SIZE) {
|
|
|
|
/*
|
|
|
|
* The flags don't matter since size_index is common to all.
|
|
|
|
* Neither does the caller for just getting ->object_size.
|
|
|
|
*/
|
2024-07-01 12:12:59 -07:00
|
|
|
return kmalloc_slab(size, NULL, GFP_KERNEL, 0)->object_size;
|
2023-09-07 12:42:20 +00:00
|
|
|
}
|
|
|
|
|
slab: Introduce kmalloc_size_roundup()
In the effort to help the compiler reason about buffer sizes, the
__alloc_size attribute was added to allocators. This improves the scope
of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near
future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well,
as the vast majority of callers are not expecting to use more memory
than what they asked for.
There is, however, one common exception to this: anticipatory resizing
of kmalloc allocations. These cases all use ksize() to determine the
actual bucket size of a given allocation (e.g. 128 when 126 was asked
for). This comes in two styles in the kernel:
1) An allocation has been determined to be too small, and needs to be
resized. Instead of the caller choosing its own next best size, it
wants to minimize the number of calls to krealloc(), so it just uses
ksize() plus some additional bytes, forcing the realloc into the next
bucket size, from which it can learn how large it is now. For example:
data = krealloc(data, ksize(data) + 1, gfp);
data_len = ksize(data);
2) The minimum size of an allocation is calculated, but since it may
grow in the future, just use all the space available in the chosen
bucket immediately, to avoid needing to reallocate later. A good
example of this is skbuff's allocators:
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
...
/* kmalloc(size) might give us more room than requested.
* Put skb_shared_info exactly at the end of allocated zone,
* to allow max possible filling before reallocation.
*/
osize = ksize(data);
size = SKB_WITH_OVERHEAD(osize);
In both cases, the "how much was actually allocated?" question is answered
_after_ the allocation, where the compiler hinting is not in an easy place
to make the association any more. This mismatch between the compiler's
view of the buffer length and the code's intention about how much it is
going to actually use has already caused problems[1]. It is possible to
fix this by reordering the use of the "actual size" information.
We can serve the needs of users of ksize() and still have accurate buffer
length hinting for the compiler by doing the bucket size calculation
_before_ the allocation. Code can instead ask "how large an allocation
would I get for a given size?".
Introduce kmalloc_size_roundup(), to serve this function so we can start
replacing the "anticipatory resizing" uses of ksize().
[1] https://github.com/ClangBuiltLinux/linux/issues/1599
https://github.com/KSPP/linux/issues/183
[ vbabka@suse.cz: add SLOB version ]
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-23 13:28:08 -07:00
|
|
|
/* Above the smaller buckets, size is a multiple of page size. */
|
2023-09-07 12:42:20 +00:00
|
|
|
if (size && size <= KMALLOC_MAX_SIZE)
|
slab: Introduce kmalloc_size_roundup()
In the effort to help the compiler reason about buffer sizes, the
__alloc_size attribute was added to allocators. This improves the scope
of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near
future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well,
as the vast majority of callers are not expecting to use more memory
than what they asked for.
There is, however, one common exception to this: anticipatory resizing
of kmalloc allocations. These cases all use ksize() to determine the
actual bucket size of a given allocation (e.g. 128 when 126 was asked
for). This comes in two styles in the kernel:
1) An allocation has been determined to be too small, and needs to be
resized. Instead of the caller choosing its own next best size, it
wants to minimize the number of calls to krealloc(), so it just uses
ksize() plus some additional bytes, forcing the realloc into the next
bucket size, from which it can learn how large it is now. For example:
data = krealloc(data, ksize(data) + 1, gfp);
data_len = ksize(data);
2) The minimum size of an allocation is calculated, but since it may
grow in the future, just use all the space available in the chosen
bucket immediately, to avoid needing to reallocate later. A good
example of this is skbuff's allocators:
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
...
/* kmalloc(size) might give us more room than requested.
* Put skb_shared_info exactly at the end of allocated zone,
* to allow max possible filling before reallocation.
*/
osize = ksize(data);
size = SKB_WITH_OVERHEAD(osize);
In both cases, the "how much was actually allocated?" question is answered
_after_ the allocation, where the compiler hinting is not in an easy place
to make the association any more. This mismatch between the compiler's
view of the buffer length and the code's intention about how much it is
going to actually use has already caused problems[1]. It is possible to
fix this by reordering the use of the "actual size" information.
We can serve the needs of users of ksize() and still have accurate buffer
length hinting for the compiler by doing the bucket size calculation
_before_ the allocation. Code can instead ask "how large an allocation
would I get for a given size?".
Introduce kmalloc_size_roundup(), to serve this function so we can start
replacing the "anticipatory resizing" uses of ksize().
[1] https://github.com/ClangBuiltLinux/linux/issues/1599
https://github.com/KSPP/linux/issues/183
[ vbabka@suse.cz: add SLOB version ]
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-23 13:28:08 -07:00
|
|
|
return PAGE_SIZE << get_order(size);
|
|
|
|
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
/*
|
2023-09-07 12:42:20 +00:00
|
|
|
* Return 'size' for 0 - kmalloc() returns ZERO_SIZE_PTR
|
|
|
|
* and very large size - kmalloc() may fail.
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
*/
|
2023-09-07 12:42:20 +00:00
|
|
|
return size;
|
|
|
|
|
slab: Introduce kmalloc_size_roundup()
In the effort to help the compiler reason about buffer sizes, the
__alloc_size attribute was added to allocators. This improves the scope
of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near
future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well,
as the vast majority of callers are not expecting to use more memory
than what they asked for.
There is, however, one common exception to this: anticipatory resizing
of kmalloc allocations. These cases all use ksize() to determine the
actual bucket size of a given allocation (e.g. 128 when 126 was asked
for). This comes in two styles in the kernel:
1) An allocation has been determined to be too small, and needs to be
resized. Instead of the caller choosing its own next best size, it
wants to minimize the number of calls to krealloc(), so it just uses
ksize() plus some additional bytes, forcing the realloc into the next
bucket size, from which it can learn how large it is now. For example:
data = krealloc(data, ksize(data) + 1, gfp);
data_len = ksize(data);
2) The minimum size of an allocation is calculated, but since it may
grow in the future, just use all the space available in the chosen
bucket immediately, to avoid needing to reallocate later. A good
example of this is skbuff's allocators:
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
...
/* kmalloc(size) might give us more room than requested.
* Put skb_shared_info exactly at the end of allocated zone,
* to allow max possible filling before reallocation.
*/
osize = ksize(data);
size = SKB_WITH_OVERHEAD(osize);
In both cases, the "how much was actually allocated?" question is answered
_after_ the allocation, where the compiler hinting is not in an easy place
to make the association any more. This mismatch between the compiler's
view of the buffer length and the code's intention about how much it is
going to actually use has already caused problems[1]. It is possible to
fix this by reordering the use of the "actual size" information.
We can serve the needs of users of ksize() and still have accurate buffer
length hinting for the compiler by doing the bucket size calculation
_before_ the allocation. Code can instead ask "how large an allocation
would I get for a given size?".
Introduce kmalloc_size_roundup(), to serve this function so we can start
replacing the "anticipatory resizing" uses of ksize().
[1] https://github.com/ClangBuiltLinux/linux/issues/1599
https://github.com/KSPP/linux/issues/183
[ vbabka@suse.cz: add SLOB version ]
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-23 13:28:08 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(kmalloc_size_roundup);
|
|
|
|
|
mm, slab: make kmalloc_info[] contain all types of names
Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
and KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
generated by kmalloc_cache_name().
Patch1 predefines the names of all types of kmalloc to save
the time spent dynamically generating names.
These changes make sense, and the time spent by new_kmalloc_cache()
has been reduced by approximately 36.3%.
Time spent by new_kmalloc_cache()
(CPU cycles)
5.3-rc7 66264
5.3-rc7+patch 42188
This patch (of 3):
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
kmalloc_cache_name().
This patch predefines the names of all types of kmalloc to save the time
spent dynamically generating names.
Besides, remove the kmalloc_cache_name() that is no longer used.
Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-30 17:49:21 -08:00
|
|
|
#ifdef CONFIG_ZONE_DMA
|
2021-06-28 19:37:38 -07:00
|
|
|
#define KMALLOC_DMA_NAME(sz) .name[KMALLOC_DMA] = "dma-kmalloc-" #sz,
|
|
|
|
#else
|
|
|
|
#define KMALLOC_DMA_NAME(sz)
|
|
|
|
#endif
|
|
|
|
|
2024-07-01 11:31:15 -04:00
|
|
|
#ifdef CONFIG_MEMCG
|
2021-06-28 19:37:38 -07:00
|
|
|
#define KMALLOC_CGROUP_NAME(sz) .name[KMALLOC_CGROUP] = "kmalloc-cg-" #sz,
|
mm, slab: make kmalloc_info[] contain all types of names
Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
and KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
generated by kmalloc_cache_name().
Patch1 predefines the names of all types of kmalloc to save
the time spent dynamically generating names.
These changes make sense, and the time spent by new_kmalloc_cache()
has been reduced by approximately 36.3%.
Time spent by new_kmalloc_cache()
(CPU cycles)
5.3-rc7 66264
5.3-rc7+patch 42188
This patch (of 3):
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
kmalloc_cache_name().
This patch predefines the names of all types of kmalloc to save the time
spent dynamically generating names.
Besides, remove the kmalloc_cache_name() that is no longer used.
Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-30 17:49:21 -08:00
|
|
|
#else
|
2021-06-28 19:37:38 -07:00
|
|
|
#define KMALLOC_CGROUP_NAME(sz)
|
|
|
|
#endif
|
|
|
|
|
2022-11-15 18:19:28 +01:00
|
|
|
#ifndef CONFIG_SLUB_TINY
|
|
|
|
#define KMALLOC_RCL_NAME(sz) .name[KMALLOC_RECLAIM] = "kmalloc-rcl-" #sz,
|
|
|
|
#else
|
|
|
|
#define KMALLOC_RCL_NAME(sz)
|
|
|
|
#endif
|
|
|
|
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
#ifdef CONFIG_RANDOM_KMALLOC_CACHES
|
|
|
|
#define __KMALLOC_RANDOM_CONCAT(a, b) a ## b
|
|
|
|
#define KMALLOC_RANDOM_NAME(N, sz) __KMALLOC_RANDOM_CONCAT(KMA_RAND_, N)(sz)
|
|
|
|
#define KMA_RAND_1(sz) .name[KMALLOC_RANDOM_START + 1] = "kmalloc-rnd-01-" #sz,
|
|
|
|
#define KMA_RAND_2(sz) KMA_RAND_1(sz) .name[KMALLOC_RANDOM_START + 2] = "kmalloc-rnd-02-" #sz,
|
|
|
|
#define KMA_RAND_3(sz) KMA_RAND_2(sz) .name[KMALLOC_RANDOM_START + 3] = "kmalloc-rnd-03-" #sz,
|
|
|
|
#define KMA_RAND_4(sz) KMA_RAND_3(sz) .name[KMALLOC_RANDOM_START + 4] = "kmalloc-rnd-04-" #sz,
|
|
|
|
#define KMA_RAND_5(sz) KMA_RAND_4(sz) .name[KMALLOC_RANDOM_START + 5] = "kmalloc-rnd-05-" #sz,
|
|
|
|
#define KMA_RAND_6(sz) KMA_RAND_5(sz) .name[KMALLOC_RANDOM_START + 6] = "kmalloc-rnd-06-" #sz,
|
|
|
|
#define KMA_RAND_7(sz) KMA_RAND_6(sz) .name[KMALLOC_RANDOM_START + 7] = "kmalloc-rnd-07-" #sz,
|
|
|
|
#define KMA_RAND_8(sz) KMA_RAND_7(sz) .name[KMALLOC_RANDOM_START + 8] = "kmalloc-rnd-08-" #sz,
|
|
|
|
#define KMA_RAND_9(sz) KMA_RAND_8(sz) .name[KMALLOC_RANDOM_START + 9] = "kmalloc-rnd-09-" #sz,
|
|
|
|
#define KMA_RAND_10(sz) KMA_RAND_9(sz) .name[KMALLOC_RANDOM_START + 10] = "kmalloc-rnd-10-" #sz,
|
|
|
|
#define KMA_RAND_11(sz) KMA_RAND_10(sz) .name[KMALLOC_RANDOM_START + 11] = "kmalloc-rnd-11-" #sz,
|
|
|
|
#define KMA_RAND_12(sz) KMA_RAND_11(sz) .name[KMALLOC_RANDOM_START + 12] = "kmalloc-rnd-12-" #sz,
|
|
|
|
#define KMA_RAND_13(sz) KMA_RAND_12(sz) .name[KMALLOC_RANDOM_START + 13] = "kmalloc-rnd-13-" #sz,
|
|
|
|
#define KMA_RAND_14(sz) KMA_RAND_13(sz) .name[KMALLOC_RANDOM_START + 14] = "kmalloc-rnd-14-" #sz,
|
|
|
|
#define KMA_RAND_15(sz) KMA_RAND_14(sz) .name[KMALLOC_RANDOM_START + 15] = "kmalloc-rnd-15-" #sz,
|
|
|
|
#else // CONFIG_RANDOM_KMALLOC_CACHES
|
|
|
|
#define KMALLOC_RANDOM_NAME(N, sz)
|
|
|
|
#endif
|
|
|
|
|
mm, slab: make kmalloc_info[] contain all types of names
Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
and KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
generated by kmalloc_cache_name().
Patch1 predefines the names of all types of kmalloc to save
the time spent dynamically generating names.
These changes make sense, and the time spent by new_kmalloc_cache()
has been reduced by approximately 36.3%.
Time spent by new_kmalloc_cache()
(CPU cycles)
5.3-rc7 66264
5.3-rc7+patch 42188
This patch (of 3):
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
kmalloc_cache_name().
This patch predefines the names of all types of kmalloc to save the time
spent dynamically generating names.
Besides, remove the kmalloc_cache_name() that is no longer used.
Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-30 17:49:21 -08:00
|
|
|
#define INIT_KMALLOC_INFO(__size, __short_size) \
|
|
|
|
{ \
|
|
|
|
.name[KMALLOC_NORMAL] = "kmalloc-" #__short_size, \
|
2022-11-15 18:19:28 +01:00
|
|
|
KMALLOC_RCL_NAME(__short_size) \
|
2021-06-28 19:37:38 -07:00
|
|
|
KMALLOC_CGROUP_NAME(__short_size) \
|
|
|
|
KMALLOC_DMA_NAME(__short_size) \
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
KMALLOC_RANDOM_NAME(RANDOM_KMALLOC_CACHES_NR, __short_size) \
|
mm, slab: make kmalloc_info[] contain all types of names
Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
and KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
generated by kmalloc_cache_name().
Patch1 predefines the names of all types of kmalloc to save
the time spent dynamically generating names.
These changes make sense, and the time spent by new_kmalloc_cache()
has been reduced by approximately 36.3%.
Time spent by new_kmalloc_cache()
(CPU cycles)
5.3-rc7 66264
5.3-rc7+patch 42188
This patch (of 3):
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
kmalloc_cache_name().
This patch predefines the names of all types of kmalloc to save the time
spent dynamically generating names.
Besides, remove the kmalloc_cache_name() that is no longer used.
Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-30 17:49:21 -08:00
|
|
|
.size = __size, \
|
|
|
|
}
|
|
|
|
|
2015-06-24 16:55:54 -07:00
|
|
|
/*
|
2023-12-15 11:41:48 +08:00
|
|
|
* kmalloc_info[] is to make slab_debug=,kmalloc-xx option work at boot time.
|
2022-08-17 19:18:19 +09:00
|
|
|
* kmalloc_index() supports up to 2^21=2MB, so the final entry of the table is
|
|
|
|
* kmalloc-2M.
|
2015-06-24 16:55:54 -07:00
|
|
|
*/
|
2017-02-22 15:41:05 -08:00
|
|
|
const struct kmalloc_info_struct kmalloc_info[] __initconst = {
|
mm, slab: make kmalloc_info[] contain all types of names
Patch series "mm, slab: Make kmalloc_info[] contain all types of names", v6.
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
and KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
generated by kmalloc_cache_name().
Patch1 predefines the names of all types of kmalloc to save
the time spent dynamically generating names.
These changes make sense, and the time spent by new_kmalloc_cache()
has been reduced by approximately 36.3%.
Time spent by new_kmalloc_cache()
(CPU cycles)
5.3-rc7 66264
5.3-rc7+patch 42188
This patch (of 3):
There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM and
KMALLOC_DMA.
The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, but the
names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically generated by
kmalloc_cache_name().
This patch predefines the names of all types of kmalloc to save the time
spent dynamically generating names.
Besides, remove the kmalloc_cache_name() that is no longer used.
Link: http://lkml.kernel.org/r/1569241648-26908-2-git-send-email-lpf.vector@gmail.com
Signed-off-by: Pengfei Li <lpf.vector@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-30 17:49:21 -08:00
|
|
|
INIT_KMALLOC_INFO(0, 0),
|
|
|
|
INIT_KMALLOC_INFO(96, 96),
|
|
|
|
INIT_KMALLOC_INFO(192, 192),
|
|
|
|
INIT_KMALLOC_INFO(8, 8),
|
|
|
|
INIT_KMALLOC_INFO(16, 16),
|
|
|
|
INIT_KMALLOC_INFO(32, 32),
|
|
|
|
INIT_KMALLOC_INFO(64, 64),
|
|
|
|
INIT_KMALLOC_INFO(128, 128),
|
|
|
|
INIT_KMALLOC_INFO(256, 256),
|
|
|
|
INIT_KMALLOC_INFO(512, 512),
|
|
|
|
INIT_KMALLOC_INFO(1024, 1k),
|
|
|
|
INIT_KMALLOC_INFO(2048, 2k),
|
|
|
|
INIT_KMALLOC_INFO(4096, 4k),
|
|
|
|
INIT_KMALLOC_INFO(8192, 8k),
|
|
|
|
INIT_KMALLOC_INFO(16384, 16k),
|
|
|
|
INIT_KMALLOC_INFO(32768, 32k),
|
|
|
|
INIT_KMALLOC_INFO(65536, 64k),
|
|
|
|
INIT_KMALLOC_INFO(131072, 128k),
|
|
|
|
INIT_KMALLOC_INFO(262144, 256k),
|
|
|
|
INIT_KMALLOC_INFO(524288, 512k),
|
|
|
|
INIT_KMALLOC_INFO(1048576, 1M),
|
2022-08-17 19:18:19 +09:00
|
|
|
INIT_KMALLOC_INFO(2097152, 2M)
|
2015-06-24 16:55:54 -07:00
|
|
|
};
|
|
|
|
|
2013-01-10 19:12:17 +00:00
|
|
|
/*
|
2015-06-24 16:55:57 -07:00
|
|
|
* Patch up the size_index table if we have strange large alignment
|
|
|
|
* requirements for the kmalloc array. This is only the case for
|
|
|
|
* MIPS it seems. The standard arches will not generate any code here.
|
|
|
|
*
|
|
|
|
* Largest permitted alignment is 256 bytes due to the way we
|
|
|
|
* handle the index determination for the smaller caches.
|
|
|
|
*
|
|
|
|
* Make sure that nothing crazy happens if someone starts tinkering
|
|
|
|
* around with ARCH_KMALLOC_MINALIGN
|
2013-01-10 19:12:17 +00:00
|
|
|
*/
|
2015-06-24 16:55:57 -07:00
|
|
|
void __init setup_kmalloc_cache_index_table(void)
|
2013-01-10 19:12:17 +00:00
|
|
|
{
|
2018-04-05 16:20:44 -07:00
|
|
|
unsigned int i;
|
2013-01-10 19:12:17 +00:00
|
|
|
|
2013-01-10 19:14:19 +00:00
|
|
|
BUILD_BUG_ON(KMALLOC_MIN_SIZE > 256 ||
|
2022-02-17 17:16:09 +08:00
|
|
|
!is_power_of_2(KMALLOC_MIN_SIZE));
|
2013-01-10 19:14:19 +00:00
|
|
|
|
|
|
|
for (i = 8; i < KMALLOC_MIN_SIZE; i += 8) {
|
2018-04-05 16:20:44 -07:00
|
|
|
unsigned int elem = size_index_elem(i);
|
2013-01-10 19:14:19 +00:00
|
|
|
|
2023-11-13 12:02:02 +01:00
|
|
|
if (elem >= ARRAY_SIZE(kmalloc_size_index))
|
2013-01-10 19:14:19 +00:00
|
|
|
break;
|
2023-11-13 12:02:02 +01:00
|
|
|
kmalloc_size_index[elem] = KMALLOC_SHIFT_LOW;
|
2013-01-10 19:14:19 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (KMALLOC_MIN_SIZE >= 64) {
|
|
|
|
/*
|
2022-01-14 14:09:25 -08:00
|
|
|
* The 96 byte sized cache is not used if the alignment
|
2013-01-10 19:14:19 +00:00
|
|
|
* is 64 byte.
|
|
|
|
*/
|
|
|
|
for (i = 64 + 8; i <= 96; i += 8)
|
2023-11-13 12:02:02 +01:00
|
|
|
kmalloc_size_index[size_index_elem(i)] = 7;
|
2013-01-10 19:14:19 +00:00
|
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
if (KMALLOC_MIN_SIZE >= 128) {
|
|
|
|
/*
|
|
|
|
* The 192 byte sized cache is not used if the alignment
|
|
|
|
* is 128 byte. Redirect kmalloc to use the 256 byte cache
|
|
|
|
* instead.
|
|
|
|
*/
|
|
|
|
for (i = 128 + 8; i <= 192; i += 8)
|
2023-11-13 12:02:02 +01:00
|
|
|
kmalloc_size_index[size_index_elem(i)] = 8;
|
2013-01-10 19:14:19 +00:00
|
|
|
}
|
2015-06-24 16:55:57 -07:00
|
|
|
}
|
|
|
|
|
2023-06-12 16:31:48 +01:00
|
|
|
static unsigned int __kmalloc_minalign(void)
|
|
|
|
{
|
2023-10-06 17:39:34 +01:00
|
|
|
unsigned int minalign = dma_get_cache_alignment();
|
|
|
|
|
2023-08-01 08:23:57 +02:00
|
|
|
if (IS_ENABLED(CONFIG_DMA_BOUNCE_UNALIGNED_KMALLOC) &&
|
|
|
|
is_swiotlb_allocated())
|
2023-10-06 17:39:34 +01:00
|
|
|
minalign = ARCH_KMALLOC_MINALIGN;
|
|
|
|
|
|
|
|
return max(minalign, arch_slab_minalign());
|
2023-06-12 16:31:48 +01:00
|
|
|
}
|
|
|
|
|
2024-01-30 09:41:07 +08:00
|
|
|
static void __init
|
|
|
|
new_kmalloc_cache(int idx, enum kmalloc_cache_type type)
|
2015-06-29 09:28:08 -05:00
|
|
|
{
|
2024-01-30 09:41:07 +08:00
|
|
|
slab_flags_t flags = 0;
|
2023-06-12 16:31:48 +01:00
|
|
|
unsigned int minalign = __kmalloc_minalign();
|
|
|
|
unsigned int aligned_size = kmalloc_info[idx].size;
|
|
|
|
int aligned_idx = idx;
|
|
|
|
|
2022-11-15 18:19:28 +01:00
|
|
|
if ((KMALLOC_RECLAIM != KMALLOC_NORMAL) && (type == KMALLOC_RECLAIM)) {
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 15:05:38 -07:00
|
|
|
flags |= SLAB_RECLAIM_ACCOUNT;
|
2024-07-01 11:31:15 -04:00
|
|
|
} else if (IS_ENABLED(CONFIG_MEMCG) && (type == KMALLOC_CGROUP)) {
|
2022-01-14 14:05:29 -08:00
|
|
|
if (mem_cgroup_kmem_disabled()) {
|
2021-06-28 19:37:38 -07:00
|
|
|
kmalloc_caches[type][idx] = kmalloc_caches[KMALLOC_NORMAL][idx];
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
flags |= SLAB_ACCOUNT;
|
mm/slab_common: move dma-kmalloc caches creation into new_kmalloc_cache()
There are four types of kmalloc_caches: KMALLOC_NORMAL, KMALLOC_CGROUP,
KMALLOC_RECLAIM, and KMALLOC_DMA. While the first three types are
created using new_kmalloc_cache(), KMALLOC_DMA caches are created in a
separate logic. Let KMALLOC_DMA caches be also created using
new_kmalloc_cache(), to enhance readability.
Historically, there were only KMALLOC_NORMAL caches and KMALLOC_DMA
caches in the first place, and they were initialized in two separate
logics. However, when KMALLOC_RECLAIM was introduced in v4.20 via
commit 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable
caches") and KMALLOC_CGROUP was introduced in v5.14 via
commit 494c1dfe855e ("mm: memcg/slab: create a new set of kmalloc-cg-<n>
caches"), their creations were merged with KMALLOC_NORMAL's only.
KMALLOC_DMA creation logic should be merged with them, too.
By merging KMALLOC_DMA initialization with other types, the following
two changes might occur:
1. The order dma-kmalloc-<n> caches added in slab_cache list may be
sorted by size. i.e. the order they appear in /proc/slabinfo may change
as well.
2. slab_state will be set to UP after KMALLOC_DMA is created.
In case of slub, freelist randomization is dependent on slab_state>=UP,
and therefore KMALLOC_DMA cache's freelist will not be randomized in
creation, but will be deferred to init_freelist_randomization().
Co-developed-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: Ohhoon Kwon <ohkwon1043@gmail.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220410162511.656541-1-ohkwon1043@gmail.com
2022-04-11 01:25:11 +09:00
|
|
|
} else if (IS_ENABLED(CONFIG_ZONE_DMA) && (type == KMALLOC_DMA)) {
|
|
|
|
flags |= SLAB_CACHE_DMA;
|
2021-06-28 19:37:38 -07:00
|
|
|
}
|
mm, slab/slub: introduce kmalloc-reclaimable caches
Kmem caches can be created with a SLAB_RECLAIM_ACCOUNT flag, which
indicates they contain objects which can be reclaimed under memory
pressure (typically through a shrinker). This makes the slab pages
accounted as NR_SLAB_RECLAIMABLE in vmstat, which is reflected also the
MemAvailable meminfo counter and in overcommit decisions. The slab pages
are also allocated with __GFP_RECLAIMABLE, which is good for
anti-fragmentation through grouping pages by mobility.
The generic kmalloc-X caches are created without this flag, but sometimes
are used also for objects that can be reclaimed, which due to varying size
cannot have a dedicated kmem cache with SLAB_RECLAIM_ACCOUNT flag. A
prominent example are dcache external names, which prompted the creation
of a new, manually managed vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES
in commit f1782c9bc547 ("dcache: account external names as indirectly
reclaimable memory").
To better handle this and any other similar cases, this patch introduces
SLAB_RECLAIM_ACCOUNT variants of kmalloc caches, named kmalloc-rcl-X.
They are used whenever the kmalloc() call passes __GFP_RECLAIMABLE among
gfp flags. They are added to the kmalloc_caches array as a new type.
Allocations with both __GFP_DMA and __GFP_RECLAIMABLE will use a dma type
cache.
This change only applies to SLAB and SLUB, not SLOB. This is fine, since
SLOB's target are tiny system and this patch does add some overhead of
kmem management objects.
Link: http://lkml.kernel.org/r/20180731090649.16028-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 15:05:38 -07:00
|
|
|
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
#ifdef CONFIG_RANDOM_KMALLOC_CACHES
|
|
|
|
if (type >= KMALLOC_RANDOM_START && type <= KMALLOC_RANDOM_END)
|
|
|
|
flags |= SLAB_NO_MERGE;
|
|
|
|
#endif
|
|
|
|
|
2021-06-28 19:37:41 -07:00
|
|
|
/*
|
2024-07-01 11:31:15 -04:00
|
|
|
* If CONFIG_MEMCG is enabled, disable cache merging for
|
2021-06-28 19:37:41 -07:00
|
|
|
* KMALLOC_NORMAL caches.
|
|
|
|
*/
|
2024-07-01 11:31:15 -04:00
|
|
|
if (IS_ENABLED(CONFIG_MEMCG) && (type == KMALLOC_NORMAL))
|
2023-06-13 12:28:21 +02:00
|
|
|
flags |= SLAB_NO_MERGE;
|
|
|
|
|
2023-06-12 16:31:48 +01:00
|
|
|
if (minalign > ARCH_KMALLOC_MINALIGN) {
|
|
|
|
aligned_size = ALIGN(aligned_size, minalign);
|
|
|
|
aligned_idx = __kmalloc_index(aligned_size, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!kmalloc_caches[type][aligned_idx])
|
|
|
|
kmalloc_caches[type][aligned_idx] = create_kmalloc_cache(
|
|
|
|
kmalloc_info[aligned_idx].name[type],
|
|
|
|
aligned_size, flags);
|
|
|
|
if (idx != aligned_idx)
|
|
|
|
kmalloc_caches[type][idx] = kmalloc_caches[type][aligned_idx];
|
2015-06-29 09:28:08 -05:00
|
|
|
}
|
|
|
|
|
2015-06-24 16:55:57 -07:00
|
|
|
/*
|
|
|
|
* Create the kmalloc array. Some of the regular kmalloc arrays
|
|
|
|
* may already have been created because they were needed to
|
|
|
|
* enable allocations for slab creation.
|
|
|
|
*/
|
2024-01-30 09:41:07 +08:00
|
|
|
void __init create_kmalloc_caches(void)
|
2015-06-24 16:55:57 -07:00
|
|
|
{
|
2019-11-30 17:49:28 -08:00
|
|
|
int i;
|
|
|
|
enum kmalloc_cache_type type;
|
2015-06-24 16:55:57 -07:00
|
|
|
|
2021-06-28 19:37:38 -07:00
|
|
|
/*
|
2024-07-01 11:31:15 -04:00
|
|
|
* Including KMALLOC_CGROUP if CONFIG_MEMCG defined
|
2021-06-28 19:37:38 -07:00
|
|
|
*/
|
mm/slab_common: move dma-kmalloc caches creation into new_kmalloc_cache()
There are four types of kmalloc_caches: KMALLOC_NORMAL, KMALLOC_CGROUP,
KMALLOC_RECLAIM, and KMALLOC_DMA. While the first three types are
created using new_kmalloc_cache(), KMALLOC_DMA caches are created in a
separate logic. Let KMALLOC_DMA caches be also created using
new_kmalloc_cache(), to enhance readability.
Historically, there were only KMALLOC_NORMAL caches and KMALLOC_DMA
caches in the first place, and they were initialized in two separate
logics. However, when KMALLOC_RECLAIM was introduced in v4.20 via
commit 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable
caches") and KMALLOC_CGROUP was introduced in v5.14 via
commit 494c1dfe855e ("mm: memcg/slab: create a new set of kmalloc-cg-<n>
caches"), their creations were merged with KMALLOC_NORMAL's only.
KMALLOC_DMA creation logic should be merged with them, too.
By merging KMALLOC_DMA initialization with other types, the following
two changes might occur:
1. The order dma-kmalloc-<n> caches added in slab_cache list may be
sorted by size. i.e. the order they appear in /proc/slabinfo may change
as well.
2. slab_state will be set to UP after KMALLOC_DMA is created.
In case of slub, freelist randomization is dependent on slab_state>=UP,
and therefore KMALLOC_DMA cache's freelist will not be randomized in
creation, but will be deferred to init_freelist_randomization().
Co-developed-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: Ohhoon Kwon <ohkwon1043@gmail.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220410162511.656541-1-ohkwon1043@gmail.com
2022-04-11 01:25:11 +09:00
|
|
|
for (type = KMALLOC_NORMAL; type < NR_KMALLOC_TYPES; type++) {
|
2024-04-24 23:04:21 +09:00
|
|
|
/* Caches that are NOT of the two-to-the-power-of size. */
|
2024-04-24 23:04:22 +09:00
|
|
|
if (KMALLOC_MIN_SIZE <= 32)
|
2024-04-24 23:04:21 +09:00
|
|
|
new_kmalloc_cache(1, type);
|
2024-04-24 23:04:22 +09:00
|
|
|
if (KMALLOC_MIN_SIZE <= 64)
|
2024-04-24 23:04:21 +09:00
|
|
|
new_kmalloc_cache(2, type);
|
|
|
|
|
|
|
|
/* Caches that are of the two-to-the-power-of size. */
|
2024-04-24 23:04:22 +09:00
|
|
|
for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
|
|
|
|
new_kmalloc_cache(i, type);
|
2013-05-03 18:04:18 +00:00
|
|
|
}
|
Randomized slab caches for kmalloc()
When exploiting memory vulnerabilities, "heap spraying" is a common
technique targeting those related to dynamic memory allocation (i.e. the
"heap"), and it plays an important role in a successful exploitation.
Basically, it is to overwrite the memory area of vulnerable object by
triggering allocation in other subsystems or modules and therefore
getting a reference to the targeted memory location. It's usable on
various types of vulnerablity including use after free (UAF), heap out-
of-bound write and etc.
There are (at least) two reasons why the heap can be sprayed: 1) generic
slab caches are shared among different subsystems and modules, and
2) dedicated slab caches could be merged with the generic ones.
Currently these two factors cannot be prevented at a low cost: the first
one is a widely used memory allocation mechanism, and shutting down slab
merging completely via `slub_nomerge` would be overkill.
To efficiently prevent heap spraying, we propose the following approach:
to create multiple copies of generic slab caches that will never be
merged, and random one of them will be used at allocation. The random
selection is based on the address of code that calls `kmalloc()`, which
means it is static at runtime (rather than dynamically determined at
each time of allocation, which could be bypassed by repeatedly spraying
in brute force). In other words, the randomness of cache selection will
be with respect to the code address rather than time, i.e. allocations
in different code paths would most likely pick different caches,
although kmalloc() at each place would use the same cache copy whenever
it is executed. In this way, the vulnerable object and memory allocated
in other subsystems and modules will (most probably) be on different
slab caches, which prevents the object from being sprayed.
Meanwhile, the static random selection is further enhanced with a
per-boot random seed, which prevents the attacker from finding a usable
kmalloc that happens to pick the same cache with the vulnerable
subsystem/module by analyzing the open source code. In other words, with
the per-boot seed, the random selection is static during each time the
system starts and runs, but not across different system startups.
The overhead of performance has been tested on a 40-core x86 server by
comparing the results of `perf bench all` between the kernels with and
without this patch based on the latest linux-next kernel, which shows
minor difference. A subset of benchmarks are listed below:
sched/ sched/ syscall/ mem/ mem/
messaging pipe basic memcpy memset
(sec) (sec) (sec) (GB/sec) (GB/sec)
control1 0.019 5.459 0.733 15.258789 51.398026
control2 0.019 5.439 0.730 16.009221 48.828125
control3 0.019 5.282 0.735 16.009221 48.828125
control_avg 0.019 5.393 0.733 15.759077 49.684759
experiment1 0.019 5.374 0.741 15.500992 46.502976
experiment2 0.019 5.440 0.746 16.276042 51.398026
experiment3 0.019 5.242 0.752 15.258789 51.398026
experiment_avg 0.019 5.352 0.746 15.678608 49.766343
The overhead of memory usage was measured by executing `free` after boot
on a QEMU VM with 1GB total memory, and as expected, it's positively
correlated with # of cache copies:
control 4 copies 8 copies 16 copies
total 969.8M 968.2M 968.2M 968.2M
used 20.0M 21.9M 24.1M 26.7M
free 936.9M 933.6M 931.4M 928.6M
available 932.2M 928.8M 926.6M 923.9M
Co-developed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org> # percpu
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-07-14 14:44:22 +08:00
|
|
|
#ifdef CONFIG_RANDOM_KMALLOC_CACHES
|
|
|
|
random_kmalloc_seed = get_random_u64();
|
|
|
|
#endif
|
2013-05-03 18:04:18 +00:00
|
|
|
|
2013-01-10 19:12:17 +00:00
|
|
|
/* Kmalloc array is now usable */
|
|
|
|
slab_state = UP;
|
mm/slab: Introduce kmem_buckets_create() and family
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.
This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.
While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.
In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed. Note that these caches are specifically flagged with
SLAB_NO_MERGE, since merging would defeat the entire purpose of the
mitigation.
This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.
Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness (where attackers can arrange to free an
entire slab page and have it reallocated to a different cache),
but that is an existing and separate issue which is complementary
to this improvement. Development continues for that feature via the
SLAB_VIRTUAL[3] series (which could also provide guard pages -- another
complementary improvement).
Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-01 12:13:01 -07:00
|
|
|
|
|
|
|
if (IS_ENABLED(CONFIG_SLAB_BUCKETS))
|
|
|
|
kmem_buckets_cache = kmem_cache_create("kmalloc_buckets",
|
|
|
|
sizeof(kmem_buckets),
|
|
|
|
0, SLAB_NO_MERGE, NULL);
|
2013-01-10 19:12:17 +00:00
|
|
|
}
|
2022-08-17 19:18:19 +09:00
|
|
|
|
2022-09-29 11:30:55 +02:00
|
|
|
/**
|
|
|
|
* __ksize -- Report full size of underlying allocation
|
2022-10-31 10:29:20 +01:00
|
|
|
* @object: pointer to the object
|
2022-09-29 11:30:55 +02:00
|
|
|
*
|
|
|
|
* This should only be used internally to query the true size of allocations.
|
|
|
|
* It is not meant to be a way to discover the usable size of an allocation
|
|
|
|
* after the fact. Instead, use kmalloc_size_roundup(). Using memory beyond
|
|
|
|
* the originally requested allocation size may trigger KASAN, UBSAN_BOUNDS,
|
|
|
|
* and/or FORTIFY_SOURCE.
|
|
|
|
*
|
2022-10-31 10:29:20 +01:00
|
|
|
* Return: size of the actual memory used by @object in bytes
|
2022-09-29 11:30:55 +02:00
|
|
|
*/
|
2022-08-17 19:18:21 +09:00
|
|
|
size_t __ksize(const void *object)
|
|
|
|
{
|
|
|
|
struct folio *folio;
|
|
|
|
|
|
|
|
if (unlikely(object == ZERO_SIZE_PTR))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
folio = virt_to_folio(object);
|
|
|
|
|
2022-08-17 19:18:26 +09:00
|
|
|
if (unlikely(!folio_test_slab(folio))) {
|
|
|
|
if (WARN_ON(folio_size(folio) <= KMALLOC_MAX_CACHE_SIZE))
|
|
|
|
return 0;
|
|
|
|
if (WARN_ON(object != folio_address(folio)))
|
|
|
|
return 0;
|
2022-08-17 19:18:21 +09:00
|
|
|
return folio_size(folio);
|
2022-08-17 19:18:26 +09:00
|
|
|
}
|
2022-08-17 19:18:21 +09:00
|
|
|
|
mm/slub: extend redzone check to extra allocated kmalloc space than requested
kmalloc will round up the request size to a fixed size (mostly power
of 2), so there could be a extra space than what is requested, whose
size is the actual buffer size minus original request size.
To better detect out of bound access or abuse of this space, add
redzone sanity check for it.
In current kernel, some kmalloc user already knows the existence of
the space and utilizes it after calling 'ksize()' to know the real
size of the allocated buffer. So we skip the sanity check for objects
which have been called with ksize(), as treating them as legitimate
users. Kees Cook is working on sanitizing all these user cases,
by using kmalloc_size_roundup() to avoid ambiguous usages. And after
this is done, this special handling for ksize() can be removed.
In some cases, the free pointer could be saved inside the latter
part of object data area, which may overlap the redzone part(for
small sizes of kmalloc objects). As suggested by Hyeonggon Yoo,
force the free pointer to be in meta data area when kmalloc redzone
debug is enabled, to make all kmalloc objects covered by redzone
check.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-10-21 11:24:05 +08:00
|
|
|
#ifdef CONFIG_SLUB_DEBUG
|
|
|
|
skip_orig_size_check(folio_slab(folio)->slab_cache, object);
|
|
|
|
#endif
|
|
|
|
|
2022-08-17 19:18:21 +09:00
|
|
|
return slab_ksize(folio_slab(folio)->slab_cache);
|
|
|
|
}
|
2022-08-17 19:18:22 +09:00
|
|
|
|
2020-08-06 23:18:28 -07:00
|
|
|
gfp_t kmalloc_fix_flags(gfp_t flags)
|
|
|
|
{
|
|
|
|
gfp_t invalid_mask = flags & GFP_SLAB_BUG_MASK;
|
|
|
|
|
|
|
|
flags &= ~GFP_SLAB_BUG_MASK;
|
|
|
|
pr_warn("Unexpected gfp: %#x (%pGg). Fixing up to gfp: %#x (%pGg). Fix your code!\n",
|
|
|
|
invalid_mask, &invalid_mask, flags, &flags);
|
|
|
|
dump_stack();
|
|
|
|
|
|
|
|
return flags;
|
|
|
|
}
|
|
|
|
|
2016-07-26 15:21:56 -07:00
|
|
|
#ifdef CONFIG_SLAB_FREELIST_RANDOM
|
|
|
|
/* Randomize a generic freelist */
|
2023-04-16 20:22:55 +03:00
|
|
|
static void freelist_randomize(unsigned int *list,
|
2018-04-05 16:21:46 -07:00
|
|
|
unsigned int count)
|
2016-07-26 15:21:56 -07:00
|
|
|
{
|
|
|
|
unsigned int rand;
|
2018-04-05 16:21:46 -07:00
|
|
|
unsigned int i;
|
2016-07-26 15:21:56 -07:00
|
|
|
|
|
|
|
for (i = 0; i < count; i++)
|
|
|
|
list[i] = i;
|
|
|
|
|
|
|
|
/* Fisher-Yates shuffle */
|
|
|
|
for (i = count - 1; i > 0; i--) {
|
2023-04-16 20:22:55 +03:00
|
|
|
rand = get_random_u32_below(i + 1);
|
2016-07-26 15:21:56 -07:00
|
|
|
swap(list[i], list[rand]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Create a random sequence per cache */
|
|
|
|
int cache_random_seq_create(struct kmem_cache *cachep, unsigned int count,
|
|
|
|
gfp_t gfp)
|
|
|
|
{
|
|
|
|
|
|
|
|
if (count < 2 || cachep->random_seq)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
cachep->random_seq = kcalloc(count, sizeof(unsigned int), gfp);
|
|
|
|
if (!cachep->random_seq)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2023-04-16 20:22:55 +03:00
|
|
|
freelist_randomize(cachep->random_seq, count);
|
2016-07-26 15:21:56 -07:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Destroy the per-cache random freelist sequence */
|
|
|
|
void cache_random_seq_destroy(struct kmem_cache *cachep)
|
|
|
|
{
|
|
|
|
kfree(cachep->random_seq);
|
|
|
|
cachep->random_seq = NULL;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_SLAB_FREELIST_RANDOM */
|
|
|
|
|
2023-10-02 17:43:38 +02:00
|
|
|
#ifdef CONFIG_SLUB_DEBUG
|
2018-06-14 15:27:58 -07:00
|
|
|
#define SLABINFO_RIGHTS (0400)
|
2013-07-04 08:33:24 +08:00
|
|
|
|
2014-12-10 15:44:19 -08:00
|
|
|
static void print_slabinfo_header(struct seq_file *m)
|
2012-10-19 18:20:26 +04:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Output format version, so at least we can change it
|
|
|
|
* without _too_ many complaints.
|
|
|
|
*/
|
|
|
|
seq_puts(m, "slabinfo - version: 2.1\n");
|
2016-03-17 14:19:47 -07:00
|
|
|
seq_puts(m, "# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>");
|
2012-10-19 18:20:26 +04:00
|
|
|
seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
|
|
|
|
seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
|
|
|
|
seq_putc(m, '\n');
|
|
|
|
}
|
|
|
|
|
2022-01-14 14:04:01 -08:00
|
|
|
static void *slab_start(struct seq_file *m, loff_t *pos)
|
2012-10-19 18:20:25 +04:00
|
|
|
{
|
|
|
|
mutex_lock(&slab_mutex);
|
2020-08-06 23:21:20 -07:00
|
|
|
return seq_list_start(&slab_caches, *pos);
|
2012-10-19 18:20:25 +04:00
|
|
|
}
|
|
|
|
|
2022-01-14 14:04:01 -08:00
|
|
|
static void *slab_next(struct seq_file *m, void *p, loff_t *pos)
|
2012-10-19 18:20:25 +04:00
|
|
|
{
|
2020-08-06 23:21:20 -07:00
|
|
|
return seq_list_next(p, &slab_caches, pos);
|
2012-10-19 18:20:25 +04:00
|
|
|
}
|
|
|
|
|
2022-01-14 14:04:01 -08:00
|
|
|
static void slab_stop(struct seq_file *m, void *p)
|
2012-10-19 18:20:25 +04:00
|
|
|
{
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
}
|
|
|
|
|
2014-12-10 15:44:19 -08:00
|
|
|
static void cache_show(struct kmem_cache *s, struct seq_file *m)
|
2012-10-19 18:20:25 +04:00
|
|
|
{
|
2012-10-19 18:20:27 +04:00
|
|
|
struct slabinfo sinfo;
|
|
|
|
|
|
|
|
memset(&sinfo, 0, sizeof(sinfo));
|
|
|
|
get_slabinfo(s, &sinfo);
|
|
|
|
|
|
|
|
seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
|
2020-08-06 23:21:27 -07:00
|
|
|
s->name, sinfo.active_objs, sinfo.num_objs, s->size,
|
2012-10-19 18:20:27 +04:00
|
|
|
sinfo.objects_per_slab, (1 << sinfo.cache_order));
|
|
|
|
|
|
|
|
seq_printf(m, " : tunables %4u %4u %4u",
|
|
|
|
sinfo.limit, sinfo.batchcount, sinfo.shared);
|
|
|
|
seq_printf(m, " : slabdata %6lu %6lu %6lu",
|
|
|
|
sinfo.active_slabs, sinfo.num_slabs, sinfo.shared_avail);
|
|
|
|
seq_putc(m, '\n');
|
2012-10-19 18:20:25 +04:00
|
|
|
}
|
|
|
|
|
2014-12-10 15:42:16 -08:00
|
|
|
static int slab_show(struct seq_file *m, void *p)
|
2012-12-18 14:23:01 -08:00
|
|
|
{
|
2020-08-06 23:21:20 -07:00
|
|
|
struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
|
2012-12-18 14:23:01 -08:00
|
|
|
|
2020-08-06 23:21:20 -07:00
|
|
|
if (p == slab_caches.next)
|
2014-12-10 15:42:16 -08:00
|
|
|
print_slabinfo_header(m);
|
2020-08-06 23:21:27 -07:00
|
|
|
cache_show(s, m);
|
2014-12-10 15:44:19 -08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-11-15 17:32:07 -08:00
|
|
|
void dump_unreclaimable_slab(void)
|
|
|
|
{
|
2020-12-14 19:03:47 -08:00
|
|
|
struct kmem_cache *s;
|
2017-11-15 17:32:07 -08:00
|
|
|
struct slabinfo sinfo;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Here acquiring slab_mutex is risky since we don't prefer to get
|
|
|
|
* sleep in oom path. But, without mutex hold, it may introduce a
|
|
|
|
* risk of crash.
|
|
|
|
* Use mutex_trylock to protect the list traverse, dump nothing
|
|
|
|
* without acquiring the mutex.
|
|
|
|
*/
|
|
|
|
if (!mutex_trylock(&slab_mutex)) {
|
|
|
|
pr_warn("excessive unreclaimable slab but cannot dump stats\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
pr_info("Unreclaimable slab info:\n");
|
|
|
|
pr_info("Name Used Total\n");
|
|
|
|
|
2020-12-14 19:03:47 -08:00
|
|
|
list_for_each_entry(s, &slab_caches, list) {
|
2020-08-06 23:21:27 -07:00
|
|
|
if (s->flags & SLAB_RECLAIM_ACCOUNT)
|
2017-11-15 17:32:07 -08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
get_slabinfo(s, &sinfo);
|
|
|
|
|
|
|
|
if (sinfo.num_objs > 0)
|
2020-08-06 23:21:27 -07:00
|
|
|
pr_info("%-17s %10luKB %10luKB\n", s->name,
|
2017-11-15 17:32:07 -08:00
|
|
|
(sinfo.active_objs * s->size) / 1024,
|
|
|
|
(sinfo.num_objs * s->size) / 1024);
|
|
|
|
}
|
|
|
|
mutex_unlock(&slab_mutex);
|
|
|
|
}
|
|
|
|
|
2012-10-19 18:20:25 +04:00
|
|
|
/*
|
|
|
|
* slabinfo_op - iterator that generates /proc/slabinfo
|
|
|
|
*
|
|
|
|
* Output layout:
|
|
|
|
* cache-name
|
|
|
|
* num-active-objs
|
|
|
|
* total-objs
|
|
|
|
* object size
|
|
|
|
* num-active-slabs
|
|
|
|
* total-slabs
|
|
|
|
* num-pages-per-slab
|
|
|
|
* + further values on SMP and with statistics enabled
|
|
|
|
*/
|
|
|
|
static const struct seq_operations slabinfo_op = {
|
2014-12-10 15:42:16 -08:00
|
|
|
.start = slab_start,
|
2013-07-08 08:08:28 +08:00
|
|
|
.next = slab_next,
|
|
|
|
.stop = slab_stop,
|
2014-12-10 15:42:16 -08:00
|
|
|
.show = slab_show,
|
2012-10-19 18:20:25 +04:00
|
|
|
};
|
|
|
|
|
|
|
|
static int slabinfo_open(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
return seq_open(file, &slabinfo_op);
|
|
|
|
}
|
|
|
|
|
2020-02-03 17:37:17 -08:00
|
|
|
static const struct proc_ops slabinfo_proc_ops = {
|
proc: faster open/read/close with "permanent" files
Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...
Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed. Files like /proc/cpuinfo which never
disappear simply do not need such protection.
Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.
Enable "permanent" flag for
/proc/cpuinfo
/proc/kmsg
/proc/modules
/proc/slabinfo
/proc/stat
/proc/sysvipc/*
/proc/swaps
More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.
This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".
N R t, s (before) t, s (after)
-----------------------------------------------------
64 4096 1.582458 1.530502 -3.2%
256 4096 6.371926 6.125168 -3.9%
1024 4096 25.64888 24.47528 -4.6%
Benchmark source:
#include <chrono>
#include <iostream>
#include <thread>
#include <vector>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;
int xxx = 0;
int glue(int n)
{
cpu_set_t m;
CPU_ZERO(&m);
CPU_SET(n, &m);
return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}
void f(int n)
{
glue(n % NR_CPUS);
while (*(volatile int *)&xxx == 0) {
}
for (int i = 0; i < R; i++) {
int fd = open(filename, O_RDONLY);
char buf[4096];
ssize_t rv = read(fd, buf, sizeof(buf));
asm volatile ("" :: "g" (rv));
close(fd);
}
}
int main(int argc, char *argv[])
{
if (argc < 4) {
std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
return 1;
}
N = atoi(argv[1]);
filename = argv[2];
R = atoi(argv[3]);
for (int i = 0; i < NR_CPUS; i++) {
if (glue(i) == 0)
break;
}
std::vector<std::thread> T;
T.reserve(N);
for (int i = 0; i < N; i++) {
T.emplace_back(f, i);
}
auto t0 = std::chrono::system_clock::now();
{
*(volatile int *)&xxx = 1;
for (auto& t: T) {
t.join();
}
}
auto t1 = std::chrono::system_clock::now();
std::chrono::duration<double> dt = t1 - t0;
std::cout << dt.count() << '
';
return 0;
}
P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.
[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-06 20:09:01 -07:00
|
|
|
.proc_flags = PROC_ENTRY_PERMANENT,
|
2020-02-03 17:37:17 -08:00
|
|
|
.proc_open = slabinfo_open,
|
|
|
|
.proc_read = seq_read,
|
|
|
|
.proc_lseek = seq_lseek,
|
|
|
|
.proc_release = seq_release,
|
2012-10-19 18:20:25 +04:00
|
|
|
};
|
|
|
|
|
|
|
|
static int __init slab_proc_init(void)
|
|
|
|
{
|
2020-02-03 17:37:17 -08:00
|
|
|
proc_create("slabinfo", SLABINFO_RIGHTS, NULL, &slabinfo_proc_ops);
|
2012-10-19 18:20:25 +04:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
module_init(slab_proc_init);
|
2019-07-11 20:56:38 -07:00
|
|
|
|
2023-10-02 17:43:38 +02:00
|
|
|
#endif /* CONFIG_SLUB_DEBUG */
|
2014-08-06 16:04:44 -07:00
|
|
|
|
|
|
|
/**
|
2020-08-06 23:18:13 -07:00
|
|
|
* kfree_sensitive - Clear sensitive information in memory before freeing
|
2014-08-06 16:04:44 -07:00
|
|
|
* @p: object to free memory of
|
|
|
|
*
|
|
|
|
* The memory of the object @p points to is zeroed before freed.
|
2020-08-06 23:18:13 -07:00
|
|
|
* If @p is %NULL, kfree_sensitive() does nothing.
|
2014-08-06 16:04:44 -07:00
|
|
|
*
|
|
|
|
* Note: this function zeroes the whole allocated buffer which can be a good
|
|
|
|
* deal bigger than the requested buffer size passed to kmalloc(). So be
|
|
|
|
* careful when using this function in performance sensitive code.
|
|
|
|
*/
|
2020-08-06 23:18:13 -07:00
|
|
|
void kfree_sensitive(const void *p)
|
2014-08-06 16:04:44 -07:00
|
|
|
{
|
|
|
|
size_t ks;
|
|
|
|
void *mem = (void *)p;
|
|
|
|
|
|
|
|
ks = ksize(mem);
|
2022-09-22 13:08:16 -07:00
|
|
|
if (ks) {
|
|
|
|
kasan_unpoison_range(mem, ks);
|
2020-08-06 23:18:17 -07:00
|
|
|
memzero_explicit(mem, ks);
|
2022-09-22 13:08:16 -07:00
|
|
|
}
|
2014-08-06 16:04:44 -07:00
|
|
|
kfree(mem);
|
|
|
|
}
|
2020-08-06 23:18:13 -07:00
|
|
|
EXPORT_SYMBOL(kfree_sensitive);
|
2014-08-06 16:04:44 -07:00
|
|
|
|
2019-07-11 20:54:14 -07:00
|
|
|
size_t ksize(const void *objp)
|
|
|
|
{
|
2019-07-11 20:54:18 -07:00
|
|
|
/*
|
2022-09-22 13:08:16 -07:00
|
|
|
* We need to first check that the pointer to the object is valid.
|
|
|
|
* The KASAN report printed from ksize() is more useful, then when
|
|
|
|
* it's printed later when the behaviour could be undefined due to
|
|
|
|
* a potential use-after-free or double-free.
|
2019-07-11 20:54:18 -07:00
|
|
|
*
|
2021-02-24 12:05:50 -08:00
|
|
|
* We use kasan_check_byte(), which is supported for the hardware
|
|
|
|
* tag-based KASAN mode, unlike kasan_check_read/write().
|
|
|
|
*
|
|
|
|
* If the pointed to memory is invalid, we return 0 to avoid users of
|
2019-07-11 20:54:18 -07:00
|
|
|
* ksize() writing to and potentially corrupting the memory region.
|
|
|
|
*
|
|
|
|
* We want to perform the check before __ksize(), to avoid potentially
|
|
|
|
* crashing in __ksize() due to accessing invalid metadata.
|
|
|
|
*/
|
2021-02-24 12:05:50 -08:00
|
|
|
if (unlikely(ZERO_OR_NULL_PTR(objp)) || !kasan_check_byte(objp))
|
2019-07-11 20:54:18 -07:00
|
|
|
return 0;
|
|
|
|
|
2022-09-22 13:08:16 -07:00
|
|
|
return kfence_ksize(objp) ?: __ksize(objp);
|
2019-07-11 20:54:14 -07:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(ksize);
|
|
|
|
|
2024-10-10 16:25:04 -07:00
|
|
|
#ifdef CONFIG_BPF_SYSCALL
|
|
|
|
#include <linux/btf.h>
|
|
|
|
|
|
|
|
__bpf_kfunc_start_defs();
|
|
|
|
|
|
|
|
__bpf_kfunc struct kmem_cache *bpf_get_kmem_cache(u64 addr)
|
|
|
|
{
|
|
|
|
struct slab *slab;
|
|
|
|
|
|
|
|
if (!virt_addr_valid((void *)(long)addr))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
slab = virt_to_slab((void *)(long)addr);
|
|
|
|
return slab ? slab->slab_cache : NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
__bpf_kfunc_end_defs();
|
|
|
|
#endif /* CONFIG_BPF_SYSCALL */
|
|
|
|
|
2014-08-06 16:04:44 -07:00
|
|
|
/* Tracepoints definitions. */
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kmalloc);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kfree);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
|
2018-04-05 16:23:57 -07:00
|
|
|
|
2025-02-03 10:28:50 +01:00
|
|
|
#ifndef CONFIG_KVFREE_RCU_BATCHED
|
2025-02-03 10:28:47 +01:00
|
|
|
|
|
|
|
void kvfree_call_rcu(struct rcu_head *head, void *ptr)
|
|
|
|
{
|
|
|
|
if (head) {
|
|
|
|
kasan_record_aux_stack(ptr);
|
2025-02-03 10:28:49 +01:00
|
|
|
call_rcu(head, kvfree_rcu_cb);
|
2025-02-03 10:28:47 +01:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
// kvfree_rcu(one_arg) call.
|
|
|
|
might_sleep();
|
|
|
|
synchronize_rcu();
|
|
|
|
kvfree(ptr);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvfree_call_rcu);
|
|
|
|
|
2025-02-03 10:28:50 +01:00
|
|
|
void __init kvfree_rcu_init(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
#else /* CONFIG_KVFREE_RCU_BATCHED */
|
2025-02-03 10:28:47 +01:00
|
|
|
|
2024-12-12 19:02:08 +01:00
|
|
|
/*
|
|
|
|
* This rcu parameter is runtime-read-only. It reflects
|
|
|
|
* a minimum allowed number of objects which can be cached
|
|
|
|
* per-CPU. Object size is equal to one page. This value
|
|
|
|
* can be changed at boot time.
|
|
|
|
*/
|
|
|
|
static int rcu_min_cached_objs = 5;
|
|
|
|
module_param(rcu_min_cached_objs, int, 0444);
|
|
|
|
|
|
|
|
// A page shrinker can ask for pages to be freed to make them
|
|
|
|
// available for other parts of the system. This usually happens
|
|
|
|
// under low memory conditions, and in that case we should also
|
|
|
|
// defer page-cache filling for a short time period.
|
|
|
|
//
|
|
|
|
// The default value is 5 seconds, which is long enough to reduce
|
|
|
|
// interference with the shrinker while it asks other systems to
|
|
|
|
// drain their caches.
|
|
|
|
static int rcu_delay_page_cache_fill_msec = 5000;
|
|
|
|
module_param(rcu_delay_page_cache_fill_msec, int, 0444);
|
|
|
|
|
2025-02-28 13:13:56 +01:00
|
|
|
static struct workqueue_struct *rcu_reclaim_wq;
|
|
|
|
|
2024-12-12 19:02:08 +01:00
|
|
|
/* Maximum number of jiffies to wait before draining a batch. */
|
|
|
|
#define KFREE_DRAIN_JIFFIES (5 * HZ)
|
|
|
|
#define KFREE_N_BATCHES 2
|
|
|
|
#define FREE_N_CHANNELS 2
|
|
|
|
|
|
|
|
/**
|
|
|
|
* struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers
|
|
|
|
* @list: List node. All blocks are linked between each other
|
|
|
|
* @gp_snap: Snapshot of RCU state for objects placed to this bulk
|
|
|
|
* @nr_records: Number of active pointers in the array
|
|
|
|
* @records: Array of the kvfree_rcu() pointers
|
|
|
|
*/
|
|
|
|
struct kvfree_rcu_bulk_data {
|
|
|
|
struct list_head list;
|
|
|
|
struct rcu_gp_oldstate gp_snap;
|
|
|
|
unsigned long nr_records;
|
|
|
|
void *records[] __counted_by(nr_records);
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This macro defines how many entries the "records" array
|
|
|
|
* will contain. It is based on the fact that the size of
|
|
|
|
* kvfree_rcu_bulk_data structure becomes exactly one page.
|
|
|
|
*/
|
|
|
|
#define KVFREE_BULK_MAX_ENTR \
|
|
|
|
((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
|
|
|
|
|
|
|
|
/**
|
|
|
|
* struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
|
|
|
|
* @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
|
|
|
|
* @head_free: List of kfree_rcu() objects waiting for a grace period
|
|
|
|
* @head_free_gp_snap: Grace-period snapshot to check for attempted premature frees.
|
|
|
|
* @bulk_head_free: Bulk-List of kvfree_rcu() objects waiting for a grace period
|
|
|
|
* @krcp: Pointer to @kfree_rcu_cpu structure
|
|
|
|
*/
|
|
|
|
|
|
|
|
struct kfree_rcu_cpu_work {
|
|
|
|
struct rcu_work rcu_work;
|
|
|
|
struct rcu_head *head_free;
|
|
|
|
struct rcu_gp_oldstate head_free_gp_snap;
|
|
|
|
struct list_head bulk_head_free[FREE_N_CHANNELS];
|
|
|
|
struct kfree_rcu_cpu *krcp;
|
|
|
|
};
|
|
|
|
|
|
|
|
/**
|
|
|
|
* struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period
|
|
|
|
* @head: List of kfree_rcu() objects not yet waiting for a grace period
|
|
|
|
* @head_gp_snap: Snapshot of RCU state for objects placed to "@head"
|
|
|
|
* @bulk_head: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
|
|
|
|
* @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
|
|
|
|
* @lock: Synchronize access to this structure
|
|
|
|
* @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
|
|
|
|
* @initialized: The @rcu_work fields have been initialized
|
|
|
|
* @head_count: Number of objects in rcu_head singular list
|
|
|
|
* @bulk_count: Number of objects in bulk-list
|
|
|
|
* @bkvcache:
|
|
|
|
* A simple cache list that contains objects for reuse purpose.
|
|
|
|
* In order to save some per-cpu space the list is singular.
|
|
|
|
* Even though it is lockless an access has to be protected by the
|
|
|
|
* per-cpu lock.
|
|
|
|
* @page_cache_work: A work to refill the cache when it is empty
|
|
|
|
* @backoff_page_cache_fill: Delay cache refills
|
|
|
|
* @work_in_progress: Indicates that page_cache_work is running
|
|
|
|
* @hrtimer: A hrtimer for scheduling a page_cache_work
|
|
|
|
* @nr_bkv_objs: number of allocated objects at @bkvcache.
|
|
|
|
*
|
|
|
|
* This is a per-CPU structure. The reason that it is not included in
|
|
|
|
* the rcu_data structure is to permit this code to be extracted from
|
|
|
|
* the RCU files. Such extraction could allow further optimization of
|
|
|
|
* the interactions with the slab allocators.
|
|
|
|
*/
|
|
|
|
struct kfree_rcu_cpu {
|
|
|
|
// Objects queued on a linked list
|
|
|
|
// through their rcu_head structures.
|
|
|
|
struct rcu_head *head;
|
|
|
|
unsigned long head_gp_snap;
|
|
|
|
atomic_t head_count;
|
|
|
|
|
|
|
|
// Objects queued on a bulk-list.
|
|
|
|
struct list_head bulk_head[FREE_N_CHANNELS];
|
|
|
|
atomic_t bulk_count[FREE_N_CHANNELS];
|
|
|
|
|
|
|
|
struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
|
|
|
|
raw_spinlock_t lock;
|
|
|
|
struct delayed_work monitor_work;
|
|
|
|
bool initialized;
|
|
|
|
|
|
|
|
struct delayed_work page_cache_work;
|
|
|
|
atomic_t backoff_page_cache_fill;
|
|
|
|
atomic_t work_in_progress;
|
|
|
|
struct hrtimer hrtimer;
|
|
|
|
|
|
|
|
struct llist_head bkvcache;
|
|
|
|
int nr_bkv_objs;
|
|
|
|
};
|
|
|
|
|
|
|
|
static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
|
|
|
|
.lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
|
|
|
|
};
|
|
|
|
|
|
|
|
static __always_inline void
|
|
|
|
debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < bhead->nr_records; i++)
|
|
|
|
debug_rcu_head_unqueue((struct rcu_head *)(bhead->records[i]));
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct kfree_rcu_cpu *
|
|
|
|
krc_this_cpu_lock(unsigned long *flags)
|
|
|
|
{
|
|
|
|
struct kfree_rcu_cpu *krcp;
|
|
|
|
|
|
|
|
local_irq_save(*flags); // For safely calling this_cpu_ptr().
|
|
|
|
krcp = this_cpu_ptr(&krc);
|
|
|
|
raw_spin_lock(&krcp->lock);
|
|
|
|
|
|
|
|
return krcp;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void
|
|
|
|
krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
|
|
|
|
{
|
|
|
|
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct kvfree_rcu_bulk_data *
|
|
|
|
get_cached_bnode(struct kfree_rcu_cpu *krcp)
|
|
|
|
{
|
|
|
|
if (!krcp->nr_bkv_objs)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
WRITE_ONCE(krcp->nr_bkv_objs, krcp->nr_bkv_objs - 1);
|
|
|
|
return (struct kvfree_rcu_bulk_data *)
|
|
|
|
llist_del_first(&krcp->bkvcache);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool
|
|
|
|
put_cached_bnode(struct kfree_rcu_cpu *krcp,
|
|
|
|
struct kvfree_rcu_bulk_data *bnode)
|
|
|
|
{
|
|
|
|
// Check the limit.
|
|
|
|
if (krcp->nr_bkv_objs >= rcu_min_cached_objs)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
llist_add((struct llist_node *) bnode, &krcp->bkvcache);
|
|
|
|
WRITE_ONCE(krcp->nr_bkv_objs, krcp->nr_bkv_objs + 1);
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
drain_page_cache(struct kfree_rcu_cpu *krcp)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
struct llist_node *page_list, *pos, *n;
|
|
|
|
int freed = 0;
|
|
|
|
|
|
|
|
if (!rcu_min_cached_objs)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
raw_spin_lock_irqsave(&krcp->lock, flags);
|
|
|
|
page_list = llist_del_all(&krcp->bkvcache);
|
|
|
|
WRITE_ONCE(krcp->nr_bkv_objs, 0);
|
|
|
|
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
|
|
|
|
|
|
|
llist_for_each_safe(pos, n, page_list) {
|
|
|
|
free_page((unsigned long)pos);
|
|
|
|
freed++;
|
|
|
|
}
|
|
|
|
|
|
|
|
return freed;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
kvfree_rcu_bulk(struct kfree_rcu_cpu *krcp,
|
|
|
|
struct kvfree_rcu_bulk_data *bnode, int idx)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (!WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&bnode->gp_snap))) {
|
|
|
|
debug_rcu_bhead_unqueue(bnode);
|
|
|
|
rcu_lock_acquire(&rcu_callback_map);
|
|
|
|
if (idx == 0) { // kmalloc() / kfree().
|
|
|
|
trace_rcu_invoke_kfree_bulk_callback(
|
|
|
|
"slab", bnode->nr_records,
|
|
|
|
bnode->records);
|
|
|
|
|
|
|
|
kfree_bulk(bnode->nr_records, bnode->records);
|
|
|
|
} else { // vmalloc() / vfree().
|
|
|
|
for (i = 0; i < bnode->nr_records; i++) {
|
|
|
|
trace_rcu_invoke_kvfree_callback(
|
|
|
|
"slab", bnode->records[i], 0);
|
|
|
|
|
|
|
|
vfree(bnode->records[i]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rcu_lock_release(&rcu_callback_map);
|
|
|
|
}
|
|
|
|
|
|
|
|
raw_spin_lock_irqsave(&krcp->lock, flags);
|
|
|
|
if (put_cached_bnode(krcp, bnode))
|
|
|
|
bnode = NULL;
|
|
|
|
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
|
|
|
|
|
|
|
if (bnode)
|
|
|
|
free_page((unsigned long) bnode);
|
|
|
|
|
|
|
|
cond_resched_tasks_rcu_qs();
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
kvfree_rcu_list(struct rcu_head *head)
|
|
|
|
{
|
|
|
|
struct rcu_head *next;
|
|
|
|
|
|
|
|
for (; head; head = next) {
|
|
|
|
void *ptr = (void *) head->func;
|
|
|
|
unsigned long offset = (void *) head - ptr;
|
|
|
|
|
|
|
|
next = head->next;
|
|
|
|
debug_rcu_head_unqueue((struct rcu_head *)ptr);
|
|
|
|
rcu_lock_acquire(&rcu_callback_map);
|
|
|
|
trace_rcu_invoke_kvfree_callback("slab", head, offset);
|
|
|
|
|
2025-02-03 10:28:49 +01:00
|
|
|
kvfree(ptr);
|
2024-12-12 19:02:08 +01:00
|
|
|
|
|
|
|
rcu_lock_release(&rcu_callback_map);
|
|
|
|
cond_resched_tasks_rcu_qs();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is invoked in workqueue context after a grace period.
|
|
|
|
* It frees all the objects queued on ->bulk_head_free or ->head_free.
|
|
|
|
*/
|
|
|
|
static void kfree_rcu_work(struct work_struct *work)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
struct kvfree_rcu_bulk_data *bnode, *n;
|
|
|
|
struct list_head bulk_head[FREE_N_CHANNELS];
|
|
|
|
struct rcu_head *head;
|
|
|
|
struct kfree_rcu_cpu *krcp;
|
|
|
|
struct kfree_rcu_cpu_work *krwp;
|
|
|
|
struct rcu_gp_oldstate head_gp_snap;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
krwp = container_of(to_rcu_work(work),
|
|
|
|
struct kfree_rcu_cpu_work, rcu_work);
|
|
|
|
krcp = krwp->krcp;
|
|
|
|
|
|
|
|
raw_spin_lock_irqsave(&krcp->lock, flags);
|
|
|
|
// Channels 1 and 2.
|
|
|
|
for (i = 0; i < FREE_N_CHANNELS; i++)
|
|
|
|
list_replace_init(&krwp->bulk_head_free[i], &bulk_head[i]);
|
|
|
|
|
|
|
|
// Channel 3.
|
|
|
|
head = krwp->head_free;
|
|
|
|
krwp->head_free = NULL;
|
|
|
|
head_gp_snap = krwp->head_free_gp_snap;
|
|
|
|
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
|
|
|
|
|
|
|
// Handle the first two channels.
|
|
|
|
for (i = 0; i < FREE_N_CHANNELS; i++) {
|
|
|
|
// Start from the tail page, so a GP is likely passed for it.
|
|
|
|
list_for_each_entry_safe(bnode, n, &bulk_head[i], list)
|
|
|
|
kvfree_rcu_bulk(krcp, bnode, i);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is used when the "bulk" path can not be used for the
|
|
|
|
* double-argument of kvfree_rcu(). This happens when the
|
|
|
|
* page-cache is empty, which means that objects are instead
|
|
|
|
* queued on a linked list through their rcu_head structures.
|
|
|
|
* This list is named "Channel 3".
|
|
|
|
*/
|
|
|
|
if (head && !WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&head_gp_snap)))
|
|
|
|
kvfree_rcu_list(head);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool
|
|
|
|
need_offload_krc(struct kfree_rcu_cpu *krcp)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < FREE_N_CHANNELS; i++)
|
|
|
|
if (!list_empty(&krcp->bulk_head[i]))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return !!READ_ONCE(krcp->head);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool
|
|
|
|
need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < FREE_N_CHANNELS; i++)
|
|
|
|
if (!list_empty(&krwp->bulk_head_free[i]))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return !!krwp->head_free;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int krc_count(struct kfree_rcu_cpu *krcp)
|
|
|
|
{
|
|
|
|
int sum = atomic_read(&krcp->head_count);
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < FREE_N_CHANNELS; i++)
|
|
|
|
sum += atomic_read(&krcp->bulk_count[i]);
|
|
|
|
|
|
|
|
return sum;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
__schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp)
|
|
|
|
{
|
|
|
|
long delay, delay_left;
|
|
|
|
|
|
|
|
delay = krc_count(krcp) >= KVFREE_BULK_MAX_ENTR ? 1:KFREE_DRAIN_JIFFIES;
|
|
|
|
if (delayed_work_pending(&krcp->monitor_work)) {
|
|
|
|
delay_left = krcp->monitor_work.timer.expires - jiffies;
|
|
|
|
if (delay < delay_left)
|
2025-02-28 13:13:56 +01:00
|
|
|
mod_delayed_work(rcu_reclaim_wq, &krcp->monitor_work, delay);
|
2024-12-12 19:02:08 +01:00
|
|
|
return;
|
|
|
|
}
|
2025-02-28 13:13:56 +01:00
|
|
|
queue_delayed_work(rcu_reclaim_wq, &krcp->monitor_work, delay);
|
2024-12-12 19:02:08 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
raw_spin_lock_irqsave(&krcp->lock, flags);
|
|
|
|
__schedule_delayed_monitor_work(krcp);
|
|
|
|
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
kvfree_rcu_drain_ready(struct kfree_rcu_cpu *krcp)
|
|
|
|
{
|
|
|
|
struct list_head bulk_ready[FREE_N_CHANNELS];
|
|
|
|
struct kvfree_rcu_bulk_data *bnode, *n;
|
|
|
|
struct rcu_head *head_ready = NULL;
|
|
|
|
unsigned long flags;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
raw_spin_lock_irqsave(&krcp->lock, flags);
|
|
|
|
for (i = 0; i < FREE_N_CHANNELS; i++) {
|
|
|
|
INIT_LIST_HEAD(&bulk_ready[i]);
|
|
|
|
|
|
|
|
list_for_each_entry_safe_reverse(bnode, n, &krcp->bulk_head[i], list) {
|
|
|
|
if (!poll_state_synchronize_rcu_full(&bnode->gp_snap))
|
|
|
|
break;
|
|
|
|
|
|
|
|
atomic_sub(bnode->nr_records, &krcp->bulk_count[i]);
|
|
|
|
list_move(&bnode->list, &bulk_ready[i]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (krcp->head && poll_state_synchronize_rcu(krcp->head_gp_snap)) {
|
|
|
|
head_ready = krcp->head;
|
|
|
|
atomic_set(&krcp->head_count, 0);
|
|
|
|
WRITE_ONCE(krcp->head, NULL);
|
|
|
|
}
|
|
|
|
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
|
|
|
|
|
|
|
for (i = 0; i < FREE_N_CHANNELS; i++) {
|
|
|
|
list_for_each_entry_safe(bnode, n, &bulk_ready[i], list)
|
|
|
|
kvfree_rcu_bulk(krcp, bnode, i);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (head_ready)
|
|
|
|
kvfree_rcu_list(head_ready);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return: %true if a work is queued, %false otherwise.
|
|
|
|
*/
|
|
|
|
static bool
|
|
|
|
kvfree_rcu_queue_batch(struct kfree_rcu_cpu *krcp)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
bool queued = false;
|
|
|
|
int i, j;
|
|
|
|
|
|
|
|
raw_spin_lock_irqsave(&krcp->lock, flags);
|
|
|
|
|
|
|
|
// Attempt to start a new batch.
|
|
|
|
for (i = 0; i < KFREE_N_BATCHES; i++) {
|
|
|
|
struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
|
|
|
|
|
|
|
|
// Try to detach bulk_head or head and attach it, only when
|
|
|
|
// all channels are free. Any channel is not free means at krwp
|
|
|
|
// there is on-going rcu work to handle krwp's free business.
|
|
|
|
if (need_wait_for_krwp_work(krwp))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
// kvfree_rcu_drain_ready() might handle this krcp, if so give up.
|
|
|
|
if (need_offload_krc(krcp)) {
|
|
|
|
// Channel 1 corresponds to the SLAB-pointer bulk path.
|
|
|
|
// Channel 2 corresponds to vmalloc-pointer bulk path.
|
|
|
|
for (j = 0; j < FREE_N_CHANNELS; j++) {
|
|
|
|
if (list_empty(&krwp->bulk_head_free[j])) {
|
|
|
|
atomic_set(&krcp->bulk_count[j], 0);
|
|
|
|
list_replace_init(&krcp->bulk_head[j],
|
|
|
|
&krwp->bulk_head_free[j]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Channel 3 corresponds to both SLAB and vmalloc
|
|
|
|
// objects queued on the linked list.
|
|
|
|
if (!krwp->head_free) {
|
|
|
|
krwp->head_free = krcp->head;
|
|
|
|
get_state_synchronize_rcu_full(&krwp->head_free_gp_snap);
|
|
|
|
atomic_set(&krcp->head_count, 0);
|
|
|
|
WRITE_ONCE(krcp->head, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
// One work is per one batch, so there are three
|
|
|
|
// "free channels", the batch can handle. Break
|
|
|
|
// the loop since it is done with this CPU thus
|
|
|
|
// queuing an RCU work is _always_ success here.
|
2025-02-28 13:13:56 +01:00
|
|
|
queued = queue_rcu_work(rcu_reclaim_wq, &krwp->rcu_work);
|
2024-12-12 19:02:08 +01:00
|
|
|
WARN_ON_ONCE(!queued);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
|
|
|
return queued;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is invoked after the KFREE_DRAIN_JIFFIES timeout.
|
|
|
|
*/
|
|
|
|
static void kfree_rcu_monitor(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct kfree_rcu_cpu *krcp = container_of(work,
|
|
|
|
struct kfree_rcu_cpu, monitor_work.work);
|
|
|
|
|
|
|
|
// Drain ready for reclaim.
|
|
|
|
kvfree_rcu_drain_ready(krcp);
|
|
|
|
|
|
|
|
// Queue a batch for a rest.
|
|
|
|
kvfree_rcu_queue_batch(krcp);
|
|
|
|
|
|
|
|
// If there is nothing to detach, it means that our job is
|
|
|
|
// successfully done here. In case of having at least one
|
|
|
|
// of the channels that is still busy we should rearm the
|
|
|
|
// work to repeat an attempt. Because previous batches are
|
|
|
|
// still in progress.
|
|
|
|
if (need_offload_krc(krcp))
|
|
|
|
schedule_delayed_monitor_work(krcp);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void fill_page_cache_func(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct kvfree_rcu_bulk_data *bnode;
|
|
|
|
struct kfree_rcu_cpu *krcp =
|
|
|
|
container_of(work, struct kfree_rcu_cpu,
|
|
|
|
page_cache_work.work);
|
|
|
|
unsigned long flags;
|
|
|
|
int nr_pages;
|
|
|
|
bool pushed;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
nr_pages = atomic_read(&krcp->backoff_page_cache_fill) ?
|
|
|
|
1 : rcu_min_cached_objs;
|
|
|
|
|
|
|
|
for (i = READ_ONCE(krcp->nr_bkv_objs); i < nr_pages; i++) {
|
|
|
|
bnode = (struct kvfree_rcu_bulk_data *)
|
|
|
|
__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
|
|
|
|
|
|
|
|
if (!bnode)
|
|
|
|
break;
|
|
|
|
|
|
|
|
raw_spin_lock_irqsave(&krcp->lock, flags);
|
|
|
|
pushed = put_cached_bnode(krcp, bnode);
|
|
|
|
raw_spin_unlock_irqrestore(&krcp->lock, flags);
|
|
|
|
|
|
|
|
if (!pushed) {
|
|
|
|
free_page((unsigned long) bnode);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
atomic_set(&krcp->work_in_progress, 0);
|
|
|
|
atomic_set(&krcp->backoff_page_cache_fill, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
|
|
|
|
// state specified by flags. If can_alloc is true, the caller must
|
|
|
|
// be schedulable and not be holding any locks or mutexes that might be
|
|
|
|
// acquired by the memory allocator or anything that it might invoke.
|
|
|
|
// Returns true if ptr was successfully recorded, else the caller must
|
|
|
|
// use a fallback.
|
|
|
|
static inline bool
|
|
|
|
add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
|
|
|
|
unsigned long *flags, void *ptr, bool can_alloc)
|
|
|
|
{
|
|
|
|
struct kvfree_rcu_bulk_data *bnode;
|
|
|
|
int idx;
|
|
|
|
|
|
|
|
*krcp = krc_this_cpu_lock(flags);
|
|
|
|
if (unlikely(!(*krcp)->initialized))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
idx = !!is_vmalloc_addr(ptr);
|
|
|
|
bnode = list_first_entry_or_null(&(*krcp)->bulk_head[idx],
|
|
|
|
struct kvfree_rcu_bulk_data, list);
|
|
|
|
|
|
|
|
/* Check if a new block is required. */
|
|
|
|
if (!bnode || bnode->nr_records == KVFREE_BULK_MAX_ENTR) {
|
|
|
|
bnode = get_cached_bnode(*krcp);
|
|
|
|
if (!bnode && can_alloc) {
|
|
|
|
krc_this_cpu_unlock(*krcp, *flags);
|
|
|
|
|
|
|
|
// __GFP_NORETRY - allows a light-weight direct reclaim
|
|
|
|
// what is OK from minimizing of fallback hitting point of
|
|
|
|
// view. Apart of that it forbids any OOM invoking what is
|
|
|
|
// also beneficial since we are about to release memory soon.
|
|
|
|
//
|
|
|
|
// __GFP_NOMEMALLOC - prevents from consuming of all the
|
|
|
|
// memory reserves. Please note we have a fallback path.
|
|
|
|
//
|
|
|
|
// __GFP_NOWARN - it is supposed that an allocation can
|
|
|
|
// be failed under low memory or high memory pressure
|
|
|
|
// scenarios.
|
|
|
|
bnode = (struct kvfree_rcu_bulk_data *)
|
|
|
|
__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
|
|
|
|
raw_spin_lock_irqsave(&(*krcp)->lock, *flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!bnode)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
// Initialize the new block and attach it.
|
|
|
|
bnode->nr_records = 0;
|
|
|
|
list_add(&bnode->list, &(*krcp)->bulk_head[idx]);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Finally insert and update the GP for this page.
|
|
|
|
bnode->nr_records++;
|
|
|
|
bnode->records[bnode->nr_records - 1] = ptr;
|
|
|
|
get_state_synchronize_rcu_full(&bnode->gp_snap);
|
|
|
|
atomic_inc(&(*krcp)->bulk_count[idx]);
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static enum hrtimer_restart
|
|
|
|
schedule_page_work_fn(struct hrtimer *t)
|
|
|
|
{
|
|
|
|
struct kfree_rcu_cpu *krcp =
|
|
|
|
container_of(t, struct kfree_rcu_cpu, hrtimer);
|
|
|
|
|
|
|
|
queue_delayed_work(system_highpri_wq, &krcp->page_cache_work, 0);
|
|
|
|
return HRTIMER_NORESTART;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
run_page_cache_worker(struct kfree_rcu_cpu *krcp)
|
|
|
|
{
|
|
|
|
// If cache disabled, bail out.
|
|
|
|
if (!rcu_min_cached_objs)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
|
|
|
|
!atomic_xchg(&krcp->work_in_progress, 1)) {
|
|
|
|
if (atomic_read(&krcp->backoff_page_cache_fill)) {
|
2025-02-28 13:13:56 +01:00
|
|
|
queue_delayed_work(rcu_reclaim_wq,
|
2024-12-12 19:02:08 +01:00
|
|
|
&krcp->page_cache_work,
|
|
|
|
msecs_to_jiffies(rcu_delay_page_cache_fill_msec));
|
|
|
|
} else {
|
2025-02-05 11:38:58 +01:00
|
|
|
hrtimer_setup(&krcp->hrtimer, schedule_page_work_fn, CLOCK_MONOTONIC,
|
|
|
|
HRTIMER_MODE_REL);
|
2024-12-12 19:02:08 +01:00
|
|
|
hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void __init kfree_rcu_scheduler_running(void)
|
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
|
|
|
|
|
|
|
|
if (need_offload_krc(krcp))
|
|
|
|
schedule_delayed_monitor_work(krcp);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Queue a request for lazy invocation of the appropriate free routine
|
|
|
|
* after a grace period. Please note that three paths are maintained,
|
|
|
|
* two for the common case using arrays of pointers and a third one that
|
|
|
|
* is used only when the main paths cannot be used, for example, due to
|
|
|
|
* memory pressure.
|
|
|
|
*
|
|
|
|
* Each kvfree_call_rcu() request is added to a batch. The batch will be drained
|
|
|
|
* every KFREE_DRAIN_JIFFIES number of jiffies. All the objects in the batch will
|
|
|
|
* be free'd in workqueue context. This allows us to: batch requests together to
|
|
|
|
* reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load.
|
|
|
|
*/
|
|
|
|
void kvfree_call_rcu(struct rcu_head *head, void *ptr)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
struct kfree_rcu_cpu *krcp;
|
|
|
|
bool success;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Please note there is a limitation for the head-less
|
|
|
|
* variant, that is why there is a clear rule for such
|
|
|
|
* objects: it can be used from might_sleep() context
|
|
|
|
* only. For other places please embed an rcu_head to
|
|
|
|
* your data.
|
|
|
|
*/
|
|
|
|
if (!head)
|
|
|
|
might_sleep();
|
|
|
|
|
|
|
|
// Queue the object but don't yet schedule the batch.
|
|
|
|
if (debug_rcu_head_queue(ptr)) {
|
|
|
|
// Probable double kfree_rcu(), just leak.
|
|
|
|
WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
|
|
|
|
__func__, head);
|
|
|
|
|
|
|
|
// Mark as success and leave.
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
page allocator so we end up with the ability to allocate and free
zero-refcount pages. So that callers (ie, slab) can avoid a refcount
inc & dec.
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
large folios other than PMD-sized ones.
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
fixes for this small built-in kernel selftest.
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
the mapletree code.
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups.
- "simplify split calculation" from Wei Yang provides a few fixes and a
test for the mapletree code.
- "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
continues the work of moving vma-related code into the (relatively) new
mm/vma.c.
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the page
allocator.
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue. It
should reduce the amount of unnecessary reading.
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated
(https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
Qi's series addresses this windup by synchronously freeing PTE memory
within the context of madvise(MADV_DONTNEED).
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests code
when optional compiler warnings are enabled.
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
- "pkeys kselftests improvements" from Kevin Brodsky implements various
fixes and cleanups in the MM selftests code, mainly pertaining to the
pkeys tests.
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size.
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic.
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based
kernel build was demonstrated.
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of zram_write_page().
A watchdog softlockup warning was eliminated.
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed.
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging logic.
- "Account page tables at all levels" from Kevin Brodsky cleans up and
regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy.
- "mm/damon: replace most damon_callback usages in sysfs with new core
functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
file interface logic.
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is presented in
response to DAMOS actions.
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs
is completed.
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
Xu cleans up and generalizes the hugetlb reservation accounting.
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface.
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting), but
also inclusion (allowing) behavior.
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
"introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to reduce
the size of struct page and to enable dynamic allocation of memory
descriptors."
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel build
time with swap-on-zram.
- "mm: update mips to use do_mmap(), make mmap_region() internal" from
Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal.
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
regressions and otherwise improves MGLRU performance.
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
updates DAMON documentation.
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
- "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
provides various cleanups in the areas of hugetlb folios, THP folios and
migration.
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
reading and writing. To permite userspace to address issues with
massive buildup of useless pagecache when reading/writing fast devices.
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
=Ib2s
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
tools: add VM_WARN_ON_VMG definition
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
seqlock: add missing parameter documentation for raw_seqcount_try_begin()
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
mm/page_alloc: remove the incorrect and misleading comment
zram: remove zcomp_stream_put() from write_incompressible_page()
mm: separate move/undo parts from migrate_pages_batch()
mm/kfence: use str_write_read() helper in get_access_type()
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
selftests/mm: vm_util: split up /proc/self/smaps parsing
selftests/mm: virtual_address_range: unmap chunks after validation
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/memfd/memfd_test: fix possible NULL pointer dereference
mm: add FGP_DONTCACHE folio creation flag
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
...
2025-01-26 18:36:23 -08:00
|
|
|
kasan_record_aux_stack(ptr);
|
2024-12-12 19:02:08 +01:00
|
|
|
success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head);
|
|
|
|
if (!success) {
|
|
|
|
run_page_cache_worker(krcp);
|
|
|
|
|
|
|
|
if (head == NULL)
|
|
|
|
// Inline if kvfree_rcu(one_arg) call.
|
|
|
|
goto unlock_return;
|
|
|
|
|
|
|
|
head->func = ptr;
|
|
|
|
head->next = krcp->head;
|
|
|
|
WRITE_ONCE(krcp->head, head);
|
|
|
|
atomic_inc(&krcp->head_count);
|
|
|
|
|
|
|
|
// Take a snapshot for this krcp.
|
|
|
|
krcp->head_gp_snap = get_state_synchronize_rcu();
|
|
|
|
success = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The kvfree_rcu() caller considers the pointer freed at this point
|
|
|
|
* and likely removes any references to it. Since the actual slab
|
|
|
|
* freeing (and kmemleak_free()) is deferred, tell kmemleak to ignore
|
|
|
|
* this object (no scanning or false positives reporting).
|
|
|
|
*/
|
|
|
|
kmemleak_ignore(ptr);
|
|
|
|
|
|
|
|
// Set timer to drain after KFREE_DRAIN_JIFFIES.
|
|
|
|
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
|
|
|
|
__schedule_delayed_monitor_work(krcp);
|
|
|
|
|
|
|
|
unlock_return:
|
|
|
|
krc_this_cpu_unlock(krcp, flags);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inline kvfree() after synchronize_rcu(). We can do
|
|
|
|
* it from might_sleep() context only, so the current
|
|
|
|
* CPU can pass the QS state.
|
|
|
|
*/
|
|
|
|
if (!success) {
|
|
|
|
debug_rcu_head_unqueue((struct rcu_head *) ptr);
|
|
|
|
synchronize_rcu();
|
|
|
|
kvfree(ptr);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvfree_call_rcu);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
|
|
|
|
*
|
|
|
|
* Note that a single argument of kvfree_rcu() call has a slow path that
|
|
|
|
* triggers synchronize_rcu() following by freeing a pointer. It is done
|
|
|
|
* before the return from the function. Therefore for any single-argument
|
|
|
|
* call that will result in a kfree() to a cache that is to be destroyed
|
|
|
|
* during module exit, it is developer's responsibility to ensure that all
|
|
|
|
* such calls have returned before the call to kmem_cache_destroy().
|
|
|
|
*/
|
|
|
|
void kvfree_rcu_barrier(void)
|
|
|
|
{
|
|
|
|
struct kfree_rcu_cpu_work *krwp;
|
|
|
|
struct kfree_rcu_cpu *krcp;
|
|
|
|
bool queued;
|
|
|
|
int i, cpu;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Firstly we detach objects and queue them over an RCU-batch
|
|
|
|
* for all CPUs. Finally queued works are flushed for each CPU.
|
|
|
|
*
|
|
|
|
* Please note. If there are outstanding batches for a particular
|
|
|
|
* CPU, those have to be finished first following by queuing a new.
|
|
|
|
*/
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
krcp = per_cpu_ptr(&krc, cpu);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if this CPU has any objects which have been queued for a
|
|
|
|
* new GP completion. If not(means nothing to detach), we are done
|
|
|
|
* with it. If any batch is pending/running for this "krcp", below
|
|
|
|
* per-cpu flush_rcu_work() waits its completion(see last step).
|
|
|
|
*/
|
|
|
|
if (!need_offload_krc(krcp))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
while (1) {
|
|
|
|
/*
|
|
|
|
* If we are not able to queue a new RCU work it means:
|
|
|
|
* - batches for this CPU are still in flight which should
|
|
|
|
* be flushed first and then repeat;
|
|
|
|
* - no objects to detach, because of concurrency.
|
|
|
|
*/
|
|
|
|
queued = kvfree_rcu_queue_batch(krcp);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Bail out, if there is no need to offload this "krcp"
|
|
|
|
* anymore. As noted earlier it can run concurrently.
|
|
|
|
*/
|
|
|
|
if (queued || !need_offload_krc(krcp))
|
|
|
|
break;
|
|
|
|
|
|
|
|
/* There are ongoing batches. */
|
|
|
|
for (i = 0; i < KFREE_N_BATCHES; i++) {
|
|
|
|
krwp = &(krcp->krw_arr[i]);
|
|
|
|
flush_rcu_work(&krwp->rcu_work);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Now we guarantee that all objects are flushed.
|
|
|
|
*/
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
krcp = per_cpu_ptr(&krc, cpu);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A monitor work can drain ready to reclaim objects
|
|
|
|
* directly. Wait its completion if running or pending.
|
|
|
|
*/
|
|
|
|
cancel_delayed_work_sync(&krcp->monitor_work);
|
|
|
|
|
|
|
|
for (i = 0; i < KFREE_N_BATCHES; i++) {
|
|
|
|
krwp = &(krcp->krw_arr[i]);
|
|
|
|
flush_rcu_work(&krwp->rcu_work);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
|
|
|
|
|
|
|
|
static unsigned long
|
|
|
|
kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
|
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
unsigned long count = 0;
|
|
|
|
|
|
|
|
/* Snapshot count of all CPUs */
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
|
|
|
|
|
|
|
|
count += krc_count(krcp);
|
|
|
|
count += READ_ONCE(krcp->nr_bkv_objs);
|
|
|
|
atomic_set(&krcp->backoff_page_cache_fill, 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
return count == 0 ? SHRINK_EMPTY : count;
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long
|
|
|
|
kfree_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
|
|
|
|
{
|
|
|
|
int cpu, freed = 0;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
int count;
|
|
|
|
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
|
|
|
|
|
|
|
|
count = krc_count(krcp);
|
|
|
|
count += drain_page_cache(krcp);
|
|
|
|
kfree_rcu_monitor(&krcp->monitor_work.work);
|
|
|
|
|
|
|
|
sc->nr_to_scan -= count;
|
|
|
|
freed += count;
|
|
|
|
|
|
|
|
if (sc->nr_to_scan <= 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return freed == 0 ? SHRINK_STOP : freed;
|
|
|
|
}
|
|
|
|
|
|
|
|
void __init kvfree_rcu_init(void)
|
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
int i, j;
|
|
|
|
struct shrinker *kfree_rcu_shrinker;
|
|
|
|
|
2025-02-28 13:13:56 +01:00
|
|
|
rcu_reclaim_wq = alloc_workqueue("kvfree_rcu_reclaim",
|
|
|
|
WQ_UNBOUND | WQ_MEM_RECLAIM, 0);
|
|
|
|
WARN_ON(!rcu_reclaim_wq);
|
|
|
|
|
2024-12-12 19:02:08 +01:00
|
|
|
/* Clamp it to [0:100] seconds interval. */
|
|
|
|
if (rcu_delay_page_cache_fill_msec < 0 ||
|
|
|
|
rcu_delay_page_cache_fill_msec > 100 * MSEC_PER_SEC) {
|
|
|
|
|
|
|
|
rcu_delay_page_cache_fill_msec =
|
|
|
|
clamp(rcu_delay_page_cache_fill_msec, 0,
|
|
|
|
(int) (100 * MSEC_PER_SEC));
|
|
|
|
|
|
|
|
pr_info("Adjusting rcutree.rcu_delay_page_cache_fill_msec to %d ms.\n",
|
|
|
|
rcu_delay_page_cache_fill_msec);
|
|
|
|
}
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
|
|
|
|
|
|
|
|
for (i = 0; i < KFREE_N_BATCHES; i++) {
|
|
|
|
INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
|
|
|
|
krcp->krw_arr[i].krcp = krcp;
|
|
|
|
|
|
|
|
for (j = 0; j < FREE_N_CHANNELS; j++)
|
|
|
|
INIT_LIST_HEAD(&krcp->krw_arr[i].bulk_head_free[j]);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < FREE_N_CHANNELS; i++)
|
|
|
|
INIT_LIST_HEAD(&krcp->bulk_head[i]);
|
|
|
|
|
|
|
|
INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
|
|
|
|
INIT_DELAYED_WORK(&krcp->page_cache_work, fill_page_cache_func);
|
|
|
|
krcp->initialized = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
kfree_rcu_shrinker = shrinker_alloc(0, "slab-kvfree-rcu");
|
|
|
|
if (!kfree_rcu_shrinker) {
|
|
|
|
pr_err("Failed to allocate kfree_rcu() shrinker!\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
kfree_rcu_shrinker->count_objects = kfree_rcu_shrink_count;
|
|
|
|
kfree_rcu_shrinker->scan_objects = kfree_rcu_shrink_scan;
|
|
|
|
|
|
|
|
shrinker_register(kfree_rcu_shrinker);
|
|
|
|
}
|
2025-02-03 10:28:50 +01:00
|
|
|
|
|
|
|
#endif /* CONFIG_KVFREE_RCU_BATCHED */
|
|
|
|
|