Loading modules with finit_module() can end up using vmalloc(), vmap()
and vmalloc() again, for a total of up to 3 separate allocations in the
worst case for a single module. We always kernel_read*() the module,
that's a vmalloc(). Then vmap() is used for the module decompression,
and if so the last read buffer is freed as we use the now decompressed
module buffer to stuff data into our copy module. The last allocation is
specific to each architectures but pretty much that's generally a series
of vmalloc() calls or a variation of vmalloc to handle ELF sections with
special permissions.
Evaluation with new stress-ng module support [1] with just 100 ops
is proving that you can end up using GiBs of data easily even with all
care we have in the kernel and userspace today in trying to not load modules
which are already loaded. 100 ops seems to resemble the sort of pressure a
system with about 400 CPUs can create on module loading. Although issues
relating to duplicate module requests due to each CPU inucurring a new
module reuest is silly and some of these are being fixed, we currently lack
proper tooling to help diagnose easily what happened, when it happened
and who likely is to blame -- userspace or kernel module autoloading.
Provide an initial set of stats which use debugfs to let us easily scrape
post-boot information about failed loads. This sort of information can
be used on production worklaods to try to optimize *avoiding* redundant
memory pressure using finit_module().
There's a few examples that can be provided:
A 255 vCPU system without the next patch in this series applied:
Startup finished in 19.143s (kernel) + 7.078s (userspace) = 26.221s
graphical.target reached after 6.988s in userspace
And 13.58 GiB of virtual memory space lost due to failed module loading:
root@big ~ # cat /sys/kernel/debug/modules/stats
Mods ever loaded 67
Mods failed on kread 0
Mods failed on decompress 0
Mods failed on becoming 0
Mods failed on load 1411
Total module size 11464704
Total mod text size 4194304
Failed kread bytes 0
Failed decompress bytes 0
Failed becoming bytes 0
Failed kmod bytes 14588526272
Virtual mem wasted bytes 14588526272
Average mod size 171115
Average mod text size 62602
Average fail load bytes 10339140
Duplicate failed modules:
module-name How-many-times Reason
kvm_intel 249 Load
kvm 249 Load
irqbypass 8 Load
crct10dif_pclmul 128 Load
ghash_clmulni_intel 27 Load
sha512_ssse3 50 Load
sha512_generic 200 Load
aesni_intel 249 Load
crypto_simd 41 Load
cryptd 131 Load
evdev 2 Load
serio_raw 1 Load
virtio_pci 3 Load
nvme 3 Load
nvme_core 3 Load
virtio_pci_legacy_dev 3 Load
virtio_pci_modern_dev 3 Load
t10_pi 3 Load
virtio 3 Load
crc32_pclmul 6 Load
crc64_rocksoft 3 Load
crc32c_intel 40 Load
virtio_ring 3 Load
crc64 3 Load
The following screen shot, of a simple 8vcpu 8 GiB KVM guest with the
next patch in this series applied, shows 226.53 MiB are wasted in virtual
memory allocations which due to duplicate module requests during boot.
It also shows an average module memory size of 167.10 KiB and an an
average module .text + .init.text size of 61.13 KiB. The end shows all
modules which were detected as duplicate requests and whether or not
they failed early after just the first kernel_read*() call or late after
we've already allocated the private space for the module in
layout_and_allocate(). A system with module decompression would reveal
more wasted virtual memory space.
We should put effort now into identifying the source of these duplicate
module requests and trimming these down as much possible. Larger systems
will obviously show much more wasted virtual memory allocations.
root@kmod ~ # cat /sys/kernel/debug/modules/stats
Mods ever loaded 67
Mods failed on kread 0
Mods failed on decompress 0
Mods failed on becoming 83
Mods failed on load 16
Total module size 11464704
Total mod text size 4194304
Failed kread bytes 0
Failed decompress bytes 0
Failed becoming bytes 228959096
Failed kmod bytes 8578080
Virtual mem wasted bytes 237537176
Average mod size 171115
Average mod text size 62602
Avg fail becoming bytes 2758544
Average fail load bytes 536130
Duplicate failed modules:
module-name How-many-times Reason
kvm_intel 7 Becoming
kvm 7 Becoming
irqbypass 6 Becoming & Load
crct10dif_pclmul 7 Becoming & Load
ghash_clmulni_intel 7 Becoming & Load
sha512_ssse3 6 Becoming & Load
sha512_generic 7 Becoming & Load
aesni_intel 7 Becoming
crypto_simd 7 Becoming & Load
cryptd 3 Becoming & Load
evdev 1 Becoming
serio_raw 1 Becoming
nvme 3 Becoming
nvme_core 3 Becoming
t10_pi 3 Becoming
virtio_pci 3 Becoming
crc32_pclmul 6 Becoming & Load
crc64_rocksoft 3 Becoming
crc32c_intel 3 Becoming
virtio_pci_modern_dev 2 Becoming
virtio_pci_legacy_dev 1 Becoming
crc64 2 Becoming
virtio 2 Becoming
virtio_ring 2 Becoming
[0] https://github.com/ColinIanKing/stress-ng.git
[1] echo 0 > /proc/sys/vm/oom_dump_tasks
./stress-ng --module 100 --module-name xfs
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Add a for_each_modinfo_entry() to make it easier to read and use.
This produces no functional changes but makes this code easiert
to read as we are used to with loops in the kernel and trims more
lines of code.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
This makes it clearer what it is doing. While at it,
make it available to other code other than main.c.
This will be used in the subsequent patch and make
the changes easier to read.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Bring dynamic debug in line with other subsystems by using the module
notifier callbacks. This results in a net decrease in core module
code.
Additionally, Jim Cromie has a new dynamic debug classmap feature,
which requires that jump labels be initialized prior to dynamic debug.
Specifically, the new feature toggles a jump label from the existing
dynamic_debug_setup() function. However, this does not currently work
properly, because jump labels are initialized via the
'module_notify_list' notifier chain, which is invoked after the
current call to dynamic_debug_setup(). Thus, this patch ensures that
jump labels are initialized prior to dynamic debug by setting the
dynamic debug notifier priority to 0, while jump labels have the
higher priority of 1.
Tested by Jim using his new test case, and I've verfied the correct
printing via: # modprobe test_dynamic_debug dyndbg.
Link: https://lore.kernel.org/lkml/20230113193016.749791-21-jim.cromie@gmail.com/
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202302190427.9iIK2NfJ-lkp@intel.com/
Tested-by: Jim Cromie <jim.cromie@gmail.com>
Reviewed-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
CC: Jim Cromie <jim.cromie@gmail.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
module_layout manages different types of memory (text, data, rodata, etc.)
in one allocation, which is problematic for some reasons:
1. It is hard to enable CONFIG_STRICT_MODULE_RWX.
2. It is hard to use huge pages in modules (and not break strict rwx).
3. Many archs uses module_layout for arch-specific data, but it is not
obvious how these data are used (are they RO, RX, or RW?)
Improve the scenario by replacing 2 (or 3) module_layout per module with
up to 7 module_memory per module:
MOD_TEXT,
MOD_DATA,
MOD_RODATA,
MOD_RO_AFTER_INIT,
MOD_INIT_TEXT,
MOD_INIT_DATA,
MOD_INIT_RODATA,
and allocating them separately. This adds slightly more entries to
mod_tree (from up to 3 entries per module, to up to 7 entries per
module). However, this at most adds a small constant overhead to
__module_address(), which is expected to be fast.
Various archs use module_layout for different data. These data are put
into different module_memory based on their location in module_layout.
IOW, data that used to go with text is allocated with MOD_MEM_TYPE_TEXT;
data that used to go with data is allocated with MOD_MEM_TYPE_DATA, etc.
module_memory simplifies quite some of the module code. For example,
ARCH_WANTS_MODULES_DATA_IN_VMALLOC is a lot cleaner, as it just uses a
different allocator for the data. kernel/module/strict_rwx.c is also
much cleaner with module_memory.
Signed-off-by: Song Liu <song@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
This new struct composes the linker provided (vector,len) section,
and provides a place to add other __dyndbg[] state-data later:
descs - the vector of descriptors in __dyndbg section.
num_descs - length of the data/section.
Use it, in several different ways, as follows:
In lib/dynamic_debug.c:
ddebug_add_module(): Alter params-list, replacing 2 args (array,index)
with a struct _ddebug_info * containing them both, with room for
expansion. This helps future-proof the function prototype against the
looming addition of class-map info into the dyndbg-state, by providing
a place to add more member fields later.
NB: later add static struct _ddebug_info builtins_state declaration,
not needed yet.
ddebug_add_module() is called in 2 contexts:
In dynamic_debug_init(), declare, init a struct _ddebug_info di
auto-var to use as a cursor. Then iterate over the prdbg blocks of
the builtin modules, and update the di cursor before calling
_add_module for each.
Its called from kernel/module/main.c:load_info() for each loaded
module:
In internal.h, alter struct load_info, replacing the dyndbg array,len
fields with an embedded _ddebug_info containing them both; and
populate its members in find_module_sections().
The 2 calling contexts differ in that _init deals with contiguous
subranges of __dyndbgs[] section, packed together, while loadable
modules are added one at a time.
So rename ddebug_add_module() into outer/__inner fns, call __inner
from _init, and provide the offset into the builtin __dyndbgs[] where
the module's prdbgs reside. The cursor provides start, len of the
subrange for each. The offset will be used later to pack the results
of builtin __dyndbg_sites[] de-duplication, and is 0 and unneeded for
loadable modules,
Note:
kernel/module/main.c includes <dynamic_debug.h> for struct
_ddeubg_info. This might be prone to include loops, since its also
included by printk.h. Nothing has broken in robot-land on this.
cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-12-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
No functional change.
With this patch a given module's state information (i.e. 'mod->state')
can be omitted from the specified buffer. Please note that this is in
preparation to include the last unloaded module's taint flag(s),
if available.
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
cppcheck reports the following warnings:
kernel/module/main.c:1455:26: warning: Redundant assignment of 'mod->core_layout.size' to itself. [selfAssignment]
mod->core_layout.size = strict_align(mod->core_layout.size);
^
kernel/module/main.c:1489:26: warning: Redundant assignment of 'mod->init_layout.size' to itself. [selfAssignment]
mod->init_layout.size = strict_align(mod->init_layout.size);
^
kernel/module/main.c:1493:26: warning: Redundant assignment of 'mod->init_layout.size' to itself. [selfAssignment]
mod->init_layout.size = strict_align(mod->init_layout.size);
^
kernel/module/main.c:1504:26: warning: Redundant assignment of 'mod->init_layout.size' to itself. [selfAssignment]
mod->init_layout.size = strict_align(mod->init_layout.size);
^
kernel/module/main.c:1459:26: warning: Redundant assignment of 'mod->data_layout.size' to itself. [selfAssignment]
mod->data_layout.size = strict_align(mod->data_layout.size);
^
kernel/module/main.c:1463:26: warning: Redundant assignment of 'mod->data_layout.size' to itself. [selfAssignment]
mod->data_layout.size = strict_align(mod->data_layout.size);
^
kernel/module/main.c:1467:26: warning: Redundant assignment of 'mod->data_layout.size' to itself. [selfAssignment]
mod->data_layout.size = strict_align(mod->data_layout.size);
^
This is due to strict_align() being a no-op when
CONFIG_STRICT_MODULE_RWX is not selected.
Transform strict_align() macro into an inline function. It will
allow type checking and avoid the selfAssignment warning.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Currently, only the initial module that tainted the kernel is
recorded e.g. when an out-of-tree module is loaded.
The purpose of this patch is to allow the kernel to maintain a record of
each unloaded module that taints the kernel. So, in addition to
displaying a list of linked modules (see print_modules()) e.g. in the
event of a detected bad page, unloaded modules that carried a taint/or
taints are displayed too. A tainted module unload count is maintained.
The number of tracked modules is not fixed. This feature is disabled by
default.
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates module_assert_mutex_or_preempt() to internal.h.
So, the aforementiond function can be used outside of main/or core
module code yet will remain restricted for internal use only.
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
The purpose of this patch is to modify module_flags_taint() to accept
a module's taints bitmap as a parameter and modifies all users
accordingly. Furthermore, it is now possible to access a given
module's taint flags data outside of non-essential code yet does
remain for internal use only.
This is in preparation for module unload taint tracking support.
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Add CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC to allow architectures
to request having modules data in vmalloc area instead of module area.
This is required on powerpc book3s/32 in order to set data non
executable, because it is not possible to set executability on page
basis, this is done per 256 Mbytes segments. The module area has exec
right, vmalloc area has noexec.
This can also be useful on other powerpc/32 in order to maximize the
chance of code being close enough to kernel core to avoid branch
trampolines.
Cc: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mcgrof: rebased in light of kernel/module/kdb.c move]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
In order to allow separation of data from text, add another layout,
called data_layout. For architectures requesting separation of text
and data, only text will go in core_layout and data will go in
data_layout.
For architectures which keep text and data together, make data_layout
an alias of core_layout, that way data_layout can be used for all
data manipulations, regardless of whether data is in core_layout or
data_layout.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
In order to separate text and data, we need to setup
two rb trees.
Modify functions to give the tree as a parameter.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
In order to separate text and data, we need to setup
two rb trees.
This means that struct mod_tree_root is required even without
MODULES_TREE_LOOKUP.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
debug_align() was added by commit 84e1c6bb38 ("x86: Add RO/NX
protection for loadable kernel modules")
At that time the config item was CONFIG_DEBUG_SET_MODULE_RONX.
But nowadays it has changed to CONFIG_STRICT_MODULE_RWX and
debug_align() is confusing because it has nothing to do with
DEBUG.
Rename it strict_align()
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Perform layout alignment verification up front and WARN_ON()
and fail module loading instead of crashing the machine.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Move module_enable_x() together with module_enable_nx() and
module_enable_ro().
Those three functions are going together, they are all used
to set up the correct page flags on the different sections.
As module_enable_x() is used independently of
CONFIG_STRICT_MODULE_RWX, build strict_rwx.c all the time and
use IS_ENABLED(CONFIG_STRICT_MODULE_RWX) when relevant.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
module_enable_x() has nothing to do with CONFIG_ARCH_HAS_STRICT_MODULE_RWX
allthough by coincidence architectures who need module_enable_x() are
selection CONFIG_ARCH_HAS_STRICT_MODULE_RWX.
Enable module_enable_x() for everyone everytime. If an architecture
already has module text set executable, it's a no-op.
Don't check text_size alignment. When CONFIG_STRICT_MODULE_RWX is set
the verification is already done in frob_rodata(). When
CONFIG_STRICT_MODULE_RWX is not set it is not a big deal to have the
start of data as executable. Just make sure we entirely get the last
page when the boundary is not aligned.
And don't BUG on misaligned base as some architectures like nios2
use kmalloc() for allocating modules. So just bail out in that case.
If that's a problem, a page fault will occur later anyway.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates module version support out of core code into
kernel/module/version.c. In addition simple code refactoring to
make this possible.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates module sysfs support out of core code into
kernel/module/sysfs.c. In addition simple code refactoring to
make this possible.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates code that allows one to generate a
list of loaded/or linked modules via /proc when procfs
support is enabled into kernel/module/procfs.c.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates kallsyms code out of core module
code kernel/module/kallsyms.c
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates kmemleak code out of core module
code into kernel/module/debug_kmemleak.c
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates code that makes module text
and rodata memory read-only and non-text memory
non-executable from core module code into
kernel/module/strict_rwx.c.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates module latched RB-tree support
(e.g. see __module_address()) from core module code
into kernel/module/tree_lookup.c.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch migrates livepatch support (i.e. used during module
add/or load and remove/or deletion) from core module code into
kernel/module/livepatch.c. At the moment it contains code to
persist Elf information about a given livepatch module, only.
The new file was added to MAINTAINERS.
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
This patch will address the following warning and style violations
generated by ./scripts/checkpatch.pl in strict mode:
WARNING: Use #include <linux/module.h> instead of <asm/module.h>
#10: FILE: kernel/module/internal.h:10:
+#include <asm/module.h>
CHECK: spaces preferred around that '-' (ctx:VxV)
#18: FILE: kernel/module/internal.h:18:
+#define INIT_OFFSET_MASK (1UL << (BITS_PER_LONG-1))
CHECK: Please use a blank line after function/struct/union/enum declarations
#69: FILE: kernel/module/internal.h:69:
+}
+static inline void module_decompress_cleanup(struct load_info *info)
^
CHECK: extern prototypes should be avoided in .h files
#84: FILE: kernel/module/internal.h:84:
+extern int mod_verify_sig(const void *mod, struct load_info *info);
WARNING: Missing a blank line after declarations
#116: FILE: kernel/module/decompress.c:116:
+ struct page *page = module_get_next_page(info);
+ if (!page) {
WARNING: Missing a blank line after declarations
#174: FILE: kernel/module/decompress.c:174:
+ struct page *page = module_get_next_page(info);
+ if (!page) {
CHECK: Please use a blank line after function/struct/union/enum declarations
#258: FILE: kernel/module/decompress.c:258:
+}
+static struct kobj_attribute module_compression_attr = __ATTR_RO(compression);
Note: Fortunately, the multiple-include optimisation found in
include/linux/module.h will prevent duplication/or inclusion more than
once.
Fixes: f314dfea16 ("modsign: log module name in the event of an error")
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional change.
This patch makes it possible to move non-essential code
out of core module code.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
No functional changes.
This patch moves all module related code into a separate directory,
modifies each file name and creates a new Makefile. Note: this effort
is in preparation to refactor core module code.
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>