Commit graph

1582 commits

Author SHA1 Message Date
Linus Torvalds
352af6a011 Rust changes for v6.17
Toolchain and infrastructure:
 
  - Enable a set of Clippy lints: 'ptr_as_ptr', 'ptr_cast_constness',
    'as_ptr_cast_mut', 'as_underscore', 'cast_lossless' and 'ref_as_ptr'.
 
    These are intended to avoid type casts with the 'as' operator, which
    are quite powerful, into restricted variants that are less powerful
    and thus should help to avoid mistakes.
 
  - Remove the 'author' key now that most instances were moved to the
    plural one in the previous cycle.
 
 'kernel' crate:
 
  - New 'bug' module: add 'warn_on!' macro which reuses the existing
    'BUG'/'WARN' infrastructure, i.e. it respects the usual sysctls and
    kernel parameters:
 
        warn_on!(value == 42);
 
    To avoid duplicating the assembly code, the same strategy is followed
    as for the static branch code in order to share the assembly between
    both C and Rust. This required a few rearrangements on C arch headers
    -- the existing C macros should still generate the same outputs, thus
    no functional change expected there.
 
  - 'workqueue' module: add delayed work items, including a 'DelayedWork'
    struct, a 'impl_has_delayed_work!' macro and an 'enqueue_delayed'
    method, e.g.:
 
        /// Enqueue the struct for execution on the system workqueue,
        /// where its value will be printed 42 jiffies later.
        fn print_later(value: Arc<MyStruct>) {
            let _ = workqueue::system().enqueue_delayed(value, 42);
        }
 
  - New 'bits' module: add support for 'bit' and 'genmask' functions,
    with runtime- and compile-time variants, e.g.:
 
        static_assert!(0b00010000 == bit_u8(4));
        static_assert!(0b00011110 == genmask_u8(1..=4));
 
        assert!(checked_bit_u32(u32::BITS).is_none());
 
  - 'uaccess' module: add 'UserSliceReader::strcpy_into_buf', which reads
    NUL-terminated strings from userspace into a '&CStr'.
 
    Introduce 'UserPtr' newtype, similar in purpose to '__user' in C, to
    minimize mistakes handling userspace pointers, including mixing them
    up with integers and leaking them via the 'Debug' trait. Add it to
    the prelude, too.
 
  - Start preparations for the replacement of our custom 'CStr' type
    with the analogous type in the 'core' standard library. This will
    take place across several cycles to make it easier. For this one,
    it includes a new 'fmt' module, using upstream method names and some
    other cleanups.
 
    Replace 'fmt!' with a re-export, which helps Clippy lint properly,
    and clean up the found 'uninlined-format-args' instances.
 
  - 'dma' module:
 
    - Clarify wording and be consistent in 'coherent' nomenclature.
 
    - Convert the 'read!()' and 'write!()' macros to return a 'Result'.
 
    - Add 'as_slice()', 'write()' methods in 'CoherentAllocation'.
 
    - Expose 'count()' and 'size()' in 'CoherentAllocation' and add the
      corresponding type invariants.
 
    - Implement 'CoherentAllocation::dma_handle_with_offset()'.
 
  - 'time' module:
 
    - Make 'Instant' generic over clock source. This allows the compiler
      to assert that arithmetic expressions involving the 'Instant' use
      'Instants' based on the same clock source.
 
    - Make 'HrTimer' generic over the timer mode. 'HrTimer' timers take a
      'Duration' or an 'Instant' when setting the expiry time, depending
      on the timer mode. With this change, the compiler can check the
      type matches the timer mode.
 
    - Add an abstraction for 'fsleep'. 'fsleep' is a flexible sleep
      function that will select an appropriate sleep method depending on
      the requested sleep time.
 
    - Avoid 64-bit divisions on 32-bit hardware when calculating
      timestamps.
 
    - Seal the 'HrTimerMode' trait. This prevents users of the
      'HrTimerMode' from implementing the trait on their own types.
 
    - Pass the correct timer mode ID to 'hrtimer_start_range_ns()'.
 
  - 'list' module: remove 'OFFSET' constants, allowing to remove pointer
    arithmetic; now 'impl_list_item!' invokes 'impl_has_list_links!' or
    'impl_has_list_links_self_ptr!'. Other simplifications too.
 
  - 'types' module: remove 'ForeignOwnable::PointedTo' in favor of a
    constant, which avoids exposing the type of the opaque pointer, and
    require 'into_foreign' to return non-null.
 
    Remove the 'Either<L, R>' type as well. It is unused, and we want to
    encourage the use of custom enums for concrete use cases.
 
  - 'sync' module: implement 'Borrow' and 'BorrowMut' for 'Arc' types
    to allow them to be used in generic APIs.
 
  - 'alloc' module: implement 'Borrow' and 'BorrowMut' for 'Box<T, A>';
     and 'Borrow', 'BorrowMut' and 'Default' for 'Vec<T, A>'.
 
  - 'Opaque' type: add 'cast_from' method to perform a restricted cast
    that cannot change the inner type and use it in callers of
    'container_of!'. Rename 'raw_get' to 'cast_into' to match it.
 
  - 'rbtree' module: add 'is_empty' method.
 
  - 'sync' module: new 'aref' submodule to hold 'AlwaysRefCounted' and
    'ARef', which are moved from the too general 'types' module which we
    want to reduce or eventually remove. Also fix a safety comment in
    'static_lock_class'.
 
 'pin-init' crate:
 
  - Add 'impl<T, E> [Pin]Init<T, E> for Result<T, E>', so results are now
    (pin-)initializers.
 
  - Add 'Zeroable::init_zeroed()' that delegates to 'init_zeroed()'.
 
  - New 'zeroed()', a safe version of 'mem::zeroed()' and also provide
    it via 'Zeroable::zeroed()'.
 
  - Implement 'Zeroable' for 'Option<&T>', 'Option<&mut T>' and for
    'Option<[unsafe] [extern "abi"] fn(...args...) -> ret>' for '"Rust"'
    and '"C"' ABIs and up to 20 arguments.
 
  - Changed blanket impls of 'Init' and 'PinInit' from 'impl<T, E>
    [Pin]Init<T, E> for T' to 'impl<T> [Pin]Init<T> for T'.
 
  - Renamed 'zeroed()' to 'init_zeroed()'.
 
  - Upstream dev news: improve CI more to deny warnings, use
    '--all-targets'. Check the synchronization status of the two '-next'
    branches in upstream and the kernel.
 
 MAINTAINERS:
 
  - Add Vlastimil Babka, Liam R. Howlett, Uladzislau Rezki and Lorenzo
    Stoakes as reviewers (thanks everyone).
 
 And a few other cleanups and improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAmiOWREACgkQGXyLc2ht
 IW39Ig/9E0ExSiBgNKdkCOaULMq31wAxnu3iWoVVisFndlh/Inv+JlaLrmA57BCi
 xXgBwVZ1GoMsG8Fzt6gT+gyhGYi8waNd+5KXr/WJZVTaJ9v1KpdvxuCnSz0DjCbk
 GaKfAfxvJ5GAOEwiIIX8X0TFu6kx911DCJY387/VrqZQ7Msh1QSM3tcZeir/EV4w
 lPjUdlOh1FnLJLI9CGuW20d1IhQUP7K3pdoywgJPpCZV0I8QCyMlMqCEael8Tw2S
 r/PzRaQtiIzk5HTx06V8paK+nEn0K2vQXqW2kV56Y6TNm1Zcv6dES/8hCITsISs2
 nwney3vXEwvoZX+YkQRffZddY4i6YenWMrtLgVxZzdshBL3bn6eHqBL04Nfix+p7
 pQe3qMH3G8UBtX1lugBE7RrWGWcz9ARN8sK12ClmpAUnKJOwTpo97kpqXP7pDme8
 Buh/oV3voAMsqwooSbVBzuUUWnbGaQ5Oj6CiiosSadfNh6AxJLYLKHtRLKJHZEw3
 0Ob/1HhoWS6JSvYKVjMyD19qcH7O8ThZE+83CfMAkI4KphXJarWhpSmN4cHkFn/v
 0clQ7Y5m+up9v1XWTaEq0Biqa6CaxLQwm/qW5WU0Y/TiovmvxAFdCwsQqDkRoJNx
 9kNfMJRvNl78KQxrjEDz9gl7/ajgqX1KkqP8CQbGjv29cGzFlVE=
 =5Wt9
 -----END PGP SIGNATURE-----

Merge tag 'rust-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux

Pull Rust updates from Miguel Ojeda:
 "Toolchain and infrastructure:

   - Enable a set of Clippy lints: 'ptr_as_ptr', 'ptr_cast_constness',
     'as_ptr_cast_mut', 'as_underscore', 'cast_lossless' and
     'ref_as_ptr'

     These are intended to avoid type casts with the 'as' operator,
     which are quite powerful, into restricted variants that are less
     powerful and thus should help to avoid mistakes

   - Remove the 'author' key now that most instances were moved to the
     plural one in the previous cycle

  'kernel' crate:

   - New 'bug' module: add 'warn_on!' macro which reuses the existing
     'BUG'/'WARN' infrastructure, i.e. it respects the usual sysctls and
     kernel parameters:

         warn_on!(value == 42);

     To avoid duplicating the assembly code, the same strategy is
     followed as for the static branch code in order to share the
     assembly between both C and Rust

     This required a few rearrangements on C arch headers -- the
     existing C macros should still generate the same outputs, thus no
     functional change expected there

   - 'workqueue' module: add delayed work items, including a
     'DelayedWork' struct, a 'impl_has_delayed_work!' macro and an
     'enqueue_delayed' method, e.g.:

         /// Enqueue the struct for execution on the system workqueue,
         /// where its value will be printed 42 jiffies later.
         fn print_later(value: Arc<MyStruct>) {
             let _ = workqueue::system().enqueue_delayed(value, 42);
         }

   - New 'bits' module: add support for 'bit' and 'genmask' functions,
     with runtime- and compile-time variants, e.g.:

         static_assert!(0b00010000 == bit_u8(4));
         static_assert!(0b00011110 == genmask_u8(1..=4));

         assert!(checked_bit_u32(u32::BITS).is_none());

   - 'uaccess' module: add 'UserSliceReader::strcpy_into_buf', which
     reads NUL-terminated strings from userspace into a '&CStr'

     Introduce 'UserPtr' newtype, similar in purpose to '__user' in C,
     to minimize mistakes handling userspace pointers, including mixing
     them up with integers and leaking them via the 'Debug' trait. Add
     it to the prelude, too

   - Start preparations for the replacement of our custom 'CStr' type
     with the analogous type in the 'core' standard library. This will
     take place across several cycles to make it easier. For this one,
     it includes a new 'fmt' module, using upstream method names and
     some other cleanups

     Replace 'fmt!' with a re-export, which helps Clippy lint properly,
     and clean up the found 'uninlined-format-args' instances

   - 'dma' module:

      - Clarify wording and be consistent in 'coherent' nomenclature

      - Convert the 'read!()' and 'write!()' macros to return a 'Result'

      - Add 'as_slice()', 'write()' methods in 'CoherentAllocation'

      - Expose 'count()' and 'size()' in 'CoherentAllocation' and add
        the corresponding type invariants

      - Implement 'CoherentAllocation::dma_handle_with_offset()'

   - 'time' module:

      - Make 'Instant' generic over clock source. This allows the
        compiler to assert that arithmetic expressions involving the
        'Instant' use 'Instants' based on the same clock source

      - Make 'HrTimer' generic over the timer mode. 'HrTimer' timers
        take a 'Duration' or an 'Instant' when setting the expiry time,
        depending on the timer mode. With this change, the compiler can
        check the type matches the timer mode

      - Add an abstraction for 'fsleep'. 'fsleep' is a flexible sleep
        function that will select an appropriate sleep method depending
        on the requested sleep time

      - Avoid 64-bit divisions on 32-bit hardware when calculating
        timestamps

      - Seal the 'HrTimerMode' trait. This prevents users of the
        'HrTimerMode' from implementing the trait on their own types

      - Pass the correct timer mode ID to 'hrtimer_start_range_ns()'

   - 'list' module: remove 'OFFSET' constants, allowing to remove
     pointer arithmetic; now 'impl_list_item!' invokes
     'impl_has_list_links!' or 'impl_has_list_links_self_ptr!'. Other
     simplifications too

   - 'types' module: remove 'ForeignOwnable::PointedTo' in favor of a
     constant, which avoids exposing the type of the opaque pointer, and
     require 'into_foreign' to return non-null

     Remove the 'Either<L, R>' type as well. It is unused, and we want
     to encourage the use of custom enums for concrete use cases

   - 'sync' module: implement 'Borrow' and 'BorrowMut' for 'Arc' types
     to allow them to be used in generic APIs

   - 'alloc' module: implement 'Borrow' and 'BorrowMut' for 'Box<T, A>';
     and 'Borrow', 'BorrowMut' and 'Default' for 'Vec<T, A>'

   - 'Opaque' type: add 'cast_from' method to perform a restricted cast
     that cannot change the inner type and use it in callers of
     'container_of!'. Rename 'raw_get' to 'cast_into' to match it

   - 'rbtree' module: add 'is_empty' method

   - 'sync' module: new 'aref' submodule to hold 'AlwaysRefCounted' and
     'ARef', which are moved from the too general 'types' module which
     we want to reduce or eventually remove. Also fix a safety comment
     in 'static_lock_class'

  'pin-init' crate:

   - Add 'impl<T, E> [Pin]Init<T, E> for Result<T, E>', so results are
     now (pin-)initializers

   - Add 'Zeroable::init_zeroed()' that delegates to 'init_zeroed()'

   - New 'zeroed()', a safe version of 'mem::zeroed()' and also provide
     it via 'Zeroable::zeroed()'

   - Implement 'Zeroable' for 'Option<&T>', 'Option<&mut T>' and for
     'Option<[unsafe] [extern "abi"] fn(...args...) -> ret>' for
     '"Rust"' and '"C"' ABIs and up to 20 arguments

   - Changed blanket impls of 'Init' and 'PinInit' from 'impl<T, E>
     [Pin]Init<T, E> for T' to 'impl<T> [Pin]Init<T> for T'

   - Renamed 'zeroed()' to 'init_zeroed()'

   - Upstream dev news: improve CI more to deny warnings, use
     '--all-targets'. Check the synchronization status of the two
     '-next' branches in upstream and the kernel

  MAINTAINERS:

   - Add Vlastimil Babka, Liam R. Howlett, Uladzislau Rezki and Lorenzo
     Stoakes as reviewers (thanks everyone)

  And a few other cleanups and improvements"

* tag 'rust-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (76 commits)
  rust: Add warn_on macro
  arm64/bug: Add ARCH_WARN_ASM macro for BUG/WARN asm code sharing with Rust
  riscv/bug: Add ARCH_WARN_ASM macro for BUG/WARN asm code sharing with Rust
  x86/bug: Add ARCH_WARN_ASM macro for BUG/WARN asm code sharing with Rust
  rust: kernel: move ARef and AlwaysRefCounted to sync::aref
  rust: sync: fix safety comment for `static_lock_class`
  rust: types: remove `Either<L, R>`
  rust: kernel: use `core::ffi::CStr` method names
  rust: str: add `CStr` methods matching `core::ffi::CStr`
  rust: str: remove unnecessary qualification
  rust: use `kernel::{fmt,prelude::fmt!}`
  rust: kernel: add `fmt` module
  rust: kernel: remove `fmt!`, fix clippy::uninlined-format-args
  scripts: rust: emit path candidates in panic message
  scripts: rust: replace length checks with match
  rust: list: remove nonexistent generic parameter in link
  rust: bits: add support for bits/genmask macros
  rust: list: remove OFFSET constants
  rust: list: add `impl_list_item!` examples
  rust: list: use fully qualified path
  ...
2025-08-03 13:49:10 -07:00
Linus Torvalds
a6923c06a3 bpf-fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmiNNksACgkQ6rmadz2v
 bTrKRhAAnju4bbFRHU88Y68p6Meq/jxgjxHZAkTqZA0Nvbu2cItPRL7XHAAhTWE7
 OBEIm3UKCH4gs4fY8rDHiIgnnaQavXUmvXZblOIOjxnqRKJpU3px+wwJvGFq5Enq
 WP6UZV8tj+O2tNfNNYS+mgQvvIpUISHGpKimvx7ede3e1U3cJBkppbT3gooMHYuc
 5s1QtYHWaPY/1DpkHgqJ2UPGcbT9/HSPGMHRNaHKjQTcNcLcrj7RRjchgXqcc7Vs
 hVijvVrLiuK0MyU42ritmaqvjjgD6hKPZguRQe2/hAtrOo0Alf+4mXkMgam7simN
 iHfGc7nhw1xAFTPj4WXahja89G00FdDN5NR37Rgurm/i2fY7BuXAkMjiMiwGB3C3
 jk2wG3RSifYeC2rxhkYJdqcx8Cz6m+pjgyJ2o9Jy5dn426VXg/kzkUXpl6u5jaPZ
 SmKoo9Xu1r7xqTaUc9kk8pJI5Xt9vD5oQjF2KQuPZXxNidiwW6k2OGbW+wF26nEi
 Q6pfDu3pvHAd/UE6cD5yFe97o3Cc2XfGwI/Sv2k99UVPvNcvfAvVo9fsItHBhCPn
 zHkihW2S0zmbBlhcrB+PrLclNgLleP9JukFN+5scc0a9lbQxIm6v2TNKGlBfDQtO
 I+Kn266oqT4BEgnQGlCQquINnQAdmS8VMnnunGOu6+rwPUtkI7E=
 =XLHS
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix kCFI failures in JITed BPF code on arm64 (Sami Tolvanen, Puranjay
   Mohan, Mark Rutland, Maxwell Bland)

 - Disallow tail calls between BPF programs that use different cgroup
   local storage maps to prevent out-of-bounds access (Daniel Borkmann)

 - Fix unaligned access in flow_dissector and netfilter BPF programs
   (Paul Chaignon)

 - Avoid possible use of uninitialized mod_len in libbpf (Achill
   Gilgenast)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test for unaligned flow_dissector ctx access
  bpf: Improve ctx access verifier error message
  bpf: Check netfilter ctx accesses are aligned
  bpf: Check flow_dissector ctx accesses are aligned
  arm64/cfi,bpf: Support kCFI + BPF on arm64
  cfi: Move BPF CFI types and helpers to generic code
  cfi: add C CFI type macro
  libbpf: Avoid possible use of uninitialized mod_len
  bpf: Fix oob access in cgroup local storage
  bpf: Move cgroup iterator helpers to bpf.h
  bpf: Move bpf map owner out of common struct
  bpf: Add cookie object to bpf maps
2025-08-01 17:13:26 -07:00
Sami Tolvanen
f1befc82ad cfi: Move BPF CFI types and helpers to generic code
Instead of duplicating the same code for each architecture, move
the CFI type hash variables for BPF function types and related
helper functions to generic CFI code, and allow architectures to
override the function definitions if needed.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250801001004.1859976-7-samitolvanen@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-31 18:23:53 -07:00
Linus Torvalds
beace86e61 Summary of significant series in this pull request:
- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new
   VMAs" from Lorenzo Stoakes addresses an issue with KSM's
   PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for
   merging with existing adjacent VMAs.
 
 - The 4 patch series "mm/damon: introduce DAMON_STAT for simple and
   practical access monitoring" from SeongJae Park adds a new kernel module
   which simplifies the setup and usage of DAMON in production
   environments.
 
 - The 6 patch series "stop passing a writeback_control to swap/shmem
   writeout" from Christoph Hellwig is a cleanup to the writeback code
   which removes a couple of pointers from struct writeback_control.
 
 - The 7 patch series "drivers/base/node.c: optimization and cleanups"
   from Donet Tom contains largely uncorrelated cleanups to the NUMA node
   setup and management code.
 
 - The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from
   Tal Zussman does some maintenance work on the userfaultfd code.
 
 - The 5 patch series "Readahead tweaks for larger folios" from Ryan
   Roberts implements some tuneups for pagecache readahead when it is
   reading into order>0 folios.
 
 - The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark
   Brown provides some cleanups and consistency improvements to the
   selftests code.
 
 - The 4 patch series "Optimize mremap() for large folios" from Dev Jain
   does that.  A 37% reduction in execution time was measured in a
   memset+mremap+munmap microbenchmark.
 
 - The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges
   zero_user() in favor of the more modern memzero_page().
 
 - The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and
   vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts
   which David noticed in the huge page code.  These were not known to be
   causing any issues at this time.
 
 - The 3 patch series "mm/damon: use alloc_migrate_target() for
   DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and
   consolidation work in DAMON.
 
 - The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes
   uses vm_flags_t in places where we were inappropriately using other
   types.
 
 - The 3 patch series "mm/memfd: Reserve hugetlb folios before
   allocation" from Vivek Kasireddy increases the reliability of large page
   allocation in the memfd code.
 
 - The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t
   type" from Alistair Popple removes several now-unneeded PFN_* flags.
 
 - The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae
   Park implememnts some cleanup and maintainability work in the DAMON
   sysfs layer.
 
 - The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a
   lot of cleanup/maintenance work in the madvise() code.
 
 - The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka
   provides additional cleanups on top or Lorenzo's effort.
 
 - The 11 patch series "Implement numa node notifier" from Oscar Salvador
   creates a standalone notifier for NUMA node memory state changes.
   Previously these were lumped under the more general memory on/offline
   notifier.
 
 - The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan
   cleans up the pageblock isolation code and fixes a potential issue which
   doesn't seem to cause any problems in practice.
 
 - The 5 patch series "selftests/damon: add python and drgn based DAMON
   sysfs functionality tests" from SeongJae Park adds additional drgn- and
   python-based DAMON selftests which are more comprehensive than the
   existing selftest suite.
 
 - The 5 patch series "Misc rework on hugetlb faulting path" from Oscar
   Salvador fixes a rather obscure deadlock in the hugetlb fault code and
   follows that fix with a series of cleanups.
 
 - The 3 patch series "cma: factor out allocation logic from
   __cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans
   up the highmem-specific code in the CMA allocator.
 
 - The 28 patch series "mm/migration: rework movable_ops page migration
   (part 1)" from David Hildenbrand provides cleanups and
   future-preparedness to the migration code.
 
 - The 2 patch series "mm/damon: add trace events for auto-tuned
   monitoring intervals and DAMOS quota" from SeongJae Park adds some
   tracepoints to some DAMON auto-tuning code.
 
 - The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from
   SeongJae Park does that.
 
 - The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also
   does what it claims.
 
 - The 4 patch series "mm: folio_pte_batch() improvements" from David
   Hildenbrand cleans up the large folio PTE batching code.
 
 - The 13 patch series "mm/damon/vaddr: Allow interleaving in
   migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic
   alteration of DAMON's inter-node allocation policy.
 
 - The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola
   provides a couple of page->folio conversions.
 
 - The 4 patch series "mm: per-node proactive reclaim" from Davidlohr
   Bueso implements a per-node control of proactive reclaim - beyond the
   current memcg-based implementation.
 
 - The 14 patch series "mm/damon: remove damon_callback" from SeongJae
   Park replaces the damon_callback interface with a more general and
   powerful damon_call()+damos_walk() interface.
 
 - The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs"
   from Lorenzo Stoakes implements a number of mremap cleanups (of course)
   in preparation for adding new mremap() functionality: newly permit the
   remapping of multiple VMAs when the user is specifying MREMAP_FIXED.  It
   still excludes some specialized situations where this cannot be
   performed reliably.
 
 - The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga
   switches some sparc hugetlb code over to the generic version and removes
   the thus-unneeded hugetlb_free_pgd_range().
 
 - The 4 patch series "mm/damon/sysfs: support periodic and automated
   stats update" from SeongJae Park augments the present
   userspace-requested update of DAMON sysfs monitoring files.  Automatic
   update is now provided, along with a tunable to control the update
   interval.
 
 - The 4 patch series "Some randome fixes and cleanups to swapfile" from
   Kemeng Shi does what is claims.
 
 - The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino
   and David Hildenbrand provides (and uses) a means by which debug-style
   functions can grab a copy of a pageframe and inspect it locklessly
   without tripping over the races inherent in operating on the live
   pageframe directly.
 
 - The 6 patch series "use per-vma locks for /proc/pid/maps reads" from
   Suren Baghdasaryan addresses the large contention issues which can be
   triggered by reads from that procfs file.  Latencies are reduced by more
   than half in some situations.  The series also introduces several new
   selftests for the /proc/pid/maps interface.
 
 - The 6 patch series "__folio_split() clean up" from Zi Yan cleans up
   __folio_split()!
 
 - The 7 patch series "Optimize mprotect() for large folios" from Dev
   Jain provides some quite large (>3x) speedups to mprotect() when dealing
   with large folios.
 
 - The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm
   volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some
   cleanup work in the selftests code.
 
 - The 3 patch series "tools/testing: expand mremap testing" from Lorenzo
   Stoakes extends the mremap() selftest in several ways, including adding
   more checking of Lorenzo's recently added "permit mremap() move of
   multiple VMAs" feature.
 
 - The 22 patch series "selftests/damon/sysfs.py: test all parameters"
   from SeongJae Park extends the DAMON sysfs interface selftest so that it
   tests all possible user-requested parameters.  Rather than the present
   minimal subset.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA
 jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL
 8BzbvI0c+4tntHLXvIlrC33n9KWAOQM=
 =XsFy
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "As usual, many cleanups. The below blurbiage describes 42 patchsets.
  21 of those are partially or fully cleanup work. "cleans up",
  "cleanup", "maintainability", "rationalizes", etc.

  I never knew the MM code was so dirty.

  "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes)
     addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly
     mapped VMAs were not eligible for merging with existing adjacent
     VMAs.

  "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park)
     adds a new kernel module which simplifies the setup and usage of
     DAMON in production environments.

  "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig)
     is a cleanup to the writeback code which removes a couple of
     pointers from struct writeback_control.

  "drivers/base/node.c: optimization and cleanups" (Donet Tom)
     contains largely uncorrelated cleanups to the NUMA node setup and
     management code.

  "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman)
     does some maintenance work on the userfaultfd code.

  "Readahead tweaks for larger folios" (Ryan Roberts)
     implements some tuneups for pagecache readahead when it is reading
     into order>0 folios.

  "selftests/mm: Tweaks to the cow test" (Mark Brown)
     provides some cleanups and consistency improvements to the
     selftests code.

  "Optimize mremap() for large folios" (Dev Jain)
     does that. A 37% reduction in execution time was measured in a
     memset+mremap+munmap microbenchmark.

  "Remove zero_user()" (Matthew Wilcox)
     expunges zero_user() in favor of the more modern memzero_page().

  "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand)
     addresses some warts which David noticed in the huge page code.
     These were not known to be causing any issues at this time.

  "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park)
     provides some cleanup and consolidation work in DAMON.

  "use vm_flags_t consistently" (Lorenzo Stoakes)
     uses vm_flags_t in places where we were inappropriately using other
     types.

  "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy)
     increases the reliability of large page allocation in the memfd
     code.

  "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple)
     removes several now-unneeded PFN_* flags.

  "mm/damon: decouple sysfs from core" (SeongJae Park)
     implememnts some cleanup and maintainability work in the DAMON
     sysfs layer.

  "madvise cleanup" (Lorenzo Stoakes)
     does quite a lot of cleanup/maintenance work in the madvise() code.

  "madvise anon_name cleanups" (Vlastimil Babka)
     provides additional cleanups on top or Lorenzo's effort.

  "Implement numa node notifier" (Oscar Salvador)
     creates a standalone notifier for NUMA node memory state changes.
     Previously these were lumped under the more general memory
     on/offline notifier.

  "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan)
     cleans up the pageblock isolation code and fixes a potential issue
     which doesn't seem to cause any problems in practice.

  "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park)
     adds additional drgn- and python-based DAMON selftests which are
     more comprehensive than the existing selftest suite.

  "Misc rework on hugetlb faulting path" (Oscar Salvador)
     fixes a rather obscure deadlock in the hugetlb fault code and
     follows that fix with a series of cleanups.

  "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport)
     rationalizes and cleans up the highmem-specific code in the CMA
     allocator.

  "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand)
     provides cleanups and future-preparedness to the migration code.

  "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park)
     adds some tracepoints to some DAMON auto-tuning code.

  "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park)
     does that.

  "mm/damon: misc cleanups" (SeongJae Park)
     also does what it claims.

  "mm: folio_pte_batch() improvements" (David Hildenbrand)
     cleans up the large folio PTE batching code.

  "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park)
     facilitates dynamic alteration of DAMON's inter-node allocation
     policy.

  "Remove unmap_and_put_page()" (Vishal Moola)
     provides a couple of page->folio conversions.

  "mm: per-node proactive reclaim" (Davidlohr Bueso)
     implements a per-node control of proactive reclaim - beyond the
     current memcg-based implementation.

  "mm/damon: remove damon_callback" (SeongJae Park)
     replaces the damon_callback interface with a more general and
     powerful damon_call()+damos_walk() interface.

  "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes)
     implements a number of mremap cleanups (of course) in preparation
     for adding new mremap() functionality: newly permit the remapping
     of multiple VMAs when the user is specifying MREMAP_FIXED. It still
     excludes some specialized situations where this cannot be performed
     reliably.

  "drop hugetlb_free_pgd_range()" (Anthony Yznaga)
     switches some sparc hugetlb code over to the generic version and
     removes the thus-unneeded hugetlb_free_pgd_range().

  "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park)
     augments the present userspace-requested update of DAMON sysfs
     monitoring files. Automatic update is now provided, along with a
     tunable to control the update interval.

  "Some randome fixes and cleanups to swapfile" (Kemeng Shi)
     does what is claims.

  "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand)
     provides (and uses) a means by which debug-style functions can grab
     a copy of a pageframe and inspect it locklessly without tripping
     over the races inherent in operating on the live pageframe
     directly.

  "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan)
     addresses the large contention issues which can be triggered by
     reads from that procfs file. Latencies are reduced by more than
     half in some situations. The series also introduces several new
     selftests for the /proc/pid/maps interface.

  "__folio_split() clean up" (Zi Yan)
     cleans up __folio_split()!

  "Optimize mprotect() for large folios" (Dev Jain)
     provides some quite large (>3x) speedups to mprotect() when dealing
     with large folios.

  "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian)
     does some cleanup work in the selftests code.

  "tools/testing: expand mremap testing" (Lorenzo Stoakes)
     extends the mremap() selftest in several ways, including adding
     more checking of Lorenzo's recently added "permit mremap() move of
     multiple VMAs" feature.

  "selftests/damon/sysfs.py: test all parameters" (SeongJae Park)
     extends the DAMON sysfs interface selftest so that it tests all
     possible user-requested parameters. Rather than the present minimal
     subset"

* tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits)
  MAINTAINERS: add missing headers to mempory policy & migration section
  MAINTAINERS: add missing file to cgroup section
  MAINTAINERS: add MM MISC section, add missing files to MISC and CORE
  MAINTAINERS: add missing zsmalloc file
  MAINTAINERS: add missing files to page alloc section
  MAINTAINERS: add missing shrinker files
  MAINTAINERS: move memremap.[ch] to hotplug section
  MAINTAINERS: add missing mm_slot.h file THP section
  MAINTAINERS: add missing interval_tree.c to memory mapping section
  MAINTAINERS: add missing percpu-internal.h file to per-cpu section
  mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info()
  selftests/damon: introduce _common.sh to host shared function
  selftests/damon/sysfs.py: test runtime reduction of DAMON parameters
  selftests/damon/sysfs.py: test non-default parameters runtime commit
  selftests/damon/sysfs.py: generalize DAMON context commit assertion
  selftests/damon/sysfs.py: generalize monitoring attributes commit assertion
  selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion
  selftests/damon/sysfs.py: test DAMOS filters commitment
  selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion
  selftests/damon/sysfs.py: test DAMOS destinations commitment
  ...
2025-07-31 14:57:54 -07:00
Linus Torvalds
63eb28bb14 ARM:
- Host driver for GICv5, the next generation interrupt controller for
   arm64, including support for interrupt routing, MSIs, interrupt
   translation and wired interrupts.
 
 - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
   GICv5 hardware, leveraging the legacy VGIC interface.
 
 - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
   userspace to disable support for SGIs w/o an active state on hardware
   that previously advertised it unconditionally.
 
 - Map supporting endpoints with cacheable memory attributes on systems
   with FEAT_S2FWB and DIC where KVM no longer needs to perform cache
   maintenance on the address range.
 
 - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest
   hypervisor to inject external aborts into an L2 VM and take traps of
   masked external aborts to the hypervisor.
 
 - Convert more system register sanitization to the config-driven
   implementation.
 
 - Fixes to the visibility of EL2 registers, namely making VGICv3 system
   registers accessible through the VGIC device instead of the ONE_REG
   vCPU ioctls.
 
 - Various cleanups and minor fixes.
 
 LoongArch:
 
 - Add stat information for in-kernel irqchip
 
 - Add tracepoints for CPUCFG and CSR emulation exits
 
 - Enhance in-kernel irqchip emulation
 
 - Various cleanups.
 
 RISC-V:
 
 - Enable ring-based dirty memory tracking
 
 - Improve perf kvm stat to report interrupt events
 
 - Delegate illegal instruction trap to VS-mode
 
 - MMU improvements related to upcoming nested virtualization
 
 s390x
 
 - Fixes
 
 x86:
 
 - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC,
   PIC, and PIT emulation at compile time.
 
 - Share device posted IRQ code between SVM and VMX and
   harden it against bugs and runtime errors.
 
 - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
   instead of O(n).
 
 - For MMIO stale data mitigation, track whether or not a vCPU has access to
   (host) MMIO based on whether the page tables have MMIO pfns mapped; using
   VFIO is prone to false negatives
 
 - Rework the MSR interception code so that the SVM and VMX APIs are more or
   less identical.
 
 - Recalculate all MSR intercepts from scratch on MSR filter changes,
   instead of maintaining shadow bitmaps.
 
 - Advertise support for LKGS (Load Kernel GS base), a new instruction
   that's loosely related to FRED, but is supported and enumerated
   independently.
 
 - Fix a user-triggerable WARN that syzkaller found by setting the vCPU
   in INIT_RECEIVED state (aka wait-for-SIPI), and then putting the vCPU
   into VMX Root Mode (post-VMXON).  Trying to detect every possible path
   leading to architecturally forbidden states is hard and even risks
   breaking userspace (if it goes from valid to valid state but passes
   through invalid states), so just wait until KVM_RUN to detect that
   the vCPU state isn't allowed.
 
 - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of
   APERF/MPERF reads, so that a "properly" configured VM can access
   APERF/MPERF.  This has many caveats (APERF/MPERF cannot be zeroed
   on vCPU creation or saved/restored on suspend and resume, or preserved
   over thread migration let alone VM migration) but can be useful whenever
   you're interested in letting Linux guests see the effective physical CPU
   frequency in /proc/cpuinfo.
 
 - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
   created, as there's no known use case for changing the default
   frequency for other VM types and it goes counter to the very reason
   why the ioctl was added to the vm file descriptor.  And also, there
   would be no way to make it work for confidential VMs with a "secure"
   TSC, so kill two birds with one stone.
 
 - Dynamically allocation the shadow MMU's hashed page list, and defer
   allocating the hashed list until it's actually needed (the TDP MMU
   doesn't use the list).
 
 - Extract many of KVM's helpers for accessing architectural local APIC
   state to common x86 so that they can be shared by guest-side code for
   Secure AVIC.
 
 - Various cleanups and fixes.
 
 x86 (Intel):
 
 - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
   Failure to honor FREEZE_IN_SMM can leak host state into guests.
 
 - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to prevent
   L1 from running L2 with features that KVM doesn't support, e.g. BTF.
 
 x86 (AMD):
 
 - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the
   nested SVM MSRPM offsets tracker can't handle an MSR (which is pretty
   much a static condition and therefore should never happen, but still).
 
 - Fix a variety of flaws and bugs in the AVIC device posted IRQ code.
 
 - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
   supports) instead of rejecting vCPU creation.
 
 - Extend enable_ipiv module param support to SVM, by simply leaving
   IsRunning clear in the vCPU's physical ID table entry.
 
 - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
   erratum #1235, to allow (safely) enabling AVIC on such CPUs.
 
 - Request GA Log interrupts if and only if the target vCPU is blocking,
   i.e. only if KVM needs a notification in order to wake the vCPU.
 
 - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the
   vCPU's CPUID model.
 
 - Accept any SNP policy that is accepted by the firmware with respect to
   SMT and single-socket restrictions.  An incompatible policy doesn't put
   the kernel at risk in any way, so there's no reason for KVM to care.
 
 - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
   use WBNOINVD instead of WBINVD when possible for SEV cache maintenance.
 
 - When reclaiming memory from an SEV guest, only do cache flushes on CPUs
   that have ever run a vCPU for the guest, i.e. don't flush the caches for
   CPUs that can't possibly have cache lines with dirty, encrypted data.
 
 Generic:
 
 - Rework irqbypass to track/match producers and consumers via an xarray
   instead of a linked list.  Using a linked list leads to O(n^2) insertion
   times, which is hugely problematic for use cases that create large
   numbers of VMs.  Such use cases typically don't actually use irqbypass,
   but eliminating the pointless registration is a future problem to
   solve as it likely requires new uAPI.
 
 - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
   to avoid making a simple concept unnecessarily difficult to understand.
 
 - Decouple device posted IRQs from VFIO device assignment, as binding a VM
   to a VFIO group is not a requirement for enabling device posted IRQs.
 
 - Clean up and document/comment the irqfd assignment code.
 
 - Disallow binding multiple irqfds to an eventfd with a priority waiter,
   i.e.  ensure an eventfd is bound to at most one irqfd through the entire
   host, and add a selftest to verify eventfd:irqfd bindings are globally
   unique.
 
 - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
   related to private <=> shared memory conversions.
 
 - Drop guest_memfd's .getattr() implementation as the VFS layer will call
   generic_fillattr() if inode_operations.getattr is NULL.
 
 - Fix issues with dirty ring harvesting where KVM doesn't bound the
   processing of entries in any way, which allows userspace to keep KVM
   in a tight loop indefinitely.
 
 - Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking,
   now that KVM no longer uses assigned_device_count as a heuristic for
   either irqbypass usage or MDS mitigation.
 
 Selftests:
 
 - Fix a comment typo.
 
 - Verify KVM is loaded when getting any KVM module param so that attempting
   to run a selftest without kvm.ko loaded results in a SKIP message about
   KVM not being loaded/enabled (versus some random parameter not existing).
 
 - Skip tests that hit EACCES when attempting to access a file, and rpint
   a "Root required?" help message.  In most cases, the test just needs to
   be run with elevated permissions.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmiKXMgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMhMQf/QDhC/CP1aGXph2whuyeD2NMqPKiU
 9KdnDNST+ftPwjg9QxZ9mTaa8zeVz/wly6XlxD9OQHy+opM1wcys3k0GZAFFEEQm
 YrThgURdzEZ3nwJZgb+m0t4wjJQtpiFIBwAf7qq6z1VrqQBEmHXJ/8QxGuqO+BNC
 j5q/X+q6KZwehKI6lgFBrrOKWFaxqhnRAYfW6rGBxRXxzTJuna37fvDpodQnNceN
 zOiq+avfriUMArTXTqOteJNKU0229HjiPSnjILLnFQ+B3akBlwNG0jk7TMaAKR6q
 IZWG1EIS9q1BAkGXaw6DE1y6d/YwtXCR5qgAIkiGwaPt5yj9Oj6kRN2Ytw==
 =j2At
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Host driver for GICv5, the next generation interrupt controller for
     arm64, including support for interrupt routing, MSIs, interrupt
     translation and wired interrupts

   - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
     GICv5 hardware, leveraging the legacy VGIC interface

   - Userspace control of the 'nASSGIcap' GICv3 feature, allowing
     userspace to disable support for SGIs w/o an active state on
     hardware that previously advertised it unconditionally

   - Map supporting endpoints with cacheable memory attributes on
     systems with FEAT_S2FWB and DIC where KVM no longer needs to
     perform cache maintenance on the address range

   - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the
     guest hypervisor to inject external aborts into an L2 VM and take
     traps of masked external aborts to the hypervisor

   - Convert more system register sanitization to the config-driven
     implementation

   - Fixes to the visibility of EL2 registers, namely making VGICv3
     system registers accessible through the VGIC device instead of the
     ONE_REG vCPU ioctls

   - Various cleanups and minor fixes

  LoongArch:

   - Add stat information for in-kernel irqchip

   - Add tracepoints for CPUCFG and CSR emulation exits

   - Enhance in-kernel irqchip emulation

   - Various cleanups

  RISC-V:

   - Enable ring-based dirty memory tracking

   - Improve perf kvm stat to report interrupt events

   - Delegate illegal instruction trap to VS-mode

   - MMU improvements related to upcoming nested virtualization

  s390x

   - Fixes

  x86:

   - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O
     APIC, PIC, and PIT emulation at compile time

   - Share device posted IRQ code between SVM and VMX and harden it
     against bugs and runtime errors

   - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups
     O(1) instead of O(n)

   - For MMIO stale data mitigation, track whether or not a vCPU has
     access to (host) MMIO based on whether the page tables have MMIO
     pfns mapped; using VFIO is prone to false negatives

   - Rework the MSR interception code so that the SVM and VMX APIs are
     more or less identical

   - Recalculate all MSR intercepts from scratch on MSR filter changes,
     instead of maintaining shadow bitmaps

   - Advertise support for LKGS (Load Kernel GS base), a new instruction
     that's loosely related to FRED, but is supported and enumerated
     independently

   - Fix a user-triggerable WARN that syzkaller found by setting the
     vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting
     the vCPU into VMX Root Mode (post-VMXON). Trying to detect every
     possible path leading to architecturally forbidden states is hard
     and even risks breaking userspace (if it goes from valid to valid
     state but passes through invalid states), so just wait until
     KVM_RUN to detect that the vCPU state isn't allowed

   - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling
     interception of APERF/MPERF reads, so that a "properly" configured
     VM can access APERF/MPERF. This has many caveats (APERF/MPERF
     cannot be zeroed on vCPU creation or saved/restored on suspend and
     resume, or preserved over thread migration let alone VM migration)
     but can be useful whenever you're interested in letting Linux
     guests see the effective physical CPU frequency in /proc/cpuinfo

   - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been
     created, as there's no known use case for changing the default
     frequency for other VM types and it goes counter to the very reason
     why the ioctl was added to the vm file descriptor. And also, there
     would be no way to make it work for confidential VMs with a
     "secure" TSC, so kill two birds with one stone

   - Dynamically allocation the shadow MMU's hashed page list, and defer
     allocating the hashed list until it's actually needed (the TDP MMU
     doesn't use the list)

   - Extract many of KVM's helpers for accessing architectural local
     APIC state to common x86 so that they can be shared by guest-side
     code for Secure AVIC

   - Various cleanups and fixes

  x86 (Intel):

   - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest.
     Failure to honor FREEZE_IN_SMM can leak host state into guests

   - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to
     prevent L1 from running L2 with features that KVM doesn't support,
     e.g. BTF

  x86 (AMD):

   - WARN and reject loading kvm-amd.ko instead of panicking the kernel
     if the nested SVM MSRPM offsets tracker can't handle an MSR (which
     is pretty much a static condition and therefore should never
     happen, but still)

   - Fix a variety of flaws and bugs in the AVIC device posted IRQ code

   - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware
     supports) instead of rejecting vCPU creation

   - Extend enable_ipiv module param support to SVM, by simply leaving
     IsRunning clear in the vCPU's physical ID table entry

   - Disable IPI virtualization, via enable_ipiv, if the CPU is affected
     by erratum #1235, to allow (safely) enabling AVIC on such CPUs

   - Request GA Log interrupts if and only if the target vCPU is
     blocking, i.e. only if KVM needs a notification in order to wake
     the vCPU

   - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to
     the vCPU's CPUID model

   - Accept any SNP policy that is accepted by the firmware with respect
     to SMT and single-socket restrictions. An incompatible policy
     doesn't put the kernel at risk in any way, so there's no reason for
     KVM to care

   - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and
     use WBNOINVD instead of WBINVD when possible for SEV cache
     maintenance

   - When reclaiming memory from an SEV guest, only do cache flushes on
     CPUs that have ever run a vCPU for the guest, i.e. don't flush the
     caches for CPUs that can't possibly have cache lines with dirty,
     encrypted data

  Generic:

   - Rework irqbypass to track/match producers and consumers via an
     xarray instead of a linked list. Using a linked list leads to
     O(n^2) insertion times, which is hugely problematic for use cases
     that create large numbers of VMs. Such use cases typically don't
     actually use irqbypass, but eliminating the pointless registration
     is a future problem to solve as it likely requires new uAPI

   - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a
     "void *", to avoid making a simple concept unnecessarily difficult
     to understand

   - Decouple device posted IRQs from VFIO device assignment, as binding
     a VM to a VFIO group is not a requirement for enabling device
     posted IRQs

   - Clean up and document/comment the irqfd assignment code

   - Disallow binding multiple irqfds to an eventfd with a priority
     waiter, i.e. ensure an eventfd is bound to at most one irqfd
     through the entire host, and add a selftest to verify eventfd:irqfd
     bindings are globally unique

   - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues
     related to private <=> shared memory conversions

   - Drop guest_memfd's .getattr() implementation as the VFS layer will
     call generic_fillattr() if inode_operations.getattr is NULL

   - Fix issues with dirty ring harvesting where KVM doesn't bound the
     processing of entries in any way, which allows userspace to keep
     KVM in a tight loop indefinitely

   - Kill off kvm_arch_{start,end}_assignment() and x86's associated
     tracking, now that KVM no longer uses assigned_device_count as a
     heuristic for either irqbypass usage or MDS mitigation

  Selftests:

   - Fix a comment typo

   - Verify KVM is loaded when getting any KVM module param so that
     attempting to run a selftest without kvm.ko loaded results in a
     SKIP message about KVM not being loaded/enabled (versus some random
     parameter not existing)

   - Skip tests that hit EACCES when attempting to access a file, and
     print a "Root required?" help message. In most cases, the test just
     needs to be run with elevated permissions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits)
  Documentation: KVM: Use unordered list for pre-init VGIC registers
  RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map()
  RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs
  RISC-V: perf/kvm: Add reporting of interrupt events
  RISC-V: KVM: Enable ring-based dirty memory tracking
  RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap
  RISC-V: KVM: Delegate illegal instruction fault to VS mode
  RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs
  RISC-V: KVM: Factor-out g-stage page table management
  RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence
  RISC-V: KVM: Introduce struct kvm_gstage_mapping
  RISC-V: KVM: Factor-out MMU related declarations into separate headers
  RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect()
  RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range()
  RISC-V: KVM: Don't flush TLB when PTE is unchanged
  RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH
  RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize()
  RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init()
  RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value
  KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list
  ...
2025-07-30 17:14:01 -07:00
Linus Torvalds
0b29600a30 Updates for interrupt chip drivers:
- Add support of forced affinity setting to yet offline CPUs for the
    MIPS-GIC to ensure that the affinity of per CPU interrupts can be set
    during the early bringup phase of a secondary CPU in the hotplug code
    before the CPU is set online and interrupts are enabled.\
 
  - Add support for the MIPS (RISC-V !?!?) P8700 SoC in the ACLINT_SSWI
    interrupt chip
 
  - Make the interrupt routing to RISV-V harts specification compliant so it
    supports arbitrary hart indices
 
  - Add a command line parameter and related handling to disable the generic
    RISCV IMSIC mechanism on platforms which use a trap-emulated IMSIC.
    Unfortunatly this is required because there is no mechanism available to
    discover this programatically.
 
  - Enable wakeup sources on the Renesas RZV2H driver
 
  - Convert interrupt chip drivers, which use a open coded variant of
    msi_create_parent_irq_domain() to use the new functionality
 
  - Convert interrupt chip drivers, which use the old style two level
    implementation of MSI support over to the MSI parent mechanism to
    prepare for removing at least one of the three PCI/MSI backend variants.
 
  - The usual cleanups and improvements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmiGj60THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoSh6D/9wY0G2dGz+EJeiDsldzB1n5jmf5I0k
 3XsI3o5j0Ma/Yy+nu9Re3fZq0+qzPFZZErxkBp5igCJbSoaIGheqOyXQDuQu/8tm
 s2t8Wx9k6er7Cywg9rU9pWKzJ6AFXFvcKOEvGG2q2+lFbJbIoGdAM93qPrOJhqeo
 a3NyhQv6kNl7xAjOVyEZmOlCZgCYotFwC0+K1TVQgGDGbwHWH3wad64gLTyyQQlK
 RvtbUBKfCBqqwLJ7Mww7Xclezjk/Hpgm/OppxBAglv5WyRd0e15u2dpdTdM9r4BC
 4wX5Old3ZgbqBQjdHGNlljthu4lO2S0PXU6j0EC8W2NiQjN+hPaKK/EPeerlAJCz
 UmxY0/E3HFNUk8ZHkiif3PGiOSvAbn0JwWi3+D6HuK1rlVTXNs07NIgUBk727Ty5
 S7r5JEuUA5s9dGta4pszxHGn/0Dqg/WvnMZGcbPNaV6POH47wNnPlO2mj14I1HLk
 SfG+deohJM34pVVq7fiqgGukLVPm6PfiJkXx90MK6l+BfE58uo7Oue9mm9pqT2dy
 b6K1gdNPRsZzG7AoAqkx3UrjQuD7maWIpDGb4VZeUW/34bthLygIDUY4OZhpdrUZ
 m33T8zv0PrmNuvnMdFt0RyoDTu8PC9rYS0XVvsIMqsMxJDE/URVGH2tCi5CVMiEg
 PbRWL56yGyT1NA==
 =5LGj
 -----END PGP SIGNATURE-----

Merge tag 'irq-drivers-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull interrupt chip driver updates from Thomas Gleixner:

 - Add support of forced affinity setting to yet offline CPUs for the
   MIPS-GIC to ensure that the affinity of per CPU interrupts can be set
   during the early bringup phase of a secondary CPU in the hotplug code
   before the CPU is set online and interrupts are enabled

 - Add support for the MIPS (RISC-V !?!?) P8700 SoC in the ACLINT_SSWI
   interrupt chip

 - Make the interrupt routing to RISV-V harts specification compliant so
   it supports arbitrary hart indices

 - Add a command line parameter and related handling to disable the
   generic RISCV IMSIC mechanism on platforms which use a trap-emulated
   IMSIC. Unfortunatly this is required because there is no mechanism
   available to discover this programatically.

 - Enable wakeup sources on the Renesas RZV2H driver

 - Convert interrupt chip drivers, which use a open coded variant of
   msi_create_parent_irq_domain() to use the new functionality

 - Convert interrupt chip drivers, which use the old style two level
   implementation of MSI support over to the MSI parent mechanism to
   prepare for removing at least one of the three PCI/MSI backend
   variants.

 - The usual cleanups and improvements all over the place

* tag 'irq-drivers-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  irqchip/renesas-irqc: Convert to DEFINE_SIMPLE_DEV_PM_OPS()
  irqchip/renesas-intc-irqpin: Convert to DEFINE_SIMPLE_DEV_PM_OPS()
  irqchip/riscv-imsic: Add kernel parameter to disable IPIs
  irqchip/gic-v3: Fix GICD_CTLR register naming
  irqchip/ls-scfg-msi: Fix NULL dereference in error handling
  irqchip/ls-scfg-msi: Switch to use msi_create_parent_irq_domain()
  irqchip/armada-370-xp: Switch to msi_create_parent_irq_domain()
  irqchip/alpine-msi: Switch to msi_create_parent_irq_domain()
  irqchip/alpine-msi: Convert to __free
  irqchip/alpine-msi: Convert to lock guards
  irqchip/alpine-msi: Clean up whitespace style
  irqchip/sg2042-msi: Switch to msi_create_parent_irq_domain()
  irqchip/loongson-pch-msi.c: Switch to msi_create_parent_irq_domain()
  irqchip/imx-mu-msi: Convert to msi_create_parent_irq_domain() helper
  irqchip/riscv-imsic: Convert to msi_create_parent_irq_domain() helper
  irqchip/bcm2712-mip: Switch to msi_create_parent_irq_domain()
  irqdomain: Add device pointer to irq_domain_info and msi_domain_info
  irqchip/renesas-rzv2h: Remove unneeded includes
  irqchip/renesas-rzv2h: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
  irqchip/aslint-sswi: Resolve hart index
  ...
2025-07-29 13:26:05 -07:00
Paolo Bonzini
65164fd0f6 KVM/riscv changes for 6.17
- Enabled ring-based dirty memory tracking
 - Improved perf kvm stat to report interrupt events
 - Delegate illegal instruction trap to VS-mode
 - MMU related improvements for KVM RISC-V for upcoming
   nested virtualization
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmiIr0QACgkQrUjsVaLH
 LAf4yA/+KkfQCgMFhwpml6tIzFaa9yS+C9oqemnlWVT/wTADg18/+hraomqUnYhY
 HqOPdWo6O3vmH6E0jR6+AQXz5f4whYl8X8m+HdBHz8c1rF6ozoLMF4Qh3lsDDuZ0
 7pdIEzkNLCA3umBxXUzy7yexKWSzcGD3521toFMPADQHaZTwT8Om5/KLblHB+he3
 1DzsvErPWknWW1pynvSUJoD4zB41Qn364sJvyq4tAW6i8DmxLAmM/+Reh7GBP83r
 t7nAYVdnYikFj0oCb60NcFHqOQpk88mZTqCPMeZD1BoazEDXCPkdx0J44NsBRjun
 BhEpgBLIZDIwOF1A/DDIPrNuNOjSeeUAsAY1sK/yVEOkVZ1HPndyCil5SkE1FJHT
 dmsJBXq96dTlXYo9jBfExFUaUCI1mivLbX7uziIT1876IgLr5NlEJSbwk+TQ8VmR
 IS1PISi7yp5LeZcJEh6PBYgo02UE9gQ/C3tvvcaHbxXyQjVacB6Dw7EnaBArYMwv
 dbEPkxXem90Vup90ixgdLBW3nGCDckpogbsmlqoxV5m3MOknE5L8IuWxKXKB56Tp
 pxp7o1JywBV+Ym5w1BkpTvyEL/a2VDGFOfjryq/h8NAVxXPfwk2plyNhrwLt+Naz
 dYF7RprQL3XTtPnVOND6mMHyl2Sz9y0dc2tj6mpGamSBWPxlP/w=
 =ifud
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-6.17-2' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv changes for 6.17

- Enabled ring-based dirty memory tracking
- Improved perf kvm stat to report interrupt events
- Delegate illegal instruction trap to VS-mode
- MMU related improvements for KVM RISC-V for upcoming
  nested virtualization
2025-07-29 08:33:04 -04:00
Xu Lu
3729fe8cbb RISC-V: KVM: Delegate illegal instruction fault to VS mode
Delegate illegal instruction fault to VS mode by default to avoid such
exceptions being trapped to HS and redirected back to VS.

The delegation of illegal instruction fault is particularly important
to guest applications that use vector instructions frequently. In such
cases, an illegal instruction fault will be raised when guest user thread
uses vector instruction the first time and then guest kernel will enable
user thread to execute following vector instructions.

The fw pmu event counter remains undeleted so that guest can still query
illegal instruction events via sbi call. Guest will only see zero count
on illegal instruction faults and know 'firmware' has delegated it.

Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Xu Lu <luxu.kernel@bytedance.com>
Link: https://lore.kernel.org/r/20250714094554.89151-1-luxu.kernel@bytedance.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:27:40 +05:30
Anup Patel
1f6d0eee54 RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs
Currently, all kvm_riscv_hfence_xyz() APIs assume VMID to be the
host VMID of the Guest/VM which resticts use of these APIs only
for host TLB maintenance. Let's allow passing VMID as a parameter
to all kvm_riscv_hfence_xyz() APIs so that they can be re-used
for nested virtualization related TLB maintenance.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-13-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:27:32 +05:30
Anup Patel
dd82e35638 RISC-V: KVM: Factor-out g-stage page table management
The upcoming nested virtualization can share g-stage page table
management with the current host g-stage implementation hence
factor-out g-stage page table management as separate sources
and also use "kvm_riscv_mmu_" prefix for host g-stage functions.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-12-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:27:30 +05:30
Anup Patel
4c933f3a39 RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence
Currently, the struct kvm_riscv_hfence does not have vmid field
and various hfence processing functions always pick vmid assigned
to the guest/VM. This prevents us from doing hfence operation on
arbitrary vmid hence add vmid field to struct kvm_riscv_hfence
and use it wherever applicable.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-11-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:27:27 +05:30
Anup Patel
f035b44b51 RISC-V: KVM: Introduce struct kvm_gstage_mapping
Introduce struct kvm_gstage_mapping which represents a g-stage
mapping at a particular g-stage page table level. Also, update
the kvm_riscv_gstage_map() to return the g-stage mapping upon
success.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-10-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:27:25 +05:30
Anup Patel
4ecbd3eb5b RISC-V: KVM: Factor-out MMU related declarations into separate headers
The MMU, TLB, and VMID management for KVM RISC-V already exists as
seprate sources so create separate headers along these lines. This
further simplifies asm/kvm_host.h header.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-9-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:27:23 +05:30
Anup Patel
ca539ba4bc RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range()
The kvm_arch_flush_remote_tlbs_range() expected by KVM core can be
easily implemented for RISC-V using kvm_riscv_hfence_gvma_vmid_gpa()
hence provide it.

Also with kvm_arch_flush_remote_tlbs_range() available for RISC-V, the
mmu_wp_memory_region() can happily use kvm_flush_remote_tlbs_memslot()
instead of kvm_flush_remote_tlbs().

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-7-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:27:18 +05:30
Anup Patel
7584eb611e RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH
The KVM_REQ_HFENCE_GVMA_VMID_ALL is same as KVM_REQ_TLB_FLUSH so
to avoid confusion let's replace KVM_REQ_HFENCE_GVMA_VMID_ALL with
KVM_REQ_TLB_FLUSH. Also, rename kvm_riscv_hfence_gvma_vmid_all_process()
to kvm_riscv_tlb_flush_process().

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-5-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:27:13 +05:30
Anup Patel
b79bf2025d RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize()
The kvm_riscv_local_tlb_sanitize() deals with sanitizing current
VMID related TLB mappings when a VCPU is moved from one host CPU
to another.

Let's move kvm_riscv_local_tlb_sanitize() to VMID management
sources and rename it to kvm_riscv_gstage_vmid_sanitize().

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-4-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:27:10 +05:30
Anup Patel
7c67de21ee RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init()
The kvm_riscv_vcpu_aia_init() does not return any failure so drop
the return value which is always zero.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250618113532.471448-3-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-28 22:27:07 +05:30
Ryan Roberts
a9e056de66 mm: remove arch_flush_tlb_batched_pending() arch helper
Since commit 4b63491838 ("arm64/mm: Close theoretical race where stale
TLB entry remains valid"), all arches that use tlbbatch for reclaim
(arm64, riscv, x86) implement arch_flush_tlb_batched_pending() with a
flush_tlb_mm().

So let's simplify by removing the unnecessary abstraction and doing the
flush_tlb_mm() directly in flush_tlb_batched_pending().  This effectively
reverts commit db6c1f6f23 ("mm/tlbbatch: introduce
arch_flush_tlb_batched_pending()").

Link: https://lkml.kernel.org/r/20250609103132.447370-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Suggested-by: Will Deacon <will@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24 19:12:32 -07:00
Clément Léger
c046de827c RISC-V: KVM: add SBI extension reset callback
Currently, only the STA extension needed a reset function but that's
going to be the case for FWFT as well. Add a reset callback that can
be implemented by SBI extensions.

Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20250523101932.1594077-13-cleger@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-23 17:19:44 +05:30
Clément Léger
cf648c400f RISC-V: KVM: add SBI extension init()/deinit() functions
The FWFT SBI extension will need to dynamically allocate memory and do
init time specific initialization. Add an init/deinit callbacks that
allows to do so.

Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Link: https://lore.kernel.org/r/20250523101932.1594077-12-cleger@rivosinc.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-23 17:19:42 +05:30
FUJITA Tomonori
8ad470d4e3 riscv/bug: Add ARCH_WARN_ASM macro for BUG/WARN asm code sharing with Rust
Add new ARCH_WARN_ASM macro for BUG/WARN assembly code sharing with
Rust to avoid the duplication.

No functional changes.

Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Link: https://lore.kernel.org/r/20250502094537.231725-3-fujita.tomonori@gmail.com
[ Remove ending newline in `ARCH_WARN_ASM` content to be closer to the
  original. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2025-07-23 00:04:53 +02:00
Linus Torvalds
414aaef153 RISC-V Fixes for 6.16-rc7
* Three fixes for unnecessary spew: an ACPI CPPC boot-time debug
   message, the link-time warnings for R_RISCV_NONE in binaries, and some
   compile-time warnings in __put_user_nocheck.
 * A fix for a race during text patching.
 * Interrupts are no longer disabled during exception handling.
 * A fix for a missing sign extension in the misaligned load handler.
 * A fix to avoid static ftrace being selected in Kconfig, as we have
   moved to dynamic ftrace.
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmh6wioZHHBhbG1lcmRh
 YmJlbHRAZ29vZ2xlLmNvbQAKCRAuExnzX7sYiUhvD/4uTFU7CpUfbw0mlN4e2aJl
 NdJPzfZn/Le+mI2L4iq9IW7qousBAM9bcjd4SHRTUPSRQpsF315n7hf+1dI5hLHk
 vmbnzRuVS5FqL026t9tx61Kx5d8zPUzWoHbKHNsGyn4p6mnNyDpLrfecDR83VF27
 ZzwXYVh2xHcqK5NSpKqNm35E7KNHYJtL07Hb5s2XnuW+ML/mfaDoNDKrLvwUSpEo
 eS7kMyc6pKmia33b/Brb5WDUsArdYlfI6CHICjD7eDAYsj8KwjtKxRFGDavAIazw
 6RTTWD414+XbHNWByQRvroVihK1Orr2MP5TtCXr3b/7ehnViIkipL9qZvrpLqTJO
 Lq+lr7Uxw5sgMWvnwgf03OB6BbPEiAeDZ2xr4lgEnSBtJQ/fn4VAAHFJXgPwwrgL
 bOF0+/DDpRGo+VJ1n5fbRF9zRczXD5UnakvDdB3k7XPMyf5Y9+dzAa6XdrgwW35U
 tIPCP79l5lVa10Uzc77CA2+pDrSeJomnDrkfqdFXL6nf+8zddbuR2SDepkSDY6b1
 XHEjnZJF9yoIuf2boE7/CEmVgyn/JRJ6cPu1yAd6R6O9jg+cFs9kXWepU5NdaCwo
 Db/8EeXips/ktAbTgNHzLMWL956/afDS8Wl6PUiqTGrvbYywQsDtDgSOsYe0oubo
 NepL3StcPKi1OIjrRUuslw==
 =2jFG
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-6.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

 - Three fixes for unnecessary spew: an ACPI CPPC boot-time debug
   message, the link-time warnings for R_RISCV_NONE in binaries, and
   some compile-time warnings in __put_user_nocheck

 - A fix for a race during text patching

 - Interrupts are no longer disabled during exception handling

 - A fix for a missing sign extension in the misaligned load handler

 - A fix to avoid static ftrace being selected in Kconfig, as we have
   moved to dynamic ftrace

* tag 'riscv-for-linus-6.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: uaccess: Fix -Wuninitialized and -Wshadow in __put_user_nocheck
  riscv: Stop supporting static ftrace
  riscv: traps_misaligned: properly sign extend value in misaligned load handler
  riscv: Enable interrupt during exception handling
  riscv: ftrace: Properly acquire text_mutex to fix a race condition
  ACPI: RISC-V: Remove unnecessary CPPC debug message
  riscv: Stop considering R_RISCV_NONE as bad relocations
2025-07-18 15:31:46 -07:00
Nathan Chancellor
b65ca21835
riscv: uaccess: Fix -Wuninitialized and -Wshadow in __put_user_nocheck
After a recent change in clang to strengthen uninitialized warnings [1],
there is a warning from val being uninitialized in __put_user_nocheck
when called from futex_put_value():

  kernel/futex/futex.h:326:18: warning: variable 'val' is uninitialized when used within its own initialization [-Wuninitialized]
    326 |         unsafe_put_user(val, to, Efault);
        |         ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
  arch/riscv/include/asm/uaccess.h:464:21: note: expanded from macro 'unsafe_put_user'
    464 |         __put_user_nocheck(x, (ptr), label)
        |         ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~
  arch/riscv/include/asm/uaccess.h:314:36: note: expanded from macro '__put_user_nocheck'
    314 |                 __inttype(x) val = (__inttype(x))x;                     \
        |                              ~~~                 ^

While not on by default, -Wshadow flags the same mistake:

  kernel/futex/futex.h:326:2: warning: declaration shadows a local variable [-Wshadow]
    326 |         unsafe_put_user(val, to, Efault);
        |         ^
  arch/riscv/include/asm/uaccess.h:464:2: note: expanded from macro 'unsafe_put_user'
    464 |         __put_user_nocheck(x, (ptr), label)
        |         ^
  arch/riscv/include/asm/uaccess.h:314:16: note: expanded from macro '__put_user_nocheck'
    314 |                 __inttype(x) val = (__inttype(x))x;                     \
        |                              ^
  kernel/futex/futex.h:320:48: note: previous declaration is here
    320 | static __always_inline int futex_put_value(u32 val, u32 __user *to)
        |                                                ^

Use a three underscore prefix for the val variable in __put_user_nocheck
to avoid clashing with either val or __val, which are both used within
the put_user macros, clearing up all warnings.

Closes: https://github.com/ClangBuiltLinux/linux/issues/2109
Fixes: ca1a66cdd6 ("riscv: uaccess: do not do misaligned accesses in get/put_user()")
Link: 2464313eef [1]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20250715-riscv-uaccess-fix-self-init-val-v1-1-82b8e911f120@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-07-16 10:29:39 -07:00
Anup Patel
4cec89db80 RISC-V: KVM: Move HGEI[E|P] CSR access to IMSIC virtualization
Currently, the common AIA functions kvm_riscv_vcpu_aia_has_interrupts()
and kvm_riscv_aia_wakeon_hgei() lookup HGEI line using an array of VCPU
pointers before accessing HGEI[E|P] CSR which is slow and prone to race
conditions because there is a separate per-hart lock for the VCPU pointer
array and a separate per-VCPU rwlock for IMSIC VS-file (including HGEI
line) used by the VCPU. Due to these race conditions, it is observed
on QEMU RISC-V host that Guest VCPUs sleep in WFI and never wakeup even
with interrupt pending in the IMSIC VS-file because VCPUs were waiting
for HGEI wakeup on the wrong host CPU.

The IMSIC virtualization already keeps track of the HGEI line and the
associated IMSIC VS-file used by each VCPU so move the HGEI[E|P] CSR
access to IMSIC virtualization so that costly HGEI line lookup can be
avoided and likelihood of race-conditions when updating HGEI[E|P] CSR
is also reduced.

Reviewed-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Atish Patra <atishp@rivosinc.com>
Tested-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
Fixes: 3385339296 ("RISC-V: KVM: Use IMSIC guest files when available")
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Link: https://lore.kernel.org/r/20250707035345.17494-3-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2025-07-11 18:33:27 +05:30
Alistair Popple
d438d27341 mm: remove devmap related functions and page table bits
Now that DAX and all other reference counts to ZONE_DEVICE pages are
managed normally there is no need for the special devmap PTE/PMD/PUD page
table bits.  So drop all references to these, freeing up a software
defined page table bit on architectures supporting it.

Link: https://lkml.kernel.org/r/6389398c32cc9daa3dfcaa9f79c7972525d310ce.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Will Deacon <will@kernel.org> # arm64
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: Chunyan Zhang <zhang.lyra@gmail.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: John Groves <john@groves.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:18 -07:00
Linus Torvalds
867b9987a3 RISC-V Fixes for 5.16-rc4
* .rodata is no longer linkd into PT_DYNAMIC, it was not supposed to be
   there in the first place and resultst in invalid (but unused) entries.
   This manifests as at least warnings in llvm-readelf.
 * A fix for runtime constants with all-0 upper 32-bits.  This should
   only manifest on MMU=n kernels.
 * A fix for context save/restore on systems using the T-Head vector
   extensions.
 * A fix for a conflicting "+r"/"r" register constraint in the VDSO
   getrandom syscall wrapper, which is undefined behavior in clang.
 * A fix for a missing register clobber in the RVV raid6 implementation.
   This manifests as a NULL pointer reference on some compilers, but
   could trigger in other ways.
 * Misaligned accesses from userspace at faulting addresses are now
   handled correctly.
 * A fix for an incorrect optimization that allowed access_ok() to mark
   invalid addresses as accessible, which can result in userspace
   triggering BUG()s.
 * A few fixes for build warnings, and an update to Drew's email address.
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmhe80kZHHBhbG1lcmRh
 YmJlbHRAZ29vZ2xlLmNvbQAKCRAuExnzX7sYicV6EACT/5384tdpYSQ6WQ4K2mT2
 XxPbrYTJ4jrhZMugnfe1LHBokeBGoGPRK11Dr/PyNJ71oeeDF7opv0kxAfqsiOO3
 QrwUE/4zhGgEzs7Z6D8UgYiqVDfb4aMU+oZ0qIfy+r+cB4F9M65TIejdVj99V6Hu
 V9cjJ4ABM9KfaZhD5BvoqflblYtwuSg/VYsUmZH6aolDyadzTy4rWcPk1jdFJDQt
 tIEsXjc92KNAKGSFe8DDZjjhM216Th/nUsZcxI2DLRQjjHPNEthkAgLNltQGocU9
 gJ8U3IqfazgnqcZAlrr7BXlWYlBFH/wGXVsxuBL5LPov19RcTkjl2PWH7T08yyuv
 lCGXrfkz3hSu+Sa9A40w4LptrKNWUEFJztaPkQ68gn1ZQP7KB/rsWp+82dCqhT35
 RNxmSznLyTsHFRXR2n9fZrWX/F/LwxY7vaH7cTZUDkMHI8F7WP/3tlihxPCQaUHD
 dIb+osch8puxG3YjO7H99WrpJamNNw3+L1l2lXtXTRmXdxE+x7fyatmHX98mY8IC
 7NXGOdNNIEvv4i9vzSphYQHBOT3tBVfz40z878qfSL3xYHG3ZLMIsWuynaWDMI73
 QprwAPmdFxdmJrHyIY6gIiyrscNHz5WLMjkG4K+jXlsBBmDxJMAY5zzNdFoeUVDz
 tjnDY4DYc4fCnteKSA/hpw==
 =42TO
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V Fixes for 5.16-rc4

 - .rodata is no longer linkd into PT_DYNAMIC.

   It was not supposed to be there in the first place and resulted in
   invalid (but unused) entries. This manifests as at least warnings in
   llvm-readelf

 - A fix for runtime constants with all-0 upper 32-bits. This should
   only manifest on MMU=n kernels

 - A fix for context save/restore on systems using the T-Head vector
   extensions

 - A fix for a conflicting "+r"/"r" register constraint in the VDSO
   getrandom syscall wrapper, which is undefined behavior in clang

 - A fix for a missing register clobber in the RVV raid6 implementation.

   This manifests as a NULL pointer reference on some compilers, but
   could trigger in other ways

 - Misaligned accesses from userspace at faulting addresses are now
   handled correctly

 - A fix for an incorrect optimization that allowed access_ok() to mark
   invalid addresses as accessible, which can result in userspace
   triggering BUG()s

 - A few fixes for build warnings, and an update to Drew's email address

* tag 'riscv-for-linus-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: export boot_cpu_hartid
  Revert "riscv: Define TASK_SIZE_MAX for __access_ok()"
  riscv: Fix sparse warning in vendor_extensions/sifive.c
  Revert "riscv: misaligned: fix sleeping function called during misaligned access handling"
  MAINTAINERS: Update Drew Fustini's email address
  RISC-V: uaccess: Wrap the get_user_8 uaccess macro
  raid6: riscv: Fix NULL pointer dereference caused by a missing clobber
  RISC-V: vDSO: Correct inline assembly constraints in the getrandom syscall wrapper
  riscv: vector: Fix context save/restore with xtheadvector
  riscv: fix runtime constant support for nommu kernels
  riscv: vdso: Exclude .rodata from the PT_DYNAMIC segment
2025-06-27 20:22:18 -07:00
Vladimir Kondratiev
5fe331cdcf riscv: Helper to parse hart index
RISC-V APLIC specification defines "hart index" in [1]. Similar definitions
can be found for ACLINT in [2]

Quote from the APLIC specification:

Within a given interrupt domain, each of the domain’s harts has a unique
index number in the range 0 to 2^14 − 1 (= 16,383). The index number a
domain associates with a hart may or may not have any relationship to the
unique hart identifier (“hart ID”) that the RISC-V Privileged
Architecture assigns to the hart. Two different interrupt domains may
employ entirely different index numbers for the same set of harts.

Further, it says in "4.5 Memory-mapped control region for an interrupt
domain":

The array of IDC structures may include some for potential hart index
numbers that are not actual hart index numbers in the domain.  For example,
the first IDC structure is always for hart index 0, but 0 is not
necessarily a valid index number for any hart in the domain.

Support arbitrary hart indices specified in an optional property
"riscv,hart-indexes" which is specified as an array of u32 elements, one
per interrupt target, listing hart indexes in the same order as in
"interrupts-extended".

If this property is not specified, fall back to use logical hart indices
within the domain.

Signed-off-by: Vladimir Kondratiev <vladimir.kondratiev@mobileye.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250612143911.3224046-2-vladimir.kondratiev@mobileye.com
Link: https://github.com/riscv/riscv-aia [1]
Link: https://github.com/riscvarchive/riscv-aclint [2]
2025-06-26 16:06:40 +02:00
Nam Cao
890ba5be63
Revert "riscv: Define TASK_SIZE_MAX for __access_ok()"
This reverts commit ad5643cf2f ("riscv: Define TASK_SIZE_MAX for
__access_ok()").

This commit changes TASK_SIZE_MAX to be LONG_MAX to optimize access_ok(),
because the previous TASK_SIZE_MAX (default to TASK_SIZE) requires some
computation.

The reasoning was that all user addresses are less than LONG_MAX, and all
kernel addresses are greater than LONG_MAX. Therefore access_ok() can
filter kernel addresses.

Addresses between TASK_SIZE and LONG_MAX are not valid user addresses, but
access_ok() let them pass. That was thought to be okay, because they are
not valid addresses at hardware level.

Unfortunately, one case is missed: get_user_pages_fast() happily accepts
addresses between TASK_SIZE and LONG_MAX. futex(), for instance, uses
get_user_pages_fast(). This causes the problem reported by Robert [1].

Therefore, revert this commit. TASK_SIZE_MAX is changed to the default:
TASK_SIZE.

This unfortunately reduces performance, because TASK_SIZE is more expensive
to compute compared to LONG_MAX. But correctness first, we can think about
optimization later, if required.

Reported-by: <rtm@csail.mit.edu>
Closes: https://lore.kernel.org/linux-riscv/77605.1750245028@localhost/
Signed-off-by: Nam Cao <namcao@linutronix.de>
Cc: stable@vger.kernel.org
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Fixes: ad5643cf2f ("riscv: Define TASK_SIZE_MAX for __access_ok()")
Link: https://lore.kernel.org/r/20250619155858.1249789-1-namcao@linutronix.de
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-23 16:00:23 -07:00
Palmer Dabbelt
2aa5801ada
RISC-V: uaccess: Wrap the get_user_8 uaccess macro
I must have lost this rebasing things during the merge window, I know I
got it at some point but it's not here now.  Without this I get warnings
along the lines of

    include/linux/fs.h:3975:15: warning: label followed by a declaration is a C23 extension [-Wc23-extensions]
     3975 |         if (unlikely(get_user(c, path)))
          |                      ^
    arch/riscv/include/asm/uaccess.h:274:3: note: expanded from macro 'get_user'
      274 |                 __get_user((x), __p) :                          \
          |                 ^
    arch/riscv/include/asm/uaccess.h:244:2: note: expanded from macro '__get_user'
      244 |         __get_user_error(__gu_val, __gu_ptr, __gu_err);         \
          |         ^
    arch/riscv/include/asm/uaccess.h:207:2: note: expanded from macro '__get_user_error'
      207 |         __ge  LD [M]  net/802/psnap.ko
    t_user_nocheck(x, ptr, __gu_failed);                        \
          |         ^
    arch/riscv/include/asm/uaccess.h:196:3: note: expanded from macro '__get_user_nocheck'
      196 |                 __get_user_8((x), __gu_ptr, label);             \
          |                 ^
    arch/riscv/include/asm/uaccess.h:130:2: note: expanded from macro '__get_user_8'
      130 |         u32 __user *__ptr = (u32 __user *)(ptr);                \
          |         ^

Link: https://lore.kernel.org/r/20250610213058.24852-1-palmer@dabbelt.com
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: stable@vger.kernel.org
Fixes: f6bff7827a ("riscv: uaccess: use 'asm_goto_output' for get_user()")
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-12 12:41:46 -07:00
Palmer Dabbelt
5c5ecd1f34
Merge tag 'riscv-fixes-6.16-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux into fixes
riscv fixes for 6.16-rc1

- A fix for the newly introduced getrandom vdso where clang optimizes
  away a register variable which is both an input and an output
  parameter
- A fix for theadvector where we did not save all the vector registers,
  only a few of them

* tag 'riscv-fixes-6.16-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux:
  RISC-V: vDSO: Correct inline assembly constraints in the getrandom syscall wrapper
  riscv: vector: Fix context save/restore with xtheadvector
2025-06-12 12:14:06 -07:00
Xi Ruoyao
2b9518684f
RISC-V: vDSO: Correct inline assembly constraints in the getrandom syscall wrapper
As recently pointed out by Thomas, if a register is forced for two
different register variables, among them one is used as "+" (both input
and output) and another is only used as input, Clang would treat the
conflicting input parameters as undefined behaviour and optimize away
the argument assignment.

Instead use "=r" (only output) for the output parameter and "r" (only
input) for the input parameter.
While the example from the GCC documentation uses "0" for the input
parameter, this is not necessary as confirmed by the GCC developers and "r"
matches what the other architectures' vDSO implementations are using.

[ alex: Update log to match v2 (Thomas) ]

Link: https://lore.kernel.org/all/20250603-loongarch-vdso-syscall-v1-1-6d12d6dfbdd0@linutronix.de/
Link: https://gcc.gnu.org/onlinedocs/gcc-15.1.0/gcc/Local-Register-Variables.html
Link: https://gcc.gnu.org/pipermail/gcc-help/2025-June/144266.html
Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Fixes: ee0d03053e ("RISC-V: vDSO: Wire up getrandom() vDSO")
Link: https://lore.kernel.org/r/20250606092443.73650-2-xry111@xry111.site
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-12 12:13:57 -07:00
Han Gao
4262bd0d9c
riscv: vector: Fix context save/restore with xtheadvector
Previously only v0-v7 were correctly saved/restored,
and the context of v8-v31 are damanged.
Correctly save/restore v8-v31 to avoid breaking userspace.

Fixes: d863910eab ("riscv: vector: Support xtheadvector save/restore")
Cc: stable@vger.kernel.org
Signed-off-by: Han Gao <rabenda.cn@gmail.com>
Tested-by: Xiongchuan Tan <tanxiongchuan@isrc.iscas.ac.cn>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>
Reviewed-by: Andy Chiu <andybnac@gmail.com>
Link: https://lore.kernel.org/r/9b9eb2337f3d5336ce813721f8ebea51e0b2b553.1747994822.git.rabenda.cn@gmail.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-12 12:13:47 -07:00
Magnus Lindholm
403d1338a4 mm: pgtable: fix pte_swp_exclusive
Make pte_swp_exclusive return bool instead of int.  This will better
reflect how pte_swp_exclusive is actually used in the code.

This fixes swap/swapoff problems on Alpha due pte_swp_exclusive not
returning correct values when _PAGE_SWP_EXCLUSIVE bit resides in upper
32-bits of PTE (like on alpha).

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Magnus Lindholm <linmag7@gmail.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/lkml/20250218175735.19882-2-linmag7@gmail.com/
Link: https://lore.kernel.org/lkml/20250602041118.GA2675383@ZenIV/
[ Applied as the 'sed' script Al suggested   - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-06-11 14:52:08 -07:00
Charles Mirabile
8d90d9872e
riscv: fix runtime constant support for nommu kernels
the `__runtime_fixup_32` function does not handle the case where `val` is
zero correctly (as might occur when patching a nommu kernel and referring
to a physical address below the 4GiB boundary whose upper 32 bits are all
zero) because nothing in the existing logic prevents the code from taking
the `else` branch of both nop-checks and emitting two `nop` instructions.

This leaves random garbage in the register that is supposed to receive the
upper 32 bits of the pointer instead of zero that when combined with the
value for the lower 32 bits yields an invalid pointer and causes a kernel
panic when that pointer is eventually accessed.

The author clearly considered the fact that if the `lui` is converted into
a `nop` that the second instruction needs to be adjusted to become an `li`
instead of an `addi`, hence introducing the `addi_insn_mask` variable, but
didn't follow that logic through fully to the case where the `else` branch
executes. To fix it just adjust the logic to ensure that the second `else`
branch is not taken if the first instruction will be patched to a `nop`.

Fixes: a44fb57221 ("riscv: Add runtime constant support")

Signed-off-by: Charles Mirabile <cmirabil@redhat.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Link: https://lore.kernel.org/r/20250530211422.784415-2-cmirabil@redhat.com
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-10 18:19:33 -07:00
Linus Torvalds
119b1e61a7 RISC-V Patches for the 6.16 Merge Window, Part 1
* Support for the FWFT SBI extension, which is part of SBI 3.0 and a
   dependency for many new SBI and ISA extensions.
 * Support for getrandom() in the VDSO.
 * Support for mseal.
 * Optimized routines for raid6 syndrome and recovery calculations.
 * kexec_file() supports loading Image-formatted kernel binaries.
 * Improvements to the instruction patching framework to allow for atomic
   instruction patching, along with rules as to how systems need to
   behave in order to function correctly.
 * Support for a handful of new ISA extensions: Svinval, Zicbop, Zabha,
   some SiFive vendor extensions.
 * Various fixes and cleanups, including: misaligned access handling, perf
   symbol mangling, module loading, PUD THPs, and improved uaccess
   routines.
 -----BEGIN PGP SIGNATURE-----
 
 iQJNBAABCAA3FiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmhDLP8ZHHBhbG1lcmRh
 YmJlbHRAZ29vZ2xlLmNvbQAKCRAuExnzX7sYiZhFD/4+Zikkld812VjFb9dTF+Wj
 n/x9h86zDwAEFgf2BMIpUQhHru6vtdkO2l/Ky6mQblTPMWLafF4eK85yCsf84sQ0
 +RX4sOMLZ0+qvqxKX+aOFe9JXOWB0QIQuPvgBfDDOV4UTm60sglIxwqOpKcsBEHs
 2nplXXjiv0ckaMFLos8xlwu1uy4A/jMfT3Y9FDcABxYCqBoKOZ1frcL9ezJZbHbv
 BoOKLDH8ZypFxIG/eQ511lIXXtrnLas0l4jHWjrfsWu6pmXTgJasKtbGuH3LoLnM
 G/4qvHufR6lpVUOIL5L0V6PpsmYwDi/ciFIFlc8NH2oOZil3qiVaGSEbJIkWGFu9
 8lWTXQWnbinZbfg2oYbWp8GlwI70vKomtDyYNyB9q9Cq9jyiTChMklRNODr4764j
 ZiEnzc/l4KyvaxUg8RLKCT595lKECiUDnMytbIbunJu05HBqRCoGpBtMVzlQsyUd
 ybkRt3BA7eOR8/xFA7ZZQeJofmiu2yxkBs5ggMo8UnSragw27hmv/OA0mWMXEuaD
 aaWc4ZKpKqf7qLchLHOvEl5ORUhsisyIJgZwOqdme5rQoWorVtr51faA4AKwFAN4
 vcKgc5qJjK8vnpW+rl3LNJF9LtH+h4TgmUI853vUlukPoH2oqRkeKVGSkxG0iAze
 eQy2VjP1fJz6ciRtJZn9aw==
 =cZGy
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-6.16-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V updates from Palmer Dabbelt:

 - Support for the FWFT SBI extension, which is part of SBI 3.0 and a
   dependency for many new SBI and ISA extensions

 - Support for getrandom() in the VDSO

 - Support for mseal

 - Optimized routines for raid6 syndrome and recovery calculations

 - kexec_file() supports loading Image-formatted kernel binaries

 - Improvements to the instruction patching framework to allow for
   atomic instruction patching, along with rules as to how systems need
   to behave in order to function correctly

 - Support for a handful of new ISA extensions: Svinval, Zicbop, Zabha,
   some SiFive vendor extensions

 - Various fixes and cleanups, including: misaligned access handling,
   perf symbol mangling, module loading, PUD THPs, and improved uaccess
   routines

* tag 'riscv-for-linus-6.16-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (69 commits)
  riscv: uaccess: Only restore the CSR_STATUS SUM bit
  RISC-V: vDSO: Wire up getrandom() vDSO implementation
  riscv: enable mseal sysmap for RV64
  raid6: Add RISC-V SIMD syndrome and recovery calculations
  riscv: mm: Add support for Svinval extension
  RISC-V: Documentation: Add enough title underlines to CMODX
  riscv: Improve Kconfig help for RISCV_ISA_V_PREEMPTIVE
  MAINTAINERS: Update Atish's email address
  riscv: uaccess: do not do misaligned accesses in get/put_user()
  riscv: process: use unsigned int instead of unsigned long for put_user()
  riscv: make unsafe user copy routines use existing assembly routines
  riscv: hwprobe: export Zabha extension
  riscv: Make regs_irqs_disabled() more clear
  perf symbols: Ignore mapping symbols on riscv
  RISC-V: Kconfig: Fix help text of CMDLINE_EXTEND
  riscv: module: Optimize PLT/GOT entry counting
  riscv: Add support for PUD THP
  riscv: xchg: Prefetch the destination word for sc.w
  riscv: Add ARCH_HAS_PREFETCH[W] support with Zicbop
  riscv: Add support for Zicbop
  ...
2025-06-06 18:05:18 -07:00
Palmer Dabbelt
51f1b16367
Merge patch series "riscv: add SBI FWFT misaligned exception delegation support"
Clément Léger <cleger@rivosinc.com> says:

The SBI Firmware Feature extension allows the S-mode to request some
specific features (either hardware or software) to be enabled. This
series uses this extension to request misaligned access exception
delegation to S-mode in order to let the kernel handle it. It also adds
support for the KVM FWFT SBI extension based on the misaligned access
handling infrastructure.

FWFT SBI extension is part of the SBI V3.0 specifications [1]. It can be
tested using the qemu provided at [2] which contains the series from
[3]. Upstream kvm-unit-tests can be used inside kvm to tests the correct
delegation of misaligned exceptions. Upstream OpenSBI can be used.

The tests can be run using the kselftest from series [4].

$ qemu-system-riscv64 \
        -cpu rv64,trap-misaligned-access=true,v=true \
        -M virt \
        -m 1024M \
        -bios fw_dynamic.bin \
        -kernel Image
 ...

 # ./misaligned
 TAP version 13
 1..23
 # Starting 23 tests from 1 test cases.
 #  RUN           global.gp_load_lh ...
 #            OK  global.gp_load_lh
 ok 1 global.gp_load_lh
 #  RUN           global.gp_load_lhu ...
 #            OK  global.gp_load_lhu
 ok 2 global.gp_load_lhu
 #  RUN           global.gp_load_lw ...
 #            OK  global.gp_load_lw
 ok 3 global.gp_load_lw
 #  RUN           global.gp_load_lwu ...
 #            OK  global.gp_load_lwu
 ok 4 global.gp_load_lwu
 #  RUN           global.gp_load_ld ...
 #            OK  global.gp_load_ld
 ok 5 global.gp_load_ld
 #  RUN           global.gp_load_c_lw ...
 #            OK  global.gp_load_c_lw
 ok 6 global.gp_load_c_lw
 #  RUN           global.gp_load_c_ld ...
 #            OK  global.gp_load_c_ld
 ok 7 global.gp_load_c_ld
 #  RUN           global.gp_load_c_ldsp ...
 #            OK  global.gp_load_c_ldsp
 ok 8 global.gp_load_c_ldsp
 #  RUN           global.gp_load_sh ...
 #            OK  global.gp_load_sh
 ok 9 global.gp_load_sh
 #  RUN           global.gp_load_sw ...
 #            OK  global.gp_load_sw
 ok 10 global.gp_load_sw
 #  RUN           global.gp_load_sd ...
 #            OK  global.gp_load_sd
 ok 11 global.gp_load_sd
 #  RUN           global.gp_load_c_sw ...
 #            OK  global.gp_load_c_sw
 ok 12 global.gp_load_c_sw
 #  RUN           global.gp_load_c_sd ...
 #            OK  global.gp_load_c_sd
 ok 13 global.gp_load_c_sd
 #  RUN           global.gp_load_c_sdsp ...
 #            OK  global.gp_load_c_sdsp
 ok 14 global.gp_load_c_sdsp
 #  RUN           global.fpu_load_flw ...
 #            OK  global.fpu_load_flw
 ok 15 global.fpu_load_flw
 #  RUN           global.fpu_load_fld ...
 #            OK  global.fpu_load_fld
 ok 16 global.fpu_load_fld
 #  RUN           global.fpu_load_c_fld ...
 #            OK  global.fpu_load_c_fld
 ok 17 global.fpu_load_c_fld
 #  RUN           global.fpu_load_c_fldsp ...
 #            OK  global.fpu_load_c_fldsp
 ok 18 global.fpu_load_c_fldsp
 #  RUN           global.fpu_store_fsw ...
 #            OK  global.fpu_store_fsw
 ok 19 global.fpu_store_fsw
 #  RUN           global.fpu_store_fsd ...
 #            OK  global.fpu_store_fsd
 ok 20 global.fpu_store_fsd
 #  RUN           global.fpu_store_c_fsd ...
 #            OK  global.fpu_store_c_fsd
 ok 21 global.fpu_store_c_fsd
 #  RUN           global.fpu_store_c_fsdsp ...
 #            OK  global.fpu_store_c_fsdsp
 ok 22 global.fpu_store_c_fsdsp
 #  RUN           global.gen_sigbus ...
 [12797.988647] misaligned[618]: unhandled signal 7 code 0x1 at 0x0000000000014dc0 in misaligned[4dc0,10000+76000]
 [12797.988990] CPU: 0 UID: 0 PID: 618 Comm: misaligned Not tainted 6.13.0-rc6-00008-g4ec4468967c9-dirty #51
 [12797.989169] Hardware name: riscv-virtio,qemu (DT)
 [12797.989264] epc : 0000000000014dc0 ra : 0000000000014d00 sp : 00007fffe165d100
 [12797.989407]  gp : 000000000008f6e8 tp : 0000000000095760 t0 : 0000000000000008
 [12797.989544]  t1 : 00000000000965d8 t2 : 000000000008e830 s0 : 00007fffe165d160
 [12797.989692]  s1 : 000000000000001a a0 : 0000000000000000 a1 : 0000000000000002
 [12797.989831]  a2 : 0000000000000000 a3 : 0000000000000000 a4 : ffffffffdeadbeef
 [12797.989964]  a5 : 000000000008ef61 a6 : 626769735f6e0000 a7 : fffffffffffff000
 [12797.990094]  s2 : 0000000000000001 s3 : 00007fffe165d838 s4 : 00007fffe165d848
 [12797.990238]  s5 : 000000000000001a s6 : 0000000000010442 s7 : 0000000000010200
 [12797.990391]  s8 : 000000000000003a s9 : 0000000000094508 s10: 0000000000000000
 [12797.990526]  s11: 0000555567460668 t3 : 00007fffe165d070 t4 : 00000000000965d0
 [12797.990656]  t5 : fefefefefefefeff t6 : 0000000000000073
 [12797.990756] status: 0000000200004020 badaddr: 000000000008ef61 cause: 0000000000000006
 [12797.990911] Code: 8793 8791 3423 fcf4 3783 fc84 c737 dead 0713 eef7 (c398) 0001
 #            OK  global.gen_sigbus
 ok 23 global.gen_sigbus
 # PASSED: 23 / 23 tests passed.
 # Totals: pass:23 fail:0 xfail:0 xpass:0 skip:0 error:0

With kvm-tools:

 # lkvm run -k sbi.flat -m 128
  Info: # lkvm run -k sbi.flat -m 128 -c 1 --name guest-97
  Info: Removed ghost socket file "/root/.lkvm//guest-97.sock".

 ##########################################################################
 #    kvm-unit-tests
 ##########################################################################

 ... [test messages elided]
 PASS: sbi: fwft: FWFT extension probing no error
 PASS: sbi: fwft: get/set reserved feature 0x6 error == SBI_ERR_DENIED
 PASS: sbi: fwft: get/set reserved feature 0x3fffffff error == SBI_ERR_DENIED
 PASS: sbi: fwft: get/set reserved feature 0x80000000 error == SBI_ERR_DENIED
 PASS: sbi: fwft: get/set reserved feature 0xbfffffff error == SBI_ERR_DENIED
 PASS: sbi: fwft: misaligned_deleg: Get misaligned deleg feature no error
 PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
 PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature invalid value error
 PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
 PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 0
 PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value no error
 PASS: sbi: fwft: misaligned_deleg: Set misaligned deleg feature value 1
 PASS: sbi: fwft: misaligned_deleg: Verify misaligned load exception trap in supervisor
 SUMMARY: 50 tests, 2 unexpected failures, 12 skipped

This series is available at [5].

[Palmer: slighyt commit text modification, as SBI-3.0 is merged now.
Also drop the KVM patches, as they're too late.]

* b4-shazam-merge:
  riscv: misaligned: add a function to check misalign trap delegability
  riscv: misaligned: move emulated access uniformity check in a function
  riscv: misaligned: declare misaligned_access_speed under CONFIG_RISCV_MISALIGNED
  riscv: misaligned: use on_each_cpu() for scalar misaligned access probing
  riscv: misaligned: request misaligned exception from SBI
  riscv: sbi: add SBI FWFT extension calls
  riscv: sbi: add FWFT extension interface
  riscv: sbi: add new SBI error mappings
  riscv: sbi: remove useless parenthesis
  riscv: sbi: add Firmware Feature (FWFT) SBI extensions definitions

Link: https://lore.kernel.org/r/20250523101932.1594077-1-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 14:03:19 -07:00
Palmer Dabbelt
a921f0753a
Merge patch series "riscv: misaligned: fix misaligned accesses handling in put/get_user()"
Clément Léger <cleger@rivosinc.com> says:

While debugging a few problems with the misaligned access kselftest,
Alexandre discovered some crash with the current code. Indeed, some
misaligned access was done by the kernel using put_user(). This
was resulting in trap and a kernel crash since. The path was the
following:
user -> kernel -> access to user memory -> misaligned trap -> trap ->
kernel -> misaligned handling -> memcpy -> crash due to failed page fault
while in interrupt disabled section.

Last discussion about kernel misaligned handling and interrupt reenabling
were actually not to reenable interrupt when handling misaligned access
being done by kernel. The best solution being not to do any misaligned
accesses to userspace memory, we considered a few options:

- Remove any call to put/get_user() potentially doing misaligned
  accesses
- Do not do any misaligned accesses in put/get_user() itself

The second solution was the one chosen as there are too many callsites to
put/get_user() that could potentially do misaligned accesses. We tried
two approaches for that, either split access in two aligned accesses
(and do RMW for put_user()) or call copy_from/to_user() which does not
do any misaligned accesses. The later one was the simpler to implement
(although the performances are probably lower than split aligned
accesses but still way better than doing misaligned access emulation)
and allows to support what we wanted.

These commits are based on top of Alex dev/alex/get_user_misaligned_v1
branch.

[Palmer: No idea what that branch is, so I'm basing it on the uaccess
optimizations patch series which is the last thing to touch these.]

* b4-shazam-merge
  riscv: uaccess: do not do misaligned accesses in get/put_user()
  riscv: process: use unsigned int instead of unsigned long for put_user()
  riscv: make unsafe user copy routines use existing assembly routines

Link: https://lore.kernel.org/r/20250602193918.868962-1-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 14:03:18 -07:00
Cyril Bur
265d6aba16
riscv: uaccess: Only restore the CSR_STATUS SUM bit
During switch to csrs will OR the value of the register into the
corresponding csr. In this case we're only interested in restoring the
SUM bit not the entire register.

Signed-off-by: Cyril Bur <cyrilbur@tenstorrent.com>
Link: https://lore.kernel.org/r/20250522160954.429333-1-cyrilbur@tenstorrent.com
Co-developed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Fixes: 788aa64c01 ("riscv: save the SR_SUM status over switches")
Link: https://lore.kernel.org/r/20250602121543.1544278-1-alexghiti@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 14:03:17 -07:00
Palmer Dabbelt
2670a39b1e
Merge tag 'riscv-mw2-6.16-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux into for-next
riscv patches for 6.16-rc1, part 2

* Performance improvements
  - Add support for vdso getrandom
  - Implement raid6 calculations using vectors
  - Introduce svinval tlb invalidation

* Cleanup
  - A bunch of deduplication of the macros we use for manipulating instructions

* Misc
  - Introduce a kunit test for kprobes
  - Add support for mseal as riscv fits the requirements (thanks to Lorenzo for making sure of that :))

[Palmer: There was a rebase between part 1 and part 2, so I've had to do
some more git surgery here... at least two rounds of surgery...]

* alex-pr-2: (866 commits)
  RISC-V: vDSO: Wire up getrandom() vDSO implementation
  riscv: enable mseal sysmap for RV64
  raid6: Add RISC-V SIMD syndrome and recovery calculations
  riscv: mm: Add support for Svinval extension
  riscv: Add kprobes KUnit test
  riscv: kprobes: Remove duplication of RV_EXTRACT_ITYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_UTYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_RD_REG
  riscv: kprobes: Remove duplication of RVC_EXTRACT_BTYPE_IMM
  riscv: kprobes: Remove duplication of RVC_EXTRACT_C2_RS1_REG
  riscv: kproves: Remove duplication of RVC_EXTRACT_JTYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_BTYPE_IMM
  riscv: kprobes: Remove duplication of RV_EXTRACT_RS1_REG
  riscv: kprobes: Remove duplication of RV_EXTRACT_JTYPE_IMM
  riscv: kprobes: Move branch_funct3 to insn.h
  riscv: kprobes: Move branch_rs2_idx to insn.h
  Linux 6.15-rc6
  Input: xpad - fix xpad_device sorting
  Input: xpad - add support for several more controllers
  Input: xpad - fix Share button on Xbox One controllers
  ...
2025-06-05 14:03:16 -07:00
Xi Ruoyao
ee0d03053e
RISC-V: vDSO: Wire up getrandom() vDSO implementation
Hook up the generic vDSO implementation to the generic vDSO getrandom
implementation by providing the required __arch_chacha20_blocks_nostack
and getrandom_syscall implementations. Also wire up the selftests.

The benchmark result:

	vdso: 25000000 times in 2.466341333 seconds
	libc: 25000000 times in 41.447720005 seconds
	syscall: 25000000 times in 41.043926672 seconds

	vdso: 25000000 x 256 times in 162.286219353 seconds
	libc: 25000000 x 256 times in 2953.855018685 seconds
	syscall: 25000000 x 256 times in 2796.268546000 seconds

[ alex: - Fix dynamic relocation
        - Squash Nathan's fix https://lore.kernel.org/all/20250423-riscv-fix-compat_vdso-lld-v2-1-b7bbbc244501@kernel.org/
	- Add comment from Loongarch ]

Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Link: https://lore.kernel.org/r/20250411024600.16045-1-xry111@xry111.site
Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 14:03:09 -07:00
Palmer Dabbelt
9d3da78275
Merge tag 'riscv-mw1-6.16-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux into for-next
riscv patches for 6.16-rc1

* Implement atomic patching support for ftrace which finally allows to
  get rid of stop_machine().
* Support for kexec_file_load() syscall
* Improve module loading time by changing the algorithm that counts the
  number of plt/got entries in a module.
* Zicbop is now used in the kernel to prefetch instructions

[Palmer: There's been two rounds of surgery on this one, so as a result
it's a bit different than the PR.]

* alex-pr: (734 commits)
  riscv: Improve Kconfig help for RISCV_ISA_V_PREEMPTIVE
  MAINTAINERS: Update Atish's email address
  riscv: hwprobe: export Zabha extension
  riscv: Make regs_irqs_disabled() more clear
  perf symbols: Ignore mapping symbols on riscv
  RISC-V: Kconfig: Fix help text of CMDLINE_EXTEND
  riscv: module: Optimize PLT/GOT entry counting
  riscv: Add support for PUD THP
  riscv: xchg: Prefetch the destination word for sc.w
  riscv: Add ARCH_HAS_PREFETCH[W] support with Zicbop
  riscv: Add support for Zicbop
  riscv: Introduce Zicbop instructions
  riscv/kexec_file: Fix comment in purgatory relocator
  riscv: kexec_file: Support loading Image binary file
  riscv: kexec_file: Split the loading of kernel and others
  riscv: Documentation: add a description about dynamic ftrace
  riscv: ftrace: support direct call using call_ops
  riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
  riscv: ftrace: support PREEMPT
  riscv: add a data fence for CMODX in the kernel mode
  ...

Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 12:26:06 -07:00
Alexandre Ghiti
847689d2a0
Merge patch series "riscv: Add Zicbop & prefetchw support"
Alexandre Ghiti <alexghiti@rivosinc.com> says:

I found this lost series developed by Guo so here is a respin with the
comments on v2 applied.

This patch series adds Zicbop support and then enables the Linux
prefetch features.

* patches from https://lore.kernel.org/r/20250421142441.395849-1-alexghiti@rivosinc.com:
  riscv: xchg: Prefetch the destination word for sc.w
  riscv: Add ARCH_HAS_PREFETCH[W] support with Zicbop
  riscv: Add support for Zicbop
  riscv: Introduce Zicbop instructions

Link: https://lore.kernel.org/r/20250421142441.395849-1-alexghiti@rivosinc.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 12:21:59 -07:00
Clément Léger
ca1a66cdd6
riscv: uaccess: do not do misaligned accesses in get/put_user()
Doing misaligned access to userspace memory would make a trap on
platform where it is emulated. Latest fixes removed the kernel
capability to do unaligned accesses to userspace memory safely since
interrupts are kept disabled at all time during that. Thus doing so
would crash the kernel.

Such behavior was detected with GET_UNALIGN_CTL() that was doing
a put_user() with an unsigned long* address that should have been an
unsigned int*. Reenabling kernel misaligned access emulation is a bit
risky and it would also degrade performances. Rather than doing that,
we will try to avoid any misaligned accessed by using copy_from/to_user()
which does not do any misaligned accesses. This can be done only for
!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS and thus allows to only generate
a bit more code for this config.

Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20250602193918.868962-4-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 11:39:17 -07:00
Alexandre Ghiti
a434854633
riscv: make unsafe user copy routines use existing assembly routines
The current implementation is underperforming and in addition, it
triggers misaligned access traps on platforms which do not handle
misaligned accesses in hardware.

Use the existing assembly routines to solve both problems at once.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20250602193918.868962-2-cleger@rivosinc.com
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 11:39:15 -07:00
Tiezhu Yang
d7e0cce103
riscv: Make regs_irqs_disabled() more clear
The return value of regs_irqs_disabled() is true or false, so change
its type to reflect that and also make it always inline.

Suggested-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20250422113156.25742-1-yangtiezhu@loongson.cn
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 11:10:17 -07:00
Palmer Dabbelt
9eb9ea31ff
Merge patch series "riscv: kexec_file: Support loading Image binary file"
Björn Töpel <bjorn@kernel.org> says:

From: Björn Töpel <bjorn@rivosinc.com>

Hi!

For over a year ago, Daniel and I was testing the V2 of Song's series.
I also promised to take the V2, that had been sitting on the lists for
too long, to rebase it on a new kernel, and re-test it.

One year later, here's the V3! ;-)

There are no changes from V2 other, than some simple checkpatch
cleanups.

Song's original cover:
  | This series makes the kexec_file_load() syscall support to load
  | Image binary file. At the same time, corresponding support for
  | kexec-tools had been pushed to my repo[2].
  |
  | Now, we can leverage that kexec-tools and this series to use the
  | kexec_load() or kexec_file_load() syscall to boot both vmlinux and
  | Image file, as seen in these combo tests:
  |
  | ```
  | 1. kexec -l vmlinux
  | 2. kexec -l Image
  | 3. kexec -s -l vmlinux
  | 4. kexec -s -l Image
  | ```

Notably, kexec-tools has still not made it upstream. I've prepared a
branch on my GH [3], that I indend to post ASAP. That branch is a
collection of fixes/features, including Song's userland Image loading.

The V2 is here [2], and V1 [1].

I've tested the kexec-file/Image on qemu-rv64, with following
combinations:
 * ACPI/UEFI
 * DT/UEFI
 * DT

both "regular" kexec (-s + -e), and crashkernels (-p).

Note that there are two purgatory patches that has to be present (part
of -rc1, so all good):
  commit 28093cfef5 ("riscv/kexec_file: Handle R_RISCV_64 in purgatory relocator")
  commit 3f7023171d ("riscv/purgatory: 4B align purgatory_start")

[1] https://lore.kernel.org/linux-riscv/20230914020044.1397356-1-songshuaishuai@tinylab.org/
[2] https://lore.kernel.org/linux-riscv/20231016092006.3347632-1-songshuaishuai@tinylab.org/
[3] https://github.com/bjoto/kexec-tools/tree/rv-on-master

* patches from https://lore.kernel.org/r/20250409193004.643839-1-bjorn@kernel.org:
  riscv: kexec_file: Support loading Image binary file
  riscv: kexec_file: Split the loading of kernel and others

Link: https://lore.kernel.org/r/20250409193004.643839-1-bjorn@kernel.org
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 11:10:11 -07:00
Alexandre Ghiti
881dadf079
Merge patch series "riscv: ftrace: atmoic patching and preempt improvements"
Andy Chiu <andybnac@gmail.com> says:

This series makes atomic code patching in ftrace possible and eliminates
the need of the stop_machine dance. The major difference of this version
is that we merge the CALL_OPS support from Puranjay [1] and make direct
calls available for practical uses such as BPF. Thanks for the time
reviewing the series and suggestions, we hope this version gets a step
closer to happening in the upstream.

Please reference the link to v3 below for more introductory view of the
implementation [2]

Added patch: 2, 4, 10, 11, 12
Modified patch: 5, 6
Unchanged patch: 1, 3, 7, 8, 9
(1, 8 has commit msg modified)

Special thanks to Björn for his efforts on testing and guiding the
series!

[1]: https://lore.kernel.org/lkml/20240306165904.108141-1-puranjay12@gmail.com/
[2]: https://lore.kernel.org/linux-riscv/20241127172908.17149-1-andybnac@gmail.com/

* patches from https://lore.kernel.org/r/20250407180838.42877-1-andybnac@gmail.com:
  riscv: Documentation: add a description about dynamic ftrace
  riscv: ftrace: support direct call using call_ops
  riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
  riscv: ftrace: support PREEMPT
  riscv: add a data fence for CMODX in the kernel mode
  riscv: vector: Support calling schedule() for preemptible Vector
  riscv: ftrace: do not use stop_machine to update code
  riscv: ftrace: prepare ftrace for atomic code patching
  kernel: ftrace: export ftrace_sync_ipi
  riscv: ftrace: align patchable functions to 4 Byte boundary
  riscv: ftrace factor out code defined by !WITH_ARG
  riscv: ftrace: support fastcc in Clang for WITH_ARGS

Link: https://lore.kernel.org/r/20250407180838.42877-1-andybnac@gmail.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
2025-06-05 11:09:41 -07:00
Alexandre Ghiti
c3cc2a4a3a
riscv: Add support for PUD THP
Add the necessary page table functions to deal with PUD THP, this
enables the use of PUD pfnmap.

Link: https://lore.kernel.org/r/20250321123954.225097-1-alexghiti@rivosinc.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 11:09:40 -07:00
Guo Ren
eb87e56d65
riscv: xchg: Prefetch the destination word for sc.w
The cost of changing a cacheline from shared to exclusive state can be
significant, especially when this is triggered by an exclusive store,
since it may result in having to retry the transaction.

This patch makes use of prefetch.w to prefetch cachelines for write
prior to lr/sc loops when using the xchg_small atomic routine.

This patch is inspired by commit 0ea366f5e1 ("arm64: atomics:
prefetch the destination word for write prior to stxr").

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Link: https://lore.kernel.org/r/20231231082955.16516-4-guoren@kernel.org
Tested-by: Andrea Parri <parri.andrea@gmail.com>
Link: https://lore.kernel.org/r/20250421142441.395849-5-alexghiti@rivosinc.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 11:09:39 -07:00
Guo Ren
a5f947c731
riscv: Add ARCH_HAS_PREFETCH[W] support with Zicbop
Enable Linux prefetch and prefetchw primitives using Zicbop.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Link: https://lore.kernel.org/r/20231231082955.16516-3-guoren@kernel.org
Tested-by: Andrea Parri <parri.andrea@gmail.com>
Link: https://lore.kernel.org/r/20250421142441.395849-4-alexghiti@rivosinc.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05 11:09:38 -07:00