Commit graph

1359 commits

Author SHA1 Message Date
Linus Torvalds
ff7dcfedf9 Major ext4 changes for 6.17:
- Better scalability for ext4 block allocation
   - Fix insufficient credits when writing back large folios
 
 Miscellaneous bug fixes, especially when handling exteded attriutes,
 inline data, and fast commit.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmiIQEoACgkQ8vlZVpUN
 gaPB9wf/QursT7eLjx9Gz+4PYNWPKptBERQtmmDAnNYxDlEQ28+CHdMdEeiIPPoP
 IW1DIHfR7VaTI2K7gy6D5632VAhDDKiXBpIYu1yh3KPClAxjTZbhrif8J5UBXj1K
 ZwmCeLDF40jijua4rVKq3Fqf4iTJUyU2NqLpvcze7BZg7FwstXiNJrZ3DjAwi1BW
 j/5veWwh/KrNMzT5u0+RpMs4FBrdXQXvwSe/4pSx6d75r6WAdzhgUMy09os1wAWU
 3N0JU+R5hAG6iFfbWQRURB6oLMmmxl4x2F7r5BvM27uQtELNLNcxBKZhMW97HpiE
 uSwKgo/59DKpWX0xQ2x/yugQIzd62w==
 =oPHD
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Major ext4 changes for 6.17:

   - Better scalability for ext4 block allocation

   - Fix insufficient credits when writing back large folios

  Miscellaneous bug fixes, especially when handling exteded attriutes,
  inline data, and fast commit"

* tag 'ext4_for_linus_6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (39 commits)
  ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr
  ext4: implement linear-like traversal across order xarrays
  ext4: refactor choose group to scan group
  ext4: convert free groups order lists to xarrays
  ext4: factor out ext4_mb_scan_group()
  ext4: factor out ext4_mb_might_prefetch()
  ext4: factor out __ext4_mb_scan_group()
  ext4: fix largest free orders lists corruption on mb_optimize_scan switch
  ext4: fix zombie groups in average fragment size lists
  ext4: merge freed extent with existing extents before insertion
  ext4: convert sbi->s_mb_free_pending to atomic_t
  ext4: fix typo in CR_GOAL_LEN_SLOW comment
  ext4: get rid of some obsolete EXT4_MB_HINT flags
  ext4: utilize multiple global goals to reduce contention
  ext4: remove unnecessary s_md_lock on update s_mb_last_group
  ext4: remove unnecessary s_mb_last_start
  ext4: separate stream goal hits from s_bal_goals for better tracking
  ext4: add ext4_try_lock_group() to skip busy groups
  ext4: initialize superblock fields in the kballoc-test.c kunit tests
  ext4: refactor the inline directory conversion and new directory codepaths
  ...
2025-07-31 10:02:44 -07:00
Jan Kara
91b8ca8b26 ext4: Make sure BH_New bit is cleared in ->write_end handler
Currently we clear BH_New bit in case of error and also in the standard
ext4_write_end() handler (in block_commit_write()). However
ext4_journalled_write_end() misses this clearing and thus we are leaving
stale BH_New bits behind. Generally ext4_block_write_begin() clears
these bits before any harm can be done but in case blocksize < pagesize
and we hit some error when processing a page with these stale bits,
we'll try to zero buffers with these stale BH_New bits and jbd2 will
complain (as buffers were not prepared for writing in this transaction).
Fix the problem by clearing BH_New bits in ext4_journalled_write_end()
and WARN if ext4_block_write_begin() sees stale BH_New bits.

Reported-by: Baolin Liu <liubaolin12138@163.com>
Reported-by: Zhi Long <longzhi@sangfor.com.cn>
Fixes: 3910b513fc ("ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250709084831.23876-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-17 23:25:21 -04:00
Taotao Chen
ae21c0c0ac
ext4: support uncached buffered I/O
Set FOP_DONTCACHE in ext4_file_operations to declare support for
uncached buffered I/O.

To handle this flag, update ext4_write_begin() and ext4_da_write_begin()
to use write_begin_get_folio(), which encapsulates FGP_DONTCACHE logic
based on iocb->ki_flags.

Part of a series refactoring address_space_operations write_begin and
write_end callbacks to use struct kiocb for passing write context and
flags.

Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Link: https://lore.kernel.org/20250716093559.217344-6-chentaotao@didiglobal.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16 14:48:18 +02:00
Taotao Chen
e9d8e2bf23
fs: change write_begin/write_end interface to take struct kiocb *
Change the address_space_operations callbacks write_begin() and
write_end() to take struct kiocb * as the first argument instead of
struct file *.

Update all affected function prototypes, implementations, call sites,
and related documentation across VFS, filesystems, and block layer.

Part of a series refactoring address_space_operations write_begin and
write_end callbacks to use struct kiocb for passing write context and
flags.

Signed-off-by: Taotao Chen <chentaotao@didiglobal.com>
Link: https://lore.kernel.org/20250716093559.217344-4-chentaotao@didiglobal.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-16 14:48:18 +02:00
Zhang Yi
b12f423d59 ext4: limit the maximum folio order
In environments with a page size of 64KB, the maximum size of a folio
can reach up to 128MB. Consequently, during the write-back of folios,
the 'rsv_blocks' will be overestimated to 1,577, which can make
pressure on the journal space where the journal is small. This can
easily exceed the limit of a single transaction. Besides, an excessively
large folio is meaningless and will instead increase the overhead of
traversing the bhs within the folio. Therefore, limit the maximum order
of a folio to 2048 filesystem blocks.

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Joseph Qi <jiangqi903@gmail.com>
Closes: https://lore.kernel.org/linux-ext4/CA+G9fYsyYQ3ZL4xaSg1-Tt5Evto7Zd+hgNWZEa9cQLbahA1+xg@mail.gmail.com/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-12-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-14 23:48:15 -04:00
Zhang Yi
5137d6c890 ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()
The calculation of journal credits in ext4_meta_trans_blocks() should
include pextents, as each extent separately may be allocated from a
different group and thus need to update different bitmap and group
descriptor block.

Fixes: 0e32d86170 ("ext4: correct the journal credits calculations of allocating blocks")
Reported-by: Jan Kara <jack@suse.cz>
Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-11-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:52 -04:00
Zhang Yi
57661f2875 ext4: replace ext4_writepage_trans_blocks()
After ext4 supports large folios, the semantics of reserving credits in
pages is no longer applicable. In most scenarios, reserving credits in
extents is sufficient. Therefore, introduce ext4_chunk_trans_extent()
to replace ext4_writepage_trans_blocks(). move_extent_per_page() is the
only remaining location where we are still processing extents in pages.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-10-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:52 -04:00
Zhang Yi
bbbf150f3f ext4: reserved credits for one extent during the folio writeback
After ext4 supports large folios, reserving journal credits for one
maximum-ordered folio based on the worst case cenario during the
writeback process can easily exceed the maximum transaction credits.
Additionally, reserving journal credits for one page is also no
longer appropriate.

Currently, the folio writeback process can either extend the journal
credits or initiate a new transaction if the currently reserved journal
credits are insufficient. Therefore, it can be modified to reserve
credits for only one extent at the outset. In most cases involving
continuous mapping, these credits are generally adequate, and we may
only need to perform some basic credit expansion. However, in extreme
cases where the block size and folio size differ significantly, or when
the folios are sufficiently discontinuous, it may be necessary to
restart a new transaction and resubmit the folios.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:52 -04:00
Zhang Yi
95ad8ee45c ext4: correct the reserved credits for extent conversion
Now, we reserve journal credits for converting extents in only one page
to written state when the I/O operation is complete. This is
insufficient when large folio is enabled.

Fix this by reserving credits for converting up to one extent per block in
the largest 2MB folio, this calculation should only involve extents index
and leaf blocks, so it should not estimate too many credits.

Fixes: 7ac67301e8 ("ext4: enable large folio for regular file")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:52 -04:00
Zhang Yi
6b132759b0 ext4: enhance tracepoints during the folios writeback
After mpage_map_and_submit_extent() supports restarting handle if
credits are insufficient during allocating blocks, it is more likely to
exit the current mapping iteration and continue to process the current
processing partially mapped folio again. The existing tracepoints are
not sufficient to track this situation, so enhance the tracepoints to
track the writeback position and the return value before and after
submitting the folios.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Zhang Yi
e2c4c49dee ext4: restart handle if credits are insufficient during allocating blocks
After large folios are supported on ext4, writing back a sufficiently
large and discontinuous folio may consume a significant number of
journal credits, placing considerable strain on the journal. For
example, in a 20GB filesystem with 1K block size and 1MB journal size,
writing back a 2MB folio could require thousands of credits in the
worst-case scenario (when each block is discontinuous and distributed
across different block groups), potentially exceeding the journal size.
This issue can also occur in ext4_write_begin() and ext4_page_mkwrite()
when delalloc is not enabled.

Fix this by ensuring that there are sufficient journal credits before
allocating an extent in mpage_map_one_extent() and
ext4_block_write_begin(). If there are not enough credits, return
-EAGAIN, exit the current mapping loop, restart a new handle and a new
transaction, and allocating blocks on this folio again in the next
iteration.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Zhang Yi
2bddafea3d ext4: refactor the block allocation process of ext4_page_mkwrite()
The block allocation process and error handling in ext4_page_mkwrite()
is complex now. Refactor it by introducing a new helper function,
ext4_block_page_mkwrite(). It will call ext4_block_write_begin() to
allocate blocks instead of directly calling block_page_mkwrite().
Preparing to implement retry logic in a subsequent patch to address
situations where the reserved journal credits are insufficient.
Additionally, this modification will help prevent potential deadlocks
that may occur when waiting for folio writeback while holding the
transaction handle.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Zhang Yi
ded2d726a3 ext4: fix stale data if it bail out of the extents mapping loop
During the process of writing back folios, if
mpage_map_and_submit_extent() exits the extent mapping loop due to an
ENOSPC or ENOMEM error, it may result in stale data or filesystem
inconsistency in environments where the block size is smaller than the
folio size.

When mapping a discontinuous folio in mpage_map_and_submit_extent(),
some buffers may have already be mapped. If we exit the mapping loop
prematurely, the folio data within the mapped range will not be written
back, and the file's disk size will not be updated. Once the transaction
that includes this range of extents is committed, this can lead to stale
data or filesystem inconsistency.

Fix this by submitting the current processing partially mapped folio.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Zhang Yi
f922c8c246 ext4: move the calculation of wbc->nr_to_write to mpage_folio_done()
mpage_folio_done() should be a more appropriate place than
mpage_submit_folio() for updating the wbc->nr_to_write after we have
submitted a fully mapped folio. Preparing to make mpage_submit_folio()
allows to submit partially mapped folio that is still under processing.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Zhang Yi
1bfe6354e0 ext4: process folios writeback in bytes
Since ext4 supports large folios, processing writebacks in pages is no
longer appropriate, it can be modified to process writebacks in bytes.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-13 23:41:51 -04:00
Dan Carpenter
c5da1f6694 ext4: remove unnecessary duplicate check in ext4_map_blocks()
The previous lines ensure that EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF is
set so remove this duplicate check.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/aDCdjUhpzxB64vkD@stanley.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-11 16:52:59 -04:00
Jinliang Zheng
b6f3801727 ext4: remove duplicate check for EXT4_FC_REPLAY
EXT4_FC_REPLAY will be checked in ext4_es_lookup_extent(). If it is
set, ext4_es_lookup_extent() will return 0.

Remove the repeated check for EXT4_FC_REPLAY in ext4_map_blocks()
to simplify the code.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250429111722.294975-1-alexjlzheng@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-07-10 21:33:03 -04:00
Matthew Wilcox (Oracle)
b39f7d75dc
fs: Remove three arguments from block_write_end()
block_write_end() looks like it can be used as a ->write_end()
implementation.  However, it can't as it does not unlock nor put
the folio.  Since it does not use the 'file', 'mapping' nor 'fsdata'
arguments, remove them.

Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/20250624132130.1590285-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24 15:53:40 +02:00
Linus Torvalds
d87d73895f New ext4 features and performance improvements:
* Fast commit performance improvements
    * Multi-fsblock atomic write support for bigalloc file systems
    * Large folio support for regular files
 
 This last can result in really stupendous performance for the right
 workloads.  For example, see [1] where the Kernel Test Robot reported
 over 37% improvement on a large sequential I/O workload.
 
 [1] https://lore.kernel.org/all/202505161418.ec0d753f-lkp@intel.com/
 
 There are also the usual bug fixes and cleanups.  Of note are cleanups
 of the extent status tree to fix potential races that could result in
 the extent status tree getting corrupted under heavy siulatneous
 allocation and deallocation to a single file.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmg2GJgACgkQ8vlZVpUN
 gaOmmgf/fEh2OPDG6aAJpQ6hjy2WIbrxqTyuWC/+AFnyI5/Jy0Iskis3lHBiFdKP
 IFjgC1h9CB5ARVvOLd7NOgflPHSSHsnYqoTCS6J4tdWvFN4VRiHe2J3fdTZd/bea
 dzdWniHS3SAJiQm4wvbkhluFgecItBHYzDltapkHI0OGepxZt3thWVvbay6veO9R
 ChXQ7T7/9eUZa5N5IVUeJmWobgh0RD+DgtwCih59UDfnezGqiDr6/shpyNC6EvWV
 oZdvJw2+2DCPn5+DF4Ut77mLpKnxorQ4osNPOovZf59JnSyEcCmbBDuvyNfRldfC
 yQYoCFkOv0Fz8tgJbtoAN71+YXl66w==
 =fxDh
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "New ext4 features and performance improvements:

   - Fast commit performance improvements

   - Multi-fsblock atomic write support for bigalloc file systems

   - Large folio support for regular files

  This last can result in really stupendous performance for the right
  workloads. For example, see [1] where the Kernel Test Robot reported
  over 37% improvement on a large sequential I/O workload.

  There are also the usual bug fixes and cleanups. Of note are cleanups
  of the extent status tree to fix potential races that could result in
  the extent status tree getting corrupted under heavy simultaneous
  allocation and deallocation to a single file"

Link: https://lore.kernel.org/all/202505161418.ec0d753f-lkp@intel.com/ [1]

* tag 'ext4_for_linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits)
  ext4: Add a WARN_ON_ONCE for querying LAST_IN_LEAF instead
  ext4: Simplify flags in ext4_map_query_blocks()
  ext4: Rename and document EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER
  ext4: Simplify last in leaf check in ext4_map_query_blocks
  ext4: Unwritten to written conversion requires EXT4_EX_NOCACHE
  ext4: only dirty folios when data journaling regular files
  ext4: Add atomic block write documentation
  ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc
  ext4: Add multi-fsblock atomic write support with bigalloc
  ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS
  ext4: Make ext4_meta_trans_blocks() non-static for later use
  ext4: Check if inode uses extents in ext4_inode_can_atomic_write()
  ext4: Document an edge case for overwrites
  jbd2: remove journal_t argument from jbd2_superblock_csum()
  jbd2: remove journal_t argument from jbd2_chksum()
  ext4: remove sb argument from ext4_superblock_csum()
  ext4: remove sbi argument from ext4_chksum()
  ext4: enable large folio for regular file
  ext4: make online defragmentation support large folios
  ext4: make the writeback path support large folios
  ...
2025-05-28 12:12:08 -07:00
Ritesh Harjani (IBM)
618320daa9 ext4: Simplify flags in ext4_map_query_blocks()
Now that we have EXT4_EX_QUERY_FILTER mask, let's use that to simplify
the filtering of flags for passing to ext4_ext_map_blocks() in
ext4_map_query_blocks() function. This allows us to kill the query_flags
local variable which is not needed anymore.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/4ae735e83e6f43341e53e2d289e59156a8360134.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 14:21:00 -04:00
Ritesh Harjani (IBM)
9597376bdb ext4: Rename and document EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER
Rename EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER to better describe its
purpose as a filter mask used specifically in ext4_map_query_blocks().
Add a comment explaining that this macro is used to filter flags needed
when querying the on-disk extent tree.

We will later use EXT4_EX_QUERY_FILTER mask to add another
EXT4_GET_BLOCKS_QUERY needed to lookup in on-disk extent tree.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/51f05d0ba286372eb8693af95bd4b10194b53141.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 14:21:00 -04:00
Ritesh Harjani (IBM)
797ac3ca24 ext4: Simplify last in leaf check in ext4_map_query_blocks
This simplifies the check for last in leaf in ext4_map_query_blocks()
and fixes this cocci warning.

cocci warnings: (new ones prefixed by >>)
>> fs/ext4/inode.c:573:49-51: WARNING !A || A && B is equivalent to !A || B

Fixes: 5bb12b1837 ("ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202505191524.auftmOwK-lkp@intel.com/
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/5fd5c806218c83f603c578c95997cf7f6da29d74.1747677758.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 14:21:00 -04:00
Brian Foster
e26268ff1d ext4: only dirty folios when data journaling regular files
fstest generic/388 occasionally reproduces a crash that looks as
follows:

BUG: kernel NULL pointer dereference, address: 0000000000000000
...
Call Trace:
 <TASK>
 ext4_block_zero_page_range+0x30c/0x380 [ext4]
 ext4_truncate+0x436/0x440 [ext4]
 ext4_process_orphan+0x5d/0x110 [ext4]
 ext4_orphan_cleanup+0x124/0x4f0 [ext4]
 ext4_fill_super+0x262d/0x3110 [ext4]
 get_tree_bdev_flags+0x132/0x1d0
 vfs_get_tree+0x26/0xd0
 vfs_cmd_create+0x59/0xe0
 __do_sys_fsconfig+0x4ed/0x6b0
 do_syscall_64+0x82/0x170
 ...

This occurs when processing a symlink inode from the orphan list. The
partial block zeroing code in the truncate path calls
ext4_dirty_journalled_data() -> folio_mark_dirty(). The latter calls
mapping->a_ops->dirty_folio(), but symlink inodes are not assigned an
a_ops vector in ext4, hence the crash.

To avoid this problem, update the ext4_dirty_journalled_data() helper to
only mark the folio dirty on regular files (for which a_ops is
assigned). This also matches the journaling logic in the ext4_symlink()
creation path, where ext4_handle_dirty_metadata() is called directly.

Fixes: d84c9ebdac ("ext4: Mark pages with journalled data dirty")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Link: https://patch.msgid.link/20250516173800.175577-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
2025-05-20 10:31:13 -04:00
Ritesh Harjani (IBM)
b86629c2b2 ext4: Add multi-fsblock atomic write support with bigalloc
EXT4 supports bigalloc feature which allows the FS to work in size of
clusters (group of blocks) rather than individual blocks. This patch
adds atomic write support for bigalloc so that systems with bs = ps can
also create FS using -
    mkfs.ext4 -F -O bigalloc -b 4096 -C 16384 <dev>

With bigalloc ext4 can support multi-fsblock atomic writes. We will have to
adjust ext4's atomic write unit max value to cluster size. This can then support
atomic write of size anywhere between [blocksize, clustersize]. This
patch adds the required changes to enable multi-fsblock atomic write
support using bigalloc in the next patch.

In this patch for block allocation:
we first query the underlying region of the requested range by calling
ext4_map_blocks() call. Here are the various cases which we then handle
depending upon the underlying mapping type:
1. If the underlying region for the entire requested range is a mapped extent,
   then we don't call ext4_map_blocks() to allocate anything. We don't need to
   even start the jbd2 txn in this case.
2. For an append write case, we create a mapped extent.
3. If the underlying region is entirely a hole, then we create an unwritten
   extent for the requested range.
4. If the underlying region is a large unwritten extent, then we split the
   extent into 2 unwritten extent of required size.
5. If the underlying region has any type of mixed mapping, then we call
   ext4_map_blocks() in a loop to zero out the unwritten and the hole regions
   within the requested range. This then provide a single mapped extent type
   mapping for the requested range.

Note: We invoke ext4_map_blocks() in a loop with the EXT4_GET_BLOCKS_ZERO
flag only when the underlying extent mapping of the requested range is
not entirely a hole, an unwritten extent, or a fully mapped extent. That
is, if the underlying region contains a mix of hole(s), unwritten
extent(s), and mapped extent(s), we use this loop to ensure that all the
short mappings are zeroed out. This guarantees that the entire requested
range becomes a single, uniformly mapped extent. It is ok to do so
because we know this is being done on a bigalloc enabled filesystem
where the block bitmap represents the entire cluster unit.

Note having a single contiguous underlying region of type mapped,
unwrittn or hole is not a problem. But the reason to avoid writing on
top of mixed mapping region is because, atomic writes requires all or
nothing should get written for the userspace pwritev2 request. So if at
any point in time during the write if a crash or a sudden poweroff
occurs, the region undergoing atomic write should read either complete
old data or complete new data. But it should never have a mix of both
old and new data.
So, we first convert any mixed mapping region to a single contiguous
mapped extent before any data gets written to it. This is because
normally FS will only convert unwritten extents to written at the end of
the write in ->end_io() call. And if we allow the writes over a mixed
mapping and if a sudden power off happens in between, we will end up
reading mix of new data (over mapped extents) and old data (over
unwritten extents), because unwritten to written conversion never went
through.
So to avoid this and to avoid writes getting torned due to mixed
mapping, we first allocate a single contiguous block mapping and then
do the write.

Acked-by: Darrick J. Wong <djwong@kernel.org>
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/c4965ac3407cbc773f0bc954d0966d9696f5038a.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Ritesh Harjani (IBM)
5bb12b1837 ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS
There can be a case where there are contiguous extents on the adjacent
leaf nodes of on-disk extent trees. So when someone tries to write to
this contiguous range, ext4_map_blocks() call will split by returning
1 extent at a time if this is not already cached in extent_status tree
cache (where if these extents when cached can get merged since they are
contiguous).

This is fine for a normal write however in case of atomic writes, it
can't afford to break the write into two. Now this is also something
that will only happen in the slow write case where we call
ext4_map_blocks() for each of these extents spread across different leaf
nodes. However, there is no guarantee that these extent status cache
cannot be reclaimed before the last call to ext4_map_blocks() in
ext4_map_blocks_atomic_write_slow().

Hence this patch adds support of EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS.
This flag checks if the requested range can be fully found in extent
status cache and return. If not, it looks up in on-disk extent
tree via ext4_map_query_blocks(). If the found extent is the last entry
in the leaf node, then it goes and queries the next lblk to see if there
is an adjacent contiguous extent in the adjacent leaf node of the
on-disk extent tree.

Even though there can be a case where there are multiple adjacent extent
entries spread across multiple leaf nodes. But we only read an adjacent
leaf block i.e. in total of 2 extent entries spread across 2 leaf nodes.
The reason for this is that we are mostly only going to support atomic
writes with upto 64KB or maybe max upto 1MB of atomic write support.

Acked-by: Darrick J. Wong <djwong@kernel.org>
Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/6bb563e661f5fbd80e266a9e6ce6e29178f555f6.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Ritesh Harjani (IBM)
255e7bc212 ext4: Make ext4_meta_trans_blocks() non-static for later use
Let's make ext4_meta_trans_blocks() non-static for use in later
functions during ->end_io conversion for atomic writes.
We will need this function to estimate journal credits for a special
case. Instead of adding another wrapper around it, let's make this
non-static.

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/23ce80d4286f792831ce99d13558182ee228fedb.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Ritesh Harjani (IBM)
9fa6121684 ext4: Document an edge case for overwrites
ext4_iomap_overwrite_begin() clears the flag for IOMAP_WRITE before
calling ext4_iomap_begin(). Document this above ext4_map_blocks() call
as it is easy to miss it when focusing on write paths alone.

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/fd50ba05440042dff77d555e463a620a79f8d0e9.1747337952.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Eric Biggers
6cbab5f95e ext4: remove sbi argument from ext4_chksum()
Since ext4_chksum() no longer uses its sbi argument, remove it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250513053809.699974-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
7ac67301e8 ext4: enable large folio for regular file
Besides fsverity, fscrypt, and the data=journal mode, ext4 now supports
large folios for regular files. Enable this feature by default. However,
since we cannot change the folio order limitation of mappings on active
inodes, setting the journal=data mode via ioctl on an active inode will
not take immediate effect in non-delalloc mode.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
cd9f76de6a ext4: make the writeback path support large folios
In mpage_map_and_submit_buffers(), the 'lblk' is now aligned to
PAGE_SIZE. Convert it to be aligned to folio size. Additionally, modify
the wbc->nr_to_write update to reduce the number of pages in a single
folio, ensuring that the entire writeback path can support large folios.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
0e32d86170 ext4: correct the journal credits calculations of allocating blocks
The journal credits calculation in ext4_ext_index_trans_blocks() is
currently inadequate. It only multiplies the depth of the extents tree
and doesn't account for the blocks that may be required for adding the
leaf extents themselves.

After enabling large folios, we can easily run out of handle credits,
triggering a warning in jbd2_journal_dirty_metadata() on filesystems
with a 1KB block size. This occurs because we may need more extents when
iterating through each large folio in
ext4_do_writepages()->mpage_map_and_submit_extent(). Therefore, we
should modify ext4_ext_index_trans_blocks() to include a count of the
leaf extents in the worst case as well.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
d6bf294773 ext4/jbd2: convert jbd2_journal_blocks_per_page() to support large folio
jbd2_journal_blocks_per_page() returns the number of blocks in a single
page. Rename it to jbd2_journal_blocks_per_folio() and make it returns
the number of blocks in the largest folio, preparing for the calculation
of journal credits blocks when allocating blocks within a large folio in
the writeback path.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
2e9466fc5d ext4: make __ext4_block_zero_page_range() support large folio
The partial block zero range helper __ext4_block_zero_page_range()
currently only supports folios of PAGE_SIZE in size. The calculations
for the start block and the offset within a folio for the given range
are incorrect. Modify the implementation to use offset_in_folio()
instead of directly masking PAGE_SIZE - 1, which will be able to support
for large folios.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
16705e52e6 ext4: make regular file's buffered write path support large folios
The current buffered write path in ext4 can only allocate and handle
folios of PAGE_SIZE size. To support larger folios, modify
ext4_da_write_begin() and ext4_write_begin() to allocate higher-order
folios, and trim the write length if it exceeds the folio size.
Additionally, in ext4_da_do_write_end(), use offset_in_folio() instead
of PAGE_SIZE.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250512063319.3539411-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-20 10:31:12 -04:00
Zhang Yi
1a77a028a3 ext4: ensure i_size is smaller than maxbytes
The inode i_size cannot be larger than maxbytes, check it while loading
inode from the disk.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250506012009.3896990-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2025-05-20 10:31:06 -04:00
Zhang Yi
29ec9bed23 ext4: fix incorrect punch max_end
For the extents based inodes, the maxbytes should be sb->s_maxbytes
instead of sbi->s_bitmap_maxbytes. Additionally, for the calculation of
max_end, the -sb->s_blocksize operation is necessary only for
indirect-block based inodes. Correct the maxbytes and max_end value to
correct the behavior of punch hole.

Fixes: 2da376228a ("ext4: limit length to bitmap_maxbytes - blocksize in punch_hole")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250506012009.3896990-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2025-05-20 10:30:52 -04:00
Zhang Yi
b5e58bcd79 ext4: fix out of bounds punch offset
Punching a hole with a start offset that exceeds max_end is not
permitted and will result in a negative length in the
truncate_inode_partial_folio() function while truncating the page cache,
potentially leading to undesirable consequences.

A simple reproducer:

  truncate -s 9895604649994 /mnt/foo
  xfs_io -c "pwrite 8796093022208 4096" /mnt/foo
  xfs_io -c "fpunch 8796093022213 25769803777" /mnt/foo

  kernel BUG at include/linux/highmem.h:275!
  Oops: invalid opcode: 0000 [#1] SMP PTI
  CPU: 3 UID: 0 PID: 710 Comm: xfs_io Not tainted 6.15.0-rc3
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:zero_user_segments.constprop.0+0xd7/0x110
  RSP: 0018:ffffc90001cf3b38 EFLAGS: 00010287
  RAX: 0000000000000005 RBX: ffffea0001485e40 RCX: 0000000000001000
  RDX: 000000000040b000 RSI: 0000000000000005 RDI: 000000000040b000
  RBP: 000000000040affb R08: ffff888000000000 R09: ffffea0000000000
  R10: 0000000000000003 R11: 00000000fffc7fc5 R12: 0000000000000005
  R13: 000000000040affb R14: ffffea0001485e40 R15: ffff888031cd3000
  FS:  00007f4f63d0b780(0000) GS:ffff8880d337d000(0000)
  knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000001ae0b038 CR3: 00000000536aa000 CR4: 00000000000006f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   truncate_inode_partial_folio+0x3dd/0x620
   truncate_inode_pages_range+0x226/0x720
   ? bdev_getblk+0x52/0x3e0
   ? ext4_get_group_desc+0x78/0x150
   ? crc32c_arch+0xfd/0x180
   ? __ext4_get_inode_loc+0x18c/0x840
   ? ext4_inode_csum+0x117/0x160
   ? jbd2_journal_dirty_metadata+0x61/0x390
   ? __ext4_handle_dirty_metadata+0xa0/0x2b0
   ? kmem_cache_free+0x90/0x5a0
   ? jbd2_journal_stop+0x1d5/0x550
   ? __ext4_journal_stop+0x49/0x100
   truncate_pagecache_range+0x50/0x80
   ext4_truncate_page_cache_block_range+0x57/0x3a0
   ext4_punch_hole+0x1fe/0x670
   ext4_fallocate+0x792/0x17d0
   ? __count_memcg_events+0x175/0x2a0
   vfs_fallocate+0x121/0x560
   ksys_fallocate+0x51/0xc0
   __x64_sys_fallocate+0x24/0x40
   x64_sys_call+0x18d2/0x4170
   do_syscall_64+0xa7/0x220
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fix this by filtering out cases where the punching start offset exceeds
max_end.

Fixes: 982bf37da0 ("ext4: refactor ext4_punch_hole()")
Reported-by: Liebes Wang <wanghaichi0403@gmail.com>
Closes: https://lore.kernel.org/linux-ext4/ac3a58f6-e686-488b-a9ee-fc041024e43d@huawei.com/
Tested-by: Liebes Wang <wanghaichi0403@gmail.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250506012009.3896990-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2025-05-20 10:30:44 -04:00
Arnd Bergmann
d612a07931 ext4: avoid -Wformat-security warning
check_igot_inode() prints a variable string, which causes a harmless
warning with 'make W=1':

fs/ext4/inode.c:4763:45: error: format string is not a string literal (potentially insecure) [-Werror,-Wformat-security]
 4763 |         ext4_error_inode(inode, function, line, 0, err_str);

Use a trivial "%s" format string instead.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250423164354.2780635-1-arnd@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-16 14:44:59 -04:00
Zhang Yi
1b4d2a0b79 ext4: check env when mapping and modifying extents
Add ext4_check_map_extents_env() to the places where loading extents,
mapping blocks, removing blocks, and modifying extents, excluding the
I/O writeback context. This function will verify whether the locking
mechanisms in place are adequate.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-15 13:14:51 -04:00
Zhang Yi
7871da20d4 ext4: introduce ext4_check_map_extents_env() debug helper
Loading and modifying the extents tree and extent status tree without
holding the inode's i_rwsem or the mapping's invalidate_lock is not
permitted, except during the I/O writeback. Add a new debug helper
ext4_check_map_extents_env(), it will verify whether the extent
loading/modifying context is safe.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14 10:42:13 -04:00
Zhang Yi
0b8e0bd450 ext4: factor out is_special_ino()
Factor out the helper is_special_ino() to facilitate the checking of
special inodes in the subsequent patches.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14 10:42:12 -04:00
Zhang Yi
402e38e6b7 ext4: prevent stale extent cache entries caused by concurrent I/O writeback
Currently, in the I/O writeback path, ext4_map_blocks() may attempt to
cache additional unrelated extents in the extent status tree without
holding the inode's i_rwsem and the mapping's invalidate_lock. This can
lead to stale extent status entries remaining in certain scenarios,
potentially causing data corruption.

For example, when performing a collapse range in ext4_collapse_range(),
it clears the extent cache and dirty pages before removing blocks and
shifting extents. It also holds the i_data_sem during these two
operations. However, both ext4_ext_remove_space() and
ext4_ext_shift_extents() may briefly release the i_data_sem if journal
credits are insufficient (ext4_datasem_ensure_credits()). If another
writeback process writes dirty pages from other regions during this
interval, it may cache extents that are about to be modified. Unless
ext4_collapse_range() explicitly clears the extent cache again, these
cached entries can become stale and inconsistent with the actual
extents.

     0 a  n       b      c         m
     | |  |       |      |         |
    [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
          |                           |
          N                           M

Assume that block a is dirty. The collapse range operation is removing
data from n to m and drops i_data_sem immediately after removing the
extent from b to c. At the same time, a concurrent writeback begins to
write back block a; it will reloads the extent from [n, b) into the
extent status tree since it does not hold the i_rwsem or the
invalidate_lock. After the collapse range operation, it left the stale
extent [n, b), which points logical block n to N, but the actual
physical block of n should be M.

Similarly, both ext4_insert_range() and ext4_truncate() have the same
problem. ext4_punch_hole() survived since it re-add a hole extent entry
after removing space since commit 9f1118223a ("ext4: add a hole extent
entry in cache after punch").

In most cases, during dirty page writeback, the block mapping
information is likely to be found in the extent cache, making it less
necessary to search for physical extents. Consequently, loading
unrelated extent caches during writeback appears to be ineffective.
Therefore, fix this by adds EXT4_EX_NOCACHE in the writeback path to
prevent caching of unrelated extents, eliminating this potential source
of corruption.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250423085257.122685-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14 10:42:12 -04:00
Harshad Shirwadkar
0b64fd74dd ext4: mark inode dirty before grabbing i_data_sem in ext4_setattr
Mark inode dirty first and then grab i_data_sem in ext4_setattr().

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://patch.msgid.link/20250508175908.1004880-4-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-08 21:56:17 -04:00
Harshad Shirwadkar
4d3266463e ext4: for committing inode, make ext4_fc_track_inode wait
If the inode that's being requested to track using ext4_fc_track_inode
is being committed, then wait until the inode finishes the
commit. Also, add calls to ext4_fc_track_inode at the right places.
With this patch, now calling ext4_reserve_inode_write() results in
inode being tracked for next fast commit. This ensures that by the
time ext4_reserve_inode_write() returns, it is ready to be modified
and won't be committed until the corresponding handle is open.

A subtle lock ordering requirement with i_data_sem (which is
documented in the code) requires that ext4_fc_track_inode() be called
before grabbing i_data_sem. So, this patch also adds explicit
ext4_fc_track_inode() calls in places where i_data_sem grabbed.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://patch.msgid.link/20250508175908.1004880-3-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-08 21:56:17 -04:00
John Garry
5d894321c4 fs: add atomic write unit max opt to statx
XFS will be able to support large atomic writes (atomic write > 1x block)
in future. This will be achieved by using different operating methods,
depending on the size of the write.

Specifically a new method of operation based in FS atomic extent remapping
will be supported in addition to the current HW offload-based method.

The FS method will generally be appreciably slower performing than the
HW-offload method. However the FS method will be typically able to
contribute to achieving a larger atomic write unit max limit.

XFS will support a hybrid mode, where HW offload method will be used when
possible, i.e. HW offload is used when the length of the write is
supported, and for other times FS-based atomic writes will be used.

As such, there is an atomic write length at which the user may experience
appreciably slower performance.

Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt.

When zero, it means that there is no such performance boundary.

Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is
ok for older kernels which don't support this new field, as they would
report 0 in this field (from zeroing in cp_statx()) already. Furthermore
those older kernels don't support large atomic writes - apart from block
fops, but there would be consistent performance there for atomic writes
in range [unit min, unit max].

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07 14:25:30 -07:00
Linus Torvalds
5aaaedb0cb A few more miscellaneous ext4 bug fixes and cleanups including some
syzbot failures and fixing a stale file handing refeencing an inode
 previously used as a regular file, but which has been deleted and
 reused as an ea_inode would result in ext4 erroneously consider this a
 case of fs corrupotion.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmf7r3YACgkQ8vlZVpUN
 gaPl9QgApwE5BAQdO6miW0sDMPj5b4sMc25aG4OPlfKhFqiIJB0Ub4zC2n0OFnaf
 HXk8P5oVeepH9ciTnYFF30X20Ythzjwmd9j5eyq2wsfYASQUjfcvmR9WovbqZtGQ
 3Zerd9QFp7SvZa+K4sADBhEb/7HAnxDGfiqSQptY6WQTwD+it1bnuhmzG0m6AH4m
 R1ItREDx7D2QrudDToFBd8XQ+FgRETZ8Qrs7PqIznw/dBNMdHRnAiw2eiyuoPU/S
 T8cmCxii3Z9sJ6LtohKYuWOmOmdxg951V5ZcekVRuaFSljSUsRsIplO7OlaMvQDs
 9vGVKiiZLdU2B0Wd90IeQUdJmP4xPg==
 =I8qx
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A few more miscellaneous ext4 bug fixes and cleanups including some
  syzbot failures and fixing a stale file handing refeencing an inode
  previously used as a regular file, but which has been deleted and
  reused as an ea_inode would result in ext4 erroneously considering
  this a case of fs corruption"

* tag 'ext4_for_linus-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix off-by-one error in do_split
  ext4: make block validity check resistent to sb bh corruption
  ext4: avoid -Wflex-array-member-not-at-end warning
  Documentation: ext4: Add fields to ext4_super_block documentation
  ext4: don't treat fhandle lookup of ea_inode as FS corruption
2025-04-13 07:15:50 -07:00
Ojaswin Mujoo
ccad447a3d ext4: make block validity check resistent to sb bh corruption
Block validity checks need to be skipped in case they are called
for journal blocks since they are part of system's protected
zone.

Currently, this is done by checking inode->ino against
sbi->s_es->s_journal_inum, which is a direct read from the ext4 sb
buffer head. If someone modifies this underneath us then the
s_journal_inum field might get corrupted. To prevent against this,
change the check to directly compare the inode with journal->j_inode.

**Slight change in behavior**: During journal init path,
check_block_validity etc might be called for journal inode when
sbi->s_journal is not set yet. In this case we now proceed with
ext4_inode_block_valid() instead of returning early. Since systems zones
have not been set yet, it is okay to proceed so we can perform basic
checks on the blocks.

Suggested-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/0c06bc9ebfcd6ccfed84a36e79147bf45ff5adc1.1743142920.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-04-12 22:01:37 -04:00
Jann Horn
642335f3ea ext4: don't treat fhandle lookup of ea_inode as FS corruption
A file handle that userspace provides to open_by_handle_at() can
legitimately contain an outdated inode number that has since been reused
for another purpose - that's why the file handle also contains a generation
number.

But if the inode number has been reused for an ea_inode, check_igot_inode()
will notice, __ext4_iget() will go through ext4_error_inode(), and if the
inode was newly created, it will also be marked as bad by iget_failed().
This all happens before the point where the inode generation is checked.

ext4_error_inode() is supposed to only be used on filesystem corruption; it
should not be used when userspace just got unlucky with a stale file
handle. So when this happens, let __ext4_iget() just return an error.

Fixes: b3e6bcb945 ("ext4: add EA_INODE checking to ext4_iget()")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241129-ext4-ignore-ea-fhandle-v1-1-e532c0d1cee0@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-04-10 10:53:50 -04:00
Linus Torvalds
eb0ece1602 - The 6 patch series "Enable strict percpu address space checks" from
Uros Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.
 
   This has caused a small amount of fallout - two or three issues were
   reported.  In all cases the calling code was founf to be incorrect.
 
 - The 4 patch series "Some cleanup for memcg" from Chen Ridong
   implements some relatively monir cleanups for the memcontrol code.
 
 - The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
   from David Hildenbrand fixes a boatload of issues which David found then
   using device-exclusive PTE entries when THP is enabled.  More work is
   needed, but this makes thins better - our own HMM selftests now succeed.
 
 - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
   Ahmed remove the z3fold and zbud implementations.  They have been
   deprecated for half a year and nobody has complained.
 
 - The 5 patch series "mm: further simplify VMA merge operation" from
   Lorenzo Stoakes implements numerous simplifications in this area.  No
   runtime effects are anticipated.
 
 - The 4 patch series "mm/madvise: remove redundant mmap_lock operations
   from process_madvise()" from SeongJae Park rationalizes the locking in
   the madvise() implementation.  Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.
 
 - The 12 patch series "Tiny cleanup and improvements about SWAP code"
   from Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.
 
 - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak user-visible
   output.
 
 - The 2 patch series "mm/damon/paddr: fix large folios access and
   schemes handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.
 
 - The 3 patch series "mm/damon/core: fix wrong and/or useless
   damos_walk() behaviors" from SeongJae Park fixes a few issues with the
   accuracy of kdamond's walking of DAMON regions.
 
 - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
   Lorenzo Stoakes changes the interaction between framebuffer deferred-io
   and core MM.  No functional changes are anticipated - this is
   preparatory work for the future removal of page structure fields.
 
 - The 4 patch series "mm/damon: add support for hugepage_size DAMOS
   filter" from Usama Arif adds a DAMOS filter which permits the filtering
   by huge page sizes.
 
 - The 4 patch series "mm: permit guard regions for file-backed/shmem
   mappings" from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state.  The feature now covers shmem and
   file-backed mappings.
 
 - The 4 patch series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping for
   pte-mapped large folios.
 
 - The 18 patch series "reimplement per-vma lock as a refcount" from
   Suren Baghdasaryan puts the vm_lock back into the vma.  Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy.  This patchset provides small (0-10%) improvements on one
   microbenchmark.
 
 - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
   fixes and improves" from SeongJae Park does some maintenance work on the
   DAMON docs.
 
 - The 27 patch series "hugetlb/CMA improvements for large systems" from
   Frank van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
   pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.
 
 - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.
 
 - The 12 patch series "selftests/mm: Some cleanups from trying to run
   them" from Brendan Jackman fixes a pile of unrelated issues which
   Brendan encountered while runnimg our selftests.
 
 - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
   from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.
 
 - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply wasn't
   being effective.
 
 - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
   from David Hildenbrand implements a number of unrelated cleanups in this
   code.
 
 - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
   Khandual implements a number of preparatoty cleanups to the
   GENERIC_PTDUMP Kconfig logic.
 
 - The 8 patch series "mm/damon: auto-tune aggregation interval" from
   SeongJae Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.
 
 - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
   issues in powerpc, sparc and x86 lazy MMU implementations.  Ryan did
   this in preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.
 
 - The 2 patch series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the code
   easier to follow.
 
 - The 3 patch series "page_counter cleanup and size reduction" from
   Shakeel Butt cleans up the page_counter code and fixes a size increase
   which we accidentally added late last year.
 
 - The 3 patch series "Add a command line option that enables control of
   how many threads should be used to allocate huge pages" from Thomas
   Prescher does that.  It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.
 
 - The 3 patch series "Fix calculations in trace_balance_dirty_pages()
   for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.
 
 - The 9 patch series "mm/damon: make allow filters after reject filters
   useful and intuitive" from SeongJae Park improves the handling of allow
   and reject filters.  Behaviour is made more consistent and the
   documention is updated accordingly.
 
 - The 5 patch series "Switch zswap to object read/write APIs" from Yosry
   Ahmed updates zswap to the new object read/write APIs and thus permits
   the removal of some legacy code from zpool and zsmalloc.
 
 - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
   does as it claims.
 
 - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
   from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.
 
 - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
   is a preparatoty refactoring and cleanup of the mremap() code.
 
 - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
   + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.
 
 - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
   filters based on handling layers" from SeongJae Park adds a couple of
   new sysfs directories to ease the management of DAMON/DAMOS filters.
 
 - The 13 patch series "arch, mm: reduce code duplication in mem_init()"
   from Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.
 
 - The 13 patch series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.
 
 - The 3 patch series "mm: page_ext: Introduce new iteration API" from
   Luiz Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.
 
 - The 8 patch series "Buddy allocator like (or non-uniform) folio split"
   from Zi Yan reworks the code to split a folio into smaller folios.  The
   main benefit is lessened memory consumption: fewer post-split folios are
   generated.
 
 - The 2 patch series "Minimize xa_node allocation during xarry split"
   from Zi Yan reduces the number of xarray xa_nodes which are generated
   during an xarray split.
 
 - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.
 
 - The 3 patch series "Add tracepoints for lowmem reserves, watermarks
   and totalreserve_pages" from Martin Liu adds some more tracepoints to
   the page allocator code.
 
 - The 4 patch series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which SeongJae
   observed during his earlier madvise work.
 
 - The 3 patch series "mm/hwpoison: Fix regressions in memory failure
   handling" from Shuai Xue addresses two quite serious regressions which
   Shuai has observed in the memory-failure implementation.
 
 - The 5 patch series "mm: reliable huge page allocator" from Johannes
   Weiner makes huge page allocations cheaper and more reliable by reducing
   fragmentation.
 
 - The 5 patch series "Minor memcg cleanups & prep for memdescs" from
   Matthew Wilcox is preparatory work for the future implementation of
   memdescs.
 
 - The 4 patch series "track memory used by balloon drivers" from Nico
   Pache introduces a way to track memory used by our various balloon
   drivers.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for active
   pages" from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.
 
 - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
   Hao Jia separates the proactive reclaim statistics from the direct
   reclaim statistics.
 
 - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
   from Jinjiang Tu fixes our handling of hwpoisoned pages within the
   reclaim code.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
 jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
 Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
 =Pn2K
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "Enable strict percpu address space checks" from Uros
   Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.

   This has caused a small amount of fallout - two or three issues were
   reported. In all cases the calling code was found to be incorrect.

 - The series "Some cleanup for memcg" from Chen Ridong implements some
   relatively monir cleanups for the memcontrol code.

 - The series "mm: fixes for device-exclusive entries (hmm)" from David
   Hildenbrand fixes a boatload of issues which David found then using
   device-exclusive PTE entries when THP is enabled. More work is
   needed, but this makes thins better - our own HMM selftests now
   succeed.

 - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
   remove the z3fold and zbud implementations. They have been deprecated
   for half a year and nobody has complained.

 - The series "mm: further simplify VMA merge operation" from Lorenzo
   Stoakes implements numerous simplifications in this area. No runtime
   effects are anticipated.

 - The series "mm/madvise: remove redundant mmap_lock operations from
   process_madvise()" from SeongJae Park rationalizes the locking in the
   madvise() implementation. Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.

 - The series "Tiny cleanup and improvements about SWAP code" from
   Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.

 - The series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak
   user-visible output.

 - The series "mm/damon/paddr: fix large folios access and schemes
   handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.

 - The series "mm/damon/core: fix wrong and/or useless damos_walk()
   behaviors" from SeongJae Park fixes a few issues with the accuracy of
   kdamond's walking of DAMON regions.

 - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
   Stoakes changes the interaction between framebuffer deferred-io and
   core MM. No functional changes are anticipated - this is preparatory
   work for the future removal of page structure fields.

 - The series "mm/damon: add support for hugepage_size DAMOS filter"
   from Usama Arif adds a DAMOS filter which permits the filtering by
   huge page sizes.

 - The series "mm: permit guard regions for file-backed/shmem mappings"
   from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state. The feature now covers shmem and
   file-backed mappings.

 - The series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping
   for pte-mapped large folios.

 - The series "reimplement per-vma lock as a refcount" from Suren
   Baghdasaryan puts the vm_lock back into the vma. Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy. This patchset provides small (0-10%) improvements on one
   microbenchmark.

 - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
   improves" from SeongJae Park does some maintenance work on the DAMON
   docs.

 - The series "hugetlb/CMA improvements for large systems" from Frank
   van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.

 - The series "mm/damon: introduce DAMOS filter type for unmapped pages"
   from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.

 - The series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.

 - The series "selftests/mm: Some cleanups from trying to run them" from
   Brendan Jackman fixes a pile of unrelated issues which Brendan
   encountered while runnimg our selftests.

 - The series "fs/proc/task_mmu: add guard region bit to pagemap" from
   Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.

 - The series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply
   wasn't being effective.

 - The series "mm: cleanups for device-exclusive entries (hmm)" from
   David Hildenbrand implements a number of unrelated cleanups in this
   code.

 - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
   implements a number of preparatoty cleanups to the GENERIC_PTDUMP
   Kconfig logic.

 - The series "mm/damon: auto-tune aggregation interval" from SeongJae
   Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.

 - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
   powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
   preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.

 - The series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the
   code easier to follow.

 - The series "page_counter cleanup and size reduction" from Shakeel
   Butt cleans up the page_counter code and fixes a size increase which
   we accidentally added late last year.

 - The series "Add a command line option that enables control of how
   many threads should be used to allocate huge pages" from Thomas
   Prescher does that. It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.

 - The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
   from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.

 - The series "mm/damon: make allow filters after reject filters useful
   and intuitive" from SeongJae Park improves the handling of allow and
   reject filters. Behaviour is made more consistent and the documention
   is updated accordingly.

 - The series "Switch zswap to object read/write APIs" from Yosry Ahmed
   updates zswap to the new object read/write APIs and thus permits the
   removal of some legacy code from zpool and zsmalloc.

 - The series "Some trivial cleanups for shmem" from Baolin Wang does as
   it claims.

 - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
   Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.

 - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
   preparatoty refactoring and cleanup of the mremap() code.

 - The series "mm: MM owner tracking for large folios (!hugetlb) +
   CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.

 - The series "mm/damon: add sysfs dirs for managing DAMOS filters based
   on handling layers" from SeongJae Park adds a couple of new sysfs
   directories to ease the management of DAMON/DAMOS filters.

 - The series "arch, mm: reduce code duplication in mem_init()" from
   Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.

 - The series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.

 - The series "mm: page_ext: Introduce new iteration API" from Luiz
   Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.

 - The series "Buddy allocator like (or non-uniform) folio split" from
   Zi Yan reworks the code to split a folio into smaller folios. The
   main benefit is lessened memory consumption: fewer post-split folios
   are generated.

 - The series "Minimize xa_node allocation during xarry split" from Zi
   Yan reduces the number of xarray xa_nodes which are generated during
   an xarray split.

 - The series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.

 - The series "Add tracepoints for lowmem reserves, watermarks and
   totalreserve_pages" from Martin Liu adds some more tracepoints to the
   page allocator code.

 - The series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which
   SeongJae observed during his earlier madvise work.

 - The series "mm/hwpoison: Fix regressions in memory failure handling"
   from Shuai Xue addresses two quite serious regressions which Shuai
   has observed in the memory-failure implementation.

 - The series "mm: reliable huge page allocator" from Johannes Weiner
   makes huge page allocations cheaper and more reliable by reducing
   fragmentation.

 - The series "Minor memcg cleanups & prep for memdescs" from Matthew
   Wilcox is preparatory work for the future implementation of memdescs.

 - The series "track memory used by balloon drivers" from Nico Pache
   introduces a way to track memory used by our various balloon drivers.

 - The series "mm/damon: introduce DAMOS filter type for active pages"
   from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.

 - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
   separates the proactive reclaim statistics from the direct reclaim
   statistics.

 - The series "mm/vmscan: don't try to reclaim hwpoison folio" from
   Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
   code.

* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
  mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
  x86/mm: restore early initialization of high_memory for 32-bits
  mm/vmscan: don't try to reclaim hwpoison folio
  mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
  cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
  mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
  selftests/mm: speed up split_huge_page_test
  selftests/mm: uffd-unit-tests support for hugepages > 2M
  docs/mm/damon/design: document active DAMOS filter type
  mm/damon: implement a new DAMOS filter type for active pages
  fs/dax: don't disassociate zero page entries
  MM documentation: add "Unaccepted" meminfo entry
  selftests/mm: add commentary about 9pfs bugs
  fork: use __vmalloc_node() for stack allocation
  docs/mm: Physical Memory: Populate the "Zones" section
  xen: balloon: update the NR_BALLOON_PAGES state
  hv_balloon: update the NR_BALLOON_PAGES state
  balloon_compaction: update the NR_BALLOON_PAGES state
  meminfo: add a per node counter for balloon drivers
  mm: remove references to folio in __memcg_kmem_uncharge_page()
  ...
2025-04-01 09:29:18 -07:00
Linus Torvalds
5c2a430e85 Ext4 bug fixes and cleanups, including:
* hardening against maliciously fuzzed file systems
   * backwards compatibility for the brief period when we attempted to
      ignore zero-width characters
   * avoid potentially BUG'ing if there is a file system corruption found
     during the file system unmount
   * fix free space reporting by statfs when project quotas are enabled
     and the free space is less than the remaining project quota
 
 Also improve performance when replaying a journal with a very large
 number of revoke records (applicable for Lustre volumes).
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmflfY4ACgkQ8vlZVpUN
 gaMx7Qf/akTELvyBZ7iPCCHh2HwayuO8qLhPNqrU0TmYMFvgwgYUPcQ3BLn8CE+/
 j5UeT8XxNaLU4GJn3z+q6yW6PnNHfqZqKry9j/iPc3s1mjTslntr/xENlgu6i4Bp
 Q58xc7Pj45vdmP+xmYhRnJcefgsZMvB/N1SEHxwIP8bntZqsEvP9pI82r9Ouc8SA
 ZLQ1/K4OADmk7f3GhlPr9AtgH7O0CjlAas30h/AW77DXBQl7ZgbDsGDlgTwaGqkR
 jHcvfr6hLnWy+MUVGmlNZ2HY6iUgBPItWlYCP/fsrUdnc+CONyl5E17JPSl1QQtR
 CLYlo4xV8j1+zJ094DjhDWMKI2G7jw==
 =oudL
 -----END PGP SIGNATURE-----

Merge tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Ext4 bug fixes and cleanups, including:

   - hardening against maliciously fuzzed file systems

   - backwards compatibility for the brief period when we attempted to
     ignore zero-width characters

   - avoid potentially BUG'ing if there is a file system corruption
     found during the file system unmount

   - fix free space reporting by statfs when project quotas are enabled
     and the free space is less than the remaining project quota

  Also improve performance when replaying a journal with a very large
  number of revoke records (applicable for Lustre volumes)"

* tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (71 commits)
  ext4: fix OOB read when checking dotdot dir
  ext4: on a remount, only log the ro or r/w state when it has changed
  ext4: correct the error handle in ext4_fallocate()
  ext4: Make sb update interval tunable
  ext4: avoid journaling sb update on error if journal is destroying
  ext4: define ext4_journal_destroy wrapper
  ext4: hash: simplify kzalloc(n * 1, ...) to kzalloc(n, ...)
  jbd2: add a missing data flush during file and fs synchronization
  ext4: don't over-report free space or inodes in statvfs
  ext4: clear DISCARD flag if device does not support discard
  jbd2: remove jbd2_journal_unfile_buffer()
  ext4: reorder capability check last
  ext4: update the comment about mb_optimize_scan
  jbd2: fix off-by-one while erasing journal
  ext4: remove references to bh->b_page
  ext4: goto right label 'out_mmap_sem' in ext4_setattr()
  ext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all()
  ext4: introduce ITAIL helper
  jbd2: remove redundant function jbd2_journal_has_csum_v2or3_feature
  ext4: remove redundant function ext4_has_metadata_csum
  ...
2025-03-27 13:27:08 -07:00