Commit graph

14430 commits

Author SHA1 Message Date
Naohiro Aota
1c34e71966 btrfs: pass struct btrfs_inode to btrfs_free_reserved_data_space_noquota()
As well as the last patch, pass struct btrfs_inode to the function and
let it distinguish which data space it is working on in a later patch.
There is no functional change with this commit.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:52 +02:00
Naohiro Aota
5d39fda880 btrfs: pass btrfs_space_info to btrfs_reserve_data_bytes()
Pass struct btrfs_space_info to btrfs_reserve_data_bytes() to allow
reserving the data from multiple data space_info candidates.

This is a preparation for the following commits and there is no functional
change.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:52 +02:00
Filipe Manana
66864101d1 btrfs: make extent unpinning more efficient when committing transaction
At btrfs_finish_extent_commit() we have this loop that keeps finding an
extent range to unpin in the transaction's pinned_extents io tree, caches
the extent state and then passes that cached extent state to
btrfs_clear_extent_dirty(), which will free that extent state since we
clear the only bit it can have set. So on each loop iteration we do a
full io tree search and the cached state is used only to avoid having
a tree search done by btrfs_clear_extent_dirty().

During the lifetime of a transaction we can pin many thousands of extents,
resulting in a large and deep rb tree that backs the io tree. For example,
for the following fs_mark run on a 12 cores boxes:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/nullb0
  FILES=100000
  THREADS=$(nproc --all)

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  OPTS="-S 0 -L 8 -n $FILES -s 0 -t $THREADS -k"
  for ((i = 1; i <= $THREADS; i++)); do
      OPTS="$OPTS -d $MNT/d$i"
  done

  fs_mark $OPTS

  umount $MNT

an histogram for the number of ranges (elements) in the pinned extents
io tree of a transaction was the following:

  Count: 76
  Range: 5440.000 - 51088.000; Mean: 27354.368; Median: 28312.000; Stddev: 9800.767
  Percentiles:  90th: 40486.000; 95th: 43322.000; 99th: 51088.000
   5440.000 -  6805.809:     1 ###
   6805.809 - 10652.034:     1 ###
  10652.034 - 13326.178:     3 ########
  13326.178 - 16671.590:     8 ######################
  16671.590 - 20856.773:     7 ####################
  20856.773 - 26092.528:    13 ####################################
  26092.528 - 32642.571:    19 #####################################################
  32642.571 - 40836.818:    17 ###############################################
  40836.818 - 51088.000:     7 ####################

We can improve on this by grabbing the next state before calling
btrfs_clear_extent_dirty(), avoiding a full tree search on the next
iteration which always has an O(log n) complexity while grabbing the next
element (rb_next() rbtree operation) is in the worst case O(log n) too,
but very often much less than that, making it more efficient.

Here follow histograms for the execution times, in nanoseconds, of
btrfs_finish_extent_commit() before and after applying this patch and all
the other patches in the same patchset.

Before patchset:

  Count: 32
  Range: 3925691.000 - 269990635.000; Mean: 133814526.906; Median: 122758052.000; Stddev: 65776550.375
  Percentiles:  90th: 228672087.000; 95th: 265187000.000; 99th: 269990635.000
    3925691.000 -   5993208.660:     1 ####
    5993208.660 -  75878537.656:     4 ##################
   75878537.656 - 115840974.514:    12 #####################################################
  115840974.514 - 176850157.761:     6 ###########################
  176850157.761 - 269990635.000:     9 ########################################

After patchset:

  Count: 32
  Range: 1849393.000 - 231491064.000; Mean: 126978584.625; Median: 123732897.000; Stddev: 58007821.806
  Percentiles:  90th: 203055491.000; 95th: 219952699.000; 99th: 231491064.000
    1849393.000 -   2997642.092:     1 ####
    2997642.092 -  88111637.071:     9 #####################################
   88111637.071 - 142818264.414:     9 #####################################
  142818264.414 - 231491064.000:    13 #####################################################

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:52 +02:00
Filipe Manana
93ef6c232a btrfs: remove variable to track trimmed bytes at btrfs_finish_extent_commit()
We don't need to keep track of discarded (trimmed) bytes at
btrfs_finish_extent_commit() but we are declaring a local variable for
that and passing a reference to the btrfs_discard_extent() calls when we
are processing delete block groups. So instead pass NULL to
btrfs_discard_extent() and remove that variable.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:52 +02:00
Filipe Manana
2598371392 btrfs: don't BUG_ON() when unpinning extents during transaction commit
In the final phase of a transaction commit, when unpinning extents at
btrfs_finish_extent_commit(), there's no need to BUG_ON() if we fail to
unpin an extent range. All that can happen is that we fail to return the
extent range to the in-memory free space cache, meaning no future space
allocations can reuse that extent range while the fs is mounted.

So instead return the error to the caller and make it abort the
transaction, so that the error is noticed and prevent misteriously leaking
space. We keep track of the first error we get while unpinning an extent
range and keep trying to unpin all the following extent ranges, so that
we attempt to do all discards. The transaction abort will deal with all
resource cleanups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:52 +02:00
Filipe Manana
b2460c2aee btrfs: remove unnecessary NULL checks before freeing extent state
When manipulating extent bits for an io tree range, we have this pattern:

   if (prealloc)
          btrfs_free_extent_state(prealloc);

but this is not needed nowadays since btrfs_free_extent_state() ignores
a NULL pointer argument, following the common pattern of kernel and btrfs
freeing functions, as well as libc and other user space libraries.
So remove the NULL checks, reducing source code and object size.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:52 +02:00
Filipe Manana
aa2c80a9ae btrfs: avoid re-searching tree when setting bits in an extent range
When setting bits for an extent range (set_extent_bit()), if the current
extent state record starts after the target range, we always do a jump to
the 'search_again' label, which will cause us to do a full tree search for
the next state if the current state ends before the target range. Unless
we need to reschedule, we can just grab the next state and process it,
avoiding a full tree search, even if that next state is not contiguous, as
we'll allocate and insert a new prealloc state if needed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:52 +02:00
Filipe Manana
b61dd9b1cb btrfs: avoid repeated extent state processing when setting extent bits
When setting bits for an extent range, if we find an extent state with
its start offset greater than current start offset, we insert a new extent
state to cover the gap, with its end offset computed and stored in the
@this_end local variable, and after the insertion we update the current
start offset to @this_end + 1. However if the insert_state() call resulted
in an extent state merge then the end offset of the merged extent may be
greater than @this_end and if that's the case, since we jump to the
'search_again' label, we'll do a full tree search that will leave us in
the same extent state - this is harmless but wastes time by doing a
pointless tree search and extent state processing.

So improve on this by updating the current start offset to the end offset
of the inserted state plus 1. This also removes the use of the @this_end
variable and directly set the value in the prealloc extent state to avoid
any confusion and misuse in the future.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:52 +02:00
Filipe Manana
8faab454c5 btrfs: simplify last record detection at set_extent_bit()
There's no need to compare the current extent state's end offset to
(u64)-1 to check if we have the last possible record and to check as
as well if after updating the start offset to the end offset of the
current record plus one we are still inside the target range.

Instead we can simplify and exit if the current extent state ends at or
after the target range and then remove the check for the (u64)-1 as well
as the check to see if the updated start offset (to last_end + 1) is still
inside the target range. Besides the simplification, this also avoids
seaching for the next extent state record (through next_state()) when the
current extent state record ends at the same offset as our target range,
which is pointless and only wastes times iterating through the rb tree.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:52 +02:00
Filipe Manana
41d69d4d78 btrfs: exit after state split error at set_extent_bit()
If split_state() returned an error we call extent_io_tree_panic() which
will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an
uncommon and exotic scenario, then we fallthrough and hit a use after free
when calling set_state_bits() since the extent state record which the
local variable 'prealloc' points to was freed by split_state().

So jump to the label 'out' after calling extent_io_tree_panic() and set
the 'prealloc' pointer to NULL since split_state() has already freed it
when it hit an error.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:52 +02:00
Filipe Manana
67f10a1018 btrfs: exit after state insertion failure at set_extent_bit()
If insert_state() state failed it returns an error pointer and we call
extent_io_tree_panic() which will trigger a BUG() call. However if
CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then
we fallthrough and call cache_state() which will dereference the error
pointer, resulting in an invalid memory access.

So jump to the 'out' label after calling extent_io_tree_panic(), it also
makes the code more clear besides dealing with the exotic scenario where
CONFIG_BUG is disabled.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:51 +02:00
Filipe Manana
0edc1a5c54 btrfs: simplify last record detection at btrfs_convert_extent_bit()
There's no need to compare the current extent state's end offset to
(u64)-1 to check if we have the last possible record and to check as
as well if after updating the start offset to the end offset of the
current record plus one we are still inside the target range.

Instead we can simplify and exit if the current extent state ends at or
after the target range and then remove the check for the (u64)-1 as well
as the check to see if the updated start offset (to last_end + 1) is still
inside the target range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:51 +02:00
Filipe Manana
be2270262f btrfs: avoid re-searching tree when converting bits in an extent range
When converting bits for an extent range (btrfs_convert_extent_bit()), if
the current extent state record starts after the target range, we always
do a jump to the 'search_again' label, which will cause us to do a full
tree search for the next state if the current state ends before the target
range. Unless we need to reschedule, we can just grab the next state and
process it, avoiding a full tree search, even if that next state is not
contiguous, as we'll allocate and insert a new prealloc state if needed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:51 +02:00
Filipe Manana
eeb808422f btrfs: avoid repeated extent state processing when converting extent bits
When converting bits for an extent range, if we find an extent state with
its start offset greater than current start offset, we insert a new extent
state to cover the gap, with its end offset computed and stored in the
@this_end local variable, and after the insertion we update the current
start offset to @this_end + 1. However if the insert_state() call resulted
in an extent state merge then the end offset of the merged extent may be
greater than @this_end and if that's the case, since we jump to the
'search_again' label, we'll do a full tree search that will leave us in
the same extent state - this is harmless but wastes time by doing a
pointless tree search and extent state processing.

So improve on this by updating the current start offset to the end offset
of the inserted state plus 1. This also removes the use of the @this_end
variable and directly set the value in the prealloc extent state to avoid
any confusion and misuse in the future.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:51 +02:00
Filipe Manana
240dd0e1bb btrfs: avoid unnecessary next node searches when clearing bits from extent range
When clearing bits for a range in an io tree, at clear_state_bit(), we
always go search for the next node (through next_state() -> rb_next()) and
return it. However if the current extent state record ends at or after the
target range passed to btrfs_clear_extent_bit_changeset() or
btrfs_convert_extent_bit(), we are just wasting time finding that next
node since we won't use it in those functions.

Improve on this by skipping the next node search if the current node ends
at or after the target range.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:51 +02:00
Filipe Manana
3bf179e36d btrfs: exit after state insertion failure at btrfs_convert_extent_bit()
If insert_state() state failed it returns an error pointer and we call
extent_io_tree_panic() which will trigger a BUG() call. However if
CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then
we fallthrough and call cache_state() which will dereference the error
pointer, resulting in an invalid memory access.

So jump to the 'out' label after calling extent_io_tree_panic(), it also
makes the code more clear besides dealing with the exotic scenario where
CONFIG_BUG is disabled.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:51 +02:00
Filipe Manana
2a72dd9996 btrfs: exit after state split error at btrfs_convert_extent_bit()
If split_state() returned an error we call extent_io_tree_panic() which
will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an
uncommon and exotic scenario, then we fallthrough and hit a use after free
when calling set_state_bits() since the extent state record which the
local variable 'prealloc' points to was freed by split_state().

So jump to the label 'out' after calling extent_io_tree_panic() and set
the 'prealloc' pointer to NULL since split_state() has already freed it
when it hit an error.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:51 +02:00
Filipe Manana
5f9c554a6c btrfs: remove duplicate error check at btrfs_convert_extent_bit()
There's no need to check if split_state() returned an error twice, instead
unify into a single if statement after setting 'prealloc' to NULL, because
on error split_state() frees the 'prealloc' extent state record.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:51 +02:00
Filipe Manana
f2a24bef55 btrfs: simplify last record detection at btrfs_clear_extent_bit_changeset()
Instead of checking for an end offset of (u64)-1 (U64_MAX) for the current
extent state's end, and then checking after updating the current start
offset if it's now beyond the range's end offset, we can simply stop if
the current extent state's end is greater than or equals to our range's
end offset. This helps remove one comparison under the 'next' label and
allows to remove the if statement that checks if the start offset is
greater than the end offset under the 'search_again' label.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:51 +02:00
Filipe Manana
6c28102f9a btrfs: avoid extra tree search at btrfs_clear_extent_bit_changeset()
When we find an extent state that starts before our range's start we
split it and jump into the 'search_again' label with our start offset
remaining the same, making us then go to the 'again' label and search
again for an extent state that contains the 'start' offset, and this
time it finds the same extent state but with its start offset set to
our range's start offset (due to the split). This is because we have
consumed the preallocated extent state record and we may need to split
again, and by jumping to 'again' we release the spinlock, allocate a new
prealloc state and restart the search.

However we may not need to restart and allocate a new extent state in
case we don't find extent states that cross our end offset, therefore
no need for further extent state splits, or we may be able to do an
atomic allocation (which is quick even if it fails). In these cases
it's a waste to restart the search.

So change the behaviour to do the restart only if we need to reschedule,
otherwise fall through - if we need to allocate an extent state for split
operations, we will try an atomic allocation and if that fails we will do
the restart as before.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:51 +02:00
Filipe Manana
c832378622 btrfs: use bools for local variables at btrfs_clear_extent_bit_changeset()
Several variables are defined as integers but used as booleans, and the
'delete' variable can be made const since it's not changed after being
declared. So change them to proper booleans and simplify setting their
value.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Filipe Manana
5af1eae78d btrfs: add missing error return to btrfs_clear_extent_bit_changeset()
We have a couple error branches where we have an error stored in the 'err'
variable and then jump to the 'out' label, however we don't return that
error, we just return 0. Normally this is not a problem since those error
branches call extent_io_tree_panic() which triggers a BUG() call, however
it's possible to have rather exotic kernel config with CONFIG_BUG disabled
in which case the BUG() call does nothing and we fallthrough. So make sure
to return the error, not just to fix that exotic case but also to make the
code less confusing. While at it also rename the 'err' variable to 'ret'
since this is the style we prefer and use more widely.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Filipe Manana
2187540b6f btrfs: exit after state split error at btrfs_clear_extent_bit_changeset()
If split_state() returned an error we call extent_io_tree_panic() which
will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an
uncommon and exotic scenario, then we fallthrough and hit a use after free
when calling clear_state_bit() since the extent state record which the
local variable 'prealloc' points to was freed by split_state().

So jump to the label 'out' after calling extent_io_tree_panic() and set
the 'prealloc' pointer to NULL since split_state() has already freed it
when it hit an error.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Filipe Manana
f389e7b982 btrfs: remove duplicate error check at btrfs_clear_extent_bit_changeset()
There's no need to check if split_state() returned an error twice, instead
unify into a single if statement after setting 'prealloc' to NULL, because
on error split_state() frees the 'prealloc' extent state record.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Qu Wenruo
007fa63225 btrfs: get rid of btrfs_read_dev_super()
The function is introduced by commit a512bbf855 ("Btrfs: superblock
duplication") at the beginning of btrfs.

It leaved a comment saying we'd need a special mount option to read all
super blocks, but it's never been implemented and there was not
need/request for it. The check/rescue tools are able to start from a
specific copy and use it as primary eventually.

This means btrfs_read_dev_super() is always reading the first super
block, making all the code finding the latest super block unnecessary.

Just remove that function and replace all call sites with
btrfs_read_disk_super(bdev, 0, false).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Qu Wenruo
63f32b7b5d btrfs: merge btrfs_read_dev_one_super() into btrfs_read_disk_super()
We have two functions to read a super block from a block device:

- btrfs_read_dev_one_super()
  Exported from disk-io.c

- btrfs_read_disk_super()
  Local to volumes.c

And they have some minor differences:

- btrfs_read_dev_one_super() uses @copy_num
  Meanwhile btrfs_read_disk_super() relies on the physical and expected
  bytenr passed from the caller.

  The parameter list of btrfs_read_dev_one_super() is more user
  friendly.

- btrfs_read_disk_super() makes sure the label is NUL terminated

We do not need two different functions doing the same job, so merge the
behavior into btrfs_read_disk_super() by:

- Remove btrfs_read_dev_one_super()

- Export btrfs_read_disk_super()
  The name pairs with btrfs_release_disk_super() perfectly.

- Change the parameter list of btrfs_read_disk_super() to mimic
  btrfs_read_dev_one_super()
  All existing callers are calculating the physical address and expect
  bytenr before calling btrfs_read_disk_super() already.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Daniel Vacek
13ae88706a btrfs: get rid of goto in alloc_test_extent_buffer()
The `free_eb` label is used only once. Simplify by moving the code inplace.

Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Josef Bacik
5e121ae687 btrfs: use buffer xarray for extent buffer writeback operations
Currently we have this ugly back and forth with the btree writeback
where we find the folio, find the eb associated with that folio, and
then attempt to writeback.  This results in two different paths for
subpage ebs and >= page size ebs.

Clean this up by adding our own infrastructure around looking up tagged
ebs and writing the ebs out directly.  This allows us to unify the
subpage and >= pagesize IO paths, resulting in a much cleaner writeback
path for extent buffers.

I ran this through fsperf on a VM with 8 CPUs and 16GiB of RAM.  I used
smallfiles100k, but reduced the files to 1k to make it run faster, the
results are as follows, with the statistically significant improvements
marked with *, there were no regressions.  fsperf was run with -n 10 for
both runs, so the baseline is the average 10 runs and the test is the
average of 10 runs.

smallfiles100k results
      metric           baseline       current        stdev            diff
================================================================================
avg_commit_ms               68.58         58.44          3.35   -14.79% *
commits                    270.60        254.70         16.24    -5.88%
dev_read_iops                  48            48             0     0.00%
dev_read_kbytes              1044          1044             0     0.00%
dev_write_iops          866117.90     850028.10      14292.20    -1.86%
dev_write_kbytes      10939976.40   10605701.20     351330.32    -3.06%
elapsed                     49.30            33          1.64   -33.06% *
end_state_mount_ns    41251498.80   35773220.70    2531205.32   -13.28% *
end_state_umount_ns      1.90e+09      1.50e+09   14186226.85   -21.38% *
max_commit_ms                 139        111.60          9.72   -19.71% *
sys_cpu                      4.90          3.86          0.88   -21.29%
write_bw_bytes        42935768.20   64318451.10    1609415.05    49.80% *
write_clat_ns_mean      366431.69     243202.60      14161.98   -33.63% *
write_clat_ns_p50        49203.20         20992        264.40   -57.34% *
write_clat_ns_p99          827392     653721.60      65904.74   -20.99% *
write_io_kbytes           2035940       2035940             0     0.00%
write_iops               10482.37      15702.75        392.92    49.80% *
write_lat_ns_max         1.01e+08      90516129    3910102.06   -10.29% *
write_lat_ns_mean       366556.19     243308.48      14154.51   -33.62% *

As you can see we get about a 33% decrease runtime, with a 50%
throughput increase, which is pretty significant.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Josef Bacik
4bc0a3cb75 btrfs: set DIRTY and WRITEBACK tags on the buffer_tree
In preparation for changing how we do writeout of extent buffers, start
tagging the extent buffer xarray with DIRTY and WRITEBACK to make it
easier to find extent buffers that are in either state.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
Josef Bacik
19d7f65f03 btrfs: convert the buffer_radix to an xarray
In order to fully utilize xarray tagging to improve writeback we need to
convert the buffer_radix to a proper xarray.  This conversion is
relatively straightforward as the radix code uses the xarray underneath.
Using xarray directly allows for quite a lot less code.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:50 +02:00
David Sterba
656e9f51de btrfs: rename btrfs_discard workqueue to btrfs-discard
We use the "btrfs-" prefix for our workqueues, the discard has
underscore instead of dash, so unify it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
13d6d866e8 btrfs: on unknown chunk allocation policy fallback to regular
We have only two chunk allocation policies right now and the
switch/cases don't handle an unknown one properly. The error is in the
impossible category (the policy is stored only in memory), we don't have
to BUG(), falling back to regular policy should be safe.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
3329d3d833 btrfs: reformat comments in acls_after_inode_item()
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
f24d25544f btrfs: switch int dev_replace_is_ongoing variables/parameters to bool
Both the variable and the parameter are used as logical indicators so
convert them to bool.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
f963e0128b btrfs: trivial conversion to return bool instead of int
Old code has a lot of int for bool return values, bool is recommended
and done in new code. Convert the trivial cases that do simple 0/false
and 1/true. Functions comment are updated if needed.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
Qu Wenruo
73d6bcf41b btrfs: subpage: reject tree blocks which are not nodesize aligned
When btrfs subpage support (fs block < page size) was introduced, a
subpage filesystem will only reject tree blocks which cross page
boundaries.

This used to be a compromise to simplify the tree block handling and
still allowing subpage cases to read some old converted filesystems
which did not have proper chunk alignment.

But in practice, suppose we have the following unaligned tree block on a
64K page sized system:

  0                           32K           44K             60K  64K
  |                                         |///////////////|    |

Although btrfs has no problem reading the tree block at [44K, 60K), if
extent allocator is allocating another tree block, it may choose the
range [60K, 74K), as extent allocator has no awareness if it's a subpage
metadata request or not.

Then we'd get -EINVAL from the following sequence:

 btrfs_alloc_tree_block()
 |- btrfs_reserve_extent()
 |  Which returned range [60K, 74K)
 |- btrfs_init_new_buffer()
    |- btrfs_find_create_tree_block()
       |- alloc_extent_buffer()
          |- check_eb_alignment()
	     Which returned -EINVAL, because the range crosses page
	     boundary.

This situation will not fix itself and should mostly mark the fs
read-only.

Thankfully we didn't really get such reports in the real world because:

- The original unaligned tree block is only caused by older
  btrfs-convert
  It's before the btrfs-convert rework was done in v4.6, where converted
  btrfs filesystem can have metadata block groups which are not aligned
  to nodesize nor stripe size (64K).

  But after btrfs-progs v4.6, all chunks allocated will be stripe (64K)
  aligned, thus no more such problem.

Considering how old the fix is (v4.6 was released almost 10 years ago),
subpage support for btrfs was introduced in v5.15, it should be safe to
reject those unaligned tree blocks.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
Daniel Vacek
406698623a btrfs: move folio initialization to one place in attach_eb_folio_to_filemap()
This is just a trivial change. The code looks a bit more readable this way, IMO.

Move initialization of existing_folio to the beginning of the retry loop
so it's set to NULL at one place.

Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
c779b7980c btrfs: raid56: rename parameter err to status in endio helpers
Trivial renames to unify the naming of blk_status_t variables/parameters.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
853b5727c9 btrfs: change return type of btrfs_alloc_dummy_sum() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_submit_chunk().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
d2080c7a00 btrfs: rename ret2 to ret in btrfs_submit_compressed_read()
We can now rename 'ret2' to 'ret' and use it for generic errors.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
David Sterba
a83134b48a btrfs: rename ret to status in btrfs_submit_compressed_read()
We're using 'status' for the blk_status_t variables, rename 'ret' so we can
use it for generic errors.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
79cbc151f9 btrfs: simplify reading bio status in end_compressed_writeback()
We don't need to have a separate variable to read the bio status, 'ret'
works for that just fine so remove 'error'.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
9c0b0807ec btrfs: rename error to ret in btrfs_submit_chunk()
We can now rename 'error' to 'ret' and use it for generic errors.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
beaa7cdb6a btrfs: rename ret to status in btrfs_submit_chunk()
We're using 'status' for the blk_status_t variables, rename 'ret' so we
can use it for proper return type.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
64c13195dd btrfs: change return type of btrfs_bio_csum() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
a24d185c36 btrfs: change return type of btree_csum_one_bio() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
9b20d242af btrfs: change return type of btrfs_csum_one_bio() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
6f6e7e98b0 btrfs: change return type of btrfs_lookup_bio_sums() to int
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_submit_chunk().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
ae8ce87165 btrfs: drop redundant local variable in raid_wait_write_end_io()
The bio status is read only once, no variable needed for that.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00
David Sterba
c0ee55f796 btrfs: merge __setup_root() to btrfs_alloc_root()
There's only one caller of __setup_root() so merge it there.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:48 +02:00