Commit graph

13633 commits

Author SHA1 Message Date
Linus Torvalds
ed8fd8d5dd for-6.13-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmeI5MAACgkQxWXV+ddt
 WDtMIg//WwV2p/gQmFMfFu7txZQQrt6m6nMuq/2y0qAA+d2e9oXFr3jNLlho0RwL
 hvSM5IEsxzL1ujh67tqxUBojvRjq+CRwiI6fxFO13dLIMrlKoiINtuOEiIhgZr/Q
 K+weXC4yHCFSB4q82dcr5i++/GVdY8wBoLHGe4zdJ0TVtM0xHrYVQ/4PXAGMX97V
 Bmv038aAMpRJTxJ6rcObloNxOb+WoyM0DIRbJ2UWVmy+bOLHMEa/jSkkQDWwcuPV
 UJjQgXz21FXiJqhtcndM58mEsKfe1WV9gG3lihedyb/xNeCPggqkDbfSzZU55Lkm
 vRXSgklbkifK6RTqrozg7e38MV+rZMwwNzqysMxNtlvrC5R+0A6of4t2Ny1vaeGG
 flcVncbLi/ip9J4fMLUpVS0fbAZjQgIJca6fbgM7Ajdzvm9rIDGpR078GW1zDlg2
 20f7isQWqSkj3i9kPElGbDT65RvcxR5EpJQfwr78+bsgSsyBGwghhAJ5F7vqZh5L
 wdmJOYxkwq5r1YxeuEQPVv404WPrwZE2Gyk6TGXkzpA+l6isCViY3NTyQPBOMH9y
 2GEelW61QSKtc5BPp1h85Aw74wtSCpGkCY+t/v03oQ6PBssWIsVcXZ7NnjJh9+72
 U+JY0H7UPyyczEzicEGlwWYBWs2/KeBuCrladNCkv3S5TWgB5yQ=
 =sF6v
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:

 - handle d_path() errors when canonicalizing device mapper paths during
   device scan

* tag 'for-6.13-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: add the missing error handling inside get_canonical_dev_path
2025-01-16 08:54:33 -08:00
Qu Wenruo
fe4de594f7 btrfs: add the missing error handling inside get_canonical_dev_path
Inside function get_canonical_dev_path(), we call d_path() to get the
final device path.

But d_path() can return error, and in that case the next strscpy() call
will trigger an invalid memory access.

Add back the missing error handling for d_path().

Reported-by: Boris Burkov <boris@bur.io>
Fixes: 7e06de7c83 ("btrfs: canonicalize the device path before adding it")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13 21:39:52 +01:00
Linus Torvalds
643e2e259c for-6.13-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmd/7dYACgkQxWXV+ddt
 WDuX7Q//UkrNtVh7UEiyNyujLjjvczfMXhpD1fAdVU0zMon6ux3RQ3JSs3xvAGrb
 jFFa9c9+Db8/kWzdWp5n1u9Q/+sy4XBaeKGuzPRLPPGT1yXfKEa4mrm1sCrWRJoS
 c8b07Kfuepldcim80x8WSa2qhr5gmDmSZBgvjKt63ppp5/jaNKCZg+d3BhwqhHbI
 XA9JjIk9j0ZsAYauYflQTwgUpkyvXV1a9YyeKv4U6mYA1r+rXl2aolcndNkS1U/D
 dDGuiDpOjKtIUecRi4YbOkt2zvwREDdQCbRV/QLsZajHxqeHV5QH0TBI/URikx2z
 1shwYMzLfLtQIW0+PhHCGKiftMIb4NliyMUxxviPdN78nCFmocrR/ZkPx+a5M9Io
 d7oqwS/8U3pFGeB4bAey8WvMzQI5BtCCYJY+3HreNTDkiubqcRtTCtJ9dNDTAMFH
 FMZ6DA8wTsqSA2e9Q8OwKNjvMCLAKevXn/4wiJi5b75Fiu5ZB/imTfJ+geEMUZCR
 3uq9oybFCKti7lestM0z06K19AKtmPWLoq5YJ1Hg69DsafS2aR3CBeYOi7uQ+56D
 7uwAFjVrGPrxOgGkCohYpPLCUikJ0y3Nl/k5fnybsnLPWr0cenLroUeP7Rao4fFU
 8hLzMSv3ImL+Io0RjH0XBAM8YLN+xO3CLYCv6D8d42AlQTgAIVw=
 =QYC1
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes.

  Besides the one-liners in Btrfs there's fix to the io_uring and
  encoded read integration (added in this development cycle). The update
  to io_uring provides more space for the ongoing command that is then
  used in Btrfs to handle some cases.

   - io_uring and encoded read:
       - provide stable storage for io_uring command data
       - make a copy of encoded read ioctl call, reuse that in case the
         call would block and will be called again

   - properly initialize zlib context for hardware compression on s390

   - fix max extent size calculation on filesystems with non-zoned
     devices

   - fix crash in scrub on crafted image due to invalid extent tree"

* tag 'for-6.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zlib: fix avail_in bytes for s390 zlib HW compression path
  btrfs: zoned: calculate max_extent_size properly on non-zoned setup
  btrfs: avoid NULL pointer dereference if no valid extent tree
  btrfs: don't read from userspace twice in btrfs_uring_encoded_read()
  io_uring: add io_uring_cmd_get_async_data helper
  io_uring/cmd: add per-op data to struct io_uring_cmd_data
  io_uring/cmd: rename struct uring_cache to io_uring_cmd_data
2025-01-09 10:16:45 -08:00
Mikhail Zaslonko
0ee4736c00 btrfs: zlib: fix avail_in bytes for s390 zlib HW compression path
Since the input data length passed to zlib_compress_folios() can be
arbitrary, always setting strm.avail_in to a multiple of PAGE_SIZE may
cause read-in bytes to exceed the input range. Currently this triggers
an assert in btrfs_compress_folios() on the debug kernel (see below).
Fix strm.avail_in calculation for S390 hardware acceleration path.

  assertion failed: *total_in <= orig_len, in fs/btrfs/compression.c:1041
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/compression.c:1041!
  monitor event: 0040 ilc:2 [#1] PREEMPT SMP
  CPU: 16 UID: 0 PID: 325 Comm: kworker/u273:3 Not tainted 6.13.0-20241204.rc1.git6.fae3b21430ca.300.fc41.s390x+debug #1
  Hardware name: IBM 3931 A01 703 (z/VM 7.4.0)
  Workqueue: btrfs-delalloc btrfs_work_helper
  Krnl PSW : 0704d00180000000 0000021761df6538 (btrfs_compress_folios+0x198/0x1a0)
             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
  Krnl GPRS: 0000000080000000 0000000000000001 0000000000000047 0000000000000000
             0000000000000006 ffffff01757bb000 000001976232fcc0 000000000000130c
             000001976232fcd0 000001976232fcc8 00000118ff4a0e30 0000000000000001
             00000111821ab400 0000011100000000 0000021761df6534 000001976232fb58
  Krnl Code: 0000021761df6528: c020006f5ef4        larl    %r2,0000021762be2310
             0000021761df652e: c0e5ffbd09d5        brasl   %r14,00000217615978d8
            #0000021761df6534: af000000            mc      0,0
            >0000021761df6538: 0707                bcr     0,%r7
             0000021761df653a: 0707                bcr     0,%r7
             0000021761df653c: 0707                bcr     0,%r7
             0000021761df653e: 0707                bcr     0,%r7
             0000021761df6540: c004004bb7ec        brcl    0,000002176276d518
  Call Trace:
   [<0000021761df6538>] btrfs_compress_folios+0x198/0x1a0
  ([<0000021761df6534>] btrfs_compress_folios+0x194/0x1a0)
   [<0000021761d97788>] compress_file_range+0x3b8/0x6d0
   [<0000021761dcee7c>] btrfs_work_helper+0x10c/0x160
   [<0000021761645760>] process_one_work+0x2b0/0x5d0
   [<000002176164637e>] worker_thread+0x20e/0x3e0
   [<000002176165221a>] kthread+0x15a/0x170
   [<00000217615b859c>] __ret_from_fork+0x3c/0x60
   [<00000217626e72d2>] ret_from_fork+0xa/0x38
  INFO: lockdep is turned off.
  Last Breaking-Event-Address:
   [<0000021761597924>] _printk+0x4c/0x58
  Kernel panic - not syncing: Fatal exception: panic_on_oops

Fixes: fd1e75d010 ("btrfs: make compression path to be subpage compatible")
CC: stable@vger.kernel.org # 6.12+
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-06 16:32:43 +01:00
Christoph Hellwig
7467bc5959 btrfs: zoned: calculate max_extent_size properly on non-zoned setup
Since commit 559218d43e ("block: pre-calculate max_zone_append_sectors"),
queue_limits's max_zone_append_sectors is default to be 0 and it is only
updated when there is a zoned device. So, we have
lim->max_zone_append_sectors = 0 when there is no zoned device in the
filesystem.

That leads to fs_info->max_zone_append_size and thus
fs_info->max_extent_size to be 0, which is wrong and can for example
lead to a divide by zero in count_max_extents().

Fix this by only capping fs_info->max_extent_size to
fs_info->max_zone_append_size when it is non-zero.

Based on a patch from Naohiro Aota <naohiro.aota@wdc.com>, from which
much of this commit message is stolen as well.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 559218d43e ("block: pre-calculate max_zone_append_sectors")
Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-06 16:32:35 +01:00
Qu Wenruo
6aecd91a5c btrfs: avoid NULL pointer dereference if no valid extent tree
[BUG]
Syzbot reported a crash with the following call trace:

  BTRFS info (device loop0): scrub: started on devid 1
  BUG: kernel NULL pointer dereference, address: 0000000000000208
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 106e70067 P4D 106e70067 PUD 107143067 PMD 0
  Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 1 UID: 0 PID: 689 Comm: repro Kdump: loaded Tainted: G           O       6.13.0-rc4-custom+ #206
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
  RIP: 0010:find_first_extent_item+0x26/0x1f0 [btrfs]
  Call Trace:
   <TASK>
   scrub_find_fill_first_stripe+0x13d/0x3b0 [btrfs]
   scrub_simple_mirror+0x175/0x260 [btrfs]
   scrub_stripe+0x5d4/0x6c0 [btrfs]
   scrub_chunk+0xbb/0x170 [btrfs]
   scrub_enumerate_chunks+0x2f4/0x5f0 [btrfs]
   btrfs_scrub_dev+0x240/0x600 [btrfs]
   btrfs_ioctl+0x1dc8/0x2fa0 [btrfs]
   ? do_sys_openat2+0xa5/0xf0
   __x64_sys_ioctl+0x97/0xc0
   do_syscall_64+0x4f/0x120
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
   </TASK>

[CAUSE]
The reproducer is using a corrupted image where extent tree root is
corrupted, thus forcing to use "rescue=all,ro" mount option to mount the
image.

Then it triggered a scrub, but since scrub relies on extent tree to find
where the data/metadata extents are, scrub_find_fill_first_stripe()
relies on an non-empty extent root.

But unfortunately scrub_find_fill_first_stripe() doesn't really expect
an NULL pointer for extent root, it use extent_root to grab fs_info and
triggered a NULL pointer dereference.

[FIX]
Add an extra check for a valid extent root at the beginning of
scrub_find_fill_first_stripe().

The new error path is introduced by 42437a6386 ("btrfs: introduce
mount option rescue=ignorebadroots"), but that's pretty old, and later
commit b979547513 ("btrfs: scrub: introduce helper to find and fill
sector info for a scrub_stripe") changed how we do scrub.

So for kernels older than 6.6, the fix will need manual backport.

Reported-by: syzbot+339e9dbe3a2ca419b85d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/67756935.050a0220.25abdd.0a12.GAE@google.com/
Fixes: 42437a6386 ("btrfs: introduce mount option rescue=ignorebadroots")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-06 16:32:31 +01:00
Mark Harmstone
c21b89d495 btrfs: don't read from userspace twice in btrfs_uring_encoded_read()
If we return -EAGAIN the first time because we need to block,
btrfs_uring_encoded_read() will get called twice. Take a copy of args,
the iovs, and the iter the first time, as by the time we are called the
second time these may have gone out of scope.

Reported-by: Jens Axboe <axboe@kernel.dk>
Fixes: 34310c442e ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)")
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-06 13:59:29 +01:00
Linus Torvalds
c059361673 for-6.13-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdw8AgACgkQxWXV+ddt
 WDsL4w/+Ib5WGmd2Rjn1+1X9U5dzrEb+/072UBAhwwaqOOUTlBofeyRSdYqFB0oZ
 aucRMXdXPpVe1xrXsj0WsOZmPsuZT46Eh2ALqqZP5fO1sgBkJ2WmQF0Ei7uypfb+
 abQwiEO2IaMMwt2XgDNzbpZS7oVNGEXHzoHF0R/deL4FoBDNMsbCfRnW+L9++tWU
 dUSpafLhgMMwivJN07VJYwU4ZVXsBhmKv2qI8WpJ5w9kJb1ssN692CvBOVjhuSYd
 A8IMV84dW2KO37fmPqN36QAWotz4mKpv8yrhjJvrix7nAOcXe3TXFUhaFBh1Vmzg
 G5bhkqYcNP6UHT7CIcLZE1mdv6ZAKTp0zSNCh2Uu51+MJL2tIQVjTaUQhbkYLnLN
 9DS2dXz4ksm9ISrjr2tmPe4kgyNQIrp5TCdwXu3CYs+AaU7yKeEBukZ7mXcp/e/W
 TdLKvzPRLMED8mGlFBwg2QbOvcJJ663UW2esyv6DvC61F3tXyiV2RXSC/1qF+RyZ
 FBJvvEevensQlASn1NScuQV+iEQpMo2lMURnRjSG8dGhwMmHpW3wifa2TJDyBzWS
 AH0MriQA9nsYQTkPGPnqr46/BAhFG2vEfVlX20Sk9S0PTBLu8YRy/o2evcV67J8v
 zGaa5pa7fQPbEjRv4Rthdb4R2VIFkZTOtIZSZfjHkPDjtvS7ahU=
 =NwGH
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes that accumulated over the last two weeks, fixing some
  user reported problems:

   - swapfile fixes:
       - conditional reschedule in the activation loop
       - fix race with memory mapped file when activating
       - make activation loop interruptible
       - rework and fix extent sharing checks

   - folio fixes:
       - in send, recheck folio mapping after unlock
       - in relocation, recheck folio mapping after unlock

   - fix waiting for encoded read io_uring requests

   - fix transaction atomicity when enabling simple quotas

   - move COW block trace point before the block gets freed

   - print various sizes in sysfs with correct endianity"

* tag 'for-6.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: sysfs: fix direct super block member reads
  btrfs: fix transaction atomicity bug when enabling simple quotas
  btrfs: avoid monopolizing a core when activating a swap file
  btrfs: allow swap activation to be interruptible
  btrfs: fix swap file activation failure due to extents that used to be shared
  btrfs: fix race with memory mapped writes when activating swap file
  btrfs: check folio mapping after unlock in put_file_data()
  btrfs: check folio mapping after unlock in relocate_one_folio()
  btrfs: fix use-after-free when COWing tree bock and tracing is enabled
  btrfs: fix use-after-free waiting for encoded read endios
2024-12-29 09:34:34 -08:00
Qu Wenruo
fca432e73d btrfs: sysfs: fix direct super block member reads
The following sysfs entries are reading super block member directly,
which can have a different endian and cause wrong values:

- sys/fs/btrfs/<uuid>/nodesize
- sys/fs/btrfs/<uuid>/sectorsize
- sys/fs/btrfs/<uuid>/clone_alignment

Thankfully those values (nodesize and sectorsize) are always aligned
inside the btrfs_super_block, so it won't trigger unaligned read errors,
just endian problems.

Fix them by using the native cached members instead.

Fixes: df93589a17 ("btrfs: export more from FS_INFO to sysfs")
CC: stable@vger.kernel.org
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:06:44 +01:00
Julian Sun
f2363e6fcc btrfs: fix transaction atomicity bug when enabling simple quotas
Set squota incompat bit before committing the transaction that enables
the feature.

With the config CONFIG_BTRFS_ASSERT enabled, an assertion
failure occurs regarding the simple quota feature.

  [5.596534] assertion failed: btrfs_fs_incompat(fs_info, SIMPLE_QUOTA), in fs/btrfs/qgroup.c:365
  [5.597098] ------------[ cut here ]------------
  [5.597371] kernel BUG at fs/btrfs/qgroup.c:365!
  [5.597946] CPU: 1 UID: 0 PID: 268 Comm: mount Not tainted 6.13.0-rc2-00031-gf92f4749861b #146
  [5.598450] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
  [5.599008] RIP: 0010:btrfs_read_qgroup_config+0x74d/0x7a0
  [5.604303]  <TASK>
  [5.605230]  ? btrfs_read_qgroup_config+0x74d/0x7a0
  [5.605538]  ? exc_invalid_op+0x56/0x70
  [5.605775]  ? btrfs_read_qgroup_config+0x74d/0x7a0
  [5.606066]  ? asm_exc_invalid_op+0x1f/0x30
  [5.606441]  ? btrfs_read_qgroup_config+0x74d/0x7a0
  [5.606741]  ? btrfs_read_qgroup_config+0x74d/0x7a0
  [5.607038]  ? try_to_wake_up+0x317/0x760
  [5.607286]  open_ctree+0xd9c/0x1710
  [5.607509]  btrfs_get_tree+0x58a/0x7e0
  [5.608002]  vfs_get_tree+0x2e/0x100
  [5.608224]  fc_mount+0x16/0x60
  [5.608420]  btrfs_get_tree+0x2f8/0x7e0
  [5.608897]  vfs_get_tree+0x2e/0x100
  [5.609121]  path_mount+0x4c8/0xbc0
  [5.609538]  __x64_sys_mount+0x10d/0x150

The issue can be easily reproduced using the following reproducer:

  root@q:linux# cat repro.sh
  set -e

  mkfs.btrfs -q -f /dev/sdb
  mount /dev/sdb /mnt/btrfs
  btrfs quota enable -s /mnt/btrfs
  umount /mnt/btrfs
  mount /dev/sdb /mnt/btrfs

The issue is that when enabling quotas, at btrfs_quota_enable(), we set
BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE at fs_info->qgroup_flags and persist
it in the quota root in the item with the key BTRFS_QGROUP_STATUS_KEY, but
we only set the incompat bit BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA after we
commit the transaction used to enable simple quotas.

This means that if after that transaction commit we unmount the filesystem
without starting and committing any other transaction, or we have a power
failure, the next time we mount the filesystem we will find the flag
BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE set in the item with the key
BTRFS_QGROUP_STATUS_KEY but we will not find the incompat bit
BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA set in the superblock, triggering an
assertion failure at:

  btrfs_read_qgroup_config() -> qgroup_read_enable_gen()

To fix this issue, set the BTRFS_FEATURE_INCOMPAT_SIMPLE_QUOTA flag
immediately after setting the BTRFS_QGROUP_STATUS_FLAG_SIMPLE_MODE.
This ensures that both flags are flushed to disk within the same
transaction.

Fixes: 182940f4f4 ("btrfs: qgroup: add new quota mode for simple quotas")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:05:05 +01:00
Filipe Manana
2c8507c63f btrfs: avoid monopolizing a core when activating a swap file
During swap activation we iterate over the extents of a file and we can
have many thousands of them, so we can end up in a busy loop monopolizing
a core. Avoid this by doing a voluntary reschedule after processing each
extent.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:04:48 +01:00
Filipe Manana
9a45022a0e btrfs: allow swap activation to be interruptible
During swap activation we iterate over the extents of a file, then do
several checks for each extent, some of which may take some significant
time such as checking if an extent is shared. Since a file can have
many thousands of extents, this can be a very slow operation and it's
currently not interruptible. I had a bug during development of a previous
patch that resulted in an infinite loop when iterating the extents, so
a core was busy looping and I couldn't cancel the operation, which is very
annoying and requires a reboot. So make the loop interruptible by checking
for fatal signals at the end of each iteration and stopping immediately if
there is one.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:04:38 +01:00
Filipe Manana
03018e5d85 btrfs: fix swap file activation failure due to extents that used to be shared
When activating a swap file, to determine if an extent is shared we use
can_nocow_extent(), which ends up at btrfs_cross_ref_exist(). That helper
is meant to be quick because it's used in the NOCOW write path, when
flushing delalloc and when doing a direct IO write, however it does return
some false positives, meaning it may indicate that an extent is shared
even if it's no longer the case. For the write path this is fine, we just
do a unnecessary COW operation instead of doing a more rigorous check
which would be too heavy (calling btrfs_is_data_extent_shared()).

However when activating a swap file, the false positives simply result
in a failure, which is confusing for users/applications. One particular
case where this happens is when a data extent only has 1 reference but
that reference is not inlined in the extent item located in the extent
tree - this happens when we create more than 33 references for an extent
and then delete those 33 references plus every other non-inline reference
except one. The function check_committed_ref() assumes that if the size
of an extent item doesn't match the size of struct btrfs_extent_item
plus the size of an inline reference (plus an owner reference in case
simple quotas are enabled), then the extent is shared - that is not the
case however, we can have a single reference but it's not inlined - the
reason we do this is to be fast and avoid inspecting non-inline references
which may be located in another leaf of the extent tree, slowing down
write paths.

The following test script reproduces the bug:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdi
   MNT=/mnt/sdi
   NUM_CLONES=50

   umount $DEV &> /dev/null

   run_test()
   {
        local sync_after_add_reflinks=$1
        local sync_after_remove_reflinks=$2

        mkfs.btrfs -f $DEV > /dev/null
        #mkfs.xfs -f $DEV > /dev/null
        mount $DEV $MNT

        touch $MNT/foo
        chmod 0600 $MNT/foo
   	# On btrfs the file must be NOCOW.
        chattr +C $MNT/foo &> /dev/null
        xfs_io -s -c "pwrite -b 1M 0 1M" $MNT/foo
        mkswap $MNT/foo

        for ((i = 1; i <= $NUM_CLONES; i++)); do
            touch $MNT/foo_clone_$i
            chmod 0600 $MNT/foo_clone_$i
            # On btrfs the file must be NOCOW.
            chattr +C $MNT/foo_clone_$i &> /dev/null
            cp --reflink=always $MNT/foo $MNT/foo_clone_$i
        done

        if [ $sync_after_add_reflinks -ne 0 ]; then
            # Flush delayed refs and commit current transaction.
            sync -f $MNT
        fi

        # Remove the original file and all clones except the last.
        rm -f $MNT/foo
        for ((i = 1; i < $NUM_CLONES; i++)); do
            rm -f $MNT/foo_clone_$i
        done

        if [ $sync_after_remove_reflinks -ne 0 ]; then
            # Flush delayed refs and commit current transaction.
            sync -f $MNT
        fi

        # Now use the last clone as a swap file. It should work since
        # its extent are not shared anymore.
        swapon $MNT/foo_clone_${NUM_CLONES}
        swapoff $MNT/foo_clone_${NUM_CLONES}

        umount $MNT
   }

   echo -e "\nTest without sync after creating and removing clones"
   run_test 0 0

   echo -e "\nTest with sync after creating clones"
   run_test 1 0

   echo -e "\nTest with sync after removing clones"
   run_test 0 1

   echo -e "\nTest with sync after creating and removing clones"
   run_test 1 1

Running the test:

   $ ./test.sh
   Test without sync after creating and removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0017 sec (556.793 MiB/sec and 556.7929 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=a6b9c29e-5ef4-4689-a8ac-bc199c750f02
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

   Test with sync after creating clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0036 sec (271.739 MiB/sec and 271.7391 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=5e9008d6-1f7a-4948-a1b4-3f30aba20a33
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

   Test with sync after removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0103 sec (96.665 MiB/sec and 96.6651 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=916c2740-fa9f-4385-9f06-29c3f89e4764

   Test with sync after creating and removing clones
   wrote 1048576/1048576 bytes at offset 0
   1 MiB, 1 ops; 0.0031 sec (314.268 MiB/sec and 314.2678 ops/sec)
   Setting up swapspace version 1, size = 1020 KiB (1044480 bytes)
   no label, UUID=06aab1dd-4d90-49c0-bd9f-3a8db4e2f912
   swapon: /mnt/sdi/foo_clone_50: swapon failed: Invalid argument
   swapoff: /mnt/sdi/foo_clone_50: swapoff failed: Invalid argument

Fix this by reworking btrfs_swap_activate() to instead of using extent
maps and checking for shared extents with can_nocow_extent(), iterate
over the inode's file extent items and use the accurate
btrfs_is_data_extent_shared().

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:04:17 +01:00
Filipe Manana
0525064bb8 btrfs: fix race with memory mapped writes when activating swap file
When activating the swap file we flush all delalloc and wait for ordered
extent completion, so that we don't miss any delalloc and extents before
we check that the file's extent layout is usable for a swap file and
activate the swap file. We are called with the inode's VFS lock acquired,
so we won't race with buffered and direct IO writes, however we can still
race with memory mapped writes since they don't acquire the inode's VFS
lock. The race window is between flushing all delalloc and locking the
whole file's extent range, since memory mapped writes lock an extent range
with the length of a page.

Fix this by acquiring the inode's mmap lock before we flush delalloc.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:03:43 +01:00
Boris Burkov
0fba7be1ca btrfs: check folio mapping after unlock in put_file_data()
When we call btrfs_read_folio() we get an unlocked folio, so it is possible
for a different thread to concurrently modify folio->mapping. We must
check that this hasn't happened once we do have the lock.

CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:00:07 +01:00
Boris Burkov
3e74859ee3 btrfs: check folio mapping after unlock in relocate_one_folio()
When we call btrfs_read_folio() to bring a folio uptodate, we unlock the
folio. The result of that is that a different thread can modify the
mapping (like remove it with invalidate) before we call folio_lock().
This results in an invalid page and we need to try again.

In particular, if we are relocating concurrently with aborting a
transaction, this can result in a crash like the following:

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP
  CPU: 76 PID: 1411631 Comm: kworker/u322:5
  Workqueue: events_unbound btrfs_reclaim_bgs_work
  RIP: 0010:set_page_extent_mapped+0x20/0xb0
  RSP: 0018:ffffc900516a7be8 EFLAGS: 00010246
  RAX: ffffea009e851d08 RBX: ffffea009e0b1880 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffffc900516a7b90 RDI: ffffea009e0b1880
  RBP: 0000000003573000 R08: 0000000000000001 R09: ffff88c07fd2f3f0
  R10: 0000000000000000 R11: 0000194754b575be R12: 0000000003572000
  R13: 0000000003572fff R14: 0000000000100cca R15: 0000000005582fff
  FS:  0000000000000000(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000407d00f002 CR4: 00000000007706f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  PKRU: 55555554
  Call Trace:
  <TASK>
  ? __die+0x78/0xc0
  ? page_fault_oops+0x2a8/0x3a0
  ? __switch_to+0x133/0x530
  ? wq_worker_running+0xa/0x40
  ? exc_page_fault+0x63/0x130
  ? asm_exc_page_fault+0x22/0x30
  ? set_page_extent_mapped+0x20/0xb0
  relocate_file_extent_cluster+0x1a7/0x940
  relocate_data_extent+0xaf/0x120
  relocate_block_group+0x20f/0x480
  btrfs_relocate_block_group+0x152/0x320
  btrfs_relocate_chunk+0x3d/0x120
  btrfs_reclaim_bgs_work+0x2ae/0x4e0
  process_scheduled_works+0x184/0x370
  worker_thread+0xc6/0x3e0
  ? blk_add_timer+0xb0/0xb0
  kthread+0xae/0xe0
  ? flush_tlb_kernel_range+0x90/0x90
  ret_from_fork+0x2f/0x40
  ? flush_tlb_kernel_range+0x90/0x90
  ret_from_fork_asm+0x11/0x20
  </TASK>

This occurs because cleanup_one_transaction() calls
destroy_delalloc_inodes() which calls invalidate_inode_pages2() which
takes the folio_lock before setting mapping to NULL. We fail to check
this, and subsequently call set_extent_mapping(), which assumes that
mapping != NULL (in fact it asserts that in debug mode)

Note that the "fixes" patch here is not the one that introduced the
race (the very first iteration of this code from 2009) but a more recent
change that made this particular crash happen in practice.

Fixes: e7f1326cc2 ("btrfs: set page extent mapped after read_folio in relocate_one_page")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 22:00:07 +01:00
Filipe Manana
44f52bbe96 btrfs: fix use-after-free when COWing tree bock and tracing is enabled
When a COWing a tree block, at btrfs_cow_block(), and we have the
tracepoint trace_btrfs_cow_block() enabled and preemption is also enabled
(CONFIG_PREEMPT=y), we can trigger a use-after-free in the COWed extent
buffer while inside the tracepoint code. This is because in some paths
that call btrfs_cow_block(), such as btrfs_search_slot(), we are holding
the last reference on the extent buffer @buf so btrfs_force_cow_block()
drops the last reference on the @buf extent buffer when it calls
free_extent_buffer_stale(buf), which schedules the release of the extent
buffer with RCU. This means that if we are on a kernel with preemption,
the current task may be preempted before calling trace_btrfs_cow_block()
and the extent buffer already released by the time trace_btrfs_cow_block()
is called, resulting in a use-after-free.

Fix this by moving the trace_btrfs_cow_block() from btrfs_cow_block() to
btrfs_force_cow_block() before the COWed extent buffer is freed.
This also has a side effect of invoking the tracepoint in the tree defrag
code, at defrag.c:btrfs_realloc_node(), since btrfs_force_cow_block() is
called there, but this is fine and it was actually missing there.

Reported-by: syzbot+8517da8635307182c8a5@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6759a9b9.050a0220.1ac542.000d.GAE@google.com/
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 21:59:32 +01:00
Johannes Thumshirn
d29662695e btrfs: fix use-after-free waiting for encoded read endios
Fix a use-after-free in the I/O completion path for encoded reads by
using a completion instead of a wait_queue for synchronizing the
destruction of 'struct btrfs_encoded_read_private'.

Fixes: 1881fba89b ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23 21:55:06 +01:00
Linus Torvalds
eabcdba3ad for-6.13-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdhyQAACgkQxWXV+ddt
 WDuveg//bJSuXHrA7jkijst8rdoAFrceiUXuQPZ6bqb9QrSqlDZlP5/XQpdXZ3yU
 qJh/aE13cy0zWTQ2+fMcc770WSvU1cRW/f5BZ+fdXgvO8lS516suXGYd2Q06Cl9/
 DriAKGKtRfJn1BrEEv8+fjKS/chxZg6IR/W4kN6AinW31myY9jE5mEDAn+vyTDgQ
 8USZ/ar/3KuWo+wO5h5JzrvGnhzK0W0HRs/A0NZ3gG8J5T4yj+8zG0VJR4Gf93AL
 iBlsnAR8VzAYJOZCi36SD3j3/eDxJio5GhDYsdt28tk1bL8FqSuI4Yxt+LuiZ2Fg
 Cq/31lELEkyEH8AoVFm9pX3HNyRmV6JhpvDXiyofHaOUZ3VeivVE59gOShLUUMkn
 f9Pl/uh5/t/ioWWHBnCMyRpI9GZUGCvW24k7HjT7QZhsDGFLTm07diCiRgZ7eaOu
 LZRKMOL5jifAnfxNSvIJV19H4lQLTZfbdjmJyb6Il39tIU/1U9pXicgih3iyidW2
 N5n4pHf3OQFwG8kNw1mR1g1CPBALP62ja8kMv//IgH4YXXnm1Mo7B3CcJogAAmo4
 HB9f/gFqZ8kWaiuIUJKfPZkkLFt5x0TNZQyyOhVUd7V4mFdtEzVtZRWo3juYuLGk
 7Shp/MTlYokwnEropiWHU5ab3Bb9vLxlh8daGK/OmwBz01DaApI=
 =AAmb
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - tree-checker catches invalid number of inline extent references

 - zoned mode fixes:
    - enhance zone append IO command so it also detects emulated writes
    - handle bio splitting at sectorsize boundary

 - when deleting a snapshot, fix a condition for visiting nodes in reloc
   trees

* tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: tree-checker: reject inline extent items with 0 ref count
  btrfs: split bios to the fs sector size boundary
  btrfs: use bio_is_zone_append() in the completion handler
  btrfs: fix improper generation check in snapshot delete
2024-12-18 14:17:21 -08:00
Qu Wenruo
dfb92681a1 btrfs: tree-checker: reject inline extent items with 0 ref count
[BUG]
There is a bug report in the mailing list where btrfs_run_delayed_refs()
failed to drop the ref count for logical 25870311358464 num_bytes
2113536.

The involved leaf dump looks like this:

  item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
    extent refs 1 gen 84178 flags 1
    ref#0: shared data backref parent 32399126528000 count 0 <<<
    ref#1: shared data backref parent 31808973717504 count 1

Notice the count number is 0.

[CAUSE]
There is no concrete evidence yet, but considering 0 -> 1 is also a
single bit flipped, it's possible that hardware memory bitflip is
involved, causing the on-disk extent tree to be corrupted.

[FIX]
To prevent us reading such corrupted extent item, or writing such
damaged extent item back to disk, enhance the handling of
BTRFS_EXTENT_DATA_REF_KEY and BTRFS_SHARED_DATA_REF_KEY keys for both
inlined and key items, to detect such 0 ref count and reject them.

CC: stable@vger.kernel.org # 5.4+
Link: https://lore.kernel.org/linux-btrfs/7c69dd49-c346-4806-86e7-e6f863a66f48@app.fastmail.com/
Reported-by: Frankie Fisher <frankie@terrorise.me.uk>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-17 19:54:32 +01:00
Christoph Hellwig
be691b5e59 btrfs: split bios to the fs sector size boundary
Btrfs like other file systems can't really deal with I/O not aligned to
it's internal block size (which strangely is called sector size in
btrfs, for historical reasons), but the block layer split helper doesn't
even know about that.

Round down the split boundary so that all I/Os are aligned.

Fixes: d5e4377d50 ("btrfs: split zone append bios in btrfs_submit_bio")
CC: stable@vger.kernel.org # 6.12
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-17 19:54:32 +01:00
Christoph Hellwig
6c3864e055 btrfs: use bio_is_zone_append() in the completion handler
Otherwise it won't catch bios turned into regular writes by the block
level zone write plugging. The additional test it adds is for emulated
zone append.

Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
CC: stable@vger.kernel.org # 6.12
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-17 19:54:32 +01:00
Josef Bacik
d75d72a858 btrfs: fix improper generation check in snapshot delete
We have been using the following check

   if (generation <= root->root_key.offset)

to make decisions about whether or not to visit a node during snapshot
delete.  This is because for normal subvolumes this is set to 0, and for
snapshots it's set to the creation generation.  The idea being that if
the generation of the node is less than or equal to our creation
generation then we don't need to visit that node, because it doesn't
belong to us, we can simply drop our reference and move on.

However reloc roots don't have their generation stored in
root->root_key.offset, instead that is the objectid of their
corresponding fs root.  This means we can incorrectly not walk into
nodes that need to be dropped when deleting a reloc root.

There are a variety of consequences to making the wrong choice in two
distinct areas.

visit_node_for_delete()

1. False positive.  We think we are newer than the block when we really
   aren't.  We don't visit the node and drop our reference to the node
   and carry on.  This would result in leaked space.
2. False negative.  We do decide to walk down into a block that we
   should have just dropped our reference to.  However this means that
   the child node will have refs > 1, so we will switch to
   UPDATE_BACKREF, and then the subsequent walk_down_proc() will notice
   that btrfs_header_owner(node) != root->root_key.objectid and it'll
   break out of the loop, and then walk_up_proc() will drop our reference,
   so this appears to be ok.

do_walk_down()

1. False positive.  We are in UPDATE_BACKREF and incorrectly decide that
   we are done and don't need to update the backref for our lower nodes.
   This is another case that simply won't happen with relocation, as we
   only have to do UPDATE_BACKREF if the node below us was shared and
   didn't have FULL_BACKREF set, and since we don't own that node
   because we're a reloc root we actually won't end up in this case.
2. False negative.  Again this is tricky because as described above, we
   simply wouldn't be here from relocation, because we don't own any of
   the nodes because we never set btrfs_header_owner() to the reloc root
   objectid, and we always use FULL_BACKREF, we never actually need to
   set FULL_BACKREF on any children.

Having spent a lot of time stressing relocation/snapshot delete recently
I've not seen this pop in practice.  But this is objectively incorrect,
so fix this to get the correct starting generation based on the root
we're dropping to keep me from thinking there's a problem here.

CC: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-17 19:54:32 +01:00
Linus Torvalds
5a087a6b17 for-6.13-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdYzmoACgkQxWXV+ddt
 WDv5GxAAnCsGctNax89x/VpCDZynRghrkxlzu/4kG/pqxsJyzlgXDFtzHAEewSMs
 MYL+WCZLYpeKB5FpZq98mDJVLGNMG+9wqkx1bH/xy2ajBGZTeQe5pnkXMNlv9U1O
 SX34t8nzOdTCENDnQeRc5I2vTcsQRhgHoVjJkAYdWdhcD9fs6xHKZRe+himlstSn
 46ioKzEKSR3ztEUW4ycPF379g7d4kTR0hkk3pu5Nxe7ER8iq+jNSWXj0mzKg7mpJ
 KxP56VgY0OrsiUcJr2qFZ1hQIp810puaAuM4C1lLgRplECHxtLbP9JvL9Rr7a8Ox
 68tuThyLEpQtR59078jIX3RK6CwVi15rKb/ZkLZkW19TNSAAfM5qrB146hLBUM4T
 16WaiJ0x9lVkH2oYQv8zbNZiqDxPhPUdS/JArNAcQYk9ma+C1hCsxPQ/N5yoWH/C
 OABJddNR83sm4VTXu3Nci1EB8QuEoOuihYO6CdRkJ3PPNDuQiG6gwnoA2zqSihhy
 L5fQaLSWAUsLczarHZrvAi9Y0rfG66QzqGR+A1K/8qMTQ8pSCupd+LfqVa21QpI1
 Awx/wVFzsAm7z9CrnPTRJe+JSlBDQdeXWX7pDhhkXgwbCsMVSf3dbBweCD3o1EiM
 BVI7SfEgImlbatd0QvDp9FcsnEqp90SCi+99U+zZCmQ1SW8CEC0=
 =+DUB
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes. Apart from the one liners and updated bio splitting
  error handling there's a fix for subvolume mount with different flags.
  This was known and fixed for some time but I've delayed it to give it
  more testing.

   - fix unbalanced locking when swapfile activation fails when the
     subvolume gets deleted in the meantime

   - add btrfs error handling after bio_split() calls that got error
     handling recently

   - during unmount, flush delalloc workers at the right time before the
     cleaner thread is shut down

   - fix regression in buffered write folio conversion, explicitly wait
     for writeback as FGP_STABLE flag is currently a no-op on btrfs

   - handle race in subvolume mount with different flags, the conversion
     to the new mount API did not handle the case where multiple
     subvolumes get mounted in parallel, which is a distro use case"

* tag 'for-6.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: flush delalloc workers queue before stopping cleaner kthread during unmount
  btrfs: handle bio_split() errors
  btrfs: properly wait for writeback before buffered write
  btrfs: fix missing snapshot drew unlock when root is dead during swap activation
  btrfs: fix mount failure due to remount races
2024-12-10 18:18:01 -08:00
Filipe Manana
f10bef73fb btrfs: flush delalloc workers queue before stopping cleaner kthread during unmount
During the unmount path, at close_ctree(), we first stop the cleaner
kthread, using kthread_stop() which frees the associated task_struct, and
then stop and destroy all the work queues. However after we stopped the
cleaner we may still have a worker from the delalloc_workers queue running
inode.c:submit_compressed_extents(), which calls btrfs_add_delayed_iput(),
which in turn tries to wake up the cleaner kthread - which was already
destroyed before, resulting in a use-after-free on the task_struct.

Syzbot reported this with the following stack traces:

  BUG: KASAN: slab-use-after-free in __lock_acquire+0x78/0x2100 kernel/locking/lockdep.c:5089
  Read of size 8 at addr ffff8880259d2818 by task kworker/u8:3/52

  CPU: 1 UID: 0 PID: 52 Comm: kworker/u8:3 Not tainted 6.13.0-rc1-syzkaller-00002-gcdd30ebb1b9f #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
  Workqueue: btrfs-delalloc btrfs_work_helper
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:94 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
   print_address_description mm/kasan/report.c:378 [inline]
   print_report+0x169/0x550 mm/kasan/report.c:489
   kasan_report+0x143/0x180 mm/kasan/report.c:602
   __lock_acquire+0x78/0x2100 kernel/locking/lockdep.c:5089
   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
   class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
   try_to_wake_up+0xc2/0x1470 kernel/sched/core.c:4205
   submit_compressed_extents+0xdf/0x16e0 fs/btrfs/inode.c:1615
   run_ordered_work fs/btrfs/async-thread.c:288 [inline]
   btrfs_work_helper+0x96f/0xc40 fs/btrfs/async-thread.c:324
   process_one_work kernel/workqueue.c:3229 [inline]
   process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310
   worker_thread+0x870/0xd30 kernel/workqueue.c:3391
   kthread+0x2f0/0x390 kernel/kthread.c:389
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
   </TASK>

  Allocated by task 2:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   unpoison_slab_object mm/kasan/common.c:319 [inline]
   __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:345
   kasan_slab_alloc include/linux/kasan.h:250 [inline]
   slab_post_alloc_hook mm/slub.c:4104 [inline]
   slab_alloc_node mm/slub.c:4153 [inline]
   kmem_cache_alloc_node_noprof+0x1d9/0x380 mm/slub.c:4205
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1113
   copy_process+0x5d1/0x3d50 kernel/fork.c:2225
   kernel_clone+0x223/0x870 kernel/fork.c:2807
   kernel_thread+0x1bc/0x240 kernel/fork.c:2869
   create_kthread kernel/kthread.c:412 [inline]
   kthreadd+0x60d/0x810 kernel/kthread.c:767
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  Freed by task 24:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582
   poison_slab_object mm/kasan/common.c:247 [inline]
   __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
   kasan_slab_free include/linux/kasan.h:233 [inline]
   slab_free_hook mm/slub.c:2338 [inline]
   slab_free mm/slub.c:4598 [inline]
   kmem_cache_free+0x195/0x410 mm/slub.c:4700
   put_task_struct include/linux/sched/task.h:144 [inline]
   delayed_put_task_struct+0x125/0x300 kernel/exit.c:227
   rcu_do_batch kernel/rcu/tree.c:2567 [inline]
   rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
   handle_softirqs+0x2d4/0x9b0 kernel/softirq.c:554
   run_ksoftirqd+0xca/0x130 kernel/softirq.c:943
   smpboot_thread_fn+0x544/0xa30 kernel/smpboot.c:164
   kthread+0x2f0/0x390 kernel/kthread.c:389
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  Last potentially related work creation:
   kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
   __kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:544
   __call_rcu_common kernel/rcu/tree.c:3086 [inline]
   call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190
   context_switch kernel/sched/core.c:5372 [inline]
   __schedule+0x1803/0x4be0 kernel/sched/core.c:6756
   __schedule_loop kernel/sched/core.c:6833 [inline]
   schedule+0x14b/0x320 kernel/sched/core.c:6848
   schedule_timeout+0xb0/0x290 kernel/time/sleep_timeout.c:75
   do_wait_for_common kernel/sched/completion.c:95 [inline]
   __wait_for_common kernel/sched/completion.c:116 [inline]
   wait_for_common kernel/sched/completion.c:127 [inline]
   wait_for_completion+0x355/0x620 kernel/sched/completion.c:148
   kthread_stop+0x19e/0x640 kernel/kthread.c:712
   close_ctree+0x524/0xd60 fs/btrfs/disk-io.c:4328
   generic_shutdown_super+0x139/0x2d0 fs/super.c:642
   kill_anon_super+0x3b/0x70 fs/super.c:1237
   btrfs_kill_super+0x41/0x50 fs/btrfs/super.c:2112
   deactivate_locked_super+0xc4/0x130 fs/super.c:473
   cleanup_mnt+0x41f/0x4b0 fs/namespace.c:1373
   task_work_run+0x24f/0x310 kernel/task_work.c:239
   ptrace_notify+0x2d2/0x380 kernel/signal.c:2503
   ptrace_report_syscall include/linux/ptrace.h:415 [inline]
   ptrace_report_syscall_exit include/linux/ptrace.h:477 [inline]
   syscall_exit_work+0xc7/0x1d0 kernel/entry/common.c:173
   syscall_exit_to_user_mode_prepare kernel/entry/common.c:200 [inline]
   __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
   syscall_exit_to_user_mode+0x24a/0x340 kernel/entry/common.c:218
   do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  The buggy address belongs to the object at ffff8880259d1e00
   which belongs to the cache task_struct of size 7424
  The buggy address is located 2584 bytes inside of
   freed 7424-byte region [ffff8880259d1e00, ffff8880259d3b00)

  The buggy address belongs to the physical page:
  page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x259d0
  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
  memcg:ffff88802f4b56c1
  flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
  page_type: f5(slab)
  raw: 00fff00000000040 ffff88801bafe500 dead000000000100 dead000000000122
  raw: 0000000000000000 0000000000040004 00000001f5000000 ffff88802f4b56c1
  head: 00fff00000000040 ffff88801bafe500 dead000000000100 dead000000000122
  head: 0000000000000000 0000000000040004 00000001f5000000 ffff88802f4b56c1
  head: 00fff00000000003 ffffea0000967401 ffffffffffffffff 0000000000000000
  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: kasan: bad access detected
  page_owner tracks the page as allocated
  page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 12, tgid 12 (kworker/u8:1), ts 7328037942, free_ts 0
   set_page_owner include/linux/page_owner.h:32 [inline]
   post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1556
   prep_new_page mm/page_alloc.c:1564 [inline]
   get_page_from_freelist+0x3651/0x37a0 mm/page_alloc.c:3474
   __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4751
   alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
   alloc_slab_page+0x6a/0x140 mm/slub.c:2408
   allocate_slab+0x5a/0x2f0 mm/slub.c:2574
   new_slab mm/slub.c:2627 [inline]
   ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3815
   __slab_alloc+0x58/0xa0 mm/slub.c:3905
   __slab_alloc_node mm/slub.c:3980 [inline]
   slab_alloc_node mm/slub.c:4141 [inline]
   kmem_cache_alloc_node_noprof+0x269/0x380 mm/slub.c:4205
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1113
   copy_process+0x5d1/0x3d50 kernel/fork.c:2225
   kernel_clone+0x223/0x870 kernel/fork.c:2807
   user_mode_thread+0x132/0x1a0 kernel/fork.c:2885
   call_usermodehelper_exec_work+0x5c/0x230 kernel/umh.c:171
   process_one_work kernel/workqueue.c:3229 [inline]
   process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310
   worker_thread+0x870/0xd30 kernel/workqueue.c:3391
  page_owner free stack trace missing

  Memory state around the buggy address:
   ffff8880259d2700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880259d2780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  >ffff8880259d2800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
   ffff8880259d2880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880259d2900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

Fix this by flushing the delalloc workers queue before stopping the
cleaner kthread.

Reported-by: syzbot+b7cf50a0c173770dcb14@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/674ed7e8.050a0220.48a03.0031.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-06 15:04:18 +01:00
Johannes Thumshirn
c7c97ceff9 btrfs: handle bio_split() errors
Commit e546fe1da9 ("block: Rework bio_split() return value") changed
bio_split() so that it can return errors.

Add error handling for it in btrfs_split_bio() and ultimately
btrfs_submit_chunk(). As the bio is not submitted, the bio counter must
be decremented to pair btrfs_bio_counter_inc_blocked().

Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-06 15:04:13 +01:00
Qu Wenruo
c83d77eb0f btrfs: properly wait for writeback before buffered write
[BUG]
Before commit e820dbeb6a ("btrfs: convert btrfs_buffered_write() to
use folios"), function prepare_one_folio() will always wait for folio
writeback to finish before returning the folio.

However commit e820dbeb6a ("btrfs: convert btrfs_buffered_write() to
use folios") changed to use FGP_STABLE to do the writeback wait, but
FGP_STABLE is calling folio_wait_stable(), which only calls
folio_wait_writeback() if the address space has AS_STABLE_WRITES, which
is not set for btrfs inodes.

This means we will not wait for the folio writeback at all.

[CAUSE]
The cause is FGP_STABLE is not waiting for writeback unconditionally, but
only for address spaces with AS_STABLE_WRITES, normally such flag is set
when the super block has SB_I_STABLE_WRITES flag.

Such super block flag is set when the block device has hardware digest
support or has internal checksum requirement.

I'd argue btrfs should set such super block due to its default data
checksum behavior, but it is not set yet, so this means FGP_STABLE flag
will have no effect at all.

(For NODATASUM inodes, we can skip the waiting in theory but that should
be an optimization in the future.)

This can lead to data checksum mismatch, as we can modify the folio
while it's still under writeback, this will make the contents differ
from the contents at submission and checksum calculation.

[FIX]
Instead of fully relying on FGP_STABLE, manually do the folio writeback
waiting, until we set the address space or super flag.

Fixes: e820dbeb6a ("btrfs: convert btrfs_buffered_write() to use folios")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-06 15:04:07 +01:00
Filipe Manana
9c803c474c btrfs: fix missing snapshot drew unlock when root is dead during swap activation
When activating a swap file we acquire the root's snapshot drew lock and
then check if the root is dead, failing and returning with -EPERM if it's
dead but without unlocking the root's snapshot lock. Fix this by adding
the missing unlock.

Fixes: 60021bd754 ("btrfs: prevent subvol with swapfile from being deleted")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-03 20:27:02 +01:00
Qu Wenruo
951a3f59d2 btrfs: fix mount failure due to remount races
[BUG]
The following reproducer can cause btrfs mount to fail:

  dev="/dev/test/scratch1"
  mnt1="/mnt/test"
  mnt2="/mnt/scratch"

  mkfs.btrfs -f $dev
  mount $dev $mnt1
  btrfs subvolume create $mnt1/subvol1
  btrfs subvolume create $mnt1/subvol2
  umount $mnt1

  mount $dev $mnt1 -o subvol=subvol1
  while mount -o remount,ro $mnt1; do mount -o remount,rw $mnt1; done &
  bg=$!

  while mount $dev $mnt2 -o subvol=subvol2; do umount $mnt2; done

  kill $bg
  wait
  umount -R $mnt1
  umount -R $mnt2

The script will fail with the following error:

  mount: /mnt/scratch: /dev/mapper/test-scratch1 already mounted on /mnt/test.
        dmesg(1) may have more information after failed mount system call.
  umount: /mnt/test: target is busy.
  umount: /mnt/scratch/: not mounted

And there is no kernel error message.

[CAUSE]
During the btrfs mount, to support mounting different subvolumes with
different RO/RW flags, we need to detect that and retry if needed:

  Retry with matching RO flags if the initial mount fail with -EBUSY.

The problem is, during that retry we do not hold any super block lock
(s_umount), this means there can be a remount process changing the RO
flags of the original fs super block.

If so, we can have an EBUSY error during retry.  And this time we treat
any failure as an error, without any retry and cause the above EBUSY
mount failure.

[FIX]
The current retry behavior is racy because we do not have a super block
thus no way to hold s_umount to prevent the race with remount.

Solve the root problem by allowing fc->sb_flags to mismatch from the
sb->s_flags at btrfs_get_tree_super().

Then at the re-entry point btrfs_get_tree_subvol(), manually check the
fc->s_flags against sb->s_flags, if it's a RO->RW mismatch, then
reconfigure with s_umount lock hold.

Reported-by: Enno Gotthold <egotthold@suse.com>
Reported-by: Fabian Vogt <fvogt@suse.com>
[ Special thanks for the reproducer and early analysis pointing to btrfs. ]
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1231836
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-03 20:26:49 +01:00
Linus Torvalds
feffde684a for-6.13-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmdPMbMACgkQxWXV+ddt
 WDtFSA/+Kd61BPKaiZFF0yjOsjBOlu8WMHFLO5xa2ZRqV3HUzm6rdO4wUSq3Eyqg
 lrFszPfLA0REZIEdY2rqDKJduk1MZZg6NY7Pvn+/ByGOorM0Ym1BJoydtlUN5o2Y
 AWeaDxs6LBMwRqpai0+AkikufK/jK7QfhHci+Oo2XmOv1C39DQbkXO76SYh/yERt
 CcZNaSjm9DUlLxOOpkefxYvpPW4Uv4NSBfh9aymX/u1VxXqeuMuzPZqwZO+nwl/p
 M1yr9fcbrqh3yKC+JjhD7xmOJM3x4c2PmzSkQTdepOuAlQvuQ/iFD+Zc1YjY0XlT
 Fl938rdTKULgCarR5rdsXqdlFRnOprlgt0J1Pdf+GipTVU0EY3WU343HCz6h8pmG
 F/NPvlCahkEtk1UCguL92NOBeAf0adWhuYfKjkuxYuL5ZTvzsOl2ymF6Vlja0Y39
 VK4exjG4ilDESxweinJe53k3QLDwvUc2h0D291QVoo2X06dhXda8oOVK+6lR5mij
 zDtXqurjlCybT1W7op6RcWMY0TaA8IR3Bo9oU1YbVcctuc86/X8SERv1G8MQtXhh
 tevOVb/cKy7gEw9q73OtrH/J2EAklnsesnpH2sdt7WUEzOQPw9BIwEa/NRbjvoqn
 n2ts5KktBhNjM1s0elHBf7o4+OaTrTlTrp0HelXyONZucPrzDQ4=
 =4mFO
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - add lockdep annotations for io_uring/encoded read integration, inode
   lock is held when returning to userspace

 - properly reflect experimental config option to sysfs

 - handle NULL root in case the rescue mode accepts invalid/damaged tree
   roots (rescue=ibadroot)

 - regression fix of a deadlock between transaction and extent locks

 - fix pending bio accounting bug in encoded read ioctl

 - fix NOWAIT mode when checking references for NOCOW files

 - fix use-after-free in a rb-tree cleanup in ref-verify debugging tool

* tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix lockdep warnings on io_uring encoded reads
  btrfs: ref-verify: fix use-after-free after invalid ref action
  btrfs: add a sanity check for btrfs root in btrfs_search_slot()
  btrfs: don't loop for nowait writes when checking for cross references
  btrfs: sysfs: advertise experimental features only if CONFIG_BTRFS_EXPERIMENTAL=y
  btrfs: fix deadlock between transaction commits and extent locks
  btrfs: fix use-after-free in btrfs_encoded_read_endio()
2024-12-03 11:02:17 -08:00
Mark Harmstone
22d2e48e31 btrfs: fix lockdep warnings on io_uring encoded reads
Lockdep doesn't like the fact that btrfs_uring_read_extent() returns to
userspace still holding the inode lock, even though we release it once
the I/O finishes. Add calls to rwsem_release() and rwsem_acquire_read() to
work round this.

Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
34310c442e ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)")
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-29 16:56:38 +01:00
Filipe Manana
7c4e39f9d2 btrfs: ref-verify: fix use-after-free after invalid ref action
At btrfs_ref_tree_mod() after we successfully inserted the new ref entry
(local variable 'ref') into the respective block entry's rbtree (local
variable 'be'), if we find an unexpected action of BTRFS_DROP_DELAYED_REF,
we error out and free the ref entry without removing it from the block
entry's rbtree. Then in the error path of btrfs_ref_tree_mod() we call
btrfs_free_ref_cache(), which iterates over all block entries and then
calls free_block_entry() for each one, and there we will trigger a
use-after-free when we are called against the block entry to which we
added the freed ref entry to its rbtree, since the rbtree still points
to the block entry, as we didn't remove it from the rbtree before freeing
it in the error path at btrfs_ref_tree_mod(). Fix this by removing the
new ref entry from the rbtree before freeing it.

Syzbot report this with the following stack traces:

   BTRFS error (device loop0 state EA):   Ref action 2, root 5, ref_root 0, parent 8564736, owner 0, offset 0, num_refs 18446744073709551615
      __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
      update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
      btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
      btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
      btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
      btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4314
      btrfs_insert_empty_item fs/btrfs/ctree.h:669 [inline]
      btrfs_insert_orphan_item+0x1f1/0x320 fs/btrfs/orphan.c:23
      btrfs_orphan_add+0x6d/0x1a0 fs/btrfs/inode.c:3482
      btrfs_unlink+0x267/0x350 fs/btrfs/inode.c:4293
      vfs_unlink+0x365/0x650 fs/namei.c:4469
      do_unlinkat+0x4ae/0x830 fs/namei.c:4533
      __do_sys_unlinkat fs/namei.c:4576 [inline]
      __se_sys_unlinkat fs/namei.c:4569 [inline]
      __x64_sys_unlinkat+0xcc/0xf0 fs/namei.c:4569
      do_syscall_x64 arch/x86/entry/common.c:52 [inline]
      do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
      entry_SYSCALL_64_after_hwframe+0x77/0x7f
   BTRFS error (device loop0 state EA):   Ref action 1, root 5, ref_root 5, parent 0, owner 260, offset 0, num_refs 1
      __btrfs_mod_ref+0x76b/0xac0 fs/btrfs/extent-tree.c:2521
      update_ref_for_cow+0x96a/0x11f0
      btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
      btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
      btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
      btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
      __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
      btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
      __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
      __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
      btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
      prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
      relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
      btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
      btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
      __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
      btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
   BTRFS error (device loop0 state EA):   Ref action 2, root 5, ref_root 0, parent 8564736, owner 0, offset 0, num_refs 18446744073709551615
      __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
      update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
      btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
      btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
      btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
      btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
      __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
      btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
      __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
      __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
      btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
      prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
      relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
      btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
      btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
      __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
      btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
   ==================================================================
   BUG: KASAN: slab-use-after-free in rb_first+0x69/0x70 lib/rbtree.c:473
   Read of size 8 at addr ffff888042d1af38 by task syz.0.0/5329

   CPU: 0 UID: 0 PID: 5329 Comm: syz.0.0 Not tainted 6.12.0-rc7-syzkaller #0
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
   Call Trace:
    <TASK>
    __dump_stack lib/dump_stack.c:94 [inline]
    dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
    print_address_description mm/kasan/report.c:377 [inline]
    print_report+0x169/0x550 mm/kasan/report.c:488
    kasan_report+0x143/0x180 mm/kasan/report.c:601
    rb_first+0x69/0x70 lib/rbtree.c:473
    free_block_entry+0x78/0x230 fs/btrfs/ref-verify.c:248
    btrfs_free_ref_cache+0xa3/0x100 fs/btrfs/ref-verify.c:917
    btrfs_ref_tree_mod+0x139f/0x15e0 fs/btrfs/ref-verify.c:898
    btrfs_free_extent+0x33c/0x380 fs/btrfs/extent-tree.c:3544
    __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
    update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
    btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
    btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
    btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
    btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
    __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
    btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
    __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
    __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
    btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
    prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
    relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
    btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
    btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
    __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
    btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
    btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:907 [inline]
    __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f996df7e719
   RSP: 002b:00007f996ede7038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007f996e135f80 RCX: 00007f996df7e719
   RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000004
   RBP: 00007f996dff139e R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 0000000000000000 R14: 00007f996e135f80 R15: 00007fff79f32e68
    </TASK>

   Allocated by task 5329:
    kasan_save_stack mm/kasan/common.c:47 [inline]
    kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
    poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
    __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
    kasan_kmalloc include/linux/kasan.h:257 [inline]
    __kmalloc_cache_noprof+0x19c/0x2c0 mm/slub.c:4295
    kmalloc_noprof include/linux/slab.h:878 [inline]
    kzalloc_noprof include/linux/slab.h:1014 [inline]
    btrfs_ref_tree_mod+0x264/0x15e0 fs/btrfs/ref-verify.c:701
    btrfs_free_extent+0x33c/0x380 fs/btrfs/extent-tree.c:3544
    __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
    update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
    btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
    btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
    btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
    btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
    __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
    btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
    __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
    __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
    btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
    prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
    relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
    btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
    btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
    __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
    btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
    btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:907 [inline]
    __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   Freed by task 5329:
    kasan_save_stack mm/kasan/common.c:47 [inline]
    kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
    kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
    poison_slab_object mm/kasan/common.c:247 [inline]
    __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
    kasan_slab_free include/linux/kasan.h:230 [inline]
    slab_free_hook mm/slub.c:2342 [inline]
    slab_free mm/slub.c:4579 [inline]
    kfree+0x1a0/0x440 mm/slub.c:4727
    btrfs_ref_tree_mod+0x136c/0x15e0
    btrfs_free_extent+0x33c/0x380 fs/btrfs/extent-tree.c:3544
    __btrfs_mod_ref+0x7dd/0xac0 fs/btrfs/extent-tree.c:2523
    update_ref_for_cow+0x9cd/0x11f0 fs/btrfs/ctree.c:512
    btrfs_force_cow_block+0x9f6/0x1da0 fs/btrfs/ctree.c:594
    btrfs_cow_block+0x35e/0xa40 fs/btrfs/ctree.c:754
    btrfs_search_slot+0xbdd/0x30d0 fs/btrfs/ctree.c:2116
    btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:411
    __btrfs_update_delayed_inode+0x1e7/0xb90 fs/btrfs/delayed-inode.c:1030
    btrfs_update_delayed_inode fs/btrfs/delayed-inode.c:1114 [inline]
    __btrfs_commit_inode_delayed_items+0x2318/0x24a0 fs/btrfs/delayed-inode.c:1137
    __btrfs_run_delayed_items+0x213/0x490 fs/btrfs/delayed-inode.c:1171
    btrfs_commit_transaction+0x8a8/0x3740 fs/btrfs/transaction.c:2313
    prepare_to_relocate+0x3c4/0x4c0 fs/btrfs/relocation.c:3586
    relocate_block_group+0x16c/0xd40 fs/btrfs/relocation.c:3611
    btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4081
    btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3377
    __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4161
    btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4538
    btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:907 [inline]
    __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   The buggy address belongs to the object at ffff888042d1af00
    which belongs to the cache kmalloc-64 of size 64
   The buggy address is located 56 bytes inside of
    freed 64-byte region [ffff888042d1af00, ffff888042d1af40)

   The buggy address belongs to the physical page:
   page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x42d1a
   anon flags: 0x4fff00000000000(node=1|zone=1|lastcpupid=0x7ff)
   page_type: f5(slab)
   raw: 04fff00000000000 ffff88801ac418c0 0000000000000000 dead000000000001
   raw: 0000000000000000 0000000000200020 00000001f5000000 0000000000000000
   page dumped because: kasan: bad access detected
   page_owner tracks the page as allocated
   page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52c40(GFP_NOFS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 5055, tgid 5055 (dhcpcd-run-hook), ts 40377240074, free_ts 40376848335
    set_page_owner include/linux/page_owner.h:32 [inline]
    post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1541
    prep_new_page mm/page_alloc.c:1549 [inline]
    get_page_from_freelist+0x3649/0x3790 mm/page_alloc.c:3459
    __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4735
    alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
    alloc_slab_page+0x6a/0x140 mm/slub.c:2412
    allocate_slab+0x5a/0x2f0 mm/slub.c:2578
    new_slab mm/slub.c:2631 [inline]
    ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3818
    __slab_alloc+0x58/0xa0 mm/slub.c:3908
    __slab_alloc_node mm/slub.c:3961 [inline]
    slab_alloc_node mm/slub.c:4122 [inline]
    __do_kmalloc_node mm/slub.c:4263 [inline]
    __kmalloc_noprof+0x25a/0x400 mm/slub.c:4276
    kmalloc_noprof include/linux/slab.h:882 [inline]
    kzalloc_noprof include/linux/slab.h:1014 [inline]
    tomoyo_encode2 security/tomoyo/realpath.c:45 [inline]
    tomoyo_encode+0x26f/0x540 security/tomoyo/realpath.c:80
    tomoyo_realpath_from_path+0x59e/0x5e0 security/tomoyo/realpath.c:283
    tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
    tomoyo_check_open_permission+0x255/0x500 security/tomoyo/file.c:771
    security_file_open+0x777/0x990 security/security.c:3109
    do_dentry_open+0x369/0x1460 fs/open.c:945
    vfs_open+0x3e/0x330 fs/open.c:1088
    do_open fs/namei.c:3774 [inline]
    path_openat+0x2c84/0x3590 fs/namei.c:3933
   page last free pid 5055 tgid 5055 stack trace:
    reset_page_owner include/linux/page_owner.h:25 [inline]
    free_pages_prepare mm/page_alloc.c:1112 [inline]
    free_unref_page+0xcfb/0xf20 mm/page_alloc.c:2642
    free_pipe_info+0x300/0x390 fs/pipe.c:860
    put_pipe_info fs/pipe.c:719 [inline]
    pipe_release+0x245/0x320 fs/pipe.c:742
    __fput+0x23f/0x880 fs/file_table.c:431
    __do_sys_close fs/open.c:1567 [inline]
    __se_sys_close fs/open.c:1552 [inline]
    __x64_sys_close+0x7f/0x110 fs/open.c:1552
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f

   Memory state around the buggy address:
    ffff888042d1ae00: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    ffff888042d1ae80: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
   >ffff888042d1af00: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
                                           ^
    ffff888042d1af80: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc
    ffff888042d1b000: 00 00 00 00 00 fc fc 00 00 00 00 00 fc fc 00 00

Reported-by: syzbot+7325f164162e200000c1@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/673723eb.050a0220.1324f8.00a8.GAE@google.com/T/#u
Fixes: fd708b81d9 ("Btrfs: add a extent ref verify tool")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-29 16:52:29 +01:00
Lizhi Xu
3ed51857a5 btrfs: add a sanity check for btrfs root in btrfs_search_slot()
Syzbot reports a null-ptr-deref in btrfs_search_slot().

The reproducer is using rescue=ibadroots, and the extent tree root is
corrupted thus the extent tree is NULL.

When scrub tries to search the extent tree to gather the needed extent
info, btrfs_search_slot() doesn't check if the target root is NULL or
not, resulting the null-ptr-deref.

Add sanity check for btrfs root before using it in btrfs_search_slot().

Reported-by: syzbot+3030e17bd57a73d39bd7@syzkaller.appspotmail.com
Fixes: 42437a6386 ("btrfs: introduce mount option rescue=ignorebadroots")
Link: https://syzkaller.appspot.com/bug?extid=3030e17bd57a73d39bd7
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: syzbot+3030e17bd57a73d39bd7@syzkaller.appspotmail.com
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-29 16:50:40 +01:00
Filipe Manana
ed67f2a913 btrfs: don't loop for nowait writes when checking for cross references
When checking for delayed refs when verifying if there are cross
references for a data extent, we stop if the path has nowait set and we
can't try lock the delayed ref head's mutex, returning -EAGAIN with the
goal of making a write fallback to a blocking context. However we ignore
the -EAGAIN at btrfs_cross_ref_exist() when check_delayed_ref() returns
it, and keep looping instead of immediately returning the -EAGAIN to the
caller.

Fix this by not looping if we get -EAGAIN and we have a nowait path.

Fixes: 26ce911446 ("btrfs: make can_nocow_extent nowait compatible")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-29 16:46:47 +01:00
Filipe Manana
b188ad7791 btrfs: sysfs: advertise experimental features only if CONFIG_BTRFS_EXPERIMENTAL=y
We are advertising experimental features through sysfs if
CONFIG_BTRFS_DEBUG is set, without looking if CONFIG_BTRFS_EXPERIMENTAL
is set. This is wrong as it will result in reporting experimental
features as supported when CONFIG_BTRFS_EXPERIMENTAL is not set but
CONFIG_BTRFS_DEBUG is set.

Fix this by checking for CONFIG_BTRFS_EXPERIMENTAL instead of
CONFIG_BTRFS_DEBUG.

Fixes: 67cd3f2217 ("btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from CONFIG_BTRFS_DEBUG")
Reviewed-by: Neal Gompa <neal@gompa.dev>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-28 20:46:59 +01:00
Filipe Manana
7d6872ccbd btrfs: fix deadlock between transaction commits and extent locks
When running a workload with fsstress and duperemove (generic/561) we can
hit a deadlock related to transaction commits and locking extent ranges,
as described below.

Task A hanging during a transaction commit, waiting for all other writers
to complete:

  [178317.334817] INFO: task fsstress:555623 blocked for more than 120 seconds.
  [178317.335693]       Not tainted 6.12.0-rc6-btrfs-next-179+ #1
  [178317.336528] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [178317.337673] task:fsstress        state:D stack:0     pid:555623 tgid:555623 ppid:555620 flags:0x00004002
  [178317.337679] Call Trace:
  [178317.337681]  <TASK>
  [178317.337685]  __schedule+0x364/0xbe0
  [178317.337691]  schedule+0x26/0xa0
  [178317.337695]  btrfs_commit_transaction+0x5c5/0x1050 [btrfs]
  [178317.337769]  ? start_transaction+0xc4/0x800 [btrfs]
  [178317.337815]  ? __pfx_autoremove_wake_function+0x10/0x10
  [178317.337819]  btrfs_mksubvol+0x381/0x640 [btrfs]
  [178317.337878]  btrfs_mksnapshot+0x7a/0xb0 [btrfs]
  [178317.337935]  __btrfs_ioctl_snap_create+0x1bb/0x1d0 [btrfs]
  [178317.337995]  btrfs_ioctl_snap_create_v2+0x103/0x130 [btrfs]
  [178317.338053]  btrfs_ioctl+0x29b/0x2a90 [btrfs]
  [178317.338118]  ? kmem_cache_alloc_noprof+0x5f/0x2c0
  [178317.338126]  ? getname_flags+0x45/0x1f0
  [178317.338133]  ? _raw_spin_unlock+0x15/0x30
  [178317.338145]  ? __x64_sys_ioctl+0x88/0xc0
  [178317.338149]  __x64_sys_ioctl+0x88/0xc0
  [178317.338152]  do_syscall_64+0x4a/0x110
  [178317.338160]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [178317.338190] RIP: 0033:0x7f13c28e271b

Which corresponds to line 2361 of transaction.c:

  $ cat -n fs/btrfs/transaction.c
  (...)
  2162  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
  2163  {
  (...)
  2349          spin_lock(&fs_info->trans_lock);
  2350          add_pending_snapshot(trans);
  2351          cur_trans->state = TRANS_STATE_COMMIT_DOING;
  2352          spin_unlock(&fs_info->trans_lock);
  2353
  2354          /*
  2355           * The thread has started/joined the transaction thus it holds the
  2356           * lockdep map as a reader. It has to release it before acquiring the
  2357           * lockdep map as a writer.
  2358           */
  2359          btrfs_lockdep_release(fs_info, btrfs_trans_num_writers);
  2360          btrfs_might_wait_for_event(fs_info, btrfs_trans_num_writers);
  2361          wait_event(cur_trans->writer_wait,
  2362                     atomic_read(&cur_trans->num_writers) == 1);
  (...)

The transaction is in the TRANS_STATE_COMMIT_DOING state and so it's
waiting for all other existing writers to complete and release their
transaction handle.

Task B is running ordered extent completion and blocked waiting to lock an
extent range in an inode's io tree:

  [178317.327411] INFO: task kworker/u48:8:554545 blocked for more than 120 seconds.
  [178317.328630]       Not tainted 6.12.0-rc6-btrfs-next-179+ #1
  [178317.329635] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [178317.330872] task:kworker/u48:8   state:D stack:0     pid:554545 tgid:554545 ppid:2      flags:0x00004000
  [178317.330878] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
  [178317.330944] Call Trace:
  [178317.330945]  <TASK>
  [178317.330947]  __schedule+0x364/0xbe0
  [178317.330952]  schedule+0x26/0xa0
  [178317.330955]  __lock_extent+0x337/0x3a0 [btrfs]
  [178317.331014]  ? __pfx_autoremove_wake_function+0x10/0x10
  [178317.331017]  btrfs_finish_one_ordered+0x47a/0xaa0 [btrfs]
  [178317.331074]  ? psi_group_change+0x132/0x2d0
  [178317.331078]  btrfs_work_helper+0xbd/0x370 [btrfs]
  [178317.331140]  process_scheduled_works+0xd3/0x460
  [178317.331144]  ? __pfx_worker_thread+0x10/0x10
  [178317.331146]  worker_thread+0x121/0x250
  [178317.331149]  ? __pfx_worker_thread+0x10/0x10
  [178317.331151]  kthread+0xe9/0x120
  [178317.331154]  ? __pfx_kthread+0x10/0x10
  [178317.331157]  ret_from_fork+0x2d/0x50
  [178317.331159]  ? __pfx_kthread+0x10/0x10
  [178317.331162]  ret_from_fork_asm+0x1a/0x30

This extent range locking happens after joining the current transaction,
so task A is waiting for task B to release its transaction handle
(decrementing the transaction's num_writers counter).

Task C while doing a fiemap it tries to join the current transaction:

  [242682.812815] task:pool            state:D stack:0     pid:560767 tgid:560724 ppid:555622 flags:0x00004006
  [242682.812827] Call Trace:
  [242682.812856]  <TASK>
  [242682.812864]  __schedule+0x364/0xbe0
  [242682.812879]  ? _raw_spin_unlock_irqrestore+0x23/0x40
  [242682.812897]  schedule+0x26/0xa0
  [242682.812909]  wait_current_trans+0xd6/0x130 [btrfs]
  [242682.813148]  ? __pfx_autoremove_wake_function+0x10/0x10
  [242682.813162]  start_transaction+0x3d4/0x800 [btrfs]
  [242682.813399]  btrfs_is_data_extent_shared+0xd2/0x440 [btrfs]
  [242682.813723]  fiemap_process_hole+0x2a2/0x300 [btrfs]
  [242682.813995]  extent_fiemap+0x9b8/0xb80 [btrfs]
  [242682.814249]  btrfs_fiemap+0x78/0xc0 [btrfs]
  [242682.814501]  do_vfs_ioctl+0x2db/0xa50
  [242682.814519]  __x64_sys_ioctl+0x6a/0xc0
  [242682.814531]  do_syscall_64+0x4a/0x110
  [242682.814544]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [242682.814556] RIP: 0033:0x7efff595e71b

It tries to join the current transaction, but it can't because the
transaction is in the TRANS_STATE_COMMIT_DOING state, so
join_transaction() returns -EBUSY to start_transaction() and makes it
wait for the current transaction to complete. And while it's waiting
for the transaction to complete, it's holding an extent range locked
in the same inode that task B is operating, which causes a deadlock
between these 3 tasks. The extent range for the inode was locked at
the start of the fiemap operation, early at extent_fiemap().

In short these tasks deadlock because:

1) Task A is waiting for task B to release its transaction handle;

2) Task B is waiting to lock an extent range for an inode while holding a
   transaction handle open;

3) Task C is waiting for the current transaction to complete (for task A
   to finish the transaction commit) while holding the extent range for
   the inode locked, so task B can't progress and release its transaction
   handle.

This results in an ABBA deadlock involving transaction commits and extent
locks. Extent locks are higher level locks, like inode VFS locks, and
should always be acquired before joining or starting a transaction, but
recently commit 2206265f41 ("btrfs: remove code duplication in ordered
extent finishing") accidentally changed btrfs_finish_one_ordered() to do
the transaction join before locking the extent range.

Fix this by making sure that btrfs_finish_one_ordered() always locks the
extent before joining a transaction and add an explicit comment about the
need for this order.

Fixes: 2206265f41 ("btrfs: remove code duplication in ordered extent finishing")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-28 20:46:40 +01:00
Johannes Thumshirn
05b36b04d7 btrfs: fix use-after-free in btrfs_encoded_read_endio()
Shinichiro reported the following use-after free that sometimes is
happening in our CI system when running fstests' btrfs/284 on a TCMU
runner device:

  BUG: KASAN: slab-use-after-free in lock_release+0x708/0x780
  Read of size 8 at addr ffff888106a83f18 by task kworker/u80:6/219

  CPU: 8 UID: 0 PID: 219 Comm: kworker/u80:6 Not tainted 6.12.0-rc6-kts+ #15
  Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
  Workqueue: btrfs-endio btrfs_end_bio_work [btrfs]
  Call Trace:
   <TASK>
   dump_stack_lvl+0x6e/0xa0
   ? lock_release+0x708/0x780
   print_report+0x174/0x505
   ? lock_release+0x708/0x780
   ? __virt_addr_valid+0x224/0x410
   ? lock_release+0x708/0x780
   kasan_report+0xda/0x1b0
   ? lock_release+0x708/0x780
   ? __wake_up+0x44/0x60
   lock_release+0x708/0x780
   ? __pfx_lock_release+0x10/0x10
   ? __pfx_do_raw_spin_lock+0x10/0x10
   ? lock_is_held_type+0x9a/0x110
   _raw_spin_unlock_irqrestore+0x1f/0x60
   __wake_up+0x44/0x60
   btrfs_encoded_read_endio+0x14b/0x190 [btrfs]
   btrfs_check_read_bio+0x8d9/0x1360 [btrfs]
   ? lock_release+0x1b0/0x780
   ? trace_lock_acquire+0x12f/0x1a0
   ? __pfx_btrfs_check_read_bio+0x10/0x10 [btrfs]
   ? process_one_work+0x7e3/0x1460
   ? lock_acquire+0x31/0xc0
   ? process_one_work+0x7e3/0x1460
   process_one_work+0x85c/0x1460
   ? __pfx_process_one_work+0x10/0x10
   ? assign_work+0x16c/0x240
   worker_thread+0x5e6/0xfc0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0x2c3/0x3a0
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x31/0x70
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>

  Allocated by task 3661:
   kasan_save_stack+0x30/0x50
   kasan_save_track+0x14/0x30
   __kasan_kmalloc+0xaa/0xb0
   btrfs_encoded_read_regular_fill_pages+0x16c/0x6d0 [btrfs]
   send_extent_data+0xf0f/0x24a0 [btrfs]
   process_extent+0x48a/0x1830 [btrfs]
   changed_cb+0x178b/0x2ea0 [btrfs]
   btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
   _btrfs_ioctl_send+0x117/0x330 [btrfs]
   btrfs_ioctl+0x184a/0x60a0 [btrfs]
   __x64_sys_ioctl+0x12e/0x1a0
   do_syscall_64+0x95/0x180
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

  Freed by task 3661:
   kasan_save_stack+0x30/0x50
   kasan_save_track+0x14/0x30
   kasan_save_free_info+0x3b/0x70
   __kasan_slab_free+0x4f/0x70
   kfree+0x143/0x490
   btrfs_encoded_read_regular_fill_pages+0x531/0x6d0 [btrfs]
   send_extent_data+0xf0f/0x24a0 [btrfs]
   process_extent+0x48a/0x1830 [btrfs]
   changed_cb+0x178b/0x2ea0 [btrfs]
   btrfs_ioctl_send+0x3bf9/0x5c20 [btrfs]
   _btrfs_ioctl_send+0x117/0x330 [btrfs]
   btrfs_ioctl+0x184a/0x60a0 [btrfs]
   __x64_sys_ioctl+0x12e/0x1a0
   do_syscall_64+0x95/0x180
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

  The buggy address belongs to the object at ffff888106a83f00
   which belongs to the cache kmalloc-rnd-07-96 of size 96
  The buggy address is located 24 bytes inside of
   freed 96-byte region [ffff888106a83f00, ffff888106a83f60)

  The buggy address belongs to the physical page:
  page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888106a83800 pfn:0x106a83
  flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
  page_type: f5(slab)
  raw: 0017ffffc0000000 ffff888100053680 ffffea0004917200 0000000000000004
  raw: ffff888106a83800 0000000080200019 00000001f5000000 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff888106a83e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
   ffff888106a83e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
  >ffff888106a83f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                              ^
   ffff888106a83f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
   ffff888106a84000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ==================================================================

Further analyzing the trace and the crash dump's vmcore file shows that
the wake_up() call in btrfs_encoded_read_endio() is calling wake_up() on
the wait_queue that is in the private data passed to the end_io handler.

Commit 4ff47df40447 ("btrfs: move priv off stack in
btrfs_encoded_read_regular_fill_pages()") moved 'struct
btrfs_encoded_read_private' off the stack.

Before that commit one can see a corruption of the private data when
analyzing the vmcore after a crash:

*(struct btrfs_encoded_read_private *)0xffff88815626eec8 = {
	.wait = (wait_queue_head_t){
		.lock = (spinlock_t){
			.rlock = (struct raw_spinlock){
				.raw_lock = (arch_spinlock_t){
					.val = (atomic_t){
						.counter = (int)-2005885696,
					},
					.locked = (u8)0,
					.pending = (u8)157,
					.locked_pending = (u16)40192,
					.tail = (u16)34928,
				},
				.magic = (unsigned int)536325682,
				.owner_cpu = (unsigned int)29,
				.owner = (void *)__SCT__tp_func_btrfs_transaction_commit+0x0 = 0x0,
				.dep_map = (struct lockdep_map){
					.key = (struct lock_class_key *)0xffff8881575a3b6c,
					.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
					.name = (const char *)0xffff88815626f100 = "",
					.wait_type_outer = (u8)37,
					.wait_type_inner = (u8)178,
					.lock_type = (u8)154,
				},
			},
			.__padding = (u8 [24]){ 0, 157, 112, 136, 50, 174, 247, 31, 29 },
			.dep_map = (struct lockdep_map){
				.key = (struct lock_class_key *)0xffff8881575a3b6c,
				.class_cache = (struct lock_class *[2]){ 0xffff8882a71985c0, 0xffffea00066f5d40 },
				.name = (const char *)0xffff88815626f100 = "",
				.wait_type_outer = (u8)37,
				.wait_type_inner = (u8)178,
				.lock_type = (u8)154,
			},
		},
		.head = (struct list_head){
			.next = (struct list_head *)0x112cca,
			.prev = (struct list_head *)0x47,
		},
	},
	.pending = (atomic_t){
		.counter = (int)-1491499288,
	},
	.status = (blk_status_t)130,
}

Here we can see several indicators of in-memory data corruption, e.g. the
large negative atomic values of ->pending or
->wait->lock->rlock->raw_lock->val, as well as the bogus spinlock magic
0x1ff7ae32 (decimal 536325682 above) instead of 0xdead4ead or the bogus
pointer values for ->wait->head.

To fix this, change atomic_dec_return() to atomic_dec_and_test() to fix the
corruption, as atomic_dec_return() is defined as two instructions on
x86_64, whereas atomic_dec_and_test() is defined as a single atomic
operation. This can lead to a situation where counter value is already
decremented but the if statement in btrfs_encoded_read_endio() is not
completely processed, i.e. the 0 test has not completed. If another thread
continues executing btrfs_encoded_read_regular_fill_pages() the
atomic_dec_return() there can see an already updated ->pending counter and
continues by freeing the private data. Continuing in the endio handler the
test for 0 succeeds and the wait_queue is woken up, resulting in a
use-after-free.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Damien Le Moal <Damien.LeMoal@wdc.com>
Fixes: 1881fba89b ("btrfs: add BTRFS_IOC_ENCODED_READ ioctl")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-28 20:45:43 +01:00
Linus Torvalds
563cb0b1e7 cxl changes for v6.13
- Constify range_contains() input parameters to prevent changes.
 - Add support for displaying RCD capabilities in sysfs to support lspci for CXL device.
 - Downgrade warning message to debug in cxl_probe_component_regs().
 - Add support for adding a printf specifier '$pra' to emit 'struct range' content.
   - Add sanity tests for 'struct resource'.
   - Add documentation for special case.
   - Add %pra for 'struct range'.
   - Add %pra usage in CXL code.
 - Add preparation code for DCD support
   - Add range_overlaps().
   - Add CDAT DSMAS table shared and read only flag in ACPICA.
   - Add documentation to 'struct dev_dax_range'.
   - Delay event buffer allocation in CXL PCI code until needed.
   - Use guard() in cxl_dpa_set_mode().
   - Refactor create region code to consolidate common code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAmc84dMACgkQYGjFFmlT
 OEoGTg//cSJlQ9X7+xZDbngnzpJwcLzQkR/FXDfe3obtmgs7woDJgNNcYnKSlgyf
 wal47Q0UM/1Hv8Dtfrt62Ay1fmOvDL2GSpey35NVJGCEpIsfOqqk1zTCgfgwRHTO
 MZJLnOSFUIlDYlVz8ljLNHnNqPjr7dCoUh9tdBefvkw59FqbkHNcWI8hG1lh1SR4
 2frtJcqVg54S6vJa2eeWmNVpxz7RZvPFrb8TJzhdrGM8PkTMNFA2oJINAf0j00Ev
 8/T6HXTxXvFtNhBH0dtMO1MFh1d6Qr/zFnX/gmrnPWl1l/12HFDMBIZIzq/Whjpo
 +7hQ5xK3cwkMevFgFrAhwdZMj8maR84x1dbFItoThaoeDIQ4sGfyQEMPsbkZP/Sc
 67i5hQFIBZc+ORLB0W+z9Da52ZFGyVw/xsCmDRzXCw4s7N2twpydIoA7Pvu9NN1X
 3JVF35NrsRZ+PyuGWEitNjo0Rj6swNpBC5Xv/T1mgFtSgvVuk1T2QtSHJcPoQyzQ
 zbijsCKmvJYbdJBnPiotdrBs1BUxBsP9dBT9IxWzMy6lcEpTJrYpUheRCk2tSHFa
 Kk8O8IYNiBKZaSpN9UHKaGzr43H8gNbLf4svSIiu1lZJTSSdtWqfZZYjXFBgB1Vb
 l2gBCDmPJ0y7WKZSCa53UmQiOusr+l3Pi+OflZEfCy6JxbSqTTM=
 =GNlu
 -----END PGP SIGNATURE-----

Merge tag 'cxl-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull cxl updates from Dave Jiang:

 - Constify range_contains() input parameters to prevent changes

 - Add support for displaying RCD capabilities in sysfs to support lspci
   for CXL device

 - Downgrade warning message to debug in cxl_probe_component_regs()

 - Add support for adding a printf specifier '%pra' to emit 'struct
   range' content:
     - Add sanity tests for 'struct resource'
     - Add documentation for special case
     - Add %pra for 'struct range'
     - Add %pra usage in CXL code

 - Add preparation code for DCD support:
     - Add range_overlaps()
     - Add CDAT DSMAS table shared and read only flag in ACPICA
     - Add documentation to 'struct dev_dax_range'
     - Delay event buffer allocation in CXL PCI code until needed
     - Use guard() in cxl_dpa_set_mode()
     - Refactor create region code to consolidate common code

* tag 'cxl-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/region: Refactor common create region code
  cxl/hdm: Use guard() in cxl_dpa_set_mode()
  cxl/pci: Delay event buffer allocation
  dax: Document struct dev_dax_range
  ACPI/CDAT: Add CDAT/DSMAS shared and read only flag values
  range: Add range_overlaps()
  cxl/cdat: Use %pra for dpa range outputs
  printf: Add print format (%pra) for struct range
  Documentation/printf: struct resource add start == end special case
  test printf: Add very basic struct resource tests
  cxl: downgrade a warning message to debug level in cxl_probe_component_regs()
  cxl/pci: Add sysfs attribute for CXL 1.1 device link status
  cxl/core/regs: Add rcd_pcie_cap initialization
  kernel/range: Const-ify range_contains parameters
2024-11-22 12:33:52 -08:00
Linus Torvalds
77a0cfafa9 for-6.13/block-20241118
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmc7S40QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjHVD/43rDZ8ehs+IAAr6S0RemNX1SRG0mK2UOEb
 kMoNogS7StO/c4JYW3JuzCyLRn5ZsgeWV/muqxwDEWQrmTGrvi+V45KikrZPwm3k
 p0ump33qV9EU2jiR1MKZjtwK2P0CI7/DD3W8ww6IOvKbTT7RcqQcdHznvXArFBtc
 xCuQPpayFG7ZasC+N9VaBwtiUEVgU3Ek9AFT7UVZRWajjHPNalQwaooJWayO0rEG
 KdoW5yG0ryLrgCY2ACSvRLS+2s14EJtb8hgT08WKHTNgd5LxhSKxfsTapamua+7U
 FdVS6Ij0tEkgu2jpvgj7QKO0Uw10Cnep2gj7RHts/LVewvkliS6XcheOzqRS1jWU
 I2EI+UaGOZ11OUiw52VIveEVS5zV/NWhgy5BSP9LYEvXw0BUAHRDYGMem8o5G1V1
 SWqjIM1UWvcQDlAnMF9FDVzojvjVUmYWvcAlFFztO8J0B7SavHR3NcfHwEf57reH
 rNoUbi/9c4/wjJJF33gejiR5pU+ewy/Mk75GrtX3xpEqlztfRbf9/FbPCMEAO1KR
 DF/b3lkUV9i2/BRW6a0SpZ5RDSmSYMnateel6TrPyVSRnpiSSFO8FrbynwUOa17b
 6i49YDFWzzXOrR1YWDg6IEtTrcmBEmvi7F6aoDs020qUnL0hwLn1ZuoIxuiFEpor
 Z0iFF1B/nw==
 =PWTH
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe updates via Keith:
      - Use uring_cmd helper (Pavel)
      - Host Memory Buffer allocation enhancements (Christoph)
      - Target persistent reservation support (Guixin)
      - Persistent reservation tracing (Guixen)
      - NVMe 2.1 specification support (Keith)
      - Rotational Meta Support (Matias, Wang, Keith)
      - Volatile cache detection enhancment (Guixen)

 - MD updates via Song:
      - Maintainers update
      - raid5 sync IO fix
      - Enhance handling of faulty and blocked devices
      - raid5-ppl atomic improvement
      - md-bitmap fix

 - Support for manually defining embedded partition tables

 - Zone append fixes and cleanups

 - Stop sending the queued requests in the plug list to the driver
   ->queue_rqs() handle in reverse order.

 - Zoned write plug cleanups

 - Cleanups disk stats tracking and add support for disk stats for
   passthrough IO

 - Add preparatory support for file system atomic writes

 - Add lockdep support for queue freezing. Already found a bunch of
   issues, and some fixes for that are in here. More will be coming.

 - Fix race between queue stopping/quiescing and IO queueing

 - ublk recovery improvements

 - Fix ublk mmap for 64k pages

 - Various fixes and cleanups

* tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits)
  MAINTAINERS: Update git tree for mdraid subsystem
  block: make struct rq_list available for !CONFIG_BLOCK
  block/genhd: use seq_put_decimal_ull for diskstats decimal values
  block: don't reorder requests in blk_mq_add_to_batch
  block: don't reorder requests in blk_add_rq_to_plug
  block: add a rq_list type
  block: remove rq_list_move
  virtio_blk: reverse request order in virtio_queue_rqs
  nvme-pci: reverse request order in nvme_queue_rqs
  btrfs: validate queue limits
  block: export blk_validate_limits
  nvmet: add tracing of reservation commands
  nvme: parse reservation commands's action and rtype to string
  nvmet: report ns's vwc not present
  md/raid5: Increase r5conf.cache_name size
  block: remove the ioprio field from struct request
  block: remove the write_hint field from struct request
  nvme: check ns's volatile write cache not present
  nvme: add rotational support
  nvme: use command set independent id ns if available
  ...
2024-11-18 16:50:08 -08:00
Linus Torvalds
c14a8a4c04 for-6.13-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmc0zT4ACgkQxWXV+ddt
 WDtThRAAhzSSiHcJqTfCL5nHh7w85MNEVw28o1ETgXSYJmx0JOWLE7Znlp2FV7jj
 IbYkFfF2gXJzYvRZkcXB/TAHV9KJG5yZIBZfccbM+9db9f8xkImVKMuqQRXPU41R
 ppSCmqZTeujtt8ucsaJkMpm6pzECKJCJaGOsMJ8fiqKpo89dKO3eGAVboSbpPF4C
 r0YmppiBwSP/cCXQCqWxZRbqPGN+lUgZpIGNRi157kehfmRHlVVJTO1pgqK8PCXb
 uIT09Kulppfez8+1A10CPcniDTyinLik/qLTNlzdWoDBL4iNJMg0A0wsA04AJVf0
 PdOS0REusiv3QcEIO6PefuRFRRfXcSLPpPDUceltJT5O0uM2gUqf2C7dEHXUGU3o
 TdgYlbQpsJWpZ7VGWQDZeGGV04lOPQvu0LGLPgEerUQd5H9ABa0dX8Fn0sPhKsa8
 whpAcdfE4rdNxB2OJFnqQeFq0z3cSjP/rvKlluCmAj97QYI+kiu3QyhemcT1YSC9
 U7n5Ya9IzIYCN3ml54q3hEgyD0IVGGG20GuUmqC9XSP9mrQRC8I1g7v26AiOTrrk
 VhgSdtMmphDxXudifsnYMaQ0Z1QqiUrW1SM/prAEOnBYCo75+HDsTgrq9ithgHoI
 4xz4YXJyMRs18qfTJctXC1wmGuz5plTdQrwarHdNsELN5HEyqX4=
 =aAcf
 -----END PGP SIGNATURE-----

Merge tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Changes outside of btrfs: add io_uring command flag to track a dying
  task (the rest will go via the block git tree).

  User visible changes:

   - wire encoded read (ioctl) to io_uring commands, this can be used on
     itself, in the future this will allow 'send' to be asynchronous. As
     a consequence, the encoded read ioctl can also work in non-blocking
     mode

   - new ioctl to wait for cleaned subvolumes, no need to use the
     generic and root-only SEARCH_TREE ioctl, will be used by "btrfs
     subvol sync"

   - recognize different paths/symlinks for the same devices and don't
     report them during rescanning, this can be observed with LVM or DM

   - seeding device use case change, the sprout device (the one
     capturing new writes) will not clear the read-only status of the
     super block; this prevents accumulating space from deleted
     snapshots

  Performance improvements:

   - reduce lock contention when traversing extent buffers

   - reduce extent tree lock contention when searching for inline
     backref

   - switch from rb-trees to xarray for delayed ref tracking,
     improvements due to better cache locality, branching factors and
     more compact data structures

   - enable extent map shrinker again (prevent memory exhaustion under
     some types of IO load), reworked to run in a single worker thread
     (there used to be problems causing long stalls under memory
     pressure)

  Core changes:

   - raid-stripe-tree feature updates:
       - make device replace and scrub work
       - implement partial deletion of stripe extents
       - new selftests

   - split the config option BTRFS_DEBUG and add EXPERIMENTAL for
     features that are experimental or with known problems so we don't
     misuse debugging config for that

   - subpage mode updates (sector < page):
       - update compression implementations
       - update writepage, writeback

   - continued folio API conversions:
       - buffered writes

   - make buffered write copy one page at a time, preparatory work for
     future integration with large folios, may cause performance drop

   - proper locking of root item regarding starting send

   - error handling improvements

   - code cleanups and refactoring:
       - dead code removal
       - unused parameter reduction
       - lockdep assertions"

* tag 'for-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (119 commits)
  btrfs: send: check for read-only send root under critical section
  btrfs: send: check for dead send root under critical section
  btrfs: remove check for NULL fs_info at btrfs_folio_end_lock_bitmap()
  btrfs: fix warning on PTR_ERR() against NULL device at btrfs_control_ioctl()
  btrfs: fix a typo in btrfs_use_zone_append
  btrfs: avoid superfluous calls to free_extent_map() in btrfs_encoded_read()
  btrfs: simplify logic to decrement snapshot counter at btrfs_mksnapshot()
  btrfs: remove hole from struct btrfs_delayed_node
  btrfs: update stale comment for struct btrfs_delayed_ref_node::add_list
  btrfs: add new ioctl to wait for cleaned subvolumes
  btrfs: simplify range tracking in cow_file_range()
  btrfs: remove conditional path allocation in btrfs_read_locked_inode()
  btrfs: push cleanup into btrfs_read_locked_inode()
  io_uring/cmd: let cmds to know about dying task
  btrfs: add struct io_btrfs_cmd as type for io_uring_cmd_to_pdu()
  btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)
  btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()
  btrfs: don't sleep in btrfs_encoded_read() if IOCB_NOWAIT is set
  btrfs: change btrfs_encoded_read() so that reading of extent is done by caller
  btrfs: remove pointless iocb::ki_pos addition in btrfs_encoded_read()
  ...
2024-11-18 16:37:41 -08:00
Linus Torvalds
0f25f0e4ef the bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same
 scope where they'd been created, getting rid of reassignments
 and passing them by reference, converting to CLASS(fd{,_pos,_raw}).
 
 We are getting very close to having the memory safety of that stuff
 trivial to verify.
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZzdikAAKCRBZ7Krx/gZQ
 69nJAQCmbQHK3TGUbQhOw6MJXOK9ezpyEDN3FZb4jsu38vTIdgEA6OxAYDO2m2g9
 CN18glYmD3wRyU6Bwl4vGODouSJvDgA=
 =gVH3
 -----END PGP SIGNATURE-----

Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull 'struct fd' class updates from Al Viro:
 "The bulk of struct fd memory safety stuff

  Making sure that struct fd instances are destroyed in the same scope
  where they'd been created, getting rid of reassignments and passing
  them by reference, converting to CLASS(fd{,_pos,_raw}).

  We are getting very close to having the memory safety of that stuff
  trivial to verify"

* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
  deal with the last remaing boolean uses of fd_file()
  css_set_fork(): switch to CLASS(fd_raw, ...)
  memcg_write_event_control(): switch to CLASS(fd)
  assorted variants of irqfd setup: convert to CLASS(fd)
  do_pollfd(): convert to CLASS(fd)
  convert do_select()
  convert vfs_dedupe_file_range().
  convert cifs_ioctl_copychunk()
  convert media_request_get_by_fd()
  convert spu_run(2)
  switch spufs_calls_{get,put}() to CLASS() use
  convert cachestat(2)
  convert do_preadv()/do_pwritev()
  fdget(), more trivial conversions
  fdget(), trivial conversions
  privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
  o2hb_region_dev_store(): avoid goto around fdget()/fdput()
  introduce "fd_pos" class, convert fdget_pos() users to it.
  fdget_raw() users: switch to CLASS(fd_raw)
  convert vmsplice() to CLASS(fd)
  ...
2024-11-18 12:24:06 -08:00
Linus Torvalds
56be9aaf98 vfs-6.13.pagecache
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcUQAAKCRCRxhvAZXjc
 onEpAQCUdwIBHpwmSIFvJFA9aNGpbLzi0dDSEIxuWYtp5qVuogD+ImccwqpG3kEi
 Zq9vokdPpB1zbahxKl1mkvBG4G0GFQE=
 =LbP6
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs pagecache updates from Christian Brauner:
 "Cleanup filesystem page flag usage: This continues the work to make
  the mappedtodisk/owner_2 flag available to filesystems which don't use
  buffer heads. Further patches remove uses of Private2. This brings us
  very close to being rid of it entirely"

* tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  migrate: Remove references to Private2
  ceph: Remove call to PagePrivate2()
  btrfs: Switch from using the private_2 flag to owner_2
  mm: Remove PageMappedToDisk
  nilfs2: Convert nilfs_copy_buffer() to use folios
  fs: Move clearing of mappedtodisk to buffer.c
2024-11-18 09:54:32 -08:00
Linus Torvalds
70e7730c2a vfs-6.13.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcToAAKCRCRxhvAZXjc
 osL9AP948FFumJRC28gDJ4xp+X4eohNOfkgoEG8FTbF2zU6ulwD+O0pr26FqpFli
 pqlG+38UdATImpfqqWjPbb72sBYcfQg=
 =wLUh
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Features:

   - Fixup and improve NLM and kNFSD file lock callbacks

     Last year both GFS2 and OCFS2 had some work done to make their
     locking more robust when exported over NFS. Unfortunately, part of
     that work caused both NLM (for NFS v3 exports) and kNFSD (for
     NFSv4.1+ exports) to no longer send lock notifications to clients

     This in itself is not a huge problem because most NFS clients will
     still poll the server in order to acquire a conflicted lock

     It's important for NLM and kNFSD that they do not block their
     kernel threads inside filesystem's file_lock implementations
     because that can produce deadlocks. We used to make sure of this by
     only trusting that posix_lock_file() can correctly handle blocking
     lock calls asynchronously, so the lock managers would only setup
     their file_lock requests for async callbacks if the filesystem did
     not define its own lock() file operation

     However, when GFS2 and OCFS2 grew the capability to correctly
     handle blocking lock requests asynchronously, they started
     signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check
     for also trusting posix_lock_file() was inadvertently dropped, so
     now most filesystems no longer produce lock notifications when
     exported over NFS

     Fix this by using an fop_flag which greatly simplifies the problem
     and grooms the way for future uses by both filesystems and lock
     managers alike

   - Add a sysctl to delete the dentry when a file is removed instead of
     making it a negative dentry

     Commit 681ce86235 ("vfs: Delete the associated dentry when
     deleting a file") introduced an unconditional deletion of the
     associated dentry when a file is removed. However, this led to
     performance regressions in specific benchmarks, such as
     ilebench.sum_operations/s, prompting a revert in commit
     4a4be1ad3a ("Revert "vfs: Delete the associated dentry when
     deleting a file""). This reintroduces the concept conditionally
     through a sysctl

   - Expand the statmount() system call:

       * Report the filesystem subtype in a new fs_subtype field to
         e.g., report fuse filesystem subtypes

       * Report the superblock source in a new sb_source field

       * Add a new way to return filesystem specific mount options in an
         option array that returns filesystem specific mount options
         separated by zero bytes and unescaped. This allows caller's to
         retrieve filesystem specific mount options and immediately pass
         them to e.g., fsconfig() without having to unescape or split
         them

       * Report security (LSM) specific mount options in a separate
         security option array. We don't lump them together with
         filesystem specific mount options as security mount options are
         generic and most users aren't interested in them

         The format is the same as for the filesystem specific mount
         option array

   - Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command

   - Optimize acl_permission_check() to avoid costly {g,u}id ownership
     checks if possible

   - Use smp_mb__after_spinlock() to avoid full smp_mb() in evict()

   - Add synchronous wakeup support for ep_poll_callback.

     Currently, epoll only uses wake_up() to wake up task. But sometimes
     there are epoll users which want to use the synchronous wakeup flag
     to give a hint to the scheduler, e.g., the Android binder driver.
     So add a wake_up_sync() define, and use wake_up_sync() when sync is
     true in ep_poll_callback()

  Fixes:

   - Fix kernel documentation for inode_insert5() and iget5_locked()

   - Annotate racy epoll check on file->f_ep

   - Make F_DUPFD_QUERY associative

   - Avoid filename buffer overrun in initramfs

   - Don't let statmount() return empty strings

   - Add a cond_resched() to dump_user_range() to avoid hogging the CPU

   - Don't query the device logical blocksize multiple times for hfsplus

   - Make filemap_read() check that the offset is positive or zero

  Cleanups:

   - Various typo fixes

   - Cleanup wbc_attach_fdatawrite_inode()

   - Add __releases annotation to wbc_attach_and_unlock_inode()

   - Add hugetlbfs tracepoints

   - Fix various vfs kernel doc parameters

   - Remove obsolete TODO comment from io_cancel()

   - Convert wbc_account_cgroup_owner() to take a folio

   - Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add()

   - Reorder struct posix_acl to save 8 bytes

   - Annotate struct posix_acl with __counted_by()

   - Replace one-element array with flexible array member in freevxfs

   - Use idiomatic atomic64_inc_return() in alloc_mnt_ns()"

* tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
  statmount: retrieve security mount options
  vfs: make evict() use smp_mb__after_spinlock instead of smp_mb
  statmount: add flag to retrieve unescaped options
  fs: add the ability for statmount() to report the sb_source
  writeback: wbc_attach_fdatawrite_inode out of line
  writeback: add a __releases annoation to wbc_attach_and_unlock_inode
  fs: add the ability for statmount() to report the fs_subtype
  fs: don't let statmount return empty strings
  fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel()
  hfsplus: don't query the device logical block size multiple times
  freevxfs: Replace one-element array with flexible array member
  fs: optimize acl_permission_check()
  initramfs: avoid filename buffer overrun
  fs/writeback: convert wbc_account_cgroup_owner to take a folio
  acl: Annotate struct posix_acl with __counted_by()
  acl: Realign struct posix_acl to save 8 bytes
  epoll: Add synchronous wakeup support for ep_poll_callback
  coredump: add cond_resched() to dump_user_range
  mm/page-writeback.c: Fix comment of wb_domain_writeout_add()
  mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL
  ...
2024-11-18 09:35:30 -08:00
Linus Torvalds
6ac81fd55e vfs-6.13.mgtime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZzcScQAKCRCRxhvAZXjc
 oj+5AP4k822a77wc/3iPFk379naIvQ4dsrgemh0/Pb6ZvzvkFQEAi3vFCfzCDR2x
 SkJF/RwXXKZv6U31QXMRt2Qo6wfBuAc=
 =nVlm
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs multigrain timestamps from Christian Brauner:
 "This is another try at implementing multigrain timestamps. This time
  with significant help from the timekeeping maintainers to reduce the
  performance impact.

  Thomas provided a base branch that contains the required timekeeping
  interfaces for the VFS. It serves as the base for the multi-grain
  timestamp work:

   - Multigrain timestamps allow the kernel to use fine-grained
     timestamps when an inode's attributes is being actively observed
     via ->getattr(). With this support, it's possible for a file to get
     a fine-grained timestamp, and another modified after it to get a
     coarse-grained stamp that is earlier than the fine-grained time. If
     this happens then the files can appear to have been modified in
     reverse order, which breaks VFS ordering guarantees.

     To prevent this, a floor value is maintained for multigrain
     timestamps. Whenever a fine-grained timestamp is handed out, record
     it, and when later coarse-grained stamps are handed out, ensure
     they are not earlier than that value. If the coarse-grained
     timestamp is earlier than the fine-grained floor, return the floor
     value instead.

     The timekeeper changes add a static singleton atomic64_t into
     timekeeper.c that is used to keep track of the latest fine-grained
     time ever handed out. This is tracked as a monotonic ktime_t value
     to ensure that it isn't affected by clock jumps. Because it is
     updated at different times than the rest of the timekeeper object,
     the floor value is managed independently of the timekeeper via a
     cmpxchg() operation, and sits on its own cacheline.

     Two new public timekeeper interfaces are added:

      (1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the
          later of the coarse-grained clock and the floor time

      (2) ktime_get_real_ts64_mg() gets the fine-grained clock value,
          and tries to swap it into the floor. A timespec64 is filled
          with the result.

   - The VFS has always used coarse-grained timestamps when updating the
     ctime and mtime after a change. This has the benefit of allowing
     filesystems to optimize away a lot metadata updates, down to around
     1 per jiffy, even when a file is under heavy writes.

     Unfortunately, this has always been an issue when we're exporting
     via NFSv3, which relies on timestamps to validate caches. A lot of
     changes can happen in a jiffy, so timestamps aren't sufficient to
     help the client decide when to invalidate the cache. Even with
     NFSv4, a lot of exported filesystems don't properly support a
     change attribute and are subject to the same problems with
     timestamp granularity. Other applications have similar issues with
     timestamps (e.g backup applications).

     If we were to always use fine-grained timestamps, that would
     improve the situation, but that becomes rather expensive, as the
     underlying filesystem would have to log a lot more metadata
     updates.

     This adds a way to only use fine-grained timestamps when they are
     being actively queried. Use the (unused) top bit in
     inode->i_ctime_nsec as a flag that indicates whether the current
     timestamps have been queried via stat() or the like. When it's set,
     we allow the kernel to use a fine-grained timestamp iff it's
     necessary to make the ctime show a different value.

     This solves the problem of being able to distinguish the timestamp
     between updates, but introduces a new problem: it's now possible
     for a file being changed to get a fine-grained timestamp. A file
     that is altered just a bit later can then get a coarse-grained one
     that appears older than the earlier fine-grained time. This
     violates timestamp ordering guarantees.

     This is where the earlier mentioned timkeeping interfaces help. A
     global monotonic atomic64_t value is kept that acts as a timestamp
     floor. When we go to stamp a file, we first get the latter of the
     current floor value and the current coarse-grained time. If the
     inode ctime hasn't been queried then we just attempt to stamp it
     with that value.

     If it has been queried, then first see whether the current coarse
     time is later than the existing ctime. If it is, then we accept
     that value. If it isn't, then we get a fine-grained time and try to
     swap that into the global floor. Whether that succeeds or fails, we
     take the resulting floor time, convert it to realtime and try to
     swap that into the ctime.

     We take the result of the ctime swap whether it succeeds or fails,
     since either is just as valid.

     Filesystems can opt into this by setting the FS_MGTIME fstype flag.
     Others should be unaffected (other than being subject to the same
     floor value as multigrain filesystems)"

* tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: reduce pointer chasing in is_mgtime() test
  tmpfs: add support for multigrain timestamps
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  Documentation: add a new file documenting multigrain timestamps
  fs: add percpu counters for significant multigrain timestamp events
  fs: tracepoints around multigrain timestamp events
  fs: handle delegated timestamps in setattr_copy_mgtime
  timekeeping: Add percpu counter for tracking floor swap events
  timekeeping: Add interfaces for handling timestamps with a floor value
  fs: have setattr_copy handle multigrain timestamps appropriately
  fs: add infrastructure for multigrain timestamps
2024-11-18 09:15:39 -08:00
Linus Torvalds
c9dd4571ad for-6.12-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmc2aGIACgkQxWXV+ddt
 WDu0AQ/9FLfC/e3X2GjZ0auna/7A/rF8MPoATUdAyHn75Md6Hc8PXpi1YvMph+ba
 pvoufOqrU66/g0UNeacgsp6rF4rKJHg0q9Id+7wueLnDr41g9paXjsLYItq4j26w
 GusDiZvUDwuDmb70vlTXrgAfnjooIdSwqJTlIzJxvl4wrNzOiUlUJtTMzmUrwn/9
 Lf/iByWlGcPKKBc+1ZzFz4HlVOZZSt9YePeJw2/Aul2OMtuI3RTTAL/NtjaFIlYc
 pb+NHVqFrrfgC+xo68hLBmnsBfS41EGR58rYRjEuQo0+hARa8WbxL3DNA/E/Vi5X
 dsq/wQVlD7IVIWCoF9J94/iyDdwlDOGFMoL6FUrJwDtPGN/v/xxtA6ruvuC7k5zy
 bHCR8ZVrJWVaxE7u0Gtl+hFPpDTwNTR7SfvK69gxPfci1cN0m2wCNK02SEUJwV09
 N82N2ENGGwyWS+nOl/ERB+7A0QxViMr3JpUrPzSYqsmn8bwDvovSjK2fFouJoSey
 bpAzbFWj+OS0O9nnRqabTJDM/Tk9O0s0Ye76aUS+Vfk9d5EuVfAg6pHiOBcFDhsK
 UEG9QbPltfh6LPDHCdV93HOOsC0uNxCTCSpbQ9LFGKBICQsPIX/vZeV45fNFJDLX
 j5kEtHFVU3snU+jA97nvYXPRANDnnNx/EzXv7zo0Ye8L+plecBs=
 =ssYj
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One more fix that seems urgent and good to have in 6.12 final.

  It could potentially lead to unexpected transaction aborts, due to
  wrong comparison and order of processing of delayed refs"

* tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix incorrect comparison for delayed refs
2024-11-15 09:45:32 -08:00
Josef Bacik
7d493a5ecc btrfs: fix incorrect comparison for delayed refs
When I reworked delayed ref comparison in cf4f04325b ("btrfs: move
->parent and ->ref_root into btrfs_delayed_ref_node"), I made a mistake
and returned -1 for the case where ref1->ref_root was > than
ref2->ref_root.  This is a subtle bug that can result in improper
delayed ref running order, which can result in transaction aborts.

Fixes: cf4f04325b ("btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node")
CC: stable@vger.kernel.org # 6.10+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-14 16:11:02 +01:00
Christoph Hellwig
e559ee0226 btrfs: validate queue limits
Call blk_validate_limits on the queue limits used for zone append
splitting so that calculated values get filled in and any stacking
conflicts get cought.

Without this there isn't a max_zone_append_sectors limits as of commit
559218d43e ("block: pre-calculate max_zone_append_sectors").

Fixes: 559218d43e ("block: pre-calculate max_zone_append_sectors")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241113084541.34315-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13 11:40:11 -07:00
Filipe Manana
e82c936293 btrfs: send: check for read-only send root under critical section
We're checking if the send root is read-only without being under the
protection of the root's root_item_lock spinlock, which is what protects
the root's flags when clearing the read-only flag, done at
btrfs_ioctl_subvol_setflags(). Furthermore, it should be done in the
same critical section that increments the root's send_in_progress counter,
as btrfs_ioctl_subvol_setflags() clears the read-only flag in the same
critical section that checks the counter's value.

So fix this by moving the read-only check under the critical section
delimited by the root's root_item_lock which also increments the root's
send_in_progress counter.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:23 +01:00
Filipe Manana
dc058f5fda btrfs: send: check for dead send root under critical section
We're checking if the send root is dead without the protection of the
root's root_item_lock spinlock, which is what protects the root's flags.
The inverse, setting the dead flag on a root, is done under the protection
of that lock, at btrfs_delete_subvolume(). Also checking and updating the
root's send_in_progress counter is supposed to be done in the same
critical section as checking for or setting the root dead flag, so that
these operations are done atomically as a single step (which is correctly
done by btrfs_delete_subvolume()).

So fix this by checking if the send root is dead in the same critical
section that updates the send_in_progress counter, which is protected by
the root's root_item_lock spinlock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:23 +01:00
Filipe Manana
722d343f12 btrfs: remove check for NULL fs_info at btrfs_folio_end_lock_bitmap()
Smatch complains about possibly dereferencing a NULL fs_info at
btrfs_folio_end_lock_bitmap():

  fs/btrfs/subpage.c:332 btrfs_folio_end_lock_bitmap() warn: variable dereferenced before check 'fs_info' (see line 326)

because we access fs_info to set the 'start_bit' variable before doing the
check for a NULL fs_info.

However fs_info is never NULL, since in the only caller of
btrfs_folio_end_lock_bitmap() is extent_writepage(), where we have an
inode which always as a non-NULL fs_info.

So remove the check for a NULL fs_info at btrfs_folio_end_lock_bitmap().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:22 +01:00