linux/fs
Filipe Manana f13e01b89d btrfs: ensure fast fsync waits for ordered extents after a write failure
If a write path in COW mode fails, either before submitting a bio for the
new extents or an actual IO error happens, we can end up allowing a fast
fsync to log file extent items that point to unwritten extents.

This is because dropping the extent maps happens when completing ordered
extents, at btrfs_finish_one_ordered(), and the completion of an ordered
extent is executed in a work queue.

This can result in a fast fsync to start logging file extent items based
on existing extent maps before the ordered extents complete, therefore
resulting in a log that has file extent items that point to unwritten
extents, resulting in a corrupt file if a crash happens after and the log
tree is replayed the next time the fs is mounted.

This can happen for both direct IO writes and buffered writes.

For example consider a direct IO write, in COW mode, that fails at
btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
error:

1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
   set to false, meaning an error happened;

2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
   flag;

3) btrfs_finish_ordered_extent() queues the completion of the ordered
   extent - so that btrfs_finish_one_ordered() will be executed later in
   a work queue. That function will drop extent maps in the range when
   it's executed, since the extent maps point to unwritten locations
   (signaled by the BTRFS_ORDERED_IOERR flag);

4) After calling btrfs_finish_ordered_extent() we keep going down the
   write path and unlock the inode;

5) After that a fast fsync starts and locks the inode;

6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
   task sees the extent maps that point to the unwritten locations and
   logs file extent items based on them - it does not know they are
   unwritten, and the fast fsync path does not wait for ordered extents
   to complete, which is an intentional behaviour in order to reduce
   latency.

For the buffered write case, here's one example:

1) A fast fsync begins, and it starts by flushing delalloc and waiting for
   the writeback to complete by calling filemap_fdatawait_range();

2) Flushing the dellaloc created a new extent map X;

3) During the writeback some IO error happened, and at the end io callback
   (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
   sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
   completion;

4) After queuing the ordered extent completion, the end io callback clears
   the writeback flag from all pages (or folios), and from that moment the
   fast fsync can proceed;

5) The fast fsync proceeds sees extent map X and logs a file extent item
   based on extent map X, resulting in a log that points to an unwritten
   data extent - because the ordered extent completion hasn't run yet, it
   happens only after the logging.

To fix this make btrfs_finish_ordered_extent() set the inode flag
BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
so that a fast fsync will wait for ordered extent completion.

Note that this issues of using extent maps that point to unwritten
locations can not happen for reads, because in read paths we start by
locking the extent range and wait for any ordered extents in the range
to complete before looking for extent maps.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-28 16:35:12 +02:00
..
9p fs/9p: mitigate inode collisions 2024-04-22 15:34:27 +00:00
adfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
affs affs: remove SLAB_MEM_SPREAD flag usage 2024-02-26 11:36:28 +01:00
afs afs: Fix occasional rmdir-then-VNOVNODE with generic/011 2024-03-14 12:13:21 +01:00
autofs dcache stuff for this cycle 2024-01-11 20:11:35 -08:00
bcachefs bcachefs: fix integer conversion bug 2024-04-28 21:34:29 -04:00
befs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
bfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
btrfs btrfs: ensure fast fsync waits for ordered extents after a write failure 2024-05-28 16:35:12 +02:00
cachefiles cachefiles: fix memory leak in cachefiles_add_cache() 2024-02-20 09:46:07 +01:00
ceph ceph: switch to use cap_delay_lock for the unlink delay list 2024-04-11 22:56:28 +02:00
coda mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
configfs
cramfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
crypto fscrypt updates for 6.9 2024-03-12 13:17:36 -07:00
debugfs debugfs: fix wait/cancellation handling during remove 2024-03-07 22:08:15 +00:00
devpts fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
dlm dlm for 6.9 2024-03-18 15:39:48 -07:00
ecryptfs Merge tag 'exportfs-6.9' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/cel/linux 2024-01-23 17:56:30 +01:00
efivarfs efivarfs: Drop 'duplicates' bool parameter on efivar_init() 2024-02-25 09:43:39 +01:00
efs efs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:33 +01:00
erofs erofs: reliably distinguish block based and fscache mode 2024-04-28 20:36:52 +08:00
exfat Description for this pull request: 2024-03-21 09:47:12 -07:00
exportfs fs: Create a generic is_dot_dotdot() utility 2024-01-23 10:58:56 -05:00
ext2 \n 2024-03-13 14:30:58 -07:00
ext4 fs,block: yield devices early 2024-03-27 13:17:15 +01:00
f2fs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
fat - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min 2024-03-14 18:03:09 -07:00
freevxfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
fuse cuse: add kernel-doc comments to cuse_process_init_reply() 2024-04-15 11:02:10 +02:00
gfs2 gfs2 fix 2024-03-25 10:53:39 -07:00
hfs hfs: really remove hfs_writepage 2023-12-29 11:58:34 -08:00
hfsplus vfs-6.9.misc 2024-03-11 09:38:17 -07:00
hostfs hostfs: use d_splice_alias() calling conventions to simplify failure exits 2023-12-21 12:51:00 -05:00
hpfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
hugetlbfs vfs-6.9.misc 2024-03-11 09:38:17 -07:00
iomap vfs-6.9.rw_hint 2024-03-04 18:35:21 +01:00
isofs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
jbd2 jbd2: abort journal when detecting metadata writeback error of fs dev 2024-01-04 23:42:21 -05:00
jffs2 mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
jfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
kernfs kernfs: annotate different lockdep class for of->mutex of writable files 2024-04-14 06:55:46 -04:00
lockd NFSD 6.9 Release Notes 2024-03-12 14:27:37 -07:00
minix minix: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:32 +01:00
netfs netfs: Fix the pre-flush when appending to a file in writethrough mode 2024-04-26 14:56:18 +02:00
nfs NFS client bugfixes for Linux 6.9 2024-04-29 12:07:37 -07:00
nfs_common
nfsd nfsd-6.9 fixes: 2024-04-29 14:22:24 -07:00
nilfs2 nilfs2: fix OOB in nilfs_set_de_type 2024-04-16 15:39:52 -07:00
nls
notify fanotify: allow freeze when waiting response for permission events 2024-03-07 12:59:51 +01:00
ntfs3 ntfs3: add legacy ntfs file operations 2024-04-23 09:39:07 +02:00
ocfs2 - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min 2024-03-14 18:03:09 -07:00
omfs
openpromfs openpromfs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:32 +01:00
orangefs Julia Lawall reported this null pointer dereference, this should fix it. 2024-02-14 15:57:53 -05:00
overlayfs ovl: relax WARN_ON in ovl_verify_area() 2024-03-17 15:59:41 +02:00
proc mm: support page_mapcount() on page_has_type() pages 2024-04-24 19:34:26 -07:00
pstore pstore/zone: Don't clear memory twice 2024-03-09 12:33:22 -08:00
qnx4 mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
qnx6 qnx6: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:32 +01:00
quota mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
ramfs ramfs: Initialize security of in-memory inodes 2024-01-26 09:08:16 -08:00
reiserfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
romfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
smb smb3: fix lock ordering potential deadlock in cifs_sync_mid_result 2024-04-25 12:49:50 -05:00
squashfs Squashfs: check the inode number is not the invalid value of zero 2024-04-16 15:39:50 -07:00
sysfs fs: sysfs: Fix reference leak in sysfs_break_active_protection() 2024-04-11 15:16:48 +02:00
sysv sysv: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
tracefs eventfs: Have "events" directory get permissions from its parent 2024-05-04 04:25:37 -04:00
ubifs This pull request contains updates for UBI and UBIFS: 2024-03-21 15:09:29 -07:00
udf mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
ufs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
unicode
vboxsf vboxsf: explicitly deny setlease attempts 2024-04-03 16:06:39 +02:00
verity Networking changes for 6.9. 2024-03-12 17:44:08 -07:00
xfs Bug fixes for 6.9-rc3: 2024-04-06 09:14:18 -07:00
zonefs zonefs: Use str_plural() to fix Coccinelle warning 2024-04-10 07:23:47 +09:00
aio.c aio: Fix null ptr deref in aio_complete() wakeup 2024-04-05 11:20:28 +02:00
anon_inodes.c
attr.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
backing-file.c fs: Use KMEM_CACHE instead of kmem_cache_create 2024-02-02 13:11:50 +01:00
bad_inode.c
binfmt_elf.c
binfmt_elf_fdpic.c binfmt: replace deprecated strncpy 2024-03-21 20:20:52 -07:00
binfmt_elf_test.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
buffer.c vfs-6.9.iomap 2024-03-11 10:07:03 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c iov_iter: get rid of 'copy_mc' flag 2024-03-06 10:52:12 +01:00
d_path.c
dax.c
dcache.c vfs-6.9.misc 2024-03-11 09:38:17 -07:00
direct-io.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
drop_caches.c
eventfd.c eventfd: strictly check the count parameter of eventfd_write to avoid inputting illegal strings 2024-02-08 10:12:26 +01:00
eventpoll.c epoll: be better about file lifetimes 2024-05-05 14:00:48 -07:00
exec.c execve fixes for v6.9-rc2 2024-03-27 09:57:30 -07:00
fcntl.c vfs-6.9.iomap 2024-03-11 10:07:03 -07:00
fhandle.c do_sys_name_to_handle(): use kzalloc() to fix kernel-infoleak 2024-01-22 15:33:38 +01:00
file.c file: remove __receive_fd() 2023-12-12 14:24:14 +01:00
file_table.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
filesystems.c
fs-writeback.c writeback: move wb_wakeup_delayed defination to fs-writeback.c 2024-01-22 15:33:38 +01:00
fs_context.c
fs_parser.c __fs_parse: Correct a documentation comment 2024-02-02 13:11:50 +01:00
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c
inode.c bcachefs updates for 6.9 2024-03-15 09:00:09 -07:00
internal.h pidfs: remove config option 2024-03-13 12:53:53 -07:00
ioctl.c fs: Return ENOTTY directly if FS_IOC_GETUUID or FS_IOC_GETFSSYSFSPATH fail 2024-04-09 12:03:49 +02:00
Kconfig - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
Kconfig.binfmt
kernel_read_file.c
libfs.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
locks.c filelock: fix deadlock detection in POSIX locking 2024-02-20 09:53:33 +01:00
Makefile vfs-6.9.pidfd 2024-03-11 10:21:06 -07:00
mbcache.c vfs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
mnt_idmapping.c fs/mnt_idmapping.c: Return -EINVAL when no map is written 2024-02-08 10:12:37 +01:00
mount.h
mpage.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
namei.c security: Place security_path_post_mknod() where the original IMA call was 2024-04-03 10:21:32 -07:00
namespace.c fs: relax mount_setattr() permission checks 2024-02-07 21:16:29 +01:00
nsfs.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
open.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
pidfs.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
pipe.c fs/pipe: Convert to lockdep_cmp_fn 2024-02-02 13:11:49 +01:00
pnode.c
pnode.h
posix_acl.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
proc_namespace.c
read_write.c fsnotify: optionally pass access range in file permission hooks 2023-12-12 16:20:02 +01:00
readdir.c fsnotify: optionally pass access range in file permission hooks 2023-12-12 16:20:02 +01:00
remap_range.c remap_range: merge do_clone_file_range() into vfs_clone_file_range() 2024-02-06 17:07:21 +01:00
select.c fs/select: rework stack allocation hack for clang 2024-02-20 09:23:52 +01:00
seq_file.c
signalfd.c
splice.c fs: use splice_copy_file_range() inline helper 2023-12-12 16:20:02 +01:00
stack.c
stat.c vfs-6.8.mount 2024-01-08 10:57:34 -08:00
statfs.c
super.c fs,block: yield devices early 2024-03-27 13:17:15 +01:00
sync.c
sysctls.c fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
timerfd.c
userfaultfd.c userfaultfd: use per-vma locks in userfaultfd operations 2024-02-22 15:27:20 -08:00
utimes.c
xattr.c evm: Move to LSM infrastructure 2024-02-15 23:43:47 -05:00