Commit graph

37 commits

Author SHA1 Message Date
Kent Overstreet
1f8aede70d bcachefs: fix bch2_journal_keys_peek_prev_min() underflow
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24 18:58:18 -04:00
Kent Overstreet
17c3395e25 bcachefs: opts.journal_rewind
Add a mount option for rewinding the journal, bringing the entire
filesystem to where it was at a previous point in time.

This is for extreme disaster recovery scenarios - it's not intended as
an undelete operation.

The option takes a journal sequence number; the desired sequence number
can be determined with 'bcachefs list_journal'

Caveats:

- The 'journal_transaction_names' option must have been enabled (it's on
  by default). The option controls emitting of extra debug info in the
  journal, so we can see what individual transactions were doing;
  It also enables journalling of keys being overwritten, which is what
  we rely on here.

- A full fsck run will be automatically triggered since alloc info will
  be inconsistent. Only leaf node updates to non-alloc btrees are
  rewound, since rewinding interior btree updates isn't possible or
  desirable.

- We can't do anything about data that was deleted and overwritten.

  Lots of metadata updates after the point in time we're rewinding to
  shouldn't cause a problem, since we segragate data and metadata
  allocations (this is in order to make repair by btree node scan
  practical on larger filesystems; there's a small 64-bit per device
  bitmap in the superblock of device ranges with btree nodes, and we try
  to keep this small).

  However, having discards enabled will cause problems, since buckets
  are discarded as soon as they become empty (this is why we don't
  implement fstrim: we don't need it).

  Hopefully, this feature will be a one-off thing that's never used
  again: this was implemented for recovering from the "vfs i_nlink 0 ->
  subvol deletion" bug, and that bug was unusually disastrous and
  additional safeguards have since been implemented.

  But if it does turn out that we need this more in the future, I'll
  have to implement an option so that empty buckets aren't discarded
  immediately - lagging by perhaps 1% of device capacity.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16 19:03:52 -04:00
Kent Overstreet
0e62fca2a6 bcachefs: Fix bch2_journal_keys_peek_prev_min()
this code is rarely invoked, so - we had a few bugs left from basing it
off of bch2_journal_keys_peek_max()...

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15 22:11:55 -04:00
Kent Overstreet
09b9c72bd4 bcachefs: bch_err_throw()
Add a tracepoint for any time we return an error and unwind.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-02 12:16:35 -04:00
Kent Overstreet
18dad454cd bcachefs: Replace rcu_read_lock() with guards
The new guard(), scoped_guard() allow for more natural code.

Some of the uses with creative flow control have been left.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-01 00:03:12 -04:00
Kent Overstreet
9a4a858c9b bcachefs: Use bch2_kvmalloc() for journal keys array
We can hit this limit fairly easy when we have to reconstuct large
amounts of alloc info on large filesystems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28 22:42:17 -04:00
Kent Overstreet
55fd97fbc4 bcachefs: Use sort_nonatomic() instead of sort()
Fixes "task out to lunch" warnings during recovery on large machines
with lots of dirty data in the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06 19:33:53 -04:00
Kent Overstreet
d0fb2a266a bcachefs: cond_resched() in journal_key_sort_cmp()
Fixes "task out to lunch" warnings during recovery on large machines
with lots of dirty data in the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-26 16:26:35 -04:00
Kent Overstreet
511ddcdb2d bcachefs: fix bch2_journal_key_insert_take() seq
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:21 -05:00
Kent Overstreet
eae6c4a625 bcachefs: fix O(n^2) issue with whiteouts in journal keys
The journal_keys array can't be substantially modified after we go RW,
because lookups need to be able to check it locklessly - thus we're
limited on what we can do when a key in the journal has been
overwritten.

This is a problem when there's many overwrites to skip over for peek()
operations. To fix this, add tracking of ranges of overwrites: we create
a range entry when there's more than one contiguous whiteout.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
854724d116 bcachefs: btree_and_journal_iter: don't iterate over too many whiteouts when prefetching
To help ameloriate issues with peek operations having to skip over
deletions in the journal - just bail out if all we're doing is
prefetching btree nodes.

Since btree node prefetching runs every time we iterate to a new node,
and has to sequentially scan ahead, this avoids another O(n^2).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
06d7a56fe0 bcachefs: journal keys: sort keys for interior nodes first
There's an unavoidable issue with btree lookups when we're overlaying
journal keys and the journal has many deletions for keys present in the
btree - peek operations will have to iterate over all those deletions to
find the next live key to return.

This is mainly a problem for lookups in interior nodes, if we have to
traverse to a leaf. Looking up an insert position in a leaf (for journal
replay) doesn't have to find the next live key, but walking down the
btree does.

So to ameloriate this, change journal key sort ordering so that we
replay keys from roots and interior nodes first.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
57026c41c9 bcachefs: kill bch2_journal_entries_free()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
7e5b8e00e2 bcachefs: Implement bch2_btree_iter_prev_min()
A user contributed a filessytem dump, where the dump was actually
corrupted (due to being taken while the filesystem was online), but
which exposed an interesting bug in fsck - reconstruct_inode().

When itearting in BTREE_ITER_filter_snapshots mode, it's required to
give an end position for the iteration and it can't span inode numbers;
continuing into the next inode might mean we start seeing keys from a
different snapshot tree, that the is_ancestor() checks always filter,
thus we're never able to return a key and stop iterating.

Backwards iteration never implemented the end position because nothing
else needed it - except for reconstuct_inode().

Additionally, backwards iteration is now able to overlay keys from the
journal, which will be useful if we ever decide to start doing journal
replay in the background.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:18 -05:00
Kent Overstreet
000fe8d573 bcachefs: Rename btree_iter_peek_upto() -> btree_iter_peek_max()
We'll be introducing btree_iter_peek_prev_min(), so rename for
consistency.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:17 -05:00
Kent Overstreet
db514cf677 bcachefs: Avoid bch2_btree_id_str()
Prefer bch2_btree_id_to_text() - it prints out the integer ID when
unknown.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21 01:36:15 -05:00
Kent Overstreet
ec36573dcd bcachefs: Add a cond_resched() to __journal_keys_sort()
Without this, we'd potentially sort multiple times without a
cond_resched(), leading to hung task warnings on larger systems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09 09:41:46 -04:00
Kent Overstreet
9dec2a473b bcachefs: Accumulate accounting keys in journal replay
Until accounting keys hit the btree, they are deltas, not new versions
of the existing key; this means we have to teach journal replay to
accumulate them.

Additionally, the journal doesn't track precisely which entries have
been flushed to the btree; it only tracks a range of entries that may
possibly still need to be flushed.

That means we need to compare accounting keys against the version in the
btree and only flush updates that are newer.

There's another wrinkle with the write buffer: if the write buffer
starts flushing accounting keys before journal replay has finished
flushing accounting keys, journal replay will see the version number
from the new updates and updates from the journal will be lost.

To avoid this, journal replay has to flush accounting keys first, and
we'll be adding a flag so that write buffer flush knows to hold
accounting keys until then.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
6ab71b4a8e bcachefs: bch2_journal_keys_dump()
debug helper

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-05-08 17:29:17 -04:00
Kent Overstreet
1189bdda6c bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
We weren't respecting trans->journal_replay_not_finished - we shouldn't
be searching the journal keys unless we have a ref on them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-10 22:28:36 -04:00
Kent Overstreet
30e615a2ce bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
Multiple bug fixes for journal iters:

 - When the journal keys gap buffer is resized, we have to adjust the
   iterators for moving the gap to the end
 - We don't want to rewind iterators to point to the key we just
   inserted if it's not for the correct btree/level

Also, add some new assertions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-07 02:22:28 -04:00
Kent Overstreet
bdbf953b3c bcachefs: bch2_shoot_down_journal_keys()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-03 14:44:18 -04:00
Kent Overstreet
40cb26233a bcachefs: Be careful about btree node splits during journal replay
Don't pick a pivot that's going to be deleted.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
048f47e83f bcachefs: btree_and_journal_iter now respects trans->journal_replay_not_finished
btree_and_journal_iter is now safe to use at runtime, not just during
recovery before journal keys have been freed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-31 20:36:11 -04:00
Kent Overstreet
2cce3752ce bcachefs: split out ignore_blacklisted, ignore_not_dirty
prep work for replaying the journal backwards

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
69426613cd bcachefs: improve move_gap()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
95ffc7fb8c bcachefs: journal_keys now uses darray helpers
nice bit of code cleanup

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
894d062254 bcachefs: Rename journal_keys.d -> journal_keys.data
This will let us use some darray helpers in the next patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
0b5961b0d8 bcachefs: jset_entry for loops declare loop iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 21:22:25 -04:00
Kent Overstreet
cb6fc943b6 bcachefs: kill kvpmalloc()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-13 18:39:12 -04:00
Kent Overstreet
5f43b0134e bcachefs: btree node prefetching in check_topology
btree_and_journal_iter is old code that we want to get rid of, but we're
not ready to yet.

lack of btree node prefetching is, it turns out, a real performance
issue for fsck on spinning rust, so - add it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
fc634d8e46 bcachefs: btree_and_journal_iter.trans
we now always have a btree_trans when using a btree_and_journal_iter;
prep work for adding prefetching to btree_and_journal_iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet
f412392f6e bcachefs: __journal_keys_sort() refactoring
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-05 23:24:19 -05:00
Kent Overstreet
3c471b6588 bcachefs: convert bch_fs_flags to x-macro
Now we can print out filesystem flags in sysfs, useful for debugging
various "what's my filesystem doing" issues.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:38 -05:00
Kent Overstreet
ad9c7992eb bcachefs: Kill btree_iter->journal_pos
For BTREE_ITER_WITH_JOURNAL, we memoize lookups in the journal keys, to
avoid the binary search overhead.

Previously we stashed the pos of the last key returned from the journal,
in order to force the lookup to be redone when rewinding.

Now bch2_journal_keys_peek_upto() handles rewinding itself when
necessary - so we can slim down btree_iter.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-01-01 11:47:37 -05:00
Kent Overstreet
8a443d3ea1 bcachefs: Proper refcounting for journal_keys
The btree iterator code overlays keys from the journal until journal
replay is finished; since we're now starting copygc/rebalance etc.
before replay is finished, this is multithreaded access and thus needs
refcounting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-24 02:43:12 -05:00
Kent Overstreet
401585fe87 bcachefs: btree_journal_iter.c
Split out a new file from recovery.c for managing the list of keys we
read from the journal: before journal replay finishes the btree iterator
code needs to be able to iterate over and return keys from the journal
as well, so there's a fair bit of code here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:10:10 -04:00