Previously, calling bch2_btree_has_scanned_nodes() when btree node
scan hadn't actually run would erroniously return false - causing us to
think a btree was entirely gone.
This fixes a 6.16 regression from moving the scheduling of btree node
scan out of bch2_btree_lost_data() (fixing the bug where we'd schedule
it persistently in the superblock) and only scheduling it when
check_toploogy() is asking for scanned btree nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The next patch will add logging of the specific error being corrected in
repair paths to the journal; this means __bch2_fsck_err() can return
transaction restarts in places that previously weren't expecting them.
check_topology() is old code that doesn't use btree iterators for btree
node locking - it'll have to be rewritten in the future to work online.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The new guard(), scoped_guard() allow for more natural code.
Some of the uses with creative flow control have been left.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We'd like users to be able to debug without building custom kernels, so
this will help us get rid of CONFIG_BCACHEFS_DEBUG, at least for most
things.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a sentinal value for devices that have been removed, but don't want
to reuse their index until a fsck has completed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
_init_early() is for initialization that cannot fail, and often must
happen for teardown partway through initialization to work.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If loosing a btree won't cause data loss - i.e. it's an alloc btree, or
we can easily reconstruct it - we shouldn't require user action to
continue repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This was planned to be done ages ago, now finally completed; there are
places where we have quite a few btree_trans objects on the stack, so
this reduces stack usage somewhat.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add the new helper printbuf_indent_add_nextline(), and use it in
__bch2_fsck_err() to centralize setting the indentation of multiline
fsck errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
state_lock guards against devices coming or leaving, changing state, or
the filesystem changing between ro <-> rw.
But it's not necessary for running recovery passes, and holding it
blocks asynchronous events that would cause us to go RO or kick out
devices.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The btree node read error path already calls topology error, so this is
entirely redundant, and we're not specific enough about our error codes
- this was triggering for bucket_ref_update() errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Originally, we ran insert triggers before overwrite so that if an extent
was being moved (by fallocate insert/collapse range), the bucket sector
count wouldn't hit 0 partway through, and so we don't trigger state
changes caused by that too soon.
But this is better solved by just moving the data type change to the
alloc trigger itself, where it's already called.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can't hold mark_lock while calling fsck_err() - that's a deadlock,
mark_lock is meant to be a leaf node lock.
It's also unnecessary for gc_bucket() and bucket_gen(); rcu suffices
since the bucket_gens array describes its size, and we can't race with
device removal or resize during gc/fsck since that takes state lock.
Reported-by: syzbot+38641fcbda1aaffefdd4@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These repair paths are well tested, we can repair them without explicit
user intervention
This also tweaks bch2_topology_error() so that we run topology repair if
we're in recovery, not just fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We unconditionally go read-write, if we're going to do so, before
journal replay: lazy_rw is obsolete.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is ugly:
We may discover in alloc_write_key that the data type we calculated is
wrong, because BCH_DATA_need_discard is checked/set elsewhere, and the
disk accounting counters we calculated need to be updated.
But bch2_alloc_key_to_dev_counters(..., BTREE_TRIGGER_gc) is not safe
w.r.t. transaction restarts, so we need to propagate the fixup back to
our gc state in case we take a transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The fragmentation_lru field hasn't been needed since we reworked the LRU
btrees to use the btree write buffer; previously it was used to resolve
collisions, but the revised LRU btree uses the backpointer (the bucket)
as part of the key.
It should have been deleted at the time of the LRU rework; since it
wasn't, that left places for bugs to hide, in check/repair.
This fixes LRU fsck on a filesystem image helpfully provided by a user
who disappeared before I could get his name for the reported-by.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
check_topology doesn't need the srcu lock and doesn't use normal btree
transactions - we can just drop the srcu lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We run this in full RW mode now, so we have to guard against the
superblock buffer being reallocated.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
gc_lock is now only for synchronization between check_alloc_info and
interior btree updates - nothing else
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Break up the percpu counter allocations into individual allocations for
each disk accounting counter; this fixes an issue on large systems where
we have too many replica entries to for the percpu allocator's max
practical size.
Also, use just one eytzinger tree for the normal set of counters and the
gc counters; this simplifies accounting_gc_done() where we need the same
set of counters to be present in both tables.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck_err() now optionally takes a btree_trans; if the current thread has
one, it is required that it be passed.
The next patch will use this to unlock when waiting for user input.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Make things more consistent and ensure that we're using u64 bitfields -
key types and btree ids are already around 32 bits.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Needed for online fsck; we need the trigger to initialize newly
allocated buckets and generation number changes while gc is running.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Next change will move gc_alloc_start initialization into the alloc
trigger, so we have to mark those first.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Blocking the journal was needed to finish checking old style accounting,
but that code is gone and it's not needed in the alloc rewrite,
mark_lock is sufficient for synchronization.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Rewrite fsck/gc for the new accounting scheme.
This adds a second set of in-memory accounting counters for gc to use;
like with other parts of gc we run all trigger in TRIGGER_GC mode, then
compare what we calculated to existing in-memory accounting at the end.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reading disk accounting now requires an eytzinger lookup (see:
bch2_accounting_mem_read()), but the per-device counters are used
frequently enough that we'd like to still be able to read them with just
a percpu sum, as in the old code.
This patch special cases the device counters; when we update in-memory
accounting we also update the old style percpu counters if it's a deice
counter update.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Main part of the disk accounting rewrite.
This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.
With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.
Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.
An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a separate counter to bch_alloc_v4 for amount of striped data; this
lets us separately track striped and unstriped data in a bucket, which
lets us see when erasure coding has failed to update extents with stripe
pointers, and also find buckets to continue updating if we crash mid way
through creating a new stripe.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_root_lock is for the root keys in btree_root, not the pointers to
the nodes themselves; this fixes a lock ordering issue between
btree_root_lock and btree node locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>